WO2021190078A1 - Procédé et appareil de génération d'une courte vidéo, dispositif associé et support - Google Patents

Procédé et appareil de génération d'une courte vidéo, dispositif associé et support Download PDF

Info

Publication number
WO2021190078A1
WO2021190078A1 PCT/CN2021/070391 CN2021070391W WO2021190078A1 WO 2021190078 A1 WO2021190078 A1 WO 2021190078A1 CN 2021070391 W CN2021070391 W CN 2021070391W WO 2021190078 A1 WO2021190078 A1 WO 2021190078A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
category
probability
semantic
segment
Prior art date
Application number
PCT/CN2021/070391
Other languages
English (en)
Chinese (zh)
Inventor
亢治
胡康康
李超
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021190078A1 publication Critical patent/WO2021190078A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23424Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip

Definitions

  • this application provides a short video generation method.
  • the device for generating a short video obtains a target video, where the target video includes multiple frames of video images, determines at least one video segment in the target video through semantic analysis, and obtains the start and end time of the at least one video segment and the probability of the semantic category to which it belongs, where the video Fragments include continuous frames of video images, the number of frames of the video fragment can be equal to or less than the number of frames of the target video, and the video fragment belongs to one or more semantic categories, that is, the continuous frame video images included in the video fragment belong to one or more Semantic category; then according to the start and end time of at least one video segment and the probability of the semantic category to which it belongs, a segment for short video generation is selected from at least one video segment, and the short video is synthesized.
  • the probability of the scene category output by the video semantic analysis model in this scene can be the probability of the corresponding scene category of each frame of video image corresponding to the start and end time, or the probability of the scene category of each frame of the target video.
  • the identification path of the category of the category and the category of the action are distinguished. Whether the category of the category is static or dynamic, the conventional single-frame image recognition method is used, that is, the category of the category is identified Recognition is performed through CNN10 and the second fully connected layer 50 alone. FPN20, SPN30 and the first fully connected layer 40 are more focused on the identification of dynamic behavior categories, so that the processing directions of each network can be used to distinguish the categories of static scenes. While adding to the output result, it can save calculation time and improve recognition accuracy.
  • S103 According to the start and end time of the at least one video segment and the probability of the semantic category to which it belongs, generate a short video corresponding to the target video from the at least one video segment.
  • the short video generating device may also first obtain the subject keyword input by the user or in the historical record, and compare the semantic category of the at least one video clip with The topic keywords are matched, the video segment whose matching degree meets the threshold is determined as the topic video segment, and then a short video corresponding to the target video is generated from at least one topic video segment.
  • the target video may be subjected to Kernel Temporal Segmentation (KTS).
  • KTS Kernel Temporal Segmentation
  • KTS is a change point detection algorithm based on the kernel method. It detects the jump point in the signal by focusing on the consistency of one-dimensional signal characteristics, and can distinguish whether the signal jump is caused by noise or content change.
  • KTS can perform statistical analysis on the characteristic data of each frame of the input target video to detect the signal transition point, so as to realize the division of video clips of different content, and divide the target video into For several non-overlapping segmented segments, the start and end time of at least one segment can be obtained. Then, the start and end times of at least one video segment are combined to determine at least one overlapping segment between each video segment and each divided segment.
  • a summary video segment can be determined from at least one overlapping segment to generate a short video corresponding to the target video.
  • the probability of belonging to the semantic category includes the probability of belonging to the behavior category and the probability of belonging to the scene category. Because the probability of belonging to the behavior category is for a video clip, the probability of belonging to the behavior category is for For each frame of video image in a video segment, the two probabilities can be integrated together before selecting the summary video segment. In other words, according to the start and end time of each video segment and the probability of its behavior category, and the probability of each frame of video image in each video segment, the average category probability of at least one video segment can be determined first, and then According to the average category probability of the at least one video segment, a short video corresponding to the target video is generated from the at least one video segment.
  • the short video generation device can determine the multi-frame video image and the number of frames corresponding to the video segment according to the start and end time of the video segment, and determine the probability of the behavior category of the video segment as a multi-frame video
  • the probability of the behavior category of each frame of the video image in the image that is, the probability of the behavior category of each frame of the video clip corresponding to the behavior category is consistent with the probability of the behavior category of the entire video clip.
  • the probability of the scene category of each frame of the video image in the multi-frame video image output by the video semantic analysis model is obtained, and the probability of the behavior category of each frame of the video image in the multi-frame video image corresponding to the video segment is compared with the probability of the corresponding behavior category.
  • the sum of the probabilities of the scene category is divided by the number of frames to obtain the average category probability of the video segment. According to the above method, the average category probability of at least one video segment is finally determined.
  • the short video generating device can sort according to the average category probability, and automatically determine the summary video segment or user-specified summary video segment , And then synthesize a short video based on the summary video clip.
  • the specific details are similar to the two implementations in the first scenario, and you can refer to the above description, which will not be repeated here.
  • subsequent operations can also be performed based on the above-mentioned overlapping segments after the KTS segmentation, which will not be repeated here.
  • the embodiment of the present application uses the video semantic analysis model to identify video clips with one or more semantic categories in the target video, so as to directly extract the video clips that best reflect the target video content and have continuity. Synthesizing short video not only considers the continuity of the content between frames in the target video, but also improves the presentation effect of the short video, so that the short video content can better meet the actual needs of users, and the generation efficiency of the short video is also improved.
  • FIG. 11 is a schematic flowchart of another short video generation method provided by an embodiment of the present application. The method includes but is not limited to the following steps:
  • S202 Obtain the start and end time, the semantic category and the probability of the semantic category of at least one video segment in the target video through semantic analysis.
  • S201-S202 For the specific implementation of S201-S202, please refer to the description of S101-S102. The difference is that S102 can only output the probability of the semantic category, while S202 outputs both the semantic category and the probability of the semantic category, which will not be repeated here.
  • S203 Determine the interest category probability of at least one video clip according to the probability of the semantic category to which each video clip belongs and the category weight corresponding to the semantic category to which it belongs.
  • the category weights can be used to characterize the user's interest in the respective semantic categories.
  • a high semantic category indicates that the user has a large number of images or videos stored in this category, that is, more interested, the higher the category weight can be set; for example, the more images or videos viewed in the historical operation record
  • the semantic category belongs to, indicating that users pay more attention to images or videos in this category, and a higher category weight can also be set.
  • the corresponding category weights can be determined for various semantic categories in advance, and then the category weights corresponding to the semantic categories of each video clip can be directly called.
  • the category weights corresponding to various semantic categories can be determined through the following steps:
  • Step 1 Obtain the media data information in the local database and historical operation records.
  • the local database may be a storage space for storing or processing various types of data, or a dedicated database dedicated to storing media data (pictures, videos, etc.), such as a gallery.
  • Historical operation records refer to records generated by users of various operations (browsing, moving, editing, etc.) of data, such as local log files.
  • Media data information refers to various types of information such as images and videos, which can include images and videos themselves, can be characteristic information of images and videos, can be operation information of images and videos, and can also be various types of images and videos. Item statistics, etc.
  • Step 2 Determine the category weights corresponding to various semantic categories of the media data according to the media data information.
  • the browsing frequency of the statistical playing category is 2 times/day
  • the editing frequency is 1 time/day
  • the sharing frequency is 0.5 times/day
  • the browsing time is 20 hours
  • the editing time is 40 hours
  • the operating frequency of the statistics playing category is also 3.5 Times/day
  • operating time is 60 hours.
  • the category weight corresponding to each semantic category is calculated according to the preset weight formula, combined with the number of occurrences, operation duration, and operation frequency of each semantic category.
  • the preset weight formula can reflect that the greater the number of occurrences, the operation duration, and the operation frequency, the higher the category weight of the semantic category to which it belongs.
  • count freq_i , view freq_i , view iime_i , share freq_i, and edit freq_i are the number of occurrences, browsing frequency, browsing time, sharing frequency and editing frequency of the semantic category i in the local database and historical operation records, respectively. with They are the number of occurrences, browsing frequency, sharing frequency, and editing frequency of all h semantic categories identified in the local database and historical operation records.
  • each video segment there can be one or more semantic categories.
  • the category weight of this semantic category can be determined, and the category weight can be calculated The product of the probability of belonging to the semantic category is used as the probability of the interest category of the video segment.
  • the category weight of each semantic category can be determined separately, and then the product of the category weight and the probability corresponding to each semantic category can be calculated and sum, To get the probability of the interest category of the video clip.
  • the semantic categories of video clip A include category 1 and category 2
  • the probability of category 1 is P 1
  • the probability of category 2 is P 2
  • the category weights corresponding to category 1 and category 2 are w 1 and w 2
  • the interest category probability P w of the video segment A P 1 *w 1 +P 2 *w 2 .
  • the semantic category can include multiple categories, as mentioned above, multiple categories can also be divided into several major categories, so the weights of major categories can also be set, for example, smile category, cry category , Angry categories can be regarded as expression categories or face categories, while swimming categories, running categories, and playing categories can all be regarded as behavior categories.
  • the two categories of face and behavior can be specifically set in different categories.
  • the specific setting method can be adjusted by the user, or the weight of the major categories can be further determined according to the above-mentioned local database and historical operation records. Since the principle of the method is similar, it will not be repeated here.
  • the short video generation device may first determine the category corresponding to the probability of the scene category and the probability of the behavior category of each frame of the video image in each video segment.
  • Weight according to the above method, sum the product of the corresponding probability and category weight to determine the weight probability of each frame of video image, and then divide the sum of the weight probability of each frame of video image by the number of frames to obtain the interest category probability of the video segment.
  • S204 Determine a short video corresponding to the target video from the at least one video segment according to the start and end time of the at least one video segment and the interest category probability.
  • S204 is similar to the two implementations of the first possible implementation scenario in S103, the difference is that S103 is based on the probability of the semantic category to which it belongs, while S204 is based on the probability of the interest category, so the specific implementation For the method, refer to S103, which will not be repeated here.
  • subsequent operations can also be performed based on the above-mentioned overlapping segments after the KTS segmentation, which will not be repeated here.
  • the interest category probability in S204 comprehensively describes the two dimensions of the importance and interest of the video clip. Therefore, after sorting, you can further select the summary video clip to present as more important and more user-friendly as possible Video clips of interest.
  • the embodiments of this application further analyze user preferences based on the local database and historical operation records, so that they are selected for synthesis
  • the video clips of short videos are more targeted and more in line with user interests, short videos of thousands of people can be obtained.
  • FIG. 12 shows a schematic structural diagram of a terminal device 100 as the device for generating a short video.
  • the terminal device 100 may have more or fewer components than shown in the figure, may combine two or more components, or may have different component configurations.
  • the various components shown in the figure may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.
  • the terminal device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2.
  • Mobile communication module 150 wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display 194, And subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light Sensor 180L, bone conduction sensor 180M, etc.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait.
  • AP application processor
  • modem processor modem processor
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the different processing units may be independent devices or integrated in one or more processors.
  • the controller may be the nerve center and command center of the terminal device 100.
  • the controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching instructions and executing instructions.
  • a memory may also be provided in the processor 110 to store instructions and data.
  • the memory in the processor 110 is a cache memory.
  • the memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.
  • the processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter/receiver (universal asynchronous) interface.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transmitter/receiver
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB Universal Serial Bus
  • the interface connection relationship between the modules illustrated in the embodiment of the present application is merely a schematic description, and does not constitute a structural limitation of the terminal device 100.
  • the terminal device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charger.
  • the power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110.
  • the power management module 141 receives input from the battery 142 and/or the charge management module 140, and supplies power to the processor 110, the internal memory 121, the external memory, the display screen 194, the camera 193, and the wireless communication module 160.
  • the wireless communication function of the terminal device 100 can be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, and the baseband processor.
  • the terminal device 100 implements a display function through a GPU, a display screen 194, and an application processor.
  • the GPU is an image processing microprocessor, which is connected to the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations and is used for graphics rendering.
  • the processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
  • the display screen 194 is used to display images, videos, and the like.
  • the display screen 194 includes a display panel.
  • the display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • active-matrix organic light-emitting diode active-matrix organic light-emitting diode
  • AMOLED flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc.
  • the terminal device 100 may include one or N display screens 194, and N is a positive integer greater than one.
  • the terminal device 100 can implement a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, and an application processor.
  • the ISP is used to process the data fed back from the camera 193. For example, when taking a picture, the shutter is opened, the light is transmitted to the photosensitive element of the camera through the lens, the light signal is converted into an electrical signal, and the photosensitive element of the camera transmits the electrical signal to the ISP for processing and is converted into an image visible to the naked eye.
  • ISP can also optimize the image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be provided in the camera 193.
  • the camera 193 is used to capture still images or videos.
  • the object generates an optical image through the lens and is projected to the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal.
  • ISP outputs digital image signals to DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the camera 193 includes a camera that collects images required for face recognition, such as an infrared camera or other cameras.
  • the camera that collects the image required for face recognition is generally located on the front of the terminal device, for example, above the touch screen, and may also be located at other positions, which is not limited in the embodiment of the present invention.
  • the terminal device 100 may include other cameras.
  • the terminal device may also include a dot matrix transmitter (not shown in the figure) for emitting light.
  • the camera collects the light reflected by the face to obtain a face image, and the processor processes and analyzes the face image, and compares it with the stored face image information for verification.
  • the digital signal processor is used to process digital signals. In addition to digital image signals, it can also process other digital signals. For example, when the terminal device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
  • Video codecs are used to compress or decompress digital video.
  • the terminal device 100 may support one or more video codecs. In this way, the terminal device 100 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.
  • MPEG moving picture experts group
  • MPEG2 MPEG2, MPEG3, MPEG4, and so on.
  • NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • applications such as intelligent cognition of the terminal device 100 can be implemented, such as image recognition, face recognition, voice recognition, text understanding, and so on.
  • the external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, so as to expand the storage capacity of the terminal device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example, save music, video and other files in an external memory card.
  • the internal memory 121 may be used to store computer executable program code, where the executable program code includes instructions.
  • the processor 110 executes various functional applications and data processing of the terminal device 100 by running instructions stored in the internal memory 121.
  • the internal memory 121 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, at least one application required for a function (such as a face recognition function, a fingerprint recognition function, a mobile payment function, etc.) and so on.
  • the storage data area can store data created during the use of the terminal device 100 (such as face information template data, fingerprint information template, etc.) and the like.
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
  • UFS universal flash storage
  • the terminal device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. For example, music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal.
  • the speaker 170A also called “speaker” is used to convert audio electrical signals into sound signals.
  • the receiver 170B also called “earpiece” is used to convert audio electrical signals into sound signals.
  • the microphone 170C also called “microphone”, “microphone”, is used to convert sound signals into electrical signals.
  • the earphone interface 170D is used to connect wired earphones.
  • the earphone interface 170D may be a USB interface 130, or a 3.5mm open mobile terminal platform (OMTP) standard interface, or a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association
  • the pressure sensor 180A is used to sense the pressure signal and can convert the pressure signal into an electrical signal.
  • the pressure sensor 180A may be provided on the display screen 194.
  • the gyro sensor 180B may be used to determine the movement posture of the terminal device 100.
  • the angular velocity of the terminal device 100 around three axes ie, x, y, and z axes
  • the gyro sensor 180B can be determined by the gyro sensor 180B.
  • the proximity light sensor 180G may include, for example, a light emitting diode (LED) and a light detector such as a photodiode.
  • the light emitting diode may be an infrared light emitting diode.
  • the ambient light sensor 180L is used to sense the brightness of the ambient light.
  • the terminal device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived brightness of the ambient light.
  • the ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures.
  • the fingerprint sensor 180H is used to collect fingerprints.
  • the terminal device 100 can use the collected fingerprint characteristics to realize fingerprint unlocking, access application lock, fingerprint photography, fingerprint answering calls, and so on.
  • the fingerprint sensor 180H can be arranged under the touch screen, the terminal device 100 can receive a user's touch operation on the touch screen in the area corresponding to the fingerprint sensor, and the terminal device 100 can collect the fingerprint of the user's finger in response to the touch operation.
  • Information to realize that the fingerprint recognition involved in the embodiments of the application opens the hidden album after the fingerprint recognition is passed, the hidden application is opened after the fingerprint recognition is passed, the account is logged in after the fingerprint recognition is passed, and the payment is completed after the fingerprint recognition is passed.
  • the temperature sensor 180J is used to detect temperature.
  • the terminal device 100 uses the temperature detected by the temperature sensor 180J to execute a temperature processing strategy.
  • Touch sensor 180K also called “touch panel”.
  • the touch sensor 180K may be disposed on the display screen 194, and the touch screen is composed of the touch sensor 180K and the display screen 194, which is also called a “touch screen”.
  • the touch sensor 180K is used to detect touch operations acting on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • the visual output related to the touch operation can be provided through the display screen 194.
  • the touch sensor 180K may also be disposed on the surface of the terminal device 100, which is different from the position of the display screen 194.
  • the button 190 includes a power-on button, a volume button, and so on.
  • the button 190 may be a mechanical button. It can also be a touch button.
  • the terminal device 100 may receive key input, and generate key signal input related to user settings and function control of the terminal device 100.
  • the indicator 192 may be an indicator light, which may be used to indicate the charging status, power change, or to indicate messages, missed calls, notifications, and so on.
  • the SIM card interface 195 is used to connect to the SIM card.
  • the SIM card can be inserted into the SIM card interface 195 or pulled out from the SIM card interface 195 to achieve contact and separation with the terminal device 100.
  • the terminal device 100 adopts an eSIM, that is, an embedded SIM card.
  • the eSIM card can be embedded in the terminal device 100 and cannot be separated from the terminal device 100.
  • the software system of the terminal device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiment of the present invention takes an Android system with a layered architecture as an example to illustrate the software structure of the terminal device 100 by way of example.
  • FIG. 13 is a block diagram of the software structure of the terminal device 100 according to an embodiment of the present application.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface.
  • the Android system is divided into four layers, from top to bottom, the application layer, the application framework layer, the Android runtime and system library, and the kernel layer.
  • the application layer can include a series of application packages.
  • the application package may include applications (also called applications) such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.
  • applications also called applications
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, and a notification manager.
  • the window manager is used to manage window programs.
  • the window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, etc.
  • the content provider is used to store and retrieve data and make these data accessible to applications.
  • the data may include videos, images, audios, phone calls made and received, browsing history and bookmarks, phone book, etc.
  • the view system includes visual controls, such as controls that display text, controls that display pictures, and so on.
  • the view system can be used to build applications.
  • the display interface can be composed of one or more views.
  • a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.
  • the phone manager is used to provide the communication function of the terminal device 100. For example, the management of the call status (including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
  • the notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction.
  • the notification manager is used to notify download completion, message reminders, and so on.
  • the notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialogue interface.
  • prompt text messages in the status bar sound a prompt tone, terminal equipment vibration, flashing indicator lights, etc.
  • Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.
  • the core library consists of two parts: one part is the function functions that the java language needs to call, and the other part is the core library of Android.
  • the application layer and application framework layer run in a virtual machine.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
  • the system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.
  • the surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support multiple audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis, and layer processing.
  • the 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.
  • FIG. 14 shows a schematic structural diagram of the server 200 as the device for generating short videos.
  • server 200 may have more or fewer components than shown in the figure, may combine two or more components, or may have different component configurations.
  • the various components shown in the figure may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.
  • the server 200 may include a processor 210 and a memory 220, and the processor 210 may be connected to the memory 220 through a bus.
  • the processor 210 may include one or more processing units.
  • the processor 210 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), and/or neural network processor (neural-network processing unit, NPU), etc.
  • AP application processor
  • modem processor modem processor
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the different processing units may be independent devices or integrated in one or more processors.
  • the controller may be the nerve center and command center of the server 200.
  • the controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 210 for storing instructions and data.
  • the memory in the processor 210 is a cache memory.
  • the memory can store instructions or data that have just been used or recycled by the processor 210. If the processor 210 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 210 is reduced, and the efficiency of the system is improved.
  • the processor 210 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter/receiver (universal asynchronous) interface. receiver/transmitter, UART) interface, and/or universal serial bus (universal serial bus, USB) interface, etc.
  • the interface connection relationship between the modules illustrated in the embodiment of the present application is merely a schematic description, and does not constitute a structural limitation of the server 200.
  • the server 200 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the server 200 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
  • Video codecs are used to compress or decompress digital video.
  • the server 200 may support one or more video codecs. In this way, the server 200 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.
  • MPEG moving picture experts group
  • MPEG2 MPEG2, MPEG3, MPEG4, and so on.
  • NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • applications such as intelligent cognition of the server 200 can be realized, such as image recognition, face recognition, voice recognition, text understanding, and so on.
  • the memory 220 may be used to store computer executable program code, where the executable program code includes instructions.
  • the processor 210 executes various functional applications and data processing of the server 200 by running instructions stored in the memory 220.
  • the memory 220 may include a program storage area and a data storage area.
  • the storage program area can store an operating system, at least one application required for a function (such as a face recognition function, a fingerprint recognition function, a mobile payment function, etc.) and so on.
  • the storage data area can store data created during the use of the server 200 (such as face information template data, fingerprint information template, etc.) and the like.
  • the memory 220 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
  • UFS universal flash storage
  • the aforementioned server 200 may also be a virtualized server, that is, there are multiple virtualized logical servers on the server 200, and each logical server can rely on the software, hardware and other components in the server 200 to achieve the same data storage and Processing function.
  • FIG. 15 is a schematic structural diagram of a short video generating apparatus 300 in an embodiment of the application.
  • the short video generating apparatus 300 may be applied to the aforementioned terminal device 100 or server 200.
  • the device 300 for generating a short video may include:
  • the video acquisition module 310 is used to acquire the target video
  • the video analysis module 320 is configured to obtain the start and end times of at least one video segment in the target video and the probability of the semantic category belonging to the target video through semantic analysis; wherein, each of the video segments belongs to one or more semantic categories;
  • the short video generation module 330 is configured to generate a short video corresponding to the target video from the at least one video segment according to the start and end time of the at least one video segment and the probability of the semantic category to which it belongs.
  • the target video includes m frames of video images, and the m is a positive integer; the video analysis module 320 is specifically configured to:
  • the probability of the semantic category includes the probability of the behavior category and the probability of the scene category;
  • the target video includes m frames of video images, and the m is a positive integer;
  • the video analysis The module 320 is specifically used for:
  • the short video generation module 330 is specifically configured to:
  • Determining the probability of the behavior category of the video clip as the probability of the behavior category of each frame of video image in the video clip
  • the sum of the probability of the behavior category and the probability of the scene category of each frame of the video image in the multi-frame video image is divided by the number of frames to obtain the average category probability of the video segment.
  • a short video corresponding to the target video is generated from the at least one video segment.
  • the weight of the category corresponding to each semantic category is calculated.
  • the program can be stored in a computer readable storage medium. During execution, it may include the procedures of the above-mentioned method embodiments.
  • the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Television Signal Processing For Recording (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

L'invention concerne un procédé et un appareil de génération d'une courte vidéo, ainsi qu'un dispositif associé et un support. Le procédé comprend les étapes consistant à : acquérir une vidéo cible ; au moyen d'une analyse sémantique, obtenir le temps de début et le temps de fin d'au moins une séquence vidéo dans la vidéo cible et la probabilité d'une catégorie sémantique dont relève la séquence vidéo, chaque séquence vidéo relevant d'une ou plusieurs catégories sémantiques ; puis, en fonction du temps de début et du temps de fin de ladite au moins une séquence vidéo et de la probabilité de la catégorie sémantique dont relève la séquence vidéo, générer une courte vidéo correspondant à la vidéo cible à partir de ladite au moins une séquence vidéo. Dans une vidéo cible, une séquence vidéo qui relève d'une ou plusieurs catégories sémantiques est identifiée au moyen d'une analyse sémantique de façon à extraire directement les séquences vidéo qui reflètent le mieux le contenu de la vidéo cible et qui sont continues de manière à composer une courte vidéo. Par conséquent, non seulement la continuité du contenu entre des trames dans la vidéo cible est prise en considération, mais en plus l'efficacité de la génération de la courte vidéo est améliorée.
PCT/CN2021/070391 2020-03-26 2021-01-06 Procédé et appareil de génération d'une courte vidéo, dispositif associé et support WO2021190078A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010223607.1 2020-03-26
CN202010223607.1A CN113453040B (zh) 2020-03-26 2020-03-26 短视频的生成方法、装置、相关设备及介质

Publications (1)

Publication Number Publication Date
WO2021190078A1 true WO2021190078A1 (fr) 2021-09-30

Family

ID=77807575

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/070391 WO2021190078A1 (fr) 2020-03-26 2021-01-06 Procédé et appareil de génération d'une courte vidéo, dispositif associé et support

Country Status (2)

Country Link
CN (1) CN113453040B (fr)
WO (1) WO2021190078A1 (fr)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114117096A (zh) * 2021-11-23 2022-03-01 腾讯科技(深圳)有限公司 多媒体数据处理方法及相关设备
CN114390365A (zh) * 2022-01-04 2022-04-22 京东科技信息技术有限公司 用于生成视频信息的方法和装置
CN115119050A (zh) * 2022-06-30 2022-09-27 北京奇艺世纪科技有限公司 一种视频剪辑方法和装置、电子设备和存储介质
CN116074642A (zh) * 2023-03-28 2023-05-05 石家庄铁道大学 基于多目标处理单元的监控视频浓缩方法
CN116708945A (zh) * 2023-04-12 2023-09-05 北京优贝卡科技有限公司 一种媒体编辑方法、装置、设备和存储介质
CN116843643A (zh) * 2023-07-03 2023-10-03 北京语言大学 一种视频美学质量评价数据集构造方法
CN117499745A (zh) * 2023-04-12 2024-02-02 北京优贝卡科技有限公司 一种媒体编辑方法、装置、设备和存储介质
CN117880444A (zh) * 2024-03-12 2024-04-12 之江实验室 一种长短时特征引导的人体康复运动视频数据生成方法
WO2024082914A1 (fr) * 2022-10-20 2024-04-25 华为技术有限公司 Procédé de réponse à une question vidéo et dispositif électronique
WO2024109246A1 (fr) * 2022-11-22 2024-05-30 荣耀终端有限公司 Procédé de détermination de politique pour générer une vidéo, et dispositif électronique

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116886957A (zh) * 2023-09-05 2023-10-13 深圳市蓝鲸智联科技股份有限公司 一种一键生成车载短视频vlog方法及系统

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050159956A1 (en) * 1999-09-13 2005-07-21 Microsoft Corporation Annotating programs for automatic summary generation
CN102073864A (zh) * 2010-12-01 2011-05-25 北京邮电大学 四层结构的体育视频中足球项目检测系统及实现
US20140161351A1 (en) * 2006-04-12 2014-06-12 Google Inc. Method and apparatus for automatically summarizing video
CN105138953A (zh) * 2015-07-09 2015-12-09 浙江大学 一种基于连续的多实例学习的视频中动作识别的方法
CN106572387A (zh) * 2016-11-09 2017-04-19 广州视源电子科技股份有限公司 视频序列对齐方法和系统
CN106897714A (zh) * 2017-03-23 2017-06-27 北京大学深圳研究生院 一种基于卷积神经网络的视频动作检测方法
CN108140032A (zh) * 2015-10-28 2018-06-08 英特尔公司 自动视频概括
CN108427713A (zh) * 2018-02-01 2018-08-21 宁波诺丁汉大学 一种用于自制视频的视频摘要方法及系统
CN109697434A (zh) * 2019-01-07 2019-04-30 腾讯科技(深圳)有限公司 一种行为识别方法、装置和存储介质
CN110851621A (zh) * 2019-10-31 2020-02-28 中国科学院自动化研究所 基于知识图谱预测视频精彩级别的方法、装置及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015196281A1 (fr) * 2014-06-24 2015-12-30 Sportlogiq Inc. Système et procédé de description d'évènement visuel et d'analyse d'évènement
US10311913B1 (en) * 2018-02-22 2019-06-04 Adobe Inc. Summarizing video content based on memorability of the video content
CN110798752B (zh) * 2018-08-03 2021-10-15 北京京东尚科信息技术有限公司 用于生成视频摘要的方法和系统
CN110287374B (zh) * 2019-06-14 2023-01-03 天津大学 一种基于分布一致性的自注意力视频摘要方法

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050159956A1 (en) * 1999-09-13 2005-07-21 Microsoft Corporation Annotating programs for automatic summary generation
US20140161351A1 (en) * 2006-04-12 2014-06-12 Google Inc. Method and apparatus for automatically summarizing video
CN102073864A (zh) * 2010-12-01 2011-05-25 北京邮电大学 四层结构的体育视频中足球项目检测系统及实现
CN105138953A (zh) * 2015-07-09 2015-12-09 浙江大学 一种基于连续的多实例学习的视频中动作识别的方法
CN108140032A (zh) * 2015-10-28 2018-06-08 英特尔公司 自动视频概括
CN106572387A (zh) * 2016-11-09 2017-04-19 广州视源电子科技股份有限公司 视频序列对齐方法和系统
CN106897714A (zh) * 2017-03-23 2017-06-27 北京大学深圳研究生院 一种基于卷积神经网络的视频动作检测方法
CN108427713A (zh) * 2018-02-01 2018-08-21 宁波诺丁汉大学 一种用于自制视频的视频摘要方法及系统
CN109697434A (zh) * 2019-01-07 2019-04-30 腾讯科技(深圳)有限公司 一种行为识别方法、装置和存储介质
CN110851621A (zh) * 2019-10-31 2020-02-28 中国科学院自动化研究所 基于知识图谱预测视频精彩级别的方法、装置及存储介质

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114117096A (zh) * 2021-11-23 2022-03-01 腾讯科技(深圳)有限公司 多媒体数据处理方法及相关设备
CN114390365A (zh) * 2022-01-04 2022-04-22 京东科技信息技术有限公司 用于生成视频信息的方法和装置
CN114390365B (zh) * 2022-01-04 2024-04-26 京东科技信息技术有限公司 用于生成视频信息的方法和装置
CN115119050A (zh) * 2022-06-30 2022-09-27 北京奇艺世纪科技有限公司 一种视频剪辑方法和装置、电子设备和存储介质
CN115119050B (zh) * 2022-06-30 2023-12-15 北京奇艺世纪科技有限公司 一种视频剪辑方法和装置、电子设备和存储介质
WO2024082914A1 (fr) * 2022-10-20 2024-04-25 华为技术有限公司 Procédé de réponse à une question vidéo et dispositif électronique
WO2024109246A1 (fr) * 2022-11-22 2024-05-30 荣耀终端有限公司 Procédé de détermination de politique pour générer une vidéo, et dispositif électronique
CN116074642A (zh) * 2023-03-28 2023-05-05 石家庄铁道大学 基于多目标处理单元的监控视频浓缩方法
CN117499745A (zh) * 2023-04-12 2024-02-02 北京优贝卡科技有限公司 一种媒体编辑方法、装置、设备和存储介质
CN116708945B (zh) * 2023-04-12 2024-04-16 半月谈新媒体科技有限公司 一种媒体编辑方法、装置、设备和存储介质
CN116708945A (zh) * 2023-04-12 2023-09-05 北京优贝卡科技有限公司 一种媒体编辑方法、装置、设备和存储介质
CN116843643B (zh) * 2023-07-03 2024-01-16 北京语言大学 一种视频美学质量评价数据集构造方法
CN116843643A (zh) * 2023-07-03 2023-10-03 北京语言大学 一种视频美学质量评价数据集构造方法
CN117880444A (zh) * 2024-03-12 2024-04-12 之江实验室 一种长短时特征引导的人体康复运动视频数据生成方法
CN117880444B (zh) * 2024-03-12 2024-05-24 之江实验室 一种长短时特征引导的人体康复运动视频数据生成方法

Also Published As

Publication number Publication date
CN113453040A (zh) 2021-09-28
CN113453040B (zh) 2023-03-10

Similar Documents

Publication Publication Date Title
WO2021190078A1 (fr) Procédé et appareil de génération d'une courte vidéo, dispositif associé et support
WO2021013145A1 (fr) Procédé de démarrage d'application rapide et dispositif associé
CN111465918B (zh) 在预览界面中显示业务信息的方法及电子设备
CN110377204B (zh) 一种生成用户头像的方法及电子设备
WO2022100221A1 (fr) Procédé et appareil de traitement de récupération et support de stockage
US20220343648A1 (en) Image selection method and electronic device
US12010257B2 (en) Image classification method and electronic device
WO2023160170A1 (fr) Procédé de photographie et dispositif électronique
WO2024055797A9 (fr) Procédé de capture d'images dans une vidéo, et dispositif électronique
CN113536866A (zh) 一种人物追踪显示方法和电子设备
CN115661912A (zh) 图像处理方法、模型训练方法、电子设备及可读存储介质
WO2020073317A1 (fr) Procédé de gestion de fichier et dispositif électronique
CN115661941B (zh) 手势识别方法和电子设备
WO2022143314A1 (fr) Procédé et appareil d'enregistrement d'objet
CN115086710B (zh) 视频播放方法、终端设备、装置、系统及存储介质
WO2024067442A1 (fr) Procédé de gestion de données et appareil associé
CN116828099B (zh) 一种拍摄方法、介质和电子设备
EP4372518A1 (fr) Système, procédé de génération de liste de chansons et dispositif électronique
US20240062392A1 (en) Method for determining tracking target and electronic device
WO2023246666A1 (fr) Procédé de recherche et dispositif électronique
CN114513575B (zh) 一种收藏处理的方法及相关装置
WO2024082914A1 (fr) Procédé de réponse à une question vidéo et dispositif électronique
WO2022143083A1 (fr) Procédé et dispositif de recherche d'application, et support
CN117112087A (zh) 一种桌面卡片的排序方法、电子设备以及介质
CN115802147A (zh) 一种录像中抓拍图像的方法及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21774416

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21774416

Country of ref document: EP

Kind code of ref document: A1