WO2021190078A1 - Method and apparatus for generating short video, and related device and medium - Google Patents

Method and apparatus for generating short video, and related device and medium Download PDF

Info

Publication number
WO2021190078A1
WO2021190078A1 PCT/CN2021/070391 CN2021070391W WO2021190078A1 WO 2021190078 A1 WO2021190078 A1 WO 2021190078A1 CN 2021070391 W CN2021070391 W CN 2021070391W WO 2021190078 A1 WO2021190078 A1 WO 2021190078A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
category
probability
semantic
segment
Prior art date
Application number
PCT/CN2021/070391
Other languages
French (fr)
Chinese (zh)
Inventor
亢治
胡康康
李超
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021190078A1 publication Critical patent/WO2021190078A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23424Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip

Definitions

  • this application provides a short video generation method.
  • the device for generating a short video obtains a target video, where the target video includes multiple frames of video images, determines at least one video segment in the target video through semantic analysis, and obtains the start and end time of the at least one video segment and the probability of the semantic category to which it belongs, where the video Fragments include continuous frames of video images, the number of frames of the video fragment can be equal to or less than the number of frames of the target video, and the video fragment belongs to one or more semantic categories, that is, the continuous frame video images included in the video fragment belong to one or more Semantic category; then according to the start and end time of at least one video segment and the probability of the semantic category to which it belongs, a segment for short video generation is selected from at least one video segment, and the short video is synthesized.
  • the probability of the scene category output by the video semantic analysis model in this scene can be the probability of the corresponding scene category of each frame of video image corresponding to the start and end time, or the probability of the scene category of each frame of the target video.
  • the identification path of the category of the category and the category of the action are distinguished. Whether the category of the category is static or dynamic, the conventional single-frame image recognition method is used, that is, the category of the category is identified Recognition is performed through CNN10 and the second fully connected layer 50 alone. FPN20, SPN30 and the first fully connected layer 40 are more focused on the identification of dynamic behavior categories, so that the processing directions of each network can be used to distinguish the categories of static scenes. While adding to the output result, it can save calculation time and improve recognition accuracy.
  • S103 According to the start and end time of the at least one video segment and the probability of the semantic category to which it belongs, generate a short video corresponding to the target video from the at least one video segment.
  • the short video generating device may also first obtain the subject keyword input by the user or in the historical record, and compare the semantic category of the at least one video clip with The topic keywords are matched, the video segment whose matching degree meets the threshold is determined as the topic video segment, and then a short video corresponding to the target video is generated from at least one topic video segment.
  • the target video may be subjected to Kernel Temporal Segmentation (KTS).
  • KTS Kernel Temporal Segmentation
  • KTS is a change point detection algorithm based on the kernel method. It detects the jump point in the signal by focusing on the consistency of one-dimensional signal characteristics, and can distinguish whether the signal jump is caused by noise or content change.
  • KTS can perform statistical analysis on the characteristic data of each frame of the input target video to detect the signal transition point, so as to realize the division of video clips of different content, and divide the target video into For several non-overlapping segmented segments, the start and end time of at least one segment can be obtained. Then, the start and end times of at least one video segment are combined to determine at least one overlapping segment between each video segment and each divided segment.
  • a summary video segment can be determined from at least one overlapping segment to generate a short video corresponding to the target video.
  • the probability of belonging to the semantic category includes the probability of belonging to the behavior category and the probability of belonging to the scene category. Because the probability of belonging to the behavior category is for a video clip, the probability of belonging to the behavior category is for For each frame of video image in a video segment, the two probabilities can be integrated together before selecting the summary video segment. In other words, according to the start and end time of each video segment and the probability of its behavior category, and the probability of each frame of video image in each video segment, the average category probability of at least one video segment can be determined first, and then According to the average category probability of the at least one video segment, a short video corresponding to the target video is generated from the at least one video segment.
  • the short video generation device can determine the multi-frame video image and the number of frames corresponding to the video segment according to the start and end time of the video segment, and determine the probability of the behavior category of the video segment as a multi-frame video
  • the probability of the behavior category of each frame of the video image in the image that is, the probability of the behavior category of each frame of the video clip corresponding to the behavior category is consistent with the probability of the behavior category of the entire video clip.
  • the probability of the scene category of each frame of the video image in the multi-frame video image output by the video semantic analysis model is obtained, and the probability of the behavior category of each frame of the video image in the multi-frame video image corresponding to the video segment is compared with the probability of the corresponding behavior category.
  • the sum of the probabilities of the scene category is divided by the number of frames to obtain the average category probability of the video segment. According to the above method, the average category probability of at least one video segment is finally determined.
  • the short video generating device can sort according to the average category probability, and automatically determine the summary video segment or user-specified summary video segment , And then synthesize a short video based on the summary video clip.
  • the specific details are similar to the two implementations in the first scenario, and you can refer to the above description, which will not be repeated here.
  • subsequent operations can also be performed based on the above-mentioned overlapping segments after the KTS segmentation, which will not be repeated here.
  • the embodiment of the present application uses the video semantic analysis model to identify video clips with one or more semantic categories in the target video, so as to directly extract the video clips that best reflect the target video content and have continuity. Synthesizing short video not only considers the continuity of the content between frames in the target video, but also improves the presentation effect of the short video, so that the short video content can better meet the actual needs of users, and the generation efficiency of the short video is also improved.
  • FIG. 11 is a schematic flowchart of another short video generation method provided by an embodiment of the present application. The method includes but is not limited to the following steps:
  • S202 Obtain the start and end time, the semantic category and the probability of the semantic category of at least one video segment in the target video through semantic analysis.
  • S201-S202 For the specific implementation of S201-S202, please refer to the description of S101-S102. The difference is that S102 can only output the probability of the semantic category, while S202 outputs both the semantic category and the probability of the semantic category, which will not be repeated here.
  • S203 Determine the interest category probability of at least one video clip according to the probability of the semantic category to which each video clip belongs and the category weight corresponding to the semantic category to which it belongs.
  • the category weights can be used to characterize the user's interest in the respective semantic categories.
  • a high semantic category indicates that the user has a large number of images or videos stored in this category, that is, more interested, the higher the category weight can be set; for example, the more images or videos viewed in the historical operation record
  • the semantic category belongs to, indicating that users pay more attention to images or videos in this category, and a higher category weight can also be set.
  • the corresponding category weights can be determined for various semantic categories in advance, and then the category weights corresponding to the semantic categories of each video clip can be directly called.
  • the category weights corresponding to various semantic categories can be determined through the following steps:
  • Step 1 Obtain the media data information in the local database and historical operation records.
  • the local database may be a storage space for storing or processing various types of data, or a dedicated database dedicated to storing media data (pictures, videos, etc.), such as a gallery.
  • Historical operation records refer to records generated by users of various operations (browsing, moving, editing, etc.) of data, such as local log files.
  • Media data information refers to various types of information such as images and videos, which can include images and videos themselves, can be characteristic information of images and videos, can be operation information of images and videos, and can also be various types of images and videos. Item statistics, etc.
  • Step 2 Determine the category weights corresponding to various semantic categories of the media data according to the media data information.
  • the browsing frequency of the statistical playing category is 2 times/day
  • the editing frequency is 1 time/day
  • the sharing frequency is 0.5 times/day
  • the browsing time is 20 hours
  • the editing time is 40 hours
  • the operating frequency of the statistics playing category is also 3.5 Times/day
  • operating time is 60 hours.
  • the category weight corresponding to each semantic category is calculated according to the preset weight formula, combined with the number of occurrences, operation duration, and operation frequency of each semantic category.
  • the preset weight formula can reflect that the greater the number of occurrences, the operation duration, and the operation frequency, the higher the category weight of the semantic category to which it belongs.
  • count freq_i , view freq_i , view iime_i , share freq_i, and edit freq_i are the number of occurrences, browsing frequency, browsing time, sharing frequency and editing frequency of the semantic category i in the local database and historical operation records, respectively. with They are the number of occurrences, browsing frequency, sharing frequency, and editing frequency of all h semantic categories identified in the local database and historical operation records.
  • each video segment there can be one or more semantic categories.
  • the category weight of this semantic category can be determined, and the category weight can be calculated The product of the probability of belonging to the semantic category is used as the probability of the interest category of the video segment.
  • the category weight of each semantic category can be determined separately, and then the product of the category weight and the probability corresponding to each semantic category can be calculated and sum, To get the probability of the interest category of the video clip.
  • the semantic categories of video clip A include category 1 and category 2
  • the probability of category 1 is P 1
  • the probability of category 2 is P 2
  • the category weights corresponding to category 1 and category 2 are w 1 and w 2
  • the interest category probability P w of the video segment A P 1 *w 1 +P 2 *w 2 .
  • the semantic category can include multiple categories, as mentioned above, multiple categories can also be divided into several major categories, so the weights of major categories can also be set, for example, smile category, cry category , Angry categories can be regarded as expression categories or face categories, while swimming categories, running categories, and playing categories can all be regarded as behavior categories.
  • the two categories of face and behavior can be specifically set in different categories.
  • the specific setting method can be adjusted by the user, or the weight of the major categories can be further determined according to the above-mentioned local database and historical operation records. Since the principle of the method is similar, it will not be repeated here.
  • the short video generation device may first determine the category corresponding to the probability of the scene category and the probability of the behavior category of each frame of the video image in each video segment.
  • Weight according to the above method, sum the product of the corresponding probability and category weight to determine the weight probability of each frame of video image, and then divide the sum of the weight probability of each frame of video image by the number of frames to obtain the interest category probability of the video segment.
  • S204 Determine a short video corresponding to the target video from the at least one video segment according to the start and end time of the at least one video segment and the interest category probability.
  • S204 is similar to the two implementations of the first possible implementation scenario in S103, the difference is that S103 is based on the probability of the semantic category to which it belongs, while S204 is based on the probability of the interest category, so the specific implementation For the method, refer to S103, which will not be repeated here.
  • subsequent operations can also be performed based on the above-mentioned overlapping segments after the KTS segmentation, which will not be repeated here.
  • the interest category probability in S204 comprehensively describes the two dimensions of the importance and interest of the video clip. Therefore, after sorting, you can further select the summary video clip to present as more important and more user-friendly as possible Video clips of interest.
  • the embodiments of this application further analyze user preferences based on the local database and historical operation records, so that they are selected for synthesis
  • the video clips of short videos are more targeted and more in line with user interests, short videos of thousands of people can be obtained.
  • FIG. 12 shows a schematic structural diagram of a terminal device 100 as the device for generating a short video.
  • the terminal device 100 may have more or fewer components than shown in the figure, may combine two or more components, or may have different component configurations.
  • the various components shown in the figure may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.
  • the terminal device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2.
  • Mobile communication module 150 wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display 194, And subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light Sensor 180L, bone conduction sensor 180M, etc.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait.
  • AP application processor
  • modem processor modem processor
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the different processing units may be independent devices or integrated in one or more processors.
  • the controller may be the nerve center and command center of the terminal device 100.
  • the controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching instructions and executing instructions.
  • a memory may also be provided in the processor 110 to store instructions and data.
  • the memory in the processor 110 is a cache memory.
  • the memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.
  • the processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter/receiver (universal asynchronous) interface.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transmitter/receiver
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB Universal Serial Bus
  • the interface connection relationship between the modules illustrated in the embodiment of the present application is merely a schematic description, and does not constitute a structural limitation of the terminal device 100.
  • the terminal device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charger.
  • the power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110.
  • the power management module 141 receives input from the battery 142 and/or the charge management module 140, and supplies power to the processor 110, the internal memory 121, the external memory, the display screen 194, the camera 193, and the wireless communication module 160.
  • the wireless communication function of the terminal device 100 can be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, and the baseband processor.
  • the terminal device 100 implements a display function through a GPU, a display screen 194, and an application processor.
  • the GPU is an image processing microprocessor, which is connected to the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations and is used for graphics rendering.
  • the processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
  • the display screen 194 is used to display images, videos, and the like.
  • the display screen 194 includes a display panel.
  • the display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • active-matrix organic light-emitting diode active-matrix organic light-emitting diode
  • AMOLED flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc.
  • the terminal device 100 may include one or N display screens 194, and N is a positive integer greater than one.
  • the terminal device 100 can implement a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, and an application processor.
  • the ISP is used to process the data fed back from the camera 193. For example, when taking a picture, the shutter is opened, the light is transmitted to the photosensitive element of the camera through the lens, the light signal is converted into an electrical signal, and the photosensitive element of the camera transmits the electrical signal to the ISP for processing and is converted into an image visible to the naked eye.
  • ISP can also optimize the image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be provided in the camera 193.
  • the camera 193 is used to capture still images or videos.
  • the object generates an optical image through the lens and is projected to the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal.
  • ISP outputs digital image signals to DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the camera 193 includes a camera that collects images required for face recognition, such as an infrared camera or other cameras.
  • the camera that collects the image required for face recognition is generally located on the front of the terminal device, for example, above the touch screen, and may also be located at other positions, which is not limited in the embodiment of the present invention.
  • the terminal device 100 may include other cameras.
  • the terminal device may also include a dot matrix transmitter (not shown in the figure) for emitting light.
  • the camera collects the light reflected by the face to obtain a face image, and the processor processes and analyzes the face image, and compares it with the stored face image information for verification.
  • the digital signal processor is used to process digital signals. In addition to digital image signals, it can also process other digital signals. For example, when the terminal device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
  • Video codecs are used to compress or decompress digital video.
  • the terminal device 100 may support one or more video codecs. In this way, the terminal device 100 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.
  • MPEG moving picture experts group
  • MPEG2 MPEG2, MPEG3, MPEG4, and so on.
  • NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • applications such as intelligent cognition of the terminal device 100 can be implemented, such as image recognition, face recognition, voice recognition, text understanding, and so on.
  • the external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, so as to expand the storage capacity of the terminal device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example, save music, video and other files in an external memory card.
  • the internal memory 121 may be used to store computer executable program code, where the executable program code includes instructions.
  • the processor 110 executes various functional applications and data processing of the terminal device 100 by running instructions stored in the internal memory 121.
  • the internal memory 121 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, at least one application required for a function (such as a face recognition function, a fingerprint recognition function, a mobile payment function, etc.) and so on.
  • the storage data area can store data created during the use of the terminal device 100 (such as face information template data, fingerprint information template, etc.) and the like.
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
  • UFS universal flash storage
  • the terminal device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. For example, music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal.
  • the speaker 170A also called “speaker” is used to convert audio electrical signals into sound signals.
  • the receiver 170B also called “earpiece” is used to convert audio electrical signals into sound signals.
  • the microphone 170C also called “microphone”, “microphone”, is used to convert sound signals into electrical signals.
  • the earphone interface 170D is used to connect wired earphones.
  • the earphone interface 170D may be a USB interface 130, or a 3.5mm open mobile terminal platform (OMTP) standard interface, or a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association
  • the pressure sensor 180A is used to sense the pressure signal and can convert the pressure signal into an electrical signal.
  • the pressure sensor 180A may be provided on the display screen 194.
  • the gyro sensor 180B may be used to determine the movement posture of the terminal device 100.
  • the angular velocity of the terminal device 100 around three axes ie, x, y, and z axes
  • the gyro sensor 180B can be determined by the gyro sensor 180B.
  • the proximity light sensor 180G may include, for example, a light emitting diode (LED) and a light detector such as a photodiode.
  • the light emitting diode may be an infrared light emitting diode.
  • the ambient light sensor 180L is used to sense the brightness of the ambient light.
  • the terminal device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived brightness of the ambient light.
  • the ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures.
  • the fingerprint sensor 180H is used to collect fingerprints.
  • the terminal device 100 can use the collected fingerprint characteristics to realize fingerprint unlocking, access application lock, fingerprint photography, fingerprint answering calls, and so on.
  • the fingerprint sensor 180H can be arranged under the touch screen, the terminal device 100 can receive a user's touch operation on the touch screen in the area corresponding to the fingerprint sensor, and the terminal device 100 can collect the fingerprint of the user's finger in response to the touch operation.
  • Information to realize that the fingerprint recognition involved in the embodiments of the application opens the hidden album after the fingerprint recognition is passed, the hidden application is opened after the fingerprint recognition is passed, the account is logged in after the fingerprint recognition is passed, and the payment is completed after the fingerprint recognition is passed.
  • the temperature sensor 180J is used to detect temperature.
  • the terminal device 100 uses the temperature detected by the temperature sensor 180J to execute a temperature processing strategy.
  • Touch sensor 180K also called “touch panel”.
  • the touch sensor 180K may be disposed on the display screen 194, and the touch screen is composed of the touch sensor 180K and the display screen 194, which is also called a “touch screen”.
  • the touch sensor 180K is used to detect touch operations acting on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • the visual output related to the touch operation can be provided through the display screen 194.
  • the touch sensor 180K may also be disposed on the surface of the terminal device 100, which is different from the position of the display screen 194.
  • the button 190 includes a power-on button, a volume button, and so on.
  • the button 190 may be a mechanical button. It can also be a touch button.
  • the terminal device 100 may receive key input, and generate key signal input related to user settings and function control of the terminal device 100.
  • the indicator 192 may be an indicator light, which may be used to indicate the charging status, power change, or to indicate messages, missed calls, notifications, and so on.
  • the SIM card interface 195 is used to connect to the SIM card.
  • the SIM card can be inserted into the SIM card interface 195 or pulled out from the SIM card interface 195 to achieve contact and separation with the terminal device 100.
  • the terminal device 100 adopts an eSIM, that is, an embedded SIM card.
  • the eSIM card can be embedded in the terminal device 100 and cannot be separated from the terminal device 100.
  • the software system of the terminal device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiment of the present invention takes an Android system with a layered architecture as an example to illustrate the software structure of the terminal device 100 by way of example.
  • FIG. 13 is a block diagram of the software structure of the terminal device 100 according to an embodiment of the present application.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface.
  • the Android system is divided into four layers, from top to bottom, the application layer, the application framework layer, the Android runtime and system library, and the kernel layer.
  • the application layer can include a series of application packages.
  • the application package may include applications (also called applications) such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.
  • applications also called applications
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, and a notification manager.
  • the window manager is used to manage window programs.
  • the window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, etc.
  • the content provider is used to store and retrieve data and make these data accessible to applications.
  • the data may include videos, images, audios, phone calls made and received, browsing history and bookmarks, phone book, etc.
  • the view system includes visual controls, such as controls that display text, controls that display pictures, and so on.
  • the view system can be used to build applications.
  • the display interface can be composed of one or more views.
  • a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.
  • the phone manager is used to provide the communication function of the terminal device 100. For example, the management of the call status (including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
  • the notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction.
  • the notification manager is used to notify download completion, message reminders, and so on.
  • the notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialogue interface.
  • prompt text messages in the status bar sound a prompt tone, terminal equipment vibration, flashing indicator lights, etc.
  • Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.
  • the core library consists of two parts: one part is the function functions that the java language needs to call, and the other part is the core library of Android.
  • the application layer and application framework layer run in a virtual machine.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
  • the system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.
  • the surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support multiple audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis, and layer processing.
  • the 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.
  • FIG. 14 shows a schematic structural diagram of the server 200 as the device for generating short videos.
  • server 200 may have more or fewer components than shown in the figure, may combine two or more components, or may have different component configurations.
  • the various components shown in the figure may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.
  • the server 200 may include a processor 210 and a memory 220, and the processor 210 may be connected to the memory 220 through a bus.
  • the processor 210 may include one or more processing units.
  • the processor 210 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), and/or neural network processor (neural-network processing unit, NPU), etc.
  • AP application processor
  • modem processor modem processor
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the different processing units may be independent devices or integrated in one or more processors.
  • the controller may be the nerve center and command center of the server 200.
  • the controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 210 for storing instructions and data.
  • the memory in the processor 210 is a cache memory.
  • the memory can store instructions or data that have just been used or recycled by the processor 210. If the processor 210 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 210 is reduced, and the efficiency of the system is improved.
  • the processor 210 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter/receiver (universal asynchronous) interface. receiver/transmitter, UART) interface, and/or universal serial bus (universal serial bus, USB) interface, etc.
  • the interface connection relationship between the modules illustrated in the embodiment of the present application is merely a schematic description, and does not constitute a structural limitation of the server 200.
  • the server 200 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the server 200 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
  • Video codecs are used to compress or decompress digital video.
  • the server 200 may support one or more video codecs. In this way, the server 200 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.
  • MPEG moving picture experts group
  • MPEG2 MPEG2, MPEG3, MPEG4, and so on.
  • NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • applications such as intelligent cognition of the server 200 can be realized, such as image recognition, face recognition, voice recognition, text understanding, and so on.
  • the memory 220 may be used to store computer executable program code, where the executable program code includes instructions.
  • the processor 210 executes various functional applications and data processing of the server 200 by running instructions stored in the memory 220.
  • the memory 220 may include a program storage area and a data storage area.
  • the storage program area can store an operating system, at least one application required for a function (such as a face recognition function, a fingerprint recognition function, a mobile payment function, etc.) and so on.
  • the storage data area can store data created during the use of the server 200 (such as face information template data, fingerprint information template, etc.) and the like.
  • the memory 220 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
  • UFS universal flash storage
  • the aforementioned server 200 may also be a virtualized server, that is, there are multiple virtualized logical servers on the server 200, and each logical server can rely on the software, hardware and other components in the server 200 to achieve the same data storage and Processing function.
  • FIG. 15 is a schematic structural diagram of a short video generating apparatus 300 in an embodiment of the application.
  • the short video generating apparatus 300 may be applied to the aforementioned terminal device 100 or server 200.
  • the device 300 for generating a short video may include:
  • the video acquisition module 310 is used to acquire the target video
  • the video analysis module 320 is configured to obtain the start and end times of at least one video segment in the target video and the probability of the semantic category belonging to the target video through semantic analysis; wherein, each of the video segments belongs to one or more semantic categories;
  • the short video generation module 330 is configured to generate a short video corresponding to the target video from the at least one video segment according to the start and end time of the at least one video segment and the probability of the semantic category to which it belongs.
  • the target video includes m frames of video images, and the m is a positive integer; the video analysis module 320 is specifically configured to:
  • the probability of the semantic category includes the probability of the behavior category and the probability of the scene category;
  • the target video includes m frames of video images, and the m is a positive integer;
  • the video analysis The module 320 is specifically used for:
  • the short video generation module 330 is specifically configured to:
  • Determining the probability of the behavior category of the video clip as the probability of the behavior category of each frame of video image in the video clip
  • the sum of the probability of the behavior category and the probability of the scene category of each frame of the video image in the multi-frame video image is divided by the number of frames to obtain the average category probability of the video segment.
  • a short video corresponding to the target video is generated from the at least one video segment.
  • the weight of the category corresponding to each semantic category is calculated.
  • the program can be stored in a computer readable storage medium. During execution, it may include the procedures of the above-mentioned method embodiments.
  • the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

Abstract

Provided are a method and apparatus for generating a short video, and a related device and a medium. The method comprises: acquiring a target video, and obtaining, by means of semantic analysis, the starting and the ending time of at least one video clip in the target video and the probability of a semantic category to which the video clip belongs, wherein each video clip belongs to one or more semantic categories; and then, according to the starting and the ending time of the at least one video clip and the probability of the semantic category to which the video clip belongs, generating, from the at least one video clip, a short video corresponding to the target video. A video clip, which belongs to one or more semantic categories, in a target video is identified by means of semantic analysis, so as to directly extract video clips that best reflect the content of the target video and are continuous to composite a short video, such that not only is the continuity of the content between frames in the target video considered, but the efficiency of generating the short video is also improved.

Description

短视频的生成方法、装置、相关设备及介质Short video generation method, device, related equipment and medium
相关申请的交叉引用Cross-references to related applications
本申请要求在2020年3月26日提交中国专利局、申请号为202010223607.1、申请名称为“短视频的生成方法、装置、相关设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 26, 2020, the application number is 202010223607.1, and the application name is "short video generation methods, devices, related equipment and media", the entire content of which is incorporated by reference Incorporated in this application.
技术领域Technical field
本申请涉及视频处理技术,尤其涉及短视频的生成方法、装置、相关设备及介质。This application relates to video processing technology, in particular to short video generation methods, devices, related equipment and media.
背景技术Background technique
随着终端设备的相机效果的不断优化、新媒体社交平台的不断发展以及移动网络的速度提升,越来越多的人喜欢通过短视频分享自己的日常生活。不同于传统视频时长较长的特点,短视频的时长一般仅有几秒或几分钟不等,因此具有生产成本低、传播速度快、社交属性强等特点,因此受到广大用户的喜爱。同时,也因为时间长度有限,短视频的视频内容要能够在很短的时间内呈现出重点。因此,人们通常长视频进行筛选剪辑等操作,从而生成一段重点突出的短视频。With the continuous optimization of camera effects of terminal devices, the continuous development of new media social platforms, and the increase in the speed of mobile networks, more and more people like to share their daily lives through short videos. Different from the long-term characteristics of traditional videos, short videos generally only last for a few seconds or minutes. Therefore, they have the characteristics of low production cost, fast transmission speed, and strong social attributes, so they are loved by the majority of users. At the same time, because of the limited time length, the video content of the short video must be able to show the key points in a short time. Therefore, people usually perform operations such as filtering and editing long videos to generate a short video that highlights the key points.
目前,有一些专业的视频剪辑软件可以根据用户操作对视频进行节选、拼接等;还有一些应用程序可以直接从视频中截取一段规定时长的视频片段,例如从一段1分钟的视频中截取最开始的10秒,或者截取用户任意选定的10秒片段。但是,上述两种方式中,一种过于繁琐,需要用户自己学习软件操作并自己剪辑;另一种又过于简单,不能把视频的精华部分都截取出来。因此,需要一种更智能的方式可以自动提取视频中的重点片段并生成短视频。At present, there are some professional video editing software that can extract and splice videos according to user operations; there are also some applications that can directly intercept a video clip of a specified duration from the video, such as the first one from a 1-minute video. 10 seconds, or intercept a 10-second clip arbitrarily selected by the user. However, of the above two methods, one is too cumbersome and requires users to learn the software operation and edit by themselves; the other is too simple to cut out all the essence of the video. Therefore, there is a need for a smarter way to automatically extract key segments in the video and generate short videos.
在现有技术的一些方案中,通过识别视频中每帧视频图像的特征信息来确定视频图像的重要性,然后再根据各帧视频图像的重要性筛选出一部分视频图像生成短视频。这种方法虽然实现了智能生成短视频,但是由于针对的是单帧视频图像进行识别,忽略了帧与帧之间的关联,容易导致短视频的内容过于零散,不够连贯,不能表达出一段视频的内容脉络,此难以满足用户对短视频内容的实际需求。另一方面,目标视频中有大量的、冗余的视频图像,如果对每一帧视频图像进行一一识别,再相互比对后选择重要的视频图像合成短视频,会导致计算时间过长,影响短视频的生成效率。In some solutions in the prior art, the importance of the video image is determined by identifying the feature information of each frame of the video image in the video, and then a part of the video image is screened out to generate a short video according to the importance of each frame of the video image. Although this method realizes the intelligent generation of short videos, because it is aimed at single-frame video image recognition and ignores the association between frames, it is easy to cause the content of short videos to be too fragmented and not coherent enough to express a video. It’s difficult to meet the actual needs of users for short video content. On the other hand, there are a large number of redundant video images in the target video. If each frame of video image is identified one by one, and then the important video images are selected to synthesize a short video after comparing each other, it will lead to too long calculation time. Affect the efficiency of short video generation.
发明内容Summary of the invention
本申请提供一种短视频的生成方法、装置、相关设备及介质。该方法可以由短视频的生成装置,例如智能终端、服务器等实施,通过视频语义分析模型识别目标视频中具有一个或多个语义类别的视频片段,以直接提取体现目标视频内容且具有连续性的视频片段来合成短视频,不仅考虑了目标视频中帧与帧之间内容的连贯性,提升短视频的呈现效果,使短视频内容更满足用户的实际需求,也提升了短视频的生成效率。This application provides a short video generation method, device, related equipment and media. This method can be implemented by short video generation devices, such as smart terminals, servers, etc., through the video semantic analysis model to identify video clips with one or more semantic categories in the target video, so as to directly extract the target video content and have continuity. Synthesizing short videos with video clips not only considers the continuity of the content between frames in the target video, but also improves the presentation effect of short videos, so that short video content meets the actual needs of users, and also improves the efficiency of short video generation.
以下从多个方面介绍本申请,容易理解的是,该以下多个方面的实现方式可互相 参考。The following introduces this application from multiple aspects, and it is easy to understand that the implementation manners of the following multiple aspects can be referred to each other.
第一方面,本申请提供一种短视频的生成方法。短视频的生成装置获取目标视频,其中目标视频包括多帧视频图像,通过语义分析确定目标视频中的至少一个视频片段,并获得至少一个视频片段的起止时间和所属语义类别的概率,其中,视频片段包括连续帧视频图像,视频片段的帧数可以等于或小于目标视频的帧数,且视频片段是属于一个或多个语义类别的,也即视频片段包括的连续帧视频图像属于一个或多个语义类别;然后根据至少一个视频片段的起止时间和所属语义类别的概率,从至少一个视频片段中选择出用于进行短视频生成的片段,并合成短视频。In the first aspect, this application provides a short video generation method. The device for generating a short video obtains a target video, where the target video includes multiple frames of video images, determines at least one video segment in the target video through semantic analysis, and obtains the start and end time of the at least one video segment and the probability of the semantic category to which it belongs, where the video Fragments include continuous frames of video images, the number of frames of the video fragment can be equal to or less than the number of frames of the target video, and the video fragment belongs to one or more semantic categories, that is, the continuous frame video images included in the video fragment belong to one or more Semantic category; then according to the start and end time of at least one video segment and the probability of the semantic category to which it belongs, a segment for short video generation is selected from at least one video segment, and the short video is synthesized.
在该技术方案中,通过语义分析识别目标视频中具有一个或多个语义类别的视频片段,以直接提取最能体现目标视频内容且具有连续性的视频片段来合成短视频,该短视频可以作为目标视频的视频摘要,或者视频浓缩,本申请中不仅考虑了目标视频中帧与帧之间内容的连贯性,提升短视频的呈现效果,使短视频内容更满足用户的实际需求,也提升了短视频的生成效率。In this technical solution, semantic analysis is used to identify video clips with one or more semantic categories in the target video, so as to directly extract the video clips that best reflect the target video content and have continuity to synthesize a short video. The short video can be used as The video summary of the target video, or the video condensing, this application not only considers the continuity of the content between frames in the target video, but also improves the presentation effect of the short video, so that the short video content better meets the actual needs of users, and it also improves The efficiency of short video generation.
在该技术方案中,通过视频语义分析模型识别目标视频中具有一个或多个语义类别的视频片段,以直接提取最能体现目标视频内容且具有连续性的视频片段来合成短视频,不仅考虑了目标视频中帧与帧之间内容的连贯性,提升短视频的呈现效果,使短视频内容更满足用户的实际需求,也提升了短视频的生成效率。In this technical solution, the video semantic analysis model is used to identify video clips with one or more semantic categories in the target video, so as to directly extract the video clips that best reflect the target video content and have continuity to synthesize short videos. The continuity of the content between frames in the target video improves the presentation effect of the short video, makes the short video content more satisfying the actual needs of users, and also improves the generation efficiency of the short video.
在第一方面一种可能的实现方式中,目标视频包括m帧视频图像,m为正整数,短视频的生成装置在语义分析时,具体可以提取目标视频中每帧视频图像的n维特征数据,并基于m帧视频图像的时间顺序生成m*n的视频特征矩阵,将视频特征矩阵转化成多层特征图,基于多层特征图中的各特征点在视频特征矩阵上生成对应的至少一个候选框,根据候选框确定至少一个连续语义特征序列,并确定每个连续语义特征序列对应的视频片段的起止时间和所属语义类别的概率,其中,n为正整数。In a possible implementation of the first aspect, the target video includes m frames of video images, and m is a positive integer. During semantic analysis, the device for generating short videos can specifically extract n-dimensional feature data of each frame of video image in the target video. , And generate an m*n video feature matrix based on the time sequence of the m frames of video images, convert the video feature matrix into a multi-layer feature map, and generate at least one corresponding to the video feature matrix based on each feature point in the multi-layer feature map The candidate frame determines at least one continuous semantic feature sequence according to the candidate frame, and determines the start and end time of the video segment corresponding to each continuous semantic feature sequence and the probability of the semantic category to which it belongs, where n is a positive integer.
在该技术方案中,通过将目标视频进行特征提取,可以将时间-空间两个维度的目标视频,转换为在一个视频特征矩阵内就可以呈现的空间维度的特征图,为后续对目标视频的片段分割和选择奠定了基础;在候选框选取时,将视频特征矩阵替代原图,将原本在空间领域用于图像识别的候选框生成方法,适用在了时空领域中,使候选框从圈定图像中的物体区域转变为圈定视频特征矩阵中的连续语义特征序列。从而达到了将目标视频中包含语义类别的视频片段直接识别出来的目的,无需再一帧一帧的进行识别和筛选。这样相对于现有的每帧视频图像在时间上串联起来进行时序建模的循环网络模型,该技术方案更加简捷,从而计算速度更快,减少了计算时间和资源占用。In this technical solution, by performing feature extraction on the target video, the target video in two dimensions of time and space can be converted into a feature map of spatial dimension that can be presented in a video feature matrix, which is a subsequent evaluation of the target video. Segment segmentation and selection lay the foundation; when the candidate frame is selected, the video feature matrix is substituted for the original image, and the candidate frame generation method originally used for image recognition in the spatial domain is applied to the space-time domain, so that the candidate frame is from the circled image The object area in is transformed into a continuous semantic feature sequence in the delineated video feature matrix. In this way, the purpose of directly identifying the video clips containing semantic categories in the target video is achieved, without the need to identify and filter frame by frame. In this way, compared with the existing cyclic network model in which each frame of video images are connected in time to perform timing modeling, the technical solution is more simple and convenient, so that the calculation speed is faster, and the calculation time and resource occupation are reduced.
在第一方面一种可能的实现方式中,所属语义类别的概率包括所属行为类别的概率和所属场景类别的概率;目标视频包括m帧视频图像,m为正整数,短视频的生成装置在语义分析时,将所属行为类别的概率和所属场景类别的概率通过两种方式分别获取。针对所属行为类别的概率,具体可以提取目标视频中每帧视频图像的n维特征数据,并基于m帧视频图像的时间顺序生成m*n的视频特征矩阵,将视频特征矩阵转化成多层特征图,基于多层特征图中的各特征点在视频特征矩阵上生成对应的至少一个候选框,根据候选框确定至少一个连续语义特征序列,并确定每个连续语义特征序列对应的视频片段的起止时间和所属行为类别的概率,其中,n为正整数。针对所属场景类别的概率,可以根据目标视频中每帧视频图像的n维特征数据识别并输出目标视频中每帧视频图像的所属场景类别的概 率。In a possible implementation of the first aspect, the probability of belonging to the semantic category includes the probability of belonging to the behavior category and the probability of belonging to the scene category; the target video includes m frames of video images, m is a positive integer, and the short video generation device is in the semantics In the analysis, the probability of the belonging behavior category and the probability of the belonging scene category are obtained separately in two ways. According to the probability of the behavior category, it can extract the n-dimensional feature data of each frame of video image in the target video, and generate an m*n video feature matrix based on the time sequence of m frames of video images, and convert the video feature matrix into multi-layer features Figure, based on each feature point in the multi-layer feature map, generate at least one candidate frame corresponding to the video feature matrix, determine at least one continuous semantic feature sequence according to the candidate frame, and determine the start and end of the video segment corresponding to each continuous semantic feature sequence Time and the probability of the behavior category, where n is a positive integer. Regarding the probability of the scene category, the probability of the scene category of each frame of video image in the target video can be identified and output according to the n-dimensional feature data of each frame of video image in the target video.
在该技术方案中,将所属场景类别和所属行为类别的识别路径区分开来,所属场景类别的概率采用常规的单帧图像的识别方式,既能够将场景类别加入输出结果中,又能够重点识别动态的行为类别,利用不同识别方式擅长的处理方向,节约计算时间且提高识别准确度。In this technical solution, the identification path of the category of the category and the category of the category of behavior are distinguished, and the probability of the category of the category uses the conventional single-frame image recognition method, which can not only add the category of the scene to the output result, but also focus on the recognition Dynamic behavior categories, using the processing direction that different recognition methods are good at, saving calculation time and improving recognition accuracy.
在第一方面一种可能的实现方式中,在视频特征矩阵上生成的至少一个候选框的宽度不变。In a possible implementation of the first aspect, the width of at least one candidate frame generated on the video feature matrix is unchanged.
在该技术方案中,候选框的宽度保持不变,无需不断调整去搜索不同长宽的空间范围,只需要在长度维度上进行搜索,可以节省搜索空间的时间,从而进一步节省了模型的计算时间和占用的资源。In this technical solution, the width of the candidate frame remains unchanged, and there is no need to constantly adjust to search for different length and width space ranges, only the length dimension is required to search, which can save the search space time, thereby further saving the calculation time of the model And occupied resources.
在第一方面一种可能的实现方式中,短视频的生成装置根据每个视频片段的起止时间和所属行为类别的概率、每个视频片段中的每帧视频图像的所属场景类别的概率,确定至少一个视频片段的平均类别概率;再根据至少一个视频片段的平均类别概率,从至少一个视频片段中生成目标视频对应的短视频。In a possible implementation of the first aspect, the device for generating short videos determines the start and end time of each video segment and the probability of the behavior category to which it belongs, and the probability of the scene category of each frame of video image in each video segment. The average category probability of at least one video segment; and then according to the average category probability of the at least one video segment, a short video corresponding to the target video is generated from the at least one video segment.
在第一方面一种可能的实现方式中,短视频的生成装置可以针对每个视频片段计算平均类别概率,具体可以根据视频片段的起止时间,确定视频片段对应的多帧视频图像及帧数;将视频片段的所属行为类别的概率确定为视频片段中每帧视频图像的所属行为类别的概率;获取多帧视频图像中的每帧视频图像的所属场景类别的概率;将多帧视频图像中的每帧视频图像的所属行为类别的概率与所属场景类别的概率的和除以帧数,得到视频片段的平均类别概率。In a possible implementation of the first aspect, the short video generation device may calculate an average category probability for each video segment, and specifically may determine the multi-frame video image and the number of frames corresponding to the video segment according to the start and end time of the video segment; Determine the probability of the behavior category of the video clip as the probability of the behavior category of each frame of the video image in the video clip; obtain the probability of the scene category of each frame of the video image in the multi-frame video image; The sum of the probability of the behavior category of each frame of video image and the probability of the category of the scene is divided by the number of frames to obtain the average category probability of the video segment.
在第一方面一种可能的实现方式中,短视频的生成装置根据至少一个视频片段的所属语义类别的概率的大小顺序和起止时间,依次从至少一个视频片段中确定出至少一个摘要视频片段,然后获取至少一个摘要视频片段并合成目标视频对应的短视频。In a possible implementation of the first aspect, the device for generating short videos sequentially determines at least one summary video segment from the at least one video segment according to the magnitude order and the start and end time of the probability of the semantic category of the at least one video segment, Then obtain at least one summary video segment and synthesize a short video corresponding to the target video.
在该技术方案中,视频片段的所属语义类别的概率可以说明视频片段的重要程度,因此,基于所属语义类别的概率对至少一个视频片段进行筛选,可以在短视频的预设时长内,尽可能呈现更重要的视频片段。In this technical solution, the probability of the semantic category of the video clip can indicate the importance of the video clip. Therefore, at least one video clip can be screened based on the probability of the semantic category. Present more important video clips.
在第一方面一种可能的实现方式中,短视频的生成装置根据每个视频片段的起止时间,在目标视频中截取视频片段;根据至少一个视频片段的所属语义类别的概率的大小顺序,对各视频片段进行排序显示;当接收到对任意一个或多个视频片段的选择指令时,确定被选择的视频片段为摘要视频片段;根据至少一个摘要视频片段,合成目标视频对应的短视频。In a possible implementation of the first aspect, the device for generating short videos intercepts video clips in the target video according to the start and end time of each video clip; according to the magnitude order of the probability of the semantic category of at least one video clip, Each video segment is displayed in order; when a selection instruction for any one or more video segments is received, the selected video segment is determined to be a summary video segment; according to at least one summary video segment, a short video corresponding to the target video is synthesized.
在该技术方案中,通过与用户交互的方式,将分割好的视频片段按照所属语义类别的概率反映出的重要性的先后顺序呈现给用户,用户基于自己的兴趣或需要进行选择后,生成相应的短视频,从而使短视频更能满足用户需要。In this technical solution, by interacting with the user, the segmented video clips are presented to the user in the order of importance reflected by the probability of the semantic category to which they belong. After the user makes a selection based on his own interests or needs, the corresponding The short video, so that the short video can better meet the needs of users.
在第一方面一种可能的实现方式中,短视频的生成装置可以根据每个视频片段所属语义类别的概率、每个视频片段的所属语义类别对应的类别权重,确定至少一个视频片段的兴趣类别概率;根据至少一个视频片段的起止时间和兴趣类别概率,从至少一个视频片段中生成目标视频对应的短视频。In a possible implementation of the first aspect, the device for generating short videos may determine the interest category of at least one video clip according to the probability of the semantic category to which each video clip belongs, and the category weight corresponding to the semantic category of each video clip. Probability: According to the start and end time of the at least one video segment and the probability of the interest category, a short video corresponding to the target video is generated from the at least one video segment.
在该技术方案中,在保证短视频内容的连贯性以及短视频生成效率的基础上,又进一步考虑所属语义类别对应的类别权重,从而在选择用于合成短视频的视频片段时,能够更 具有针对性,例如挑选出指定的某一种或多种语义类别的视频片段,满足更加灵活多样的用户需求。In this technical solution, on the basis of ensuring the continuity of short video content and the efficiency of short video generation, the category weight corresponding to the semantic category is further considered, so that it can be more effective when selecting video clips for synthesizing short videos. Pertinence, such as picking out specific video clips of one or more semantic categories to meet more flexible and diverse user needs.
在第一方面一种可能的实现方式中,短视频的生成装置可以通过本地数据库和历史操作记录中的媒体数据信息,确定媒体数据的各种所属语义类别分别对应的类别权重。In a possible implementation of the first aspect, the device for generating short videos can determine the category weights corresponding to various semantic categories of the media data through the media data information in the local database and historical operation records.
在该技术方案中,根据本地数据库和历史操作记录分析了用户偏好,以此确定所属语义类别的类别权重,从而在选择用于合成短视频的视频片段时,能够更符合用户兴趣,得到千人千面的短视频。In this technical solution, user preferences are analyzed according to the local database and historical operation records to determine the category weight of the semantic category, so that when selecting video clips for synthesizing short videos, it can be more in line with user interests and get thousands of people. Thousands of short videos.
在第一方面一种可能的实现方式中,短视频的生成装置在确定每种所属语义类别对应的类别权重时,具体可以先确定本地数据库中的视频和图像的所属语义类别,统计每种所属语义类别的出现次数;然后确定历史操作记录中用户操作过的视频和图像的所属语义类别,统计每种所属语义类别的操作时长和操作频率;最后根据每种所属语义类别的出现次数、操作时长和操作频率,计算每种所属语义类别对应的类别权重。In a possible implementation of the first aspect, when the device for generating short videos determines the category weight corresponding to each semantic category, it may first determine the semantic category of the video and image in the local database, and make statistics of each category. The number of occurrences of the semantic category; then determine the semantic category of the videos and images that the user has manipulated in the historical operation record, and count the operation duration and frequency of each semantic category; finally according to the number of occurrences and operation duration of each semantic category And operation frequency, calculate the category weight corresponding to each semantic category.
在第一方面一种可能的实现方式中,短视频的生成装置根据至少一个视频片段的兴趣类别概率的大小顺序和起止时间,依次从至少一个视频片段中确定出至少一个摘要视频片段,然后获取至少一个摘要视频片段并合成目标视频对应的短视频。In a possible implementation of the first aspect, the device for generating short videos sequentially determines at least one summary video segment from the at least one video segment according to the magnitude order and start and end time of the interest category probability of the at least one video segment, and then obtains At least one summary video segment is combined with a short video corresponding to the target video.
在该技术方案中,视频片段的兴趣类别概率可以说明视频片段的重要程度和用户的感兴趣程度,因此,基于兴趣类别概率对至少一个视频片段进行筛选,可以在短视频的预设时长内,尽可能呈现更重要且更符合用户兴趣的视频片段。In this technical solution, the interest category probability of the video clip can indicate the importance of the video clip and the user's degree of interest. Therefore, filtering at least one video clip based on the interest category probability can be within the preset duration of the short video. Try to present more important video clips that are more in line with user interests.
在第一方面一种可能的实现方式中,至少一个摘要视频片段的片段时长之和不大于预设的短视频时长。In a possible implementation manner of the first aspect, the sum of the segment durations of at least one summary video segment is not greater than the preset short video duration.
在第一方面一种可能的实现方式中,短视频的生成装置根据每个视频片段的起止时间,在目标视频中截取视频片段;根据至少一个视频片段的兴趣类别概率的大小顺序,对各视频片段进行排序显示;当接收到对任意一个或多个视频片段的选择指令时,确定被选择的视频片段为摘要视频片段;根据至少一个摘要视频片段,合成目标视频对应的短视频。In a possible implementation of the first aspect, the device for generating short videos intercepts video segments in the target video according to the start and end time of each video segment; The clips are sorted and displayed; when a selection instruction for any one or more video clips is received, the selected video clip is determined to be a summary video clip; according to at least one summary video clip, a short video corresponding to the target video is synthesized.
在该技术方案中,通过与用户交互的方式,将分割好的视频片段按照兴趣类别概率反映出的重要性和兴趣度的综合先后顺序呈现给用户,用户基于自己当前的兴趣或需要再进行选择后,生成相应的短视频,从而使短视频更能满足用户的即时需要。In this technical solution, by interacting with the user, the segmented video clips are presented to the user according to the comprehensive sequence of importance and interest reflected by the probability of the interest category, and the user makes a selection based on his current interests or needs. Then, the corresponding short video is generated, so that the short video can better meet the immediate needs of users.
在第一方面一种可能的实现方式中,短视频的生成装置还可以对目标视频进行时域分割,得到至少一个分割片段的起止时间;根据至少一个视频片段的起止时间和至少一个分割片段的起止时间,确定各视频片段与各分割片段之间的至少一个重叠片段;从至少一个重叠片段中生成所述目标视频对应的短视频。In a possible implementation of the first aspect, the short video generation device can also perform time domain segmentation on the target video to obtain the start and end time of at least one segment; Starting and ending time, determine at least one overlapping segment between each video segment and each divided segment; generate a short video corresponding to the target video from the at least one overlapping segment.
在该技术方案中,KTS分割得到的分割片段的内容一致性较高,视频语义分析模型识别出的视频片段则是具有语义类别的片段,可以说明在视频片段中的重要性。两种分割方法结合后得到的重叠片段的内容一致性和重要性都比较高,同时也可以修正视频语义分析模型的结果,从而生成的短视频更加连贯且符合用户需求。In this technical solution, the content consistency of the segmented segments obtained by the KTS segmentation is relatively high, and the video segments identified by the video semantic analysis model are segments with semantic categories, which can explain the importance of the video segments. The content consistency and importance of the overlapping segments obtained after the combination of the two segmentation methods are relatively high, and the results of the video semantic analysis model can also be modified, so that the generated short videos are more coherent and meet user needs.
第二方面,本申请提供一种短视频的生成装置。该短视频的生成装置可以包括视频获取模块、视频分析模块和短视频生成模块。在一些实现方式中,短视频的生成装置还可以包括信息获取模块和类别权重确定模块。短视频的生成装置通过上述模块实现第一方面的任意实现方式提供的部分或全部方法。In the second aspect, this application provides a short video generation device. The short video generation device may include a video acquisition module, a video analysis module, and a short video generation module. In some implementation manners, the short video generation device may further include an information acquisition module and a category weight determination module. The short video generation device implements part or all of the methods provided in any implementation manner of the first aspect through the above-mentioned modules.
第三方面,本申请提供一种终端设备,该终端设备包括存储器和处理器,存储器用于 存储计算机可读指令(或者称之为计算机程序),处理器用于读取计算机可读指令以实现上述第一方面的任意实现方式提供的方法。In a third aspect, the present application provides a terminal device. The terminal device includes a memory and a processor. The memory is used to store computer-readable instructions (or referred to as computer programs), and the processor is used to read the computer-readable instructions to implement the foregoing The method provided by any implementation of the first aspect.
第四方面,本申请提供一种服务器,该终端设备包括存储器和处理器,存储器用于存储计算机可读指令(或者称之为计算机程序),处理器用于读取计算机可读指令以实现上述第一方面的任意实现方式提供的方法。In a fourth aspect, the present application provides a server. The terminal device includes a memory and a processor. The memory is used to store computer-readable instructions (or referred to as computer programs), and the processor is used to read the computer-readable instructions to implement the foregoing Any implementation of one aspect provides a method.
第五方面,本申请提供一种计算机存储介质,该计算机存储介质可以是非易失性的。该计算机存储介质中存储有计算机可读指令,当该计算机可读指令被处理器执行时实现上述第一方面的任意实现方式提供的方法。In a fifth aspect, the present application provides a computer storage medium, which may be non-volatile. The computer storage medium stores computer readable instructions, and when the computer readable instructions are executed by a processor, the method provided by any implementation manner of the first aspect described above is implemented.
第六方面,本申请提供一种计算机程序产品,该计算机程序产品中包含计算机可读指令,当该计算机可读指令被处理器执行时实现上述第一方面的任意实现方式提供的方法。In a sixth aspect, the present application provides a computer program product that contains computer-readable instructions, and when the computer-readable instructions are executed by a processor, the method provided by any implementation manner of the first aspect described above is implemented.
附图说明Description of the drawings
图1是本申请实施例提供的一种短视频的生成方法的应用场景示意图;FIG. 1 is a schematic diagram of an application scenario of a short video generation method provided by an embodiment of the present application;
图2是本申请实施例提供的一种短视频的生成方法的应用环境示意图;FIG. 2 is a schematic diagram of an application environment of a method for generating a short video provided by an embodiment of the present application;
图3是本申请实施例提供的另一种短视频的生成方法的应用环境示意图;FIG. 3 is a schematic diagram of an application environment of another short video generation method provided by an embodiment of the present application;
图4是本申请实施例提供的一种短视频的生成方法的流程示意图;FIG. 4 is a schematic flowchart of a short video generation method provided by an embodiment of the present application;
图5是本申请实施例提供的一种视频特征矩阵的示意图;FIG. 5 is a schematic diagram of a video feature matrix provided by an embodiment of the present application;
图6是本申请实施例提供的一种视频语义分析模型的模型架构示意图;FIG. 6 is a schematic diagram of a model architecture of a video semantic analysis model provided by an embodiment of the present application;
图7是本申请实施例提供的一种特征金字塔的结构示意图;FIG. 7 is a schematic structural diagram of a feature pyramid provided by an embodiment of the present application;
图8是本申请实施例提供的一种ResNet50的原理示意图;FIG. 8 is a schematic diagram of the principle of a ResNet50 provided by an embodiment of the present application;
图9是本申请实施例提供的一种区域选取网络的原理示意图;FIG. 9 is a schematic diagram of the principle of an area selection network provided by an embodiment of the present application;
图10是本申请实施例提供的另一种视频语义分析模型的模型架构示意图;FIG. 10 is a schematic diagram of the model architecture of another video semantic analysis model provided by an embodiment of the present application;
图11是本申请实施例提供的另一种短视频的生成方法的流程示意图;FIG. 11 is a schematic flowchart of another short video generation method provided by an embodiment of the present application;
图12为本申请实施例提供的一种终端设备的结构示意图;FIG. 12 is a schematic structural diagram of a terminal device provided by an embodiment of this application;
图13为本申请实施例提供的一种终端设备的软件架构示意图;FIG. 13 is a schematic diagram of a software architecture of a terminal device provided by an embodiment of this application;
图14是本申请实施例提供的一种服务器的结构示意图;FIG. 14 is a schematic structural diagram of a server provided by an embodiment of the present application;
图15是本申请实施例提供的一种短视频的生成装置的结构示意图。FIG. 15 is a schematic structural diagram of an apparatus for generating a short video provided by an embodiment of the present application.
具体实施方式Detailed ways
为了方便理解本申请实施例的技术方案,首先介绍本申请相关技术所适用的应用场景。In order to facilitate the understanding of the technical solutions of the embodiments of the present application, firstly, the application scenarios to which the related technologies of the present application are applicable are introduced.
如图1所示,为本申请实施例提供的短视频的生成方法的应用场景示意图。本申请的技术方案适用于针对一个或多个视频生成短视频并发送至各类应用平台进行分享或存储的应用场景。其中,视频与短视频之间可以是一对一、多对一、一对多或多对多的关系,即可以是一个视频对应生成一个或多个短视频,也可以是多个视频对应生成一个或多个短视频。实际上,上述几种情况的短视频生成方法都是一致的,因此,本申请实施例以一个目标视频生成对应的一个或多个短视频为例进行描述。As shown in FIG. 1, it is a schematic diagram of an application scenario of a short video generation method provided by an embodiment of this application. The technical solution of this application is suitable for application scenarios where short videos are generated for one or more videos and sent to various application platforms for sharing or storage. Among them, the relationship between video and short video can be one-to-one, many-to-one, one-to-many or many-to-many, that is, one or more short videos can be generated corresponding to one video, or multiple videos can be generated correspondingly One or more short videos. In fact, the short video generation methods in the above several cases are all the same. Therefore, the embodiment of the present application takes the generation of one or more short videos corresponding to one target video as an example for description.
本申请实施例的应用场景在针对不同的业务时可以衍生出各种具体的业务场景。例如,在社交软件或短视频平台的视频分享业务场景中,用户可以拍摄一段视频,并确定将此视频生成短视频,然后将生成的短视频分享社交软件的好友或者发布在平台上。在行车记录 的业务场景中,可以将拍摄的一段行车记录视频生成短视频,上传至交警平台。在存储空间清理的业务场景中,可以将存储空间中的所有视频生成相应的短视频保存在相册中,然后删除、压缩或迁移存储空间中的原视频以节省存储空间。又例如,针对一些电影、电视剧、记录片等各类视频影像内容,用户想要通过几分钟的视频摘要浏览影像内容,选择自己感兴趣的视频进行观看,本申请的技术方案也适用于将此类视频影像内容生成视频摘要,或者说浓缩视频,便于用户浏览查看。The application scenarios of the embodiments of the present application can derive various specific business scenarios for different businesses. For example, in a video sharing business scenario of a social software or short video platform, a user can take a video, determine to generate a short video from this video, and then share the generated short video with friends of the social software or publish it on the platform. In the business scene of driving records, a short video of a recorded driving record can be generated and uploaded to the traffic police platform. In the business scenario of storage space cleaning, all videos in the storage space can be generated into corresponding short videos and saved in the album, and then the original videos in the storage space can be deleted, compressed or migrated to save storage space. For another example, for some movies, TV series, documentaries and other various video image content, users want to browse the image content through a few minutes of video summary, and select the video they are interested in to watch. The technical solution of this application is also applicable to such video content. The video image content generates a video summary, or condensed video, which is convenient for users to browse and view.
本申请实施例中的短视频的生成方法可以通过短视频的生成装置实现。本申请实施例的短视频的生成装置可以是终端设备,也可以是服务器。The short video generation method in the embodiment of the present application may be implemented by a short video generation device. The short video generation apparatus in the embodiment of the present application may be a terminal device or a server.
终端设备实现时,终端设备应具备实现技术方案的功能模块或芯片(例如视频语义分析模块、视频播放模块等)以生成短视频,该终端设备上安装的应用程序也可以调用终端设备的本地功能模块或芯片进行短视频生成。When the terminal device is implemented, the terminal device should have functional modules or chips (such as video semantic analysis module, video playback module, etc.) that implement the technical solution to generate short videos. The application installed on the terminal device can also call the local functions of the terminal device Module or chip for short video generation.
在由服务器实现时,服务器应具备实现技术方案的功能模块或芯片(例如视频语义分析模块)以生成短视频。服务器可以是用于存储数据的存储服务器,可以利用本申请实施例的技术方案,将其存储的视频生成短视频作为一种视频摘要,并基于此进行视频数据整理、分类、调用、压缩、迁移等等操作,提升存储空间利用率以及数据调用效率。服务器也可以是具有短视频生成功能的客户端或网页所对应的服务器。其中,客户端可以是安装在终端设备上的应用程序,也可以是在应用程序上搭载的小程序;网页可以是运行在浏览器上的页面等。如图2所示的场景中,终端设备获取到用户触发的短视频生成指令后,将目标视频发送给客户端对应的服务器,由服务器进行短视频生成,然后将短视频返回给终端设备,终端设备进行短视频的分享和存储等操作。如用户A在短视频客户端点击了短视频生成指令,此时终端设备将目标视频传送至后台服务器进行短视频生成处理,服务器生成短视频后返回给终端设备,用户A可以将该短视频分享给用户B或者存储在草稿箱、图库等存储空间。如图3所示的场景中,终端设备A的用户可以触发短视频分享指令,该指令携带目标用户标识,服务器除了返回给终端设备进行分享和存储以外,也可以直接向目标用户标识对应的终端设备B进行分享。例如,用户A在短视频客户端点击了短视频分享指令,短视频分享指令中携带了目标用户B的标识,此时短视频客户端将目标视频以及目标用户B的标识传送至服务器,服务器生成短视频后,可以将该短视频直接发送给目标用户B的标识对应的终端设备B,同时,也可以将短视频返回给终端设备A。进一步,终端设备还可以在短视频生成过程中与服务器有更进一步的交互,例如服务器可以将分割出的视频片段发给终端设备,终端设备将用户选择的视频片段或视频片段标识发送给服务器,从而服务器根据用户选择进行短视频生成等。因此,可以理解的,上述实施场景仅示例性展示了本申请技术方案适用的部分场景。When implemented by a server, the server should have a functional module or chip (such as a video semantic analysis module) for implementing the technical solution to generate short videos. The server may be a storage server for storing data. The technical solutions of the embodiments of this application may be used to generate short videos from the stored videos as a kind of video summary, and the video data can be sorted, classified, called, compressed, and migrated based on this. Wait for operations to improve storage space utilization and data call efficiency. The server can also be a client with a short video generation function or a server corresponding to a web page. Among them, the client can be an application installed on the terminal device, or a small program carried on the application; the webpage can be a page running on a browser, etc. In the scenario shown in Figure 2, after the terminal device obtains the short video generation instruction triggered by the user, it sends the target video to the server corresponding to the client. The server generates the short video, and then returns the short video to the terminal device. The device performs operations such as sharing and storing short videos. For example, user A clicks the short video generation instruction on the short video client, and the terminal device transmits the target video to the background server for short video generation processing. The server generates the short video and returns it to the terminal device. User A can share the short video Give it to user B or store it in the draft box, gallery and other storage space. In the scenario shown in Figure 3, the user of the terminal device A can trigger a short video sharing instruction, which carries the target user identification. In addition to returning to the terminal device for sharing and storage, the server can also directly identify the corresponding target user. Terminal device B shares. For example, user A clicks the short video sharing instruction on the short video client, and the short video sharing instruction carries the identification of target user B. At this time, the short video client transmits the target video and the identification of target user B to the server, and the server generates After the short video, the short video can be directly sent to the terminal device B corresponding to the identifier of the target user B, and at the same time, the short video can also be returned to the terminal device A. Furthermore, the terminal device can further interact with the server during the short video generation process. For example, the server can send the segmented video clip to the terminal device, and the terminal device sends the video clip or the video clip identifier selected by the user to the server. Therefore, the server performs short video generation and so on according to the user's selection. Therefore, it can be understood that the foregoing implementation scenarios only exemplarily show part of the scenarios to which the technical solutions of the present application are applicable.
基于以上示例场景,本申请实施例中的终端设备具体可以是手机、平板电脑、笔记本电脑、车载设备、可穿戴设备等,服务器具体可以是物理服务器、云服务器等。Based on the above example scenarios, the terminal device in the embodiment of the present application may specifically be a mobile phone, a tablet computer, a notebook computer, a vehicle-mounted device, a wearable device, etc., and the server may specifically be a physical server, a cloud server, etc.
在该应用场景中,为了生成视频对应的短视频,需要经历视频分割、视频片段选择以及视频片段合成三个阶段。具体来说,终端设备从视频中分割出多个有意义的视频片段,然后从多个视频片段中选择出可以用于生成短视频的重要视频片段,最后将被选择的视频片段进行合成,从而得到视频对应的短视频。本申请实施例的技术方案就是针对上述三个阶段进行的优化。In this application scenario, in order to generate a short video corresponding to the video, it needs to go through three stages: video segmentation, video segment selection, and video segment synthesis. Specifically, the terminal device divides multiple meaningful video clips from the video, and then selects important video clips that can be used to generate short videos from the multiple video clips, and finally synthesizes the selected video clips, thereby Get the short video corresponding to the video. The technical solution of the embodiment of the present application is optimized for the above three stages.
应理解的,本申请实施例描述的应用场景是为了更加清楚的说明本申请实施例的技术 方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。It should be understood that the application scenarios described in the embodiments of the present application are intended to more clearly illustrate the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided in the embodiments of the present application. Those of ordinary skill in the art will know that with With the emergence of new application scenarios, the technical solutions provided in the embodiments of this application are equally applicable to similar technical problems.
请参见图4,图4是本申请实施例提供的一种短视频的生成方法的流程示意图,该方法包括但不限于以下步骤:Please refer to FIG. 4. FIG. 4 is a schematic flowchart of a short video generation method provided by an embodiment of the present application. The method includes but is not limited to the following steps:
S101,获取目标视频。S101: Obtain a target video.
在本申请实施例中,目标视频包括多帧视频图像,是用于生成短视频的视频,也可以理解为生成短视频的素材。为便于后续描述,可以用m表示目标视频的帧数,即目标视频包括m帧视频图像,m为大于或等于1的正整数。In the embodiment of the present application, the target video includes multiple frames of video images, which is a video used to generate a short video, and can also be understood as a material for generating a short video. For the convenience of subsequent description, m can be used to represent the number of frames of the target video, that is, the target video includes m frames of video images, and m is a positive integer greater than or equal to 1.
基于上述应用场景的描述,目标视频可以是终端设备即时拍摄的视频,例如,用户打开社交软件或短视频平台的拍摄功能后拍摄的视频。目标视频也可以是存储在存储空间中的历史视频,例如终端设备或者服务器的媒体数据库中的视频。目标视频还可以是从其他设备接收到的视频,例如,服务器从终端设备接收到的短视频生成指示消息所携带的视频。Based on the above description of the application scenario, the target video may be a video shot immediately by the terminal device, for example, a video shot after the user turns on the shooting function of a social software or a short video platform. The target video may also be a historical video stored in a storage space, such as a video in a media database of a terminal device or a server. The target video may also be a video received from another device, for example, the video carried in the short video generation instruction message received by the server from the terminal device.
S102,通过语义分析获得目标视频中的至少一个视频片段的起止时间和所属语义类别的概率。S102: Obtain the start and end time of at least one video segment in the target video and the probability of the semantic category to which it belongs through semantic analysis.
在本申请实施例中,语义分析可以采用机器学习模型实现,本申请称为视频语义分析模型。视频语义分析模型可以实现图1中三个阶段中的视频分割阶段的功能并为视频片段选择阶段提供概率数据的支持。本申请实施例中的视频分割可以理解为基于视频语义分析的视频分割,目的是将目标视频中属于一个或多个语义类别的视频片段确定出来,其中,视频片段指的是k帧连续的视频图像,k为小于或等于m的正整数。可以看出,不同于现有技术中对单帧视频图像识别后进行筛选和重组形成的视频片段,本申请实施例直接将目标视频中具有连续语义的视频片段分割出来,避免最后生成的短视频过于跳跃,且能够节省合成时间,提升短视频的生成效率。In the embodiments of the present application, the semantic analysis can be implemented by using a machine learning model, which is referred to as a video semantic analysis model in this application. The video semantic analysis model can realize the functions of the video segmentation stage among the three stages in Fig. 1 and provide probabilistic data support for the video segment selection stage. The video segmentation in the embodiments of this application can be understood as video segmentation based on video semantic analysis. The purpose is to determine video segments belonging to one or more semantic categories in the target video, where the video segment refers to a continuous video of k frames. Image, k is a positive integer less than or equal to m. It can be seen that, unlike the video clips formed by screening and reorganizing single-frame video images in the prior art, the embodiment of the present application directly divides the video clips with continuous semantics in the target video to avoid the short video generated at the end. It is too jumpy, and can save synthesis time and improve the efficiency of short video generation.
具体来说,视频语义分析模型可以具备图像特征提取功能,提取出目标视频中每帧视频图像的n维特征数据,其中,n为正整数。n维特征数据可以反映一帧视频图像的空间特征,在本申请实施例中,特征提取的具体方式可以不作限定,每一维特征数据也可以不指向某种具体属性特征。具体可以是提取RGB参数等属性特征维度,也可以是经过神经网络等方式提取的多种特征相互融合后得到的抽象特征数据。然后,视频语义分析模型可以基于目标视频包括的m帧视频图像的时间顺序,生成m*n的视频特征矩阵。这里的视频特征矩阵可以理解为一种时空特征图,其既反映了每帧视频图像的空间特征,又反映了帧与帧之间在时序上的先后排列顺序。如图5所示,是一种示例性的视频特征矩阵,其中,每一行代表了一帧视频图像的n维特征数据,列与列之间是按照目标视频的时间先后顺序排列的。Specifically, the video semantic analysis model may have an image feature extraction function to extract n-dimensional feature data of each frame of video image in the target video, where n is a positive integer. The n-dimensional feature data may reflect the spatial feature of a frame of video image. In the embodiment of the present application, the specific method of feature extraction may not be limited, and each dimensional feature data may not point to a specific attribute feature. Specifically, it can extract attribute feature dimensions such as RGB parameters, or it can be abstract feature data obtained by fusion of multiple features extracted through neural networks and other methods. Then, the video semantic analysis model can generate an m*n video feature matrix based on the time sequence of m frames of video images included in the target video. The video feature matrix here can be understood as a spatio-temporal feature map, which not only reflects the spatial characteristics of each frame of video image, but also reflects the sequence of frames in time sequence. As shown in FIG. 5, it is an exemplary video feature matrix, in which each row represents n-dimensional feature data of a frame of video image, and the columns are arranged according to the time sequence of the target video.
通过将目标视频进行特征提取,可以将时间-空间两个维度的目标视频,转换为在一个视频特征矩阵内就可以呈现的空间维度的特征图,为后续对目标视频的片段分割和选择奠定了基础,这样相对于现有的每帧视频图像在时间上串联起来进行时序建模的循环网络模型,本申请实施例的视频语义分析模型可以设计的更加简捷,从而计算速度更快,减少了计算时间和资源占用。Through feature extraction of the target video, the target video in two dimensions of time and space can be converted into a feature map of spatial dimension that can be presented in a video feature matrix, which lays the foundation for the subsequent segmentation and selection of the target video. Basically, compared with the existing cyclic network model in which each frame of video images are connected in time to perform timing modeling, the video semantic analysis model of the embodiment of the present application can be designed more simply, so that the calculation speed is faster and the calculation is reduced. Time and resource consumption.
视频语义分析模型可以从视频特征矩阵中识别对应的至少一个连续语义特征序列。连续语义特征序列是视频语义分析模型预测出的属于一个或多个语义类别的连续特征序列, 可以包括一帧或多个连续帧中的特征数据。仍以图5为例,其中第一个框和第二个框中圈定的特征数据就分别对应连续语义特征序列a和连续语义特征序列b。其中,语义类别可以是行为类别、表情类别、身份类别、场景类别等这种大类类别,也可以是大类类别中的各个从属类别,例如行为类别中的打球类别、握手类别等等。可以理解的,语义类别可以根据实际业务需要进行定义。The video semantic analysis model can identify the corresponding at least one continuous semantic feature sequence from the video feature matrix. The continuous semantic feature sequence is a continuous feature sequence that belongs to one or more semantic categories predicted by the video semantic analysis model, and may include feature data in one or more consecutive frames. Still taking Fig. 5 as an example, the feature data enclosed in the first box and the second box correspond to the continuous semantic feature sequence a and the continuous semantic feature sequence b, respectively. Among them, the semantic category can be a large category such as a behavior category, an expression category, an identity category, a scene category, etc., and can also be various subordinate categories in the large category, such as a ball game category and a handshake category in the behavior category. Understandably, semantic categories can be defined according to actual business needs.
可以理解的,每个连续语义特征序列可以对应一个视频片段。例如,图5中的连续语义特征序列a对应的是目标视频的第1帧和第2帧的连续视频图像。可以看出,在本申请实施例的实施场景中,主要关注的是时间域的内容,因此视频语义分析模型的一个输出是连续语义特征序列对应的视频片段的起止时间。例如,连续语义特征序列a对应的视频片段的起止时间就是第1帧的开始时间t1以及第2帧的结束时间t2,输出为(t1,t2)。另外,视频语义分析模型预测连续语义特征序列的所属语义类别时,实际上是预测连续语义特征序列的特征与各种语义类别的吻合概率,将最吻合的类别确定为所属语义类别,同时所属语义类别也对应有一个预测概率,本申请实施例的视频语义分析模型就可以输出该连续语义特征序列的所属语义类别的概率,这样视频语义分析模型就可以确定出连续语义特征序列对应的视频片段的起止时间以及所属语义类别的概率。It is understandable that each continuous semantic feature sequence can correspond to a video segment. For example, the continuous semantic feature sequence a in FIG. 5 corresponds to the continuous video images of the first frame and the second frame of the target video. It can be seen that in the implementation scenario of the embodiment of the present application, the main focus is on the content in the time domain, so one output of the video semantic analysis model is the start and end time of the video segment corresponding to the continuous semantic feature sequence. For example, the start and end times of the video segment corresponding to the continuous semantic feature sequence a are the start time t1 of the first frame and the end time t2 of the second frame, and the output is (t1, t2). In addition, when the video semantic analysis model predicts the semantic category of the continuous semantic feature sequence, it actually predicts the probability of matching the features of the continuous semantic feature sequence with various semantic categories, and determines the most consistent category as the semantic category to which it belongs. The category also corresponds to a prediction probability. The video semantic analysis model of the embodiment of this application can output the probability of the semantic category of the continuous semantic feature sequence, so that the video semantic analysis model can determine the video segment corresponding to the continuous semantic feature sequence. The start and end time and the probability of belonging to the semantic category.
在一种可能的实施场景中,视频语义分析模型可以是如图6所示的模型架构,具体包括卷积神经网络(Convolutional Neural Networks,CNN)10、特征金字塔网络(FPN,Feature Pyramid Network)20、序列生成网络(SPN,Sequence Proposal Network)30和第一全连接层40。下面针对该模型架构对S102进行详细描述。In a possible implementation scenario, the video semantic analysis model can be the model architecture shown in Figure 6, specifically including Convolutional Neural Networks (CNN) 10, Feature Pyramid Networks (FPN, Feature Pyramid Network) 20 , Sequence Proposal Network (SPN, Sequence Proposal Network) 30 and the first fully connected layer 40. S102 is described in detail below for the model architecture.
首先,将获取到的目标视频输入CNN中。CNN是一种常见的分类网络,一般可以包括输入层、卷积层、池化层和全连接层。其中,卷积层的功能是对输入数据进行特征提取,在卷积层进行特征提取后,输出的特征图会被传递至池化层进行特征选择和信息过滤,留下的信息则是具有尺度不变性的特征,是最能表达图像的特征。本申请实施例中就是利用CNN中这两层的特征提取功能,将池化层的输出作为目标视频中每帧视频图像的n维特征数据,并基于目标视频包括的m帧视频图像的时间顺序,生成m*n的视频特征矩阵。需要说明的是,本申请实施例不限定CNN的具体模型结构,ResNet、GoogleNet、MobileNet等经典的图像分类网络都可以适用于本申请实施例的技术方案。First, input the acquired target video into CNN. CNN is a common classification network, which can generally include an input layer, a convolutional layer, a pooling layer, and a fully connected layer. Among them, the function of the convolutional layer is to perform feature extraction on the input data. After the feature extraction of the convolutional layer, the output feature map will be passed to the pooling layer for feature selection and information filtering, and the information left is scaled The feature of invariance is the feature that best expresses the image. In the embodiment of this application, the feature extraction functions of the two layers in CNN are used, and the output of the pooling layer is used as the n-dimensional feature data of each frame of video image in the target video, and is based on the time sequence of the m frames of video images included in the target video. , Generate m*n video feature matrix. It should be noted that the embodiments of this application do not limit the specific model structure of CNN, and classic image classification networks such as ResNet, GoogleNet, MobileNet, etc. can be applied to the technical solutions of the embodiments of this application.
然后,将m*n的视频特征矩阵传递至FPN中。通常,利用网络对物体进行检测时,浅层网络分辨率高,学到的是图像的细节特征,深层网络分辨率低,学到的更多的是语义特征,因此,多数的物体检测算法都是只采用顶层特征做预测。但是,由于最深层的特征图的一个特征点映射在原图中的区域会比较大,因此小物体就会检测不到,导致检测性能较低。此时,浅层网络的细节特征就显得尤为重要。如图7所示,FPN就是将多层间的特征进行融合的网络,可以把低分辨率、高语义信息的高层特征和高分辨率、低语义信息的低层特征进行自上而下的侧边连接,从而生成多尺度下各层特征图,且每层特征都有丰富的语义信息,识别的也就越精确。可以看出,因为越上层的特征图尺寸越小,因此形成了形似金字塔的形状。在本申请实施例中,就是将视频特征矩阵转化成这样的多层特征图。Then, the m*n video feature matrix is transferred to the FPN. Generally, when using the network to detect objects, the shallow network has high resolution and learns the detailed features of the image, while the deep network has low resolution and learns more semantic features. Therefore, most object detection algorithms are It only uses top-level features to make predictions. However, since a feature point of the deepest feature map is mapped to a larger area in the original image, small objects will not be detected, resulting in low detection performance. At this time, the detailed characteristics of the shallow network are particularly important. As shown in Figure 7, FPN is a network that integrates features between multiple layers. It can combine high-level features with low-resolution and high-semantic information and low-level features with high-resolution and low-semantic information from top to bottom. Connect to generate feature maps of each layer at multiple scales, and each layer of features has rich semantic information, and the more accurate the recognition. It can be seen that because the upper layer has a smaller feature map size, it forms a pyramid-like shape. In the embodiment of the present application, the video feature matrix is converted into such a multi-layer feature map.
以50层深层残差网络(ResNet50)为例,说明FPN的实现原理。如图8所示,首先自底向上在网络中进行正向传播,依次对下层特征进行2倍下采样的卷积计算,得到C2、C3、C4、C5四个特征图。进一步对每个特征图进行1*1的卷积,然后自顶向下横向连接,即自M5开始进行2倍上采样后与C4的1*1卷积结果求和,得到M4,M4与C3也按照 上述方法进行融合,依次类推。最后M2、M3、M4、M5分别进行3*3卷积,得到P2、P3、P4、P5,对M5进行2倍下采样得到P6。P2、P3、P4、P5、P6就是5层特征金字塔。Take the 50-layer deep residual network (ResNet50) as an example to illustrate the realization principle of FPN. As shown in Figure 8, firstly, forward propagation is performed in the network from the bottom to the top, and the convolution calculation of 2 times downsampling is sequentially performed on the lower layer features to obtain four feature maps of C2, C3, C4, and C5. Further perform 1*1 convolution on each feature map, and then connect horizontally from top to bottom, that is, start from M5 and perform 2 times upsampling and sum up with the 1*1 convolution result of C4 to obtain M4, M4 and C3 The fusion is also carried out in accordance with the above method, and so on. Finally, M2, M3, M4, and M5 are respectively subjected to 3*3 convolutions to obtain P2, P3, P4, and P5, and M5 is down-sampled twice to obtain P6. P2, P3, P4, P5, and P6 are the five-layer feature pyramid.
进一步,将特征金字塔传递至SPN中。SPN可以针对特征金字塔的每层特征图生成对应的候选框,用于确定连续语义特征序列。为了更清楚理解SPN的原理,在介绍SPN之前,先对区域生成网络(Region Proposal Network,RPN)进行介绍。Further, the feature pyramid is transferred to the SPN. SPN can generate corresponding candidate frames for each layer of feature maps of the feature pyramid to determine continuous semantic feature sequences. In order to understand the principles of SPN more clearly, before introducing SPN, first introduce the Region Proposal Network (RPN).
RPN是一种区域选取网络,一般用于图像中的对象检测(物体检测、人脸检测等),用于确定对象在图中的具体区域。如图9所示,一个图像通过特征提取后可以得到一个特征图,特征图可以理解为多个特征数据组成的表征图像特征的矩阵,特征图中一个特征点就代表一个特征数据。特征图中的特征点与原图都有一一对应的映射关系,如图7中的其中一个特征点映射到原图中就是一个小框,这个小框的具体大小与原图和特征图的比例有关。将小框的中心点作为锚点可以生成一组锚点框,锚点框的数量以及每个锚点框的长宽比可以预先设置,例如图9中示出的是3个大框就是按照预先设定的数量和长宽比生成的一组锚点框。可以理解的,特征图中的每一个特征点都会对应映射在原图上这样一组锚点框,因此原图上会映射出p*s个锚点框,其中p是特征图中的特征点数量,s是预设的一组锚点框的数量。在确定出锚点框的同时,RPN还会对锚点框中的图像进行前后景判断,得到一个前景分数和一个后景分数,可以将前景分数排序前几个的锚点框筛选出来确定为真正的锚点框,具体选择的锚点框数量可以根据情况设定。这样,就可以过滤无用的后景内容,将锚点框集中圈定在前景内容较多的区域上,便于进行后续的类别识别。在对RPN进行训练时,训练样本是真实框的中心位置以及长宽尺度,训练使得锚点框与真实框之间的差距与预测的候选框与锚点框之间的差距尽可能的接近,从而使得该模型输出的候选框越准确。由于训练参考的是候选框与锚点框的差距,因此在应用RPN进行候选框提取的时候,RPN输出的也是预测得到的候选框相对于锚点框的偏移量,即中心位置的平移量(t x,t y)和长宽尺度的变化量(t w,t h)。 RPN is a region selection network, which is generally used for object detection (object detection, face detection, etc.) in an image to determine the specific area of the object in the image. As shown in FIG. 9, a feature map can be obtained from an image after feature extraction. The feature map can be understood as a matrix composed of multiple feature data to characterize image features. One feature point in the feature map represents one feature data. The feature points in the feature map have a one-to-one mapping relationship with the original image. As shown in Figure 7, one of the feature points mapped to the original image is a small box. The specific size of this small box is the same as that of the original image and the feature map. The ratio is related. Using the center point of the small box as the anchor point can generate a set of anchor point boxes. The number of anchor point boxes and the aspect ratio of each anchor point box can be preset. For example, the three large boxes shown in Figure 9 are based on A set of anchor boxes generated by a preset number and aspect ratio. It is understandable that each feature point in the feature map will correspond to a set of anchor point boxes mapped on the original image, so p*s anchor point boxes will be mapped on the original image, where p is the number of feature points in the feature map , S is the number of a set of preset anchor boxes. While determining the anchor point box, RPN will also judge the background scene of the image in the anchor point box to obtain a foreground score and a background score. The first few anchor boxes in the foreground score can be sorted out and determined as For the real anchor point frame, the specific number of selected anchor point frames can be set according to the situation. In this way, useless background content can be filtered, and the anchor point frame can be concentrated on the area with more foreground content, which is convenient for subsequent category recognition. When training RPN, the training sample is the center position of the real frame and the length and width scales. The training makes the gap between the anchor frame and the real frame as close as possible to the predicted gap between the candidate frame and the anchor frame. So that the candidate frame output by the model is more accurate. Since the training refers to the gap between the candidate frame and the anchor point frame, when applying RPN to extract the candidate frame, the output of RPN is also the offset of the predicted candidate frame relative to the anchor point frame, that is, the translation of the center position. (t x , t y ) and the change in length and width (t w , t h ).
本申请实施例中的SPN与RPN的原理基本相似,区别主要在于,本申请实施例中特征金字塔的每层特征图中的特征点不是映射到原图中,而是视频特征矩阵上,因此候选框也是在视频特征矩阵上生成,从而候选框由提取区域变成了提取特征序列。另外,在视频特征矩阵上生成的候选框携带了时间和空间两种信息,前文提到了本申请实施例主要关注的是时间域的内容。在视频特征矩阵上,长度代表时间维度,宽度代表空间维度,我们只关注候选框的长度,而不关注宽度。因此,本申请实施例的候选框在预设长宽时,其宽度可以保持不变,这样,SPN无需像RPN一样不断调整去搜索不同长宽的空间范围,只需要在长度维度上进行搜索,可以节省搜索空间的时间,从而进一步节省了模型的计算时间和占用的资源。具体的,宽度可以与视频特征矩阵的n维数据的维度保持一致,使候选框圈定全部特征,提取的将是各种时间段的全维度特征数据。The principles of SPN and RPN in the embodiment of this application are basically similar. The main difference is that the feature points in each layer of the feature map of the feature pyramid in the embodiment of this application are not mapped to the original image, but to the video feature matrix. Therefore, the candidate The frame is also generated on the video feature matrix, so that the candidate frame changes from the extracted area to the extracted feature sequence. In addition, the candidate frame generated on the video feature matrix carries both time and space information. As mentioned above, the embodiment of the present application mainly focuses on the content of the time domain. In the video feature matrix, the length represents the time dimension, and the width represents the space dimension. We only pay attention to the length of the candidate frame, not the width. Therefore, when the length and width of the candidate frame in the embodiment of the present application are preset, the width can remain unchanged. In this way, SPN does not need to constantly adjust to search for spatial ranges of different lengths and widths like RPN, but only needs to search in the length dimension. The search space time can be saved, thereby further saving the calculation time and the occupied resources of the model. Specifically, the width can be consistent with the dimension of the n-dimensional data of the video feature matrix, so that the candidate frame encloses all the features, and what is extracted will be the full-dimensional feature data of various time periods.
例如,P2层的特征图大小如果为256*256,其相对于视频特征矩阵的步长为4,则P2上的一特征点会对应在视频特征矩阵上生成一个4*4的小框作为锚点,若设置4个基准像素序列值{50、100、200、400},则以锚点为中心,每个特征点会对应生成4个长度值分别为{4*50、4*100、4*200、4*400}的锚点框,锚点框的宽度则是n,以圈定n维数据。For example, if the size of the feature map of the P2 layer is 256*256, and its step size relative to the video feature matrix is 4, then a feature point on P2 will generate a 4*4 small box on the video feature matrix as an anchor Point, if you set 4 reference pixel sequence values {50, 100, 200, 400}, then take the anchor point as the center, and each feature point will generate 4 length values corresponding to {4*50, 4*100, 4 *200, 4*400} anchor point box, the width of the anchor point box is n, to encircle n-dimensional data.
也就是说,在本申请实施例中,候选框的中心位置的变化只是在长度方向上的偏移,候选框的尺度变化也只是在长度方向上的增减。因此,SPN的训练样本可以是多种语义类别的特征序列以及经标记的真实框的中心位置在长度维度上的坐标以及长度值。相应的, 在应用SPN进行候选框提取的时候,SPN输出的也是预测得到的候选框相对于锚点框的在长度方向的偏移量,即中心位置在长度方向的平移量(t y)和长度的变化量(t h)。根据偏移量确定出候选框,从而在视频特征矩阵中框选出一段有对象的连续序列,即连续语义特征序列。需要说明的是,除上述宽度坐标无需考虑之外,SPN的训练方法,包括损失函数、分类误差、回归误差等是与RPN相似的,因此这里不作赘述。 That is to say, in the embodiment of the present application, the change in the center position of the candidate frame is only the offset in the length direction, and the change in the scale of the candidate frame is only the increase or decrease in the length direction. Therefore, the training samples of the SPN may be feature sequences of multiple semantic categories and the coordinates and length values of the center position of the marked real frame in the length dimension. Correspondingly, when SPN is used for candidate frame extraction, the SPN output is also the lengthwise offset of the predicted candidate frame relative to the anchor point frame, that is, the translational amount of the center position in the length direction (t y ) and The amount of change in length (t h ). The candidate frame is determined according to the offset, so that a continuous sequence with objects is selected in the video feature matrix, that is, a continuous semantic feature sequence. It should be noted that, except for the above width coordinates, the training methods of SPN, including loss function, classification error, regression error, etc., are similar to RPN, so I will not repeat them here.
可以理解的,特征金字塔的每一层特征图上的每一个特征点都将在视频特征矩阵中映射多个预设尺寸的候选框,这样庞大的候选框数量可能会导致各个候选框之间的重叠,导致最后截取出很多重复序列。因此,可以在生成候选框之后,进一步采用非极大值抑制(Non-Maximum Suppression,NMS)的方式,滤除重叠的冗余候选框,只保留信息量最大的候选框。NMS的原理是根据重叠候选框之间的交并比(Intersection-over-Union,IoU)进行筛选的,由于NMS已经是常见的候选框或检测框的过滤方法,因此这里不作赘述。It is understandable that each feature point on each feature map of the feature pyramid will be mapped to multiple candidate frames of preset size in the video feature matrix. Such a large number of candidate frames may result in a gap between each candidate frame. Overlap, leading to the final truncation of many repeated sequences. Therefore, after the candidate frames are generated, a non-maximum suppression (NMS) method can be further used to filter out overlapping redundant candidate frames, and only the candidate frame with the largest amount of information is retained. The principle of NMS is to screen based on the Intersection-over-Union (IoU) between overlapping candidate frames. Since NMS is already a common filtering method for candidate frames or detection frames, it will not be repeated here.
进一步的,由于特征金字塔中每层特征图相对于视频特征矩阵的尺寸比例都不相同,因此候选框裁剪的连续语义特征序列之间的大小也会有很大差距,后续全连接层对连续语义特征序列进行分类前,将连续语义特征序列调整到同样大小的固定尺寸的难度较大。因此,可以根据连续语义特征序列的长度,将其映射在特征金字塔的某一层中,从而使多个连续语义特征序列之间的大小尽量接近。本申请实施例中,连续语义特征序列越大,则选择越高层的特征图进行映射,连续语义特征序列越小,则选择越低层的特征图进行映射。具体来说,可以采用以下公式计算连续语义特征序列映射的特征图的层级d:Furthermore, since the size ratio of the feature map of each layer in the feature pyramid to the video feature matrix is different, the size of the continuous semantic feature sequences cropped by the candidate frame will also have a large gap. Before the feature sequence is classified, it is more difficult to adjust the continuous semantic feature sequence to a fixed size of the same size. Therefore, according to the length of the continuous semantic feature sequence, it can be mapped to a certain layer of the feature pyramid, so that the size of multiple continuous semantic feature sequences can be as close as possible. In the embodiment of the present application, the larger the continuous semantic feature sequence is, the higher-level feature map is selected for mapping, and the smaller the continuous semantic feature sequence is, the lower-level feature map is selected for mapping. Specifically, the following formula can be used to calculate the level d of the feature map of the continuous semantic feature sequence mapping:
d=[d 0+log 2(wh/244)] d=[d 0 +log 2 (wh/244)]
其中,d 0为初始层级,在图8所示的实施例中,P2为初始层级,则d 0为2,w和h分别是连续语义特征序列在视频特征矩阵中的宽度的长度。可以理解的,w保持不变,h越大则d越大,从而进行映射的特征图的层级越高。 Among them, d 0 is the initial level. In the embodiment shown in FIG. 8, P2 is the initial level, so d 0 is 2, and w and h are respectively the length of the width of the continuous semantic feature sequence in the video feature matrix. It can be understood that w remains unchanged, and the larger the h, the larger d, so that the level of the feature map to be mapped is higher.
之后,可以将在对应层的特征图映射后裁剪出的连续语义特征序列进行尺寸调整,并输入第一全连接层40中。第一全连接层40对每个连续语义特征序列进行语义分类,输出连续语义特征序列对应的视频片段的所属语义类别的概率,同时也可以根据连续语义特征序列的中心以及长度偏移量,输出视频片段的起止时间,也即开始时间和结束时间。After that, the continuous semantic feature sequence cut out after the feature map of the corresponding layer is mapped can be adjusted in size and input into the first fully connected layer 40. The first fully connected layer 40 performs semantic classification on each continuous semantic feature sequence, and outputs the probability of the semantic category of the video segment corresponding to the continuous semantic feature sequence. At the same time, it can also output according to the center and length offset of the continuous semantic feature sequence. The start and end time of the video clip, that is, the start time and end time.
根据以上描述可以看出,本申请实施例中SPN将视频特征矩阵替代原图,将原本在空间领域用于图像识别的候选框生成方法,适用在了时空领域中,使候选框从圈定图像中的物体区域转变为圈定视频中的时间范围。从而达到了将目标视频中包含语义类别的视频片段直接识别出来的目的,无需再一帧一帧的进行识别和筛选。As can be seen from the above description, in the embodiment of this application, the SPN replaces the original image with the video feature matrix, and applies the candidate frame generation method originally used for image recognition in the spatial domain to the spatio-temporal domain, making the candidate frame from the circled image The object area is transformed into the time range in the circled video. In this way, the purpose of directly identifying the video clips containing semantic categories in the target video is achieved, without the need to identify and filter frame by frame.
根据上述描述可以看出,图6的模型架构可以用于识别动态的连续语义,例如动态的行为、表情、场景等等,但对于静态的场景等类别,由于帧与帧之间画面没有什么差异,如果仍采用图6的模型架构来实施的话,反而会浪费计算时间,且识别也并不准确。此时可以分为两种实施场景。From the above description, it can be seen that the model architecture of Figure 6 can be used to identify dynamic continuous semantics, such as dynamic behaviors, expressions, scenes, etc., but for categories such as static scenes, there is no difference between frames. If the model architecture of Figure 6 is still used for implementation, it will waste computing time and the identification is not accurate. At this point, it can be divided into two implementation scenarios.
在第一种可能的实施场景中,采用图6的模型识别视频片段的所属语义类别,其中所属语义类别可以包括至少一种动作类别、至少一种表情类别、至少一种身份类别、至少一种动态场景等任意一种或多种。可以看出,在这种实施场景中,主要对动作、表情、人脸、动态场景等动态语义进行识别,因此采用图4的视频语义分析模型可以直接得到至少一个视频片段的所属行为类别的概率。具体来说,某一段视频片段的所属语义类别可以是一种,例如,t1-t2这段起止时间的视频片段属于踢球类别的概率为90%;所属语义类别也可以是 多种,例如,t3-t4这段起止时间的视频片段属于踢球类别的概率为90%、属于大笑类别的概率为80%、属于某一人脸的概率是85%,此时,t3-t4的视频片段的所属行为类别的概率可以是以上三个概率之和。在这种实施场景中,视频语义分析模型主要针对动态语义类别进行识别,在动态语义类别已经能够匹配用户对视频片段重要程度的认知时,可以采用该模型进行识别。In the first possible implementation scenario, the model in FIG. 6 is used to identify the semantic category of the video clip, where the semantic category may include at least one action category, at least one expression category, at least one identity category, and at least one Any one or more of dynamic scenes. It can be seen that in this implementation scenario, dynamic semantics such as actions, expressions, faces, and dynamic scenes are mainly recognized. Therefore, the video semantic analysis model in Figure 4 can be used to directly obtain the probability of at least one video segment belonging to the behavior category. . Specifically, the semantic category of a certain video clip can be one type, for example, the probability that the video clip from the start and end time of t1-t2 belongs to the kicking category is 90%; the semantic category can also be multiple, for example, The probability that the video clips from t3-t4 belong to the kicking category is 90%, the probability that they belong to the laughter category is 80%, and the probability that they belong to a certain face is 85%. At this time, the video clip of t3-t4 The probability of belonging to the behavior category can be the sum of the above three probabilities. In this implementation scenario, the video semantic analysis model is mainly used to identify dynamic semantic categories. When the dynamic semantic category can match the user's awareness of the importance of the video segment, the model can be used for identification.
在第二种可能的实施场景中,所属语义类别的概率可以包括所属行为类别的概率和所属场景类别的概率。在此实施场景中,如图10所示,可以在视频语义分析模型中的CNN之后再引入另一个第二全连接层50,可以根据目标视频中每帧视频图像的n维特征数据,识别每帧视频图像的所属场景类别的概率。此时,视频语义分析模型可以输出至少一个视频片段的起止时间、所属行为类别的概率以及所属场景类别的概率。可以理解的,该场景中视频语义分析模型输出的所属场景类别的概率,可以是起止时间对应的各帧视频图像的所属场景类别的概率,也可以是目标视频的每帧视频图像的场景类别概率。在这种实施场景中,将所属场景类别和所属行为类别的识别路径区分开来,所属场景类别无论是静态还是动态,都采用常规的单帧图像的识别方式,也就是说,将所属场景识别单独通过CNN10和第二全连接层50进行识别,FPN20、SPN30和第一全连接层40则更专注在动态的行为类别的识别上,这样可以利用各个网络擅长的处理方向,把静态场景的类别加入输出结果中的同时,又可以节约计算时间且提高识别准确度。In the second possible implementation scenario, the probability of belonging to the semantic category may include the probability of belonging to the behavior category and the probability of belonging to the scene category. In this implementation scenario, as shown in Figure 10, another second fully connected layer 50 can be introduced after the CNN in the video semantic analysis model, and each frame can be identified based on the n-dimensional feature data of each frame of the video image in the target video. The probability of the scene category of the frame video image. At this time, the video semantic analysis model can output the start and end time of at least one video segment, the probability of the behavior category to which it belongs, and the probability of the scene category to which it belongs. It is understandable that the probability of the scene category output by the video semantic analysis model in this scene can be the probability of the corresponding scene category of each frame of video image corresponding to the start and end time, or the probability of the scene category of each frame of the target video. . In this implementation scenario, the identification path of the category of the category and the category of the action are distinguished. Whether the category of the category is static or dynamic, the conventional single-frame image recognition method is used, that is, the category of the category is identified Recognition is performed through CNN10 and the second fully connected layer 50 alone. FPN20, SPN30 and the first fully connected layer 40 are more focused on the identification of dynamic behavior categories, so that the processing directions of each network can be used to distinguish the categories of static scenes. While adding to the output result, it can save calculation time and improve recognition accuracy.
S103,根据至少一个视频片段的起止时间和所属语义类别的概率,从至少一个视频片段中生成目标视频对应的短视频。S103: According to the start and end time of the at least one video segment and the probability of the semantic category to which it belongs, generate a short video corresponding to the target video from the at least one video segment.
根据至少一个视频片段的起止时间,短视频的生成装置可以确定目标视频中具有语义类别的视频片段,然后根据所属语义类别的概率,结合设定的筛选规则可以筛选出其中符合要求的视频片段,最终生成目标视频对应的短视频。其中,筛选规则可以是预设的短视频时长或帧数,还可以是用户对各种语义类别的兴趣等。According to the start and end time of at least one video clip, the short video generating device can determine the video clip with semantic category in the target video, and then according to the probability of belonging to the semantic category, combined with the set screening rules, it can filter out the video clips that meet the requirements. Finally, a short video corresponding to the target video is generated. Among them, the screening rule can be a preset short video duration or number of frames, and can also be the user's interest in various semantic categories, etc.
可以理解,视频片段的所属语义类别的概率的大小可以代表视频片段中语义成分的多样性和准确性,因此,本申请实施例以视频片段的所属语义类别的概率作为衡量视频片段重要性的指标,用以在至少一个视频片段中筛选出用于生成短视频的视频片段。具体来说,在上述提到的不同场景中,短视频有不同的生成方法。It can be understood that the probability of the semantic category belonging to the video clip can represent the diversity and accuracy of the semantic components in the video clip. Therefore, in the embodiment of the present application, the probability of the semantic category belonging to the video clip is used as an index to measure the importance of the video clip. , Used to filter out video clips used to generate short videos from at least one video clip. Specifically, in the different scenarios mentioned above, short videos have different generation methods.
在上述第一种实施场景中,可以有两种实现方式生成短视频。In the first implementation scenario described above, there can be two implementation ways to generate short videos.
在第一种可能的实施场景的第一种实现方式中,短视频的生成装置可以根据至少一个视频片段的所属语义类别的概率的大小顺序和起止时间,依次从至少一个视频片段中确定出至少一个摘要视频片段;获取至少一个摘要视频片段并合成目标视频对应的短视频。In the first implementation of the first possible implementation scenario, the device for generating short videos may sequentially determine the at least A summary video segment; obtain at least one summary video segment and synthesize a short video corresponding to the target video.
可以理解的,短视频具有时间短的特点,对短视频的时长有一定要求,因此需要结合短视频时长对至少一个视频片段进行筛选。在第一种实现方式中,短视频的生成装置可以对至少一个视频片段的所属语义类别的概率的大小进行排序,然后结合每个视频片段的起止时间和短视频时长,依次选择出至少一个摘要视频片段,且至少一个摘要视频片段的片段时长之和不大于预设的短视频时长。例如,视频语义分析模型分割出3个视频片段,经过概率排序后为:片段C—135%、片段B—120%、片段A—90%,其中,片段A的片段时长为10s,片段B的片段时长为5s,片段C的片段时长为2.5s,若预设的短视频时长为10s,则会先选择片段C,然后再选择片段B,最后再选择片段A时发现片段时长之和超出了10s,则不选择片段A,仅选择片段C和片段B,并生成短视频。进一步的,还可以在 多个摘要视频片段之间添加转场特效等,以补充短视频时长中的剩余时间。It is understandable that short videos have the characteristics of short time and have certain requirements for the duration of the short videos. Therefore, it is necessary to filter at least one video segment in combination with the duration of the short videos. In the first implementation manner, the short video generation device can sort the probability of the semantic category of at least one video segment, and then combine the start and end time of each video segment and the short video duration to sequentially select at least one summary Video segments, and the sum of the segment durations of at least one summary video segment is not greater than the preset short video duration. For example, the video semantic analysis model segmented 3 video segments, which are sorted by probability: segment C—135%, segment B—120%, and segment A—90%. Among them, the segment duration of segment A is 10s, and the segment duration of segment B is 10s. The clip duration is 5s, and the clip duration of clip C is 2.5s. If the preset short video duration is 10s, clip C will be selected first, then clip B will be selected, and finally when clip A is selected, it is found that the sum of the clip durations exceeds 10s, segment A is not selected, only segment C and segment B are selected, and a short video is generated. Furthermore, it is also possible to add transition special effects between multiple summary video clips to supplement the remaining time in the short video duration.
另一方面,若至少一个摘要视频片段的片段时长之和与预设的短视频时长之差不超过预设阈值,也可以对摘要视频片段进行裁剪以满足短视频时长要求。例如短视频的生成装置可以对排序最后的摘要视频片段进行裁剪,也可以对每一个摘要视频片段都进行部分裁剪,最终生成满足短视频时长的短视频。例如,若上例中片段A的片段时长为3s,则可以裁剪片段A的最后0.5s,也可以将三个片段各自裁剪0.2s等方式,以生成满足10s内的短视频。还例如,上例中若片段C的片段时长为11s,则也需要对片段C进行裁剪以满足短视频时长。On the other hand, if the difference between the sum of the segment durations of at least one summary video segment and the preset short video duration does not exceed the preset threshold, the summary video segment may also be cropped to meet the short video duration requirement. For example, the device for generating short videos may crop the last summary video clips in the order, or may perform partial cropping on each summary video clip, and finally generate a short video that meets the length of the short video. For example, if the segment duration of segment A in the above example is 3s, the last 0.5s of segment A can be cropped, or the three segments can be cropped each by 0.2s, etc., to generate a short video within 10s. For another example, if the segment duration of segment C is 11s in the above example, segment C also needs to be cropped to meet the short video duration.
进一步的,在短视频生成时,短视频的生成装置可以根据至少一个摘要视频片段的起止时间,在目标视频中截取对应的摘要视频片段,然后拼接生成短视频。具体来说,可以按照至少一个摘要视频片段的所属语义类别的概率大小顺序进行拼接,这样可以将重要的摘要视频片段呈现在短视频的前段,突出重点,吸引用户兴趣。还可以按照至少一个摘要视频片段在目标视频中的时间先后顺序进行拼接,这样可以按照目标视频中的真实时间线呈现短视频,可以还原目标视频的原本时间线索。Further, when the short video is generated, the short video generating device may intercept the corresponding summary video segment in the target video according to the start and end time of at least one summary video segment, and then splice the corresponding summary video segment to generate the short video. Specifically, at least one summary video segment can be spliced in the order of the probability of the semantic category to which it belongs, so that important summary video segments can be presented in the front section of the short video, highlighting key points, and attracting user interest. The splicing can also be performed according to the time sequence of at least one summary video segment in the target video, so that the short video can be presented according to the real timeline in the target video, and the original time clues of the target video can be restored.
除上述方式外,还有其他方式对摘要视频片段进行裁剪、拼接以及增加特效,并且还可以将目标视频中的音频和图像分开合成,还可以根据摘要视频片段的起止时间筛选字幕信息并添加在对应的摘要视频片段中等。由于这些视频剪辑方法已存在多种现有技术,因此本申请不多作赘述。In addition to the above methods, there are other ways to crop, splice, and add special effects to the summary video clips, and it can also separate the audio and images in the target video. You can also filter the subtitle information according to the start and end time of the summary video clips and add them The corresponding summary video clip is medium. Since these video editing methods already have a variety of existing technologies, this application will not repeat them.
基于上述描述,可以看出,视频片段的所属语义类别的概率可以说明视频片段的重要程度,因此,基于所属语义类别的概率对至少一个视频片段进行筛选,可以在短视频的预设时长内,尽可能呈现更重要的视频片段。Based on the above description, it can be seen that the probability of the semantic category of the video clip can indicate the importance of the video clip. Therefore, filtering at least one video clip based on the probability of the semantic category can be within the preset duration of the short video. Try to present more important video clips.
在第一种实施场景的第二种实现方式中,短视频的生成装置可以根据每个视频片段的起止时间,在目标视频中截取视频片段,根据至少一个视频片段的所属语义类别的概率的大小顺序,对各视频片段进行排序显示。当接收到对任意一个或多个视频片段的选择指令时,确定被选择的视频片段为摘要视频片段,根据至少一个摘要视频片段,合成目标视频对应的短视频。In the second implementation manner of the first implementation scenario, the short video generation device can intercept the video segment in the target video according to the start and end time of each video segment, according to the probability of the semantic category of at least one video segment. Sequence, sort the video clips and display them. When a selection instruction for any one or more video segments is received, it is determined that the selected video segment is a summary video segment, and a short video corresponding to the target video is synthesized according to at least one summary video segment.
在第二种实现方式中,短视频的生成装置根据每个视频片段的起止时间,先在目标视频中截取出视频片段,然后至少一个视频片段的所属语义类别的概率的大小顺序,排序呈现给用户,这样用户可以根据自己的兴趣或喜好查看和选择这些视频片段,并通过触控、点击等选择指令,选择其中一个或多个视频片段作为摘要视频片段,从而进一步根据摘要视频片段生成短视频。其中根据摘要视频片段生成短视频的方法与第一种实现方式相似,这里不作赘述。可以看出,第二种实现方式通过与用户交互的方式,将分割好的视频片段按照重要性的先后顺序呈现给用户,用户基于自己的兴趣或需要进行选择后,生成相应的短视频,从而使短视频更能满足用户需要。In the second implementation manner, the short video generation device first cuts out the video clips from the target video according to the start and end time of each video clip, and then presents them in the order of the probability of the semantic category of at least one video clip. Users, so that users can view and select these video clips according to their own interests or preferences, and select one or more video clips as summary video clips through touch, click and other selection instructions, so as to further generate short videos based on the summary video clips . The method of generating a short video based on the summary video segment is similar to the first implementation, and will not be repeated here. It can be seen that the second implementation method interacts with the user to present the segmented video clips to the user in the order of importance. After the user makes a selection based on his own interests or needs, the corresponding short video is generated, thereby Make short videos better meet the needs of users.
可选的,从至少一个视频片段中生成目标视频对应的短视频时,短视频的生成装置还可以先获取用户输入的或者历史记录中的主题关键字,将至少一个视频片段的所属语义类别与主题关键字进行匹配,将匹配程度满足阈值的视频片段确定为主题视频片段,再从至少一个主题视频片段中生成目标视频对应的短视频。Optionally, when generating a short video corresponding to the target video from at least one video clip, the short video generating device may also first obtain the subject keyword input by the user or in the historical record, and compare the semantic category of the at least one video clip with The topic keywords are matched, the video segment whose matching degree meets the threshold is determined as the topic video segment, and then a short video corresponding to the target video is generated from at least one topic video segment.
进一步可选的,从至少一个视频片段中生成目标视频对应的短视频时,短视频的生成装置还可以先对目标视频进行时域分割,得到至少一个分割片段的起止时间,然后根据至 少一个视频片段的起止时间和至少一个分割片段的起止时间,确定各视频片段与各分割片段之间的至少一个重叠片段,再从至少一个重叠片段中生成目标视频对应的短视频。Further optionally, when generating a short video corresponding to the target video from at least one video segment, the short video generation device may first perform time domain segmentation on the target video to obtain the start and end time of at least one segment, and then according to the at least one video segment The start and end time of the segment and the start and end time of at least one segment are determined, and at least one overlap segment between each video segment and each segment is determined, and then a short video corresponding to the target video is generated from the at least one overlap segment.
具体来说,可以对目标视频进行核时域分割(Kernel Temporal Segmentation,KTS)。KTS是一种基于核方法的变化点检测算法,通过聚焦一维信号特征的一致性,来检测信号中的跳变点,能够区分信号跳变是由噪声引起的还是内容变化引起的。在本申请实施例中,KTS可以通过对输入的目标视频的每帧视频图像的特征数据进行统计分析,检测出信号的跳变点,以实现对不同内容的视频片段的划分,将目标视频分为若干不重叠的分割片段,从而得到至少一个分割片段的起止时间。然后再结合至少一个视频片段的起止时间,确定出各视频片段与各分割片段之间的至少一个重叠片段。例如,一分割片段的起止时间为t1-t2,一视频片段的起止时间为t1-t3,则重叠片段则为t1-t2对应的片段。最后可以参考上述第一种可能实施场景的两种实现方式,从至少一个重叠片段中确定出摘要视频片段,以生成目标视频对应的短视频。Specifically, the target video may be subjected to Kernel Temporal Segmentation (KTS). KTS is a change point detection algorithm based on the kernel method. It detects the jump point in the signal by focusing on the consistency of one-dimensional signal characteristics, and can distinguish whether the signal jump is caused by noise or content change. In the embodiment of this application, KTS can perform statistical analysis on the characteristic data of each frame of the input target video to detect the signal transition point, so as to realize the division of video clips of different content, and divide the target video into For several non-overlapping segmented segments, the start and end time of at least one segment can be obtained. Then, the start and end times of at least one video segment are combined to determine at least one overlapping segment between each video segment and each divided segment. For example, if the start and end time of a divided segment are t1-t2, and the start and end time of a video segment are t1-t3, the overlapping segment is the segment corresponding to t1-t2. Finally, referring to the above-mentioned two implementation manners of the first possible implementation scenario, a summary video segment can be determined from at least one overlapping segment to generate a short video corresponding to the target video.
可以看出,KTS分割得到的分割片段的内容一致性较高,视频语义分析模型识别出的视频片段则是具有语义类别的片段,可以说明在视频片段中的重要性。两种分割方法结合后得到的重叠片段的内容一致性和重要性都比较高,同时也可以修正视频语义分析模型的结果,从而生成的短视频更加连贯且符合用户需求。It can be seen that the content consistency of the segmented segments obtained by KTS segmentation is relatively high, and the video segments identified by the video semantic analysis model are segments with semantic categories, which can illustrate the importance of the video segments. The content consistency and importance of the overlapping segments obtained after the combination of the two segmentation methods are relatively high, and the results of the video semantic analysis model can also be modified, so that the generated short videos are more coherent and meet user needs.
在上述第二种可能实施场景中,所属语义类别的概率包括所属行为类别的概率和所属场景类别的概率,由于所属行为类别的概率针对的是一段视频片段,而所属行为类别的概率针对的是一段视频片段中的每帧视频图像,因此可以先将两种概率整合在一起后,再进行摘要视频片段的选择。也就是说,可以根据每个视频片段的起止时间和所属行为类别的概率、每个视频片段中的每帧视频图像的所属场景类别的概率,先确定至少一个视频片段的平均类别概率,然后再根据至少一个视频片段的平均类别概率,从至少一个视频片段中生成目标视频对应的短视频。In the second possible implementation scenario, the probability of belonging to the semantic category includes the probability of belonging to the behavior category and the probability of belonging to the scene category. Because the probability of belonging to the behavior category is for a video clip, the probability of belonging to the behavior category is for For each frame of video image in a video segment, the two probabilities can be integrated together before selecting the summary video segment. In other words, according to the start and end time of each video segment and the probability of its behavior category, and the probability of each frame of video image in each video segment, the average category probability of at least one video segment can be determined first, and then According to the average category probability of the at least one video segment, a short video corresponding to the target video is generated from the at least one video segment.
具体来说,针对每个视频片段,短视频的生成装置可以根据视频片段的起止时间,确定视频片段对应的多帧视频图像及帧数,将视频片段的所属行为类别的概率确定为多帧视频图像中每帧视频图像的所属行为类别的概率,也即视频片段对应每帧视频图像的所属行为类别的概率与整段视频片段的所属行为类别的概率一致。然后,再获取视频语义分析模型输出的多帧视频图像中的每帧视频图像的所属场景类别的概率,将视频片段对应的多帧视频图像中的每帧视频图像的所属行为类别的概率与所属场景类别的概率的和除以帧数,得到视频片段的平均类别概率。按照上述方式,最终确定至少一个视频片段的平均类别概率。Specifically, for each video segment, the short video generation device can determine the multi-frame video image and the number of frames corresponding to the video segment according to the start and end time of the video segment, and determine the probability of the behavior category of the video segment as a multi-frame video The probability of the behavior category of each frame of the video image in the image, that is, the probability of the behavior category of each frame of the video clip corresponding to the behavior category is consistent with the probability of the behavior category of the entire video clip. Then, the probability of the scene category of each frame of the video image in the multi-frame video image output by the video semantic analysis model is obtained, and the probability of the behavior category of each frame of the video image in the multi-frame video image corresponding to the video segment is compared with the probability of the corresponding behavior category. The sum of the probabilities of the scene category is divided by the number of frames to obtain the average category probability of the video segment. According to the above method, the average category probability of at least one video segment is finally determined.
根据至少一个视频片段的平均类别概率,从至少一个视频片段中生成目标视频对应的短视频时,短视频的生成装置可以根据平均类别概率的大小排序,自动确定摘要视频片段或用户指定摘要视频片段,然后根据摘要视频片段合成短视频。具体细节与第一种场景中的两种实现方式相似,可以参考上文描述,此处不作赘述。同理,该实施场景中,也可以基于上述KTS分割后的重叠片段进行后续操作,此处也不作赘述。According to the average category probability of at least one video segment, when a short video corresponding to the target video is generated from at least one video segment, the short video generating device can sort according to the average category probability, and automatically determine the summary video segment or user-specified summary video segment , And then synthesize a short video based on the summary video clip. The specific details are similar to the two implementations in the first scenario, and you can refer to the above description, which will not be repeated here. In the same way, in this implementation scenario, subsequent operations can also be performed based on the above-mentioned overlapping segments after the KTS segmentation, which will not be repeated here.
基于上述技术方案,可以看出,本申请实施例通过视频语义分析模型识别目标视频中具有一个或多个语义类别的视频片段,以直接提取最能体现目标视频内容且具有连续性的视频片段来合成短视频,不仅考虑了目标视频中帧与帧之间内容的连贯性,提升短视频的呈现效果,使短视频内容更满足用户的实际需求,也提升了短视频的生成效率。Based on the above technical solutions, it can be seen that the embodiment of the present application uses the video semantic analysis model to identify video clips with one or more semantic categories in the target video, so as to directly extract the video clips that best reflect the target video content and have continuity. Synthesizing short video not only considers the continuity of the content between frames in the target video, but also improves the presentation effect of the short video, so that the short video content can better meet the actual needs of users, and the generation efficiency of the short video is also improved.
进一步的,本申请实施例适用的一些业务场景中(例如,社交软件的短视频分享业务场景),还可以结合用户兴趣来生成短视频,以使得短视频更贴合用户喜好。请参见图11,图11是本申请实施例提供的另一种短视频的生成方法的流程示意图,该方法包括但不限于以下步骤:Further, in some business scenarios to which the embodiments of the present application are applicable (for example, short video sharing business scenarios of social software), short videos can also be generated in combination with user interests, so that short videos are more suitable for users' preferences. Please refer to FIG. 11. FIG. 11 is a schematic flowchart of another short video generation method provided by an embodiment of the present application. The method includes but is not limited to the following steps:
S201,获取目标视频。S201: Obtain a target video.
S202,通过语义分析获得目标视频中的至少一个视频片段的起止时间、所属语义类别和所属语义类别的概率。S202: Obtain the start and end time, the semantic category and the probability of the semantic category of at least one video segment in the target video through semantic analysis.
S201-S202的具体实现方式请参考S101-S102的描述,区别在于S102可以只输出所属语义类别的概率,而S202既输出所属语义类别也输出所属语义类别的概率,这里不作赘述。For the specific implementation of S201-S202, please refer to the description of S101-S102. The difference is that S102 can only output the probability of the semantic category, while S202 outputs both the semantic category and the probability of the semantic category, which will not be repeated here.
S203,根据每个视频片段的所属语义类别的概率和所属语义类别对应的类别权重,确定至少一个视频片段的兴趣类别概率。S203: Determine the interest category probability of at least one video clip according to the probability of the semantic category to which each video clip belongs and the category weight corresponding to the semantic category to which it belongs.
在本申请实施例中,各种所属语义类别存在对应的类别权重,类别权重可以用于表征用户对分别各种所属语义类别的感兴趣程度,例如,在本地数据库的图像或视频中出现频率越高的所属语义类别,说明用户对此类别的图像或视频的存储数量大,即更感兴趣,则可以设置越高的类别权重;又例如,在历史操作记录中查看次数越多的图像或视频的所属语义类别,说明用户更关注此类别的图像或视频,也可以设置越高的类别权重。具体来说,可以预先为各种所属语义类别确定对应的类别权重,然后直接调用每个视频片段的所属语义类别对应的类别权重。In the embodiments of the present application, there are corresponding category weights for various semantic categories. The category weights can be used to characterize the user's interest in the respective semantic categories. A high semantic category indicates that the user has a large number of images or videos stored in this category, that is, more interested, the higher the category weight can be set; for example, the more images or videos viewed in the historical operation record The semantic category belongs to, indicating that users pay more attention to images or videos in this category, and a higher category weight can also be set. Specifically, the corresponding category weights can be determined for various semantic categories in advance, and then the category weights corresponding to the semantic categories of each video clip can be directly called.
在本申请实施例一种可能的实现方式中,可以通过以下步骤确定各种所属语义类别对应的类别权重:In a possible implementation manner of the embodiment of the present application, the category weights corresponding to various semantic categories can be determined through the following steps:
步骤一:获取本地数据库和历史操作记录中的媒体数据信息。Step 1: Obtain the media data information in the local database and historical operation records.
在本申请实施例中,本地数据库可以是用于存储或处理各类数据的存储空间,也可是专用于存储媒体数据(图片、视频等)的专用数据库,例如图库。历史操作记录指用户对数据的各项操作(浏览、移动、编辑等操作)产生的记录,例如本地日志文件。媒体数据信息是指图像、视频等类型数据的各类信息,可以包括图像和视频本身,可以是对图像和视频的特征信息,可以是图像和视频的操作信息,还可以是图像和视频的各项统计信息等等。In the embodiment of the present application, the local database may be a storage space for storing or processing various types of data, or a dedicated database dedicated to storing media data (pictures, videos, etc.), such as a gallery. Historical operation records refer to records generated by users of various operations (browsing, moving, editing, etc.) of data, such as local log files. Media data information refers to various types of information such as images and videos, which can include images and videos themselves, can be characteristic information of images and videos, can be operation information of images and videos, and can also be various types of images and videos. Item statistics, etc.
步骤二:根据媒体数据信息,确定媒体数据的各种所属语义类别分别对应的类别权重。Step 2: Determine the category weights corresponding to various semantic categories of the media data according to the media data information.
在一种可能的实现方式中,首先,短视频的生成装置可以确定本地数据库中的视频和图像的所属语义类别,统计每种所属语义类别的出现次数。然后,确定本地日志文件中用户操作过的视频和图像的所属语义类别,统计每种所属语义类别的操作时长和操作频率。具体来说,针对本地数据库中包括的视频和图像以及本地日志文件中用户操作过的视频和图像,可以进行语义分析,最后得到每张图像和每个视频的所属语义类别。在实施过程中,可以采用上述步骤S102中提到的视频语义分析模型对视频进行分析,得到视频的所属语义类别;可以采用现有技术常用的图像识别模型对图像进行分析,得到图像的所属语义类别。然后对每种所属语义类别的出现次数、操作时长和操作频率进行统计。例如,图库中有6张图片和4个视频,打球类别的出现次数为5次,吃饭类别的出现次数为1次,微笑类别的出现次数为2次。需要说明的是,这里的操作可以包括浏览、编辑、分享等各项操作,在统计操作时长和操作频率时,可以针对每项操作进行分别统计,也可以针对所有操 作进行总数统计,例如,可以统计打球类别的浏览频率为2次/天,编辑频率为1次/天,分享频率为0.5次/天,浏览时长为20小时,编辑时长为40小时;也可以统计打球类别的操作频率为3.5次/天,操作时长为60小时。最后,根据每种所属语义类别的出现次数、操作时长和操作频率,计算每种所属语义类别对应的类别权重。具体来说,可以根据预设的权重公式,结合每种所属语义类别的出现次数、操作时长和操作频率,计算每种所属语义类别对应的类别权重。其中,预设的权重公式能够体现出现次数、操作时长和操作频率的数值越大,所属语义类别的类别权重越高。In a possible implementation manner, first, the short video generation device can determine the semantic category of the video and image in the local database, and count the number of occurrences of each semantic category. Then, determine the semantic category of the videos and images operated by the user in the local log file, and count the operation duration and frequency of each semantic category. Specifically, for the videos and images included in the local database and the videos and images operated by the user in the local log file, semantic analysis can be performed, and finally each image and the semantic category of each video can be obtained. In the implementation process, the video semantic analysis model mentioned in step S102 can be used to analyze the video to obtain the semantic category of the video; the image recognition model commonly used in the prior art can be used to analyze the image to obtain the semantics of the image. category. Then count the number of occurrences, operation duration and operation frequency of each semantic category. For example, there are 6 pictures and 4 videos in the gallery, the number of appearances of the playing category is 5 times, the number of appearances of the eating category is 1 time, and the number of appearances of the smile category is 2 times. It should be noted that the operations here can include various operations such as browsing, editing, sharing, etc. When calculating operation duration and operation frequency, you can perform separate statistics for each operation, or perform total statistics for all operations. For example, you can The browsing frequency of the statistical playing category is 2 times/day, the editing frequency is 1 time/day, the sharing frequency is 0.5 times/day, the browsing time is 20 hours, and the editing time is 40 hours; the operating frequency of the statistics playing category is also 3.5 Times/day, operating time is 60 hours. Finally, according to the number of occurrences, operation duration and operation frequency of each semantic category, the category weight corresponding to each semantic category is calculated. Specifically, the weight of the category corresponding to each semantic category can be calculated according to the preset weight formula, combined with the number of occurrences, operation duration, and operation frequency of each semantic category. Among them, the preset weight formula can reflect that the greater the number of occurrences, the operation duration, and the operation frequency, the higher the category weight of the semantic category to which it belongs.
可选的,可以采用以下公式计算任一所属语义类别i的类别权重w iOptionally, the following formula can be used to calculate the category weight w i of any semantic category i:
Figure PCTCN2021070391-appb-000001
Figure PCTCN2021070391-appb-000001
其中,count freq_i、view freq_i、view iime_i、share freq_i和edit freq_i分别是本地数据库和历史操作记录中所属语义类别i的出现次数、浏览频率、浏览时间、分享频率和编辑频率,
Figure PCTCN2021070391-appb-000002
Figure PCTCN2021070391-appb-000003
分别是本地数据库和历史操作记录中识别到的所有h种所属语义类别的出现次数、浏览频率、分享频率和编辑频率。
Among them, count freq_i , view freq_i , view iime_i , share freq_i, and edit freq_i are the number of occurrences, browsing frequency, browsing time, sharing frequency and editing frequency of the semantic category i in the local database and historical operation records, respectively.
Figure PCTCN2021070391-appb-000002
with
Figure PCTCN2021070391-appb-000003
They are the number of occurrences, browsing frequency, sharing frequency, and editing frequency of all h semantic categories identified in the local database and historical operation records.
最终可以得到h种所属语义类别的类别权重W=(w 1、w 2……w h)。 Finally, the category weights W=(w 1 , w 2 ......w h ) of the semantic categories belonging to h can be obtained.
具体来说,对于每个视频片段,其所属语义类别可以为一个或多个,在只有一个所属语义类别时(例如属于握手类别),可以确定这一个所属语义类别的类别权重,并计算类别权重与所属语义类别的概率的乘积作为视频片段的兴趣类别概率。在有多个所属语义类别时(例如属于握手类别和微笑类别),则可以分别确定每个所属语义类别的类别权重,然后计算每个所属语义类别对应的类别权重与概率的乘积后求和,以得到视频片段的兴趣类别概率。例如,假设视频片段A的所属语义类别包括类别1和类别2,类别1的概率为P 1,类别2的概率为P 2,类别1和类别2分别对应的类别权重为w 1和w 2,则视频片段A的兴趣类别概率P w=P 1*w 1+P 2*w 2Specifically, for each video segment, there can be one or more semantic categories. When there is only one semantic category (for example, it belongs to the handshake category), the category weight of this semantic category can be determined, and the category weight can be calculated The product of the probability of belonging to the semantic category is used as the probability of the interest category of the video segment. When there are multiple semantic categories (such as the handshake category and the smile category), the category weight of each semantic category can be determined separately, and then the product of the category weight and the probability corresponding to each semantic category can be calculated and sum, To get the probability of the interest category of the video clip. For example, assuming that the semantic categories of video clip A include category 1 and category 2, the probability of category 1 is P 1 , the probability of category 2 is P 2 , and the category weights corresponding to category 1 and category 2 are w 1 and w 2 , Then the interest category probability P w of the video segment A = P 1 *w 1 +P 2 *w 2 .
进一步的,由于所属语义类别可以包括多种,如上文提到的,多种类别还可以划分出几种大类类别,因此还可以再对大类类别权重进行设置,例如,微笑类别、哭泣类别、生气类别都可以视为表情类别或者人脸类别,而游泳类别、跑步类别、打球类别都可以视为行为类别,人脸类别和行为类别这两大类就可以再具体设置不同的大类类别权重。具体设置方法可以由用户自己调整,也可以根据上述本地数据库和历史操作记录进一步确定大类类别权重,由于方法原理类似,因此在此不作赘述。Further, since the semantic category can include multiple categories, as mentioned above, multiple categories can also be divided into several major categories, so the weights of major categories can also be set, for example, smile category, cry category , Angry categories can be regarded as expression categories or face categories, while swimming categories, running categories, and playing categories can all be regarded as behavior categories. The two categories of face and behavior can be specifically set in different categories. Weights. The specific setting method can be adjusted by the user, or the weight of the major categories can be further determined according to the above-mentioned local database and historical operation records. Since the principle of the method is similar, it will not be repeated here.
需要说明的是,在上述第二种可能的实施方式中,短视频的生成装置可以先确定每个视频片段中的每帧视频图像的所属场景类别的概率和所属行为类别的概率分别对应的类别权重,按照上述方法将对应的概率和类别权重的乘积求和确定出每帧视频图像的权重概率,然后再将每帧视频图像的权重概率的和除以帧数得到视频片段的兴趣类别概率。It should be noted that in the above second possible implementation manner, the short video generation device may first determine the category corresponding to the probability of the scene category and the probability of the behavior category of each frame of the video image in each video segment. Weight, according to the above method, sum the product of the corresponding probability and category weight to determine the weight probability of each frame of video image, and then divide the sum of the weight probability of each frame of video image by the number of frames to obtain the interest category probability of the video segment.
S204,根据至少一个视频片段的起止时间和兴趣类别概率,从至少一个视频片段中确定出目标视频对应的短视频。S204: Determine a short video corresponding to the target video from the at least one video segment according to the start and end time of the at least one video segment and the interest category probability.
S204的具体实施方式与S103中第一种可能的实施场景的两种实现方式相似,区别在于S103中是针对所属语义类别的概率进行的排序,而S204则针对兴趣类别概率进行排序, 因此具体实现方式可以参考S103,这里不作赘述。同理,该实施场景中,也可以基于上述KTS分割后的重叠片段进行后续操作,此处也不作赘述。The specific implementation of S204 is similar to the two implementations of the first possible implementation scenario in S103, the difference is that S103 is based on the probability of the semantic category to which it belongs, while S204 is based on the probability of the interest category, so the specific implementation For the method, refer to S103, which will not be repeated here. In the same way, in this implementation scenario, subsequent operations can also be performed based on the above-mentioned overlapping segments after the KTS segmentation, which will not be repeated here.
相对于S103的两种实现方式,S204中的兴趣类别概率综合说明了视频片段的重要性和兴趣度两个维度,因此在排序后进一步选择摘要视频片段,可以尽可能呈现更重要且更符合用户兴趣的视频片段。Compared with the two implementations of S103, the interest category probability in S204 comprehensively describes the two dimensions of the importance and interest of the video clip. Therefore, after sorting, you can further select the summary video clip to present as more important and more user-friendly as possible Video clips of interest.
基于上述技术方案,可以看出,本申请实施例在保证短视频内容的连贯性以及短视频生成效率的基础上,又进一步根据本地数据库和历史操作记录分析了用户偏好,从而在选择用于合成短视频的视频片段时,能够更具有针对性,更符合用户兴趣,得到千人千面的短视频。Based on the above technical solutions, it can be seen that, on the basis of ensuring the continuity of short video content and the efficiency of short video generation, the embodiments of this application further analyze user preferences based on the local database and historical operation records, so that they are selected for synthesis When the video clips of short videos are more targeted and more in line with user interests, short videos of thousands of people can be obtained.
图12示出了短视频的生成装置为终端设备100的结构示意图。FIG. 12 shows a schematic structural diagram of a terminal device 100 as the device for generating a short video.
应该理解的是,终端设备100可以具有比图中所示的更多的或者更少的部件,可以组合两个或多个的部件,或者可以具有不同的部件配置。图中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。It should be understood that the terminal device 100 may have more or fewer components than shown in the figure, may combine two or more components, or may have different component configurations. The various components shown in the figure may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.
终端设备100可以包括:处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。The terminal device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2. Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display 194, And subscriber identification module (subscriber identification module, SIM) card interface 195 and so on. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light Sensor 180L, bone conduction sensor 180M, etc.
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 110 may include one or more processing units. For example, the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Among them, the different processing units may be independent devices or integrated in one or more processors.
其中,控制器可以是终端设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。The controller may be the nerve center and command center of the terminal device 100. The controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching instructions and executing instructions.
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。A memory may also be provided in the processor 110 to store instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户 标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。In some embodiments, the processor 110 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter/receiver (universal asynchronous) interface. receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / Or Universal Serial Bus (USB) interface, etc.
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对终端设备100的结构限定。在本申请另一些实施例中,终端设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。It can be understood that the interface connection relationship between the modules illustrated in the embodiment of the present application is merely a schematic description, and does not constitute a structural limitation of the terminal device 100. In other embodiments of the present application, the terminal device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。The charging management module 140 is used to receive charging input from the charger. Among them, the charger can be a wireless charger or a wired charger.
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,外部存储器,显示屏194,摄像头193,和无线通信模块160等供电。The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140, and supplies power to the processor 110, the internal memory 121, the external memory, the display screen 194, the camera 193, and the wireless communication module 160.
终端设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。The wireless communication function of the terminal device 100 can be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, and the baseband processor.
终端设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。The terminal device 100 implements a display function through a GPU, a display screen 194, and an application processor. The GPU is an image processing microprocessor, which is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations and is used for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,终端设备100可以包括1个或N个显示屏194,N为大于1的正整数。The display screen 194 is used to display images, videos, and the like. The display screen 194 includes a display panel. The display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode). AMOLED, flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc. In some embodiments, the terminal device 100 may include one or N display screens 194, and N is a positive integer greater than one.
终端设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。The terminal device 100 can implement a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, and an application processor.
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。The ISP is used to process the data fed back from the camera 193. For example, when taking a picture, the shutter is opened, the light is transmitted to the photosensitive element of the camera through the lens, the light signal is converted into an electrical signal, and the photosensitive element of the camera transmits the electrical signal to the ISP for processing and is converted into an image visible to the naked eye. ISP can also optimize the image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 193.
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。本发明实施例中,摄像头193包括采集人脸识别所需图像的摄像头,如红外摄像头或其他摄像头。该采集人脸识别所需图像的摄像头一般位于终端设备的正面,例如触控屏的上方,也可以位于其他位置,本发明实施例对此不做限制。在一些实施例中,终端设备100可以包括其他摄像头。终端设备还可以包括点阵发射器(图中未示出),用于发射光线。摄像头采集人脸反射的光线,得到人脸图像,处理器对人脸图像进行处理和分析,通过与存储的人脸图像的信息进行比较以进行验证。The camera 193 is used to capture still images or videos. The object generates an optical image through the lens and is projected to the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal. ISP outputs digital image signals to DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In the embodiment of the present invention, the camera 193 includes a camera that collects images required for face recognition, such as an infrared camera or other cameras. The camera that collects the image required for face recognition is generally located on the front of the terminal device, for example, above the touch screen, and may also be located at other positions, which is not limited in the embodiment of the present invention. In some embodiments, the terminal device 100 may include other cameras. The terminal device may also include a dot matrix transmitter (not shown in the figure) for emitting light. The camera collects the light reflected by the face to obtain a face image, and the processor processes and analyzes the face image, and compares it with the stored face image information for verification.
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数 字信号。例如,当终端设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。The digital signal processor is used to process digital signals. In addition to digital image signals, it can also process other digital signals. For example, when the terminal device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
视频编解码器用于对数字视频压缩或解压缩。终端设备100可以支持一种或多种视频编解码器。这样,终端设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。Video codecs are used to compress or decompress digital video. The terminal device 100 may support one or more video codecs. In this way, the terminal device 100 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现终端设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, for example, the transfer mode between human brain neurons, it can quickly process input information, and it can also continuously self-learn. Through the NPU, applications such as intelligent cognition of the terminal device 100 can be implemented, such as image recognition, face recognition, voice recognition, text understanding, and so on.
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展终端设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, so as to expand the storage capacity of the terminal device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example, save music, video and other files in an external memory card.
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行终端设备100的各种功能应用以及数据处理。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用(比如人脸识别功能,指纹识别功能、移动支付功能等)等。存储数据区可存储终端设备100使用过程中所创建的数据(比如人脸信息模板数据,指纹信息模板等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。The internal memory 121 may be used to store computer executable program code, where the executable program code includes instructions. The processor 110 executes various functional applications and data processing of the terminal device 100 by running instructions stored in the internal memory 121. The internal memory 121 may include a storage program area and a storage data area. Among them, the storage program area can store an operating system, at least one application required for a function (such as a face recognition function, a fingerprint recognition function, a mobile payment function, etc.) and so on. The storage data area can store data created during the use of the terminal device 100 (such as face information template data, fingerprint information template, etc.) and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
终端设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。The terminal device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. For example, music playback, recording, etc.
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。The audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal.
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。The speaker 170A, also called "speaker", is used to convert audio electrical signals into sound signals.
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。The receiver 170B, also called "earpiece", is used to convert audio electrical signals into sound signals.
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。The microphone 170C, also called "microphone", "microphone", is used to convert sound signals into electrical signals.
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动终端设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。The earphone interface 170D is used to connect wired earphones. The earphone interface 170D may be a USB interface 130, or a 3.5mm open mobile terminal platform (OMTP) standard interface, or a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。压力传感器180A的种类很多,如电阻式压力传感器,电感式压力传感器,电容式压力传感器等。The pressure sensor 180A is used to sense the pressure signal and can convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be provided on the display screen 194. There are many types of pressure sensors 180A, such as resistive pressure sensors, inductive pressure sensors, capacitive pressure sensors and so on.
陀螺仪传感器180B可以用于确定终端设备100的运动姿态。在一些实施例中,可以通过陀螺仪传感器180B确定终端设备100围绕三个轴(即,x,y和z轴)的角速度。The gyro sensor 180B may be used to determine the movement posture of the terminal device 100. In some embodiments, the angular velocity of the terminal device 100 around three axes (ie, x, y, and z axes) can be determined by the gyro sensor 180B.
接近光传感器180G可以包括例如发光二极管(LED)和光检测器,例如光电二极管。发光二极管可以是红外发光二极管。The proximity light sensor 180G may include, for example, a light emitting diode (LED) and a light detector such as a photodiode. The light emitting diode may be an infrared light emitting diode.
环境光传感器180L用于感知环境光亮度。终端设备100可以根据感知的环境光亮度自适应调节显示屏194亮度。环境光传感器180L也可用于拍照时自动调节白平衡。The ambient light sensor 180L is used to sense the brightness of the ambient light. The terminal device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived brightness of the ambient light. The ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures.
指纹传感器180H用于采集指纹。终端设备100可以利用采集的指纹特性实现指纹解 锁,访问应用锁,指纹拍照,指纹接听来电等。其中,该指纹传感器180H可以设置在触控屏下方,终端设备100可以接收用户在触控屏上该指纹传感器对应的区域的触摸操作,终端设备100可以响应于该触摸操作,采集用户手指的指纹信息,实现本申请实施例中所涉及的指纹识别通过后打开隐藏相册,指纹识别通过后打开隐藏应用,指纹识别通过后登录账号,指纹识别通过后完成付款等。The fingerprint sensor 180H is used to collect fingerprints. The terminal device 100 can use the collected fingerprint characteristics to realize fingerprint unlocking, access application lock, fingerprint photography, fingerprint answering calls, and so on. Wherein, the fingerprint sensor 180H can be arranged under the touch screen, the terminal device 100 can receive a user's touch operation on the touch screen in the area corresponding to the fingerprint sensor, and the terminal device 100 can collect the fingerprint of the user's finger in response to the touch operation. Information to realize that the fingerprint recognition involved in the embodiments of the application opens the hidden album after the fingerprint recognition is passed, the hidden application is opened after the fingerprint recognition is passed, the account is logged in after the fingerprint recognition is passed, and the payment is completed after the fingerprint recognition is passed.
温度传感器180J用于检测温度。在一些实施例中,终端设备100利用温度传感器180J检测的温度,执行温度处理策略。The temperature sensor 180J is used to detect temperature. In some embodiments, the terminal device 100 uses the temperature detected by the temperature sensor 180J to execute a temperature processing strategy.
触摸传感器180K,也称“触控面板”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于终端设备100的表面,与显示屏194所处的位置不同。Touch sensor 180K, also called "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch screen is composed of the touch sensor 180K and the display screen 194, which is also called a “touch screen”. The touch sensor 180K is used to detect touch operations acting on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. The visual output related to the touch operation can be provided through the display screen 194. In other embodiments, the touch sensor 180K may also be disposed on the surface of the terminal device 100, which is different from the position of the display screen 194.
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。终端设备100可以接收按键输入,产生与终端设备100的用户设置以及功能控制有关的键信号输入。The button 190 includes a power-on button, a volume button, and so on. The button 190 may be a mechanical button. It can also be a touch button. The terminal device 100 may receive key input, and generate key signal input related to user settings and function control of the terminal device 100.
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。The indicator 192 may be an indicator light, which may be used to indicate the charging status, power change, or to indicate messages, missed calls, notifications, and so on.
SIM卡接口195用于连接SIM卡。SIM卡可以通过插入SIM卡接口195,或从SIM卡接口195拔出,实现和终端设备100的接触和分离。在一些实施例中,终端设备100采用eSIM,即:嵌入式SIM卡。eSIM卡可以嵌在终端设备100中,不能和终端设备100分离。The SIM card interface 195 is used to connect to the SIM card. The SIM card can be inserted into the SIM card interface 195 or pulled out from the SIM card interface 195 to achieve contact and separation with the terminal device 100. In some embodiments, the terminal device 100 adopts an eSIM, that is, an embedded SIM card. The eSIM card can be embedded in the terminal device 100 and cannot be separated from the terminal device 100.
终端设备100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本发明实施例以分层架构的Android系统为例,示例性说明终端设备100的软件结构。The software system of the terminal device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiment of the present invention takes an Android system with a layered architecture as an example to illustrate the software structure of the terminal device 100 by way of example.
图13是本申请实施例的终端设备100的软件结构框图。FIG. 13 is a block diagram of the software structure of the terminal device 100 according to an embodiment of the present application.
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, the application layer, the application framework layer, the Android runtime and system library, and the kernel layer.
应用程序层可以包括一系列应用程序包。The application layer can include a series of application packages.
如图13所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序(也可以称为应用)。As shown in FIG. 13, the application package may include applications (also called applications) such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions.
如图13所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。As shown in Figure 13, the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, and a notification manager.
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。The window manager is used to manage window programs. The window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, etc.
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。The content provider is used to store and retrieve data and make these data accessible to applications. The data may include videos, images, audios, phone calls made and received, browsing history and bookmarks, phone book, etc.
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于 构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。The view system includes visual controls, such as controls that display text, controls that display pictures, and so on. The view system can be used to build applications. The display interface can be composed of one or more views. For example, a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.
电话管理器用于提供终端设备100的通信功能。例如通话状态的管理(包括接通,挂断等)。The phone manager is used to provide the communication function of the terminal device 100. For example, the management of the call status (including connecting, hanging up, etc.).
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话界面形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,终端设备振动,指示灯闪烁等。The notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, and so on. The notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialogue interface. For example, prompt text messages in the status bar, sound a prompt tone, terminal equipment vibration, flashing indicator lights, etc.
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。The core library consists of two parts: one part is the function functions that the java language needs to call, and the other part is the core library of Android.
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。The application layer and application framework layer run in a virtual machine. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。The system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。The surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support multiple audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis, and layer processing.
2D图形引擎是2D绘图的绘图引擎。The 2D graphics engine is a drawing engine for 2D drawing.
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。The kernel layer is the layer between hardware and software. The kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.
图14示出了短视频的生成装置为服务器200的结构示意图。FIG. 14 shows a schematic structural diagram of the server 200 as the device for generating short videos.
应该理解的是,服务器200可以具有比图中所示的更多的或者更少的部件,可以组合两个或多个的部件,或者可以具有不同的部件配置。图中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。It should be understood that the server 200 may have more or fewer components than shown in the figure, may combine two or more components, or may have different component configurations. The various components shown in the figure may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.
服务器200可以包括:处理器210和存储器220,处理器210可以通过总线连接到存储器220。The server 200 may include a processor 210 and a memory 220, and the processor 210 may be connected to the memory 220 through a bus.
处理器210可以包括一个或多个处理单元,例如:处理器210可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 210 may include one or more processing units. For example, the processor 210 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), and/or neural network processor (neural-network processing unit, NPU), etc. Among them, the different processing units may be independent devices or integrated in one or more processors.
其中,控制器可以是服务器200的神经中枢和指挥中心。控制器可以根据指令操作码 和时序信号,产生操作控制信号,完成取指令和执行指令的控制。The controller may be the nerve center and command center of the server 200. The controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching and executing instructions.
处理器210中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器210中的存储器为高速缓冲存储器。该存储器可以保存处理器210刚用过或循环使用的指令或数据。如果处理器210需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器210的等待时间,因而提高了系统的效率。A memory may also be provided in the processor 210 for storing instructions and data. In some embodiments, the memory in the processor 210 is a cache memory. The memory can store instructions or data that have just been used or recycled by the processor 210. If the processor 210 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 210 is reduced, and the efficiency of the system is improved.
在一些实施例中,处理器210可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,和/或通用串行总线(universal serial bus,USB)接口等。In some embodiments, the processor 210 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter/receiver (universal asynchronous) interface. receiver/transmitter, UART) interface, and/or universal serial bus (universal serial bus, USB) interface, etc.
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对服务器200的结构限定。在本申请另一些实施例中,服务器200也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。It can be understood that the interface connection relationship between the modules illustrated in the embodiment of the present application is merely a schematic description, and does not constitute a structural limitation of the server 200. In other embodiments of the present application, the server 200 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当服务器200在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the server 200 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
视频编解码器用于对数字视频压缩或解压缩。服务器200可以支持一种或多种视频编解码器。这样,服务器200可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。Video codecs are used to compress or decompress digital video. The server 200 may support one or more video codecs. In this way, the server 200 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现服务器200的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, for example, the transfer mode between human brain neurons, it can quickly process input information, and it can also continuously self-learn. Through the NPU, applications such as intelligent cognition of the server 200 can be realized, such as image recognition, face recognition, voice recognition, text understanding, and so on.
存储器220可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器210通过运行存储在存储器220的指令,从而执行服务器200的各种功能应用以及数据处理。存储器220可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用(比如人脸识别功能,指纹识别功能、移动支付功能等)等。存储数据区可存储服务器200使用过程中所创建的数据(比如人脸信息模板数据,指纹信息模板等)等。此外,存储器220可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。The memory 220 may be used to store computer executable program code, where the executable program code includes instructions. The processor 210 executes various functional applications and data processing of the server 200 by running instructions stored in the memory 220. The memory 220 may include a program storage area and a data storage area. Among them, the storage program area can store an operating system, at least one application required for a function (such as a face recognition function, a fingerprint recognition function, a mobile payment function, etc.) and so on. The storage data area can store data created during the use of the server 200 (such as face information template data, fingerprint information template, etc.) and the like. In addition, the memory 220 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
进一步的,上述服务器200还可以是经过虚拟化的服务器,即服务器200上具有虚拟出的多个逻辑服务器,每个逻辑服务器都可以依赖服务器200中的软件、硬件等部件实现相同的数据存储和处理功能。Further, the aforementioned server 200 may also be a virtualized server, that is, there are multiple virtualized logical servers on the server 200, and each logical server can rely on the software, hardware and other components in the server 200 to achieve the same data storage and Processing function.
图15为本申请实施例中的短视频的生成装置300的结构示意图,该短视频的生成装置300可以应用于上述终端设备100或服务器200中。该短视频的生成装置300可以包括:FIG. 15 is a schematic structural diagram of a short video generating apparatus 300 in an embodiment of the application. The short video generating apparatus 300 may be applied to the aforementioned terminal device 100 or server 200. The device 300 for generating a short video may include:
视频获取模块310,用于获取目标视频;The video acquisition module 310 is used to acquire the target video;
视频分析模块320,用于通过语义分析获得所述目标视频中的至少一个视频片段的起止时间和所属语义类别的概率;其中,每个所述视频片段属于一个或多个语义类别;The video analysis module 320 is configured to obtain the start and end times of at least one video segment in the target video and the probability of the semantic category belonging to the target video through semantic analysis; wherein, each of the video segments belongs to one or more semantic categories;
短视频生成模块330,用于根据所述至少一个视频片段的起止时间和所属语义类别的 概率,从所述至少一个视频片段中生成所述目标视频对应的短视频。The short video generation module 330 is configured to generate a short video corresponding to the target video from the at least one video segment according to the start and end time of the at least one video segment and the probability of the semantic category to which it belongs.
在一种可能的实施场景中,所述目标视频包括m帧视频图像,所述m为正整数;所述视频分析模块320具体用于:In a possible implementation scenario, the target video includes m frames of video images, and the m is a positive integer; the video analysis module 320 is specifically configured to:
提取所述目标视频中每帧视频图像的n维特征数据,并基于m帧视频图像的时间顺序生成m*n的视频特征矩阵,所述n为正整数;Extracting n-dimensional feature data of each frame of video image in the target video, and generating an m*n video feature matrix based on the time sequence of m frames of video image, where n is a positive integer;
将所述视频特征矩阵转化成多层特征图,基于所述多层特征图中的各特征点在所述视频特征矩阵上生成对应的至少一个候选框;Converting the video feature matrix into a multi-layer feature map, and generating at least one corresponding candidate frame on the video feature matrix based on each feature point in the multi-layer feature map;
根据所述候选框确定至少一个连续语义特征序列,并确定每个所述连续语义特征序列对应的视频片段的起止时间和所属语义类别的概率。Determine at least one continuous semantic feature sequence according to the candidate frame, and determine the start and end time of the video segment corresponding to each continuous semantic feature sequence and the probability of the semantic category to which it belongs.
在一种可能的实施场景中,所述所属语义类别的概率包括所属行为类别的概率和所属场景类别的概率;所述目标视频包括m帧视频图像,所述m为正整数;所述视频分析模块320具体用于:In a possible implementation scenario, the probability of the semantic category includes the probability of the behavior category and the probability of the scene category; the target video includes m frames of video images, and the m is a positive integer; the video analysis The module 320 is specifically used for:
提取所述目标视频中每帧视频图像的n维特征数据,并基于m帧视频图像的时间顺序生成m*n的视频特征矩阵,所述n为正整数;Extracting n-dimensional feature data of each frame of video image in the target video, and generating an m*n video feature matrix based on the time sequence of m frames of video image, where n is a positive integer;
将所述视频特征矩阵转化成多层特征图,基于所述多层特征图中的各特征点在所述视频特征矩阵上生成对应的至少一个候选框;Converting the video feature matrix into a multi-layer feature map, and generating at least one corresponding candidate frame on the video feature matrix based on each feature point in the multi-layer feature map;
根据所述候选框确定至少一个连续语义特征序列,并确定每个所述连续语义特征序列对应的视频片段的起止时间和所属行为类别的概率;Determine at least one continuous semantic feature sequence according to the candidate frame, and determine the start and end time of the video segment corresponding to each of the continuous semantic feature sequence and the probability of the behavior category to which it belongs;
根据所述目标视频中每帧视频图像的所述n维特征数据识别并输出所述目标视频中每帧视频图像的所属场景类别的概率。Identify and output the probability of the scene category of each frame of video image in the target video according to the n-dimensional feature data of each frame of video image in the target video.
在一种可能的实施场景中,所述至少一个候选框的宽度不变。In a possible implementation scenario, the width of the at least one candidate frame remains unchanged.
在一种可能的实施场景中,所述短视频生成模块330具体用于:In a possible implementation scenario, the short video generation module 330 is specifically configured to:
根据每个所述视频片段的起止时间和所属行为类别的概率、每个所述视频片段中的每帧视频图像的所属场景类别的概率,确定所述至少一个视频片段的平均类别概率;Determine the average category probability of the at least one video segment according to the start and end time of each video segment and the probability of the behavior category to which it belongs, and the probability of the scene category of each frame of video image in each video segment;
根据所述至少一个视频片段的平均类别概率,从所述至少一个视频片段中生成所述目标视频对应的短视频。According to the average category probability of the at least one video segment, a short video corresponding to the target video is generated from the at least one video segment.
在一种可能的实现方式中,所述短视频生成模块330具体用于:In a possible implementation manner, the short video generation module 330 is specifically configured to:
针对每个所述视频片段,根据所述视频片段的起止时间,确定所述视频片段对应的多帧视频图像及帧数;For each of the video segments, according to the start and end times of the video segments, determine the multiple frames of video images and the number of frames corresponding to the video segments;
将所述视频片段的所属行为类别的概率确定为所述视频片段中每帧视频图像的所属行为类别的概率;Determining the probability of the behavior category of the video clip as the probability of the behavior category of each frame of video image in the video clip;
获取所述多帧视频图像中的每帧视频图像的所属场景类别的概率;Acquiring the probability of the scene category of each frame of video image in the multi-frame video image;
将所述多帧视频图像中的每帧视频图像的所属行为类别的概率与所属场景类别的概率的和除以所述帧数,得到所述视频片段的平均类别概率。The sum of the probability of the behavior category and the probability of the scene category of each frame of the video image in the multi-frame video image is divided by the number of frames to obtain the average category probability of the video segment.
在一种可能的实施场景中,所述视频分析模块320具体用于:In a possible implementation scenario, the video analysis module 320 is specifically configured to:
通过语义分析获得所述目标视频中的至少一个视频片段的起止时间、所属语义类别和所属语义类别的概率;Obtaining the start and end time, the semantic category and the probability of the semantic category of at least one video segment in the target video through semantic analysis;
所述短视频生成模块330具体用于:The short video generation module 330 is specifically used for:
根据每个所述视频片段的所属语义类别的概率和所属语义类别对应的类别权重,确定所述至少一个视频片段的兴趣类别概率;Determine the interest category probability of the at least one video clip according to the probability of the semantic category to which each video clip belongs and the category weight corresponding to the semantic category to which it belongs;
根据所述至少一个视频片段的起止时间和兴趣类别概率,从所述至少一个视频片段中生成所述目标视频对应的短视频。According to the start and end time of the at least one video segment and the interest category probability, a short video corresponding to the target video is generated from the at least one video segment.
在一种可能的实施场景中,所述装置300还包括:In a possible implementation scenario, the apparatus 300 further includes:
信息获取模块340,用于获取本地数据库和历史操作记录中的媒体数据信息;The information acquisition module 340 is used to acquire media data information in the local database and historical operation records;
类别权重确定模块350,用于根据所述媒体数据信息,确定媒体数据的各种所属语义类别分别对应的类别权重。The category weight determination module 350 is configured to determine the category weights corresponding to various semantic categories of the media data according to the media data information.
在一种可能的实现方式中,所述类别权重确定模块350具体用于:In a possible implementation manner, the category weight determination module 350 is specifically configured to:
确定本地数据库中的视频和图像的所属语义类别,统计每种所属语义类别的出现次数;Determine the semantic category of videos and images in the local database, and count the number of occurrences of each semantic category;
确定历史操作记录中用户操作过的视频和图像的所属语义类别,统计每种所属语义类别的操作时长和操作频率;Determine the semantic category of the videos and images operated by the user in the historical operation record, and count the operation duration and frequency of each semantic category;
根据每种所述所属语义类别的出现次数、操作时长和操作频率,计算每种所述所属语义类别对应的类别权重。According to the number of occurrences, operation duration, and operation frequency of each semantic category, the weight of the category corresponding to each semantic category is calculated.
在一种可能的实现方式中,所述短视频生成模块330具体用于:In a possible implementation manner, the short video generation module 330 is specifically configured to:
根据所述至少一个视频片段的兴趣类别概率的大小顺序和起止时间,依次从所述至少一个视频片段中确定出至少一个摘要视频片段;Determine at least one summary video segment from the at least one video segment in sequence according to the magnitude order and the start and end time of the interest category probability of the at least one video segment;
获取所述至少一个摘要视频片段并合成所述目标视频对应的短视频。Obtain the at least one summary video segment and synthesize a short video corresponding to the target video.
可选的,所述至少一个摘要视频片段的片段时长之和不大于预设的短视频时长。Optionally, the sum of the segment durations of the at least one summary video segment is not greater than a preset short video duration.
在一种可能的实现方式中,所述短视频生成模块330具体用于:In a possible implementation manner, the short video generation module 330 is specifically configured to:
根据每个所述视频片段的起止时间,在所述目标视频中截取所述视频片段;Intercept the video segment in the target video according to the start and end time of each video segment;
根据所述至少一个视频片段的兴趣类别概率的大小顺序,对各所述视频片段进行排序显示;Sorting and displaying each of the video clips according to the order of the probability of the interest category of the at least one video clip;
当接收到对任意一个或多个所述视频片段的选择指令时,确定被选择的所述视频片段为摘要视频片段;When receiving a selection instruction for any one or more of the video clips, determining that the selected video clip is a summary video clip;
根据所述至少一个摘要视频片段,合成所述目标视频对应的短视频。According to the at least one summary video segment, a short video corresponding to the target video is synthesized.
在一种可能的实施场景中,所述短视频生成模块330具体用于:In a possible implementation scenario, the short video generation module 330 is specifically configured to:
对所述目标视频进行时域分割,得到至少一个分割片段的起止时间;Perform time domain segmentation on the target video to obtain the start and end times of at least one segment;
根据所述至少一个视频片段的起止时间和所述至少一个分割片段的起止时间,确定各所述视频片段与各所述分割片段之间的至少一个重叠片段;Determine at least one overlapping segment between each video segment and each divided segment according to the start and end time of the at least one video segment and the start and end time of the at least one divided segment;
从所述至少一个重叠片段中生成所述目标视频对应的短视频。A short video corresponding to the target video is generated from the at least one overlapping segment.
本领域普通技术人员可以理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。Those of ordinary skill in the art can understand that, in various embodiments of the present application, the size of the sequence number of each process does not mean the order of execution. The execution order of each process should be determined by its function and internal logic. There should be any limitation on the implementation process of the embodiments of the present application.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、 数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present invention are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The program can be stored in a computer readable storage medium. During execution, it may include the procedures of the above-mentioned method embodiments. Wherein, the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
以上所揭露的仅为本发明较佳实施例而已,当然不能以此来限定本发明之权利范围,因此依本发明权利要求所作的等同变化,仍属本发明所涵盖的范围。The above-disclosed are only the preferred embodiments of the present invention, which of course cannot be used to limit the scope of rights of the present invention. Therefore, equivalent changes made according to the claims of the present invention still fall within the scope of the present invention.

Claims (25)

  1. 一种短视频的生成方法,其特征在于,包括:A method for generating short video, which is characterized in that it includes:
    获取目标视频;Get the target video;
    通过语义分析获得所述目标视频中的至少一个视频片段的起止时间和所属语义类别的概率;其中,每个所述视频片段属于一个或多个语义类别;Obtaining the start and end time of at least one video segment in the target video and the probability of the semantic category to which it belongs through semantic analysis; wherein, each of the video segments belongs to one or more semantic categories;
    根据所述至少一个视频片段的起止时间和所属语义类别的概率,从所述至少一个视频片段中生成所述目标视频对应的短视频。According to the start and end time of the at least one video segment and the probability of the semantic category to which it belongs, a short video corresponding to the target video is generated from the at least one video segment.
  2. 根据权利要求1所述的方法,其特征在于,所述目标视频包括m帧视频图像,所述m为正整数;所述通过语义分析获得所述目标视频中的至少一个视频片段的起止时间和所属语义类别的概率包括:The method according to claim 1, wherein the target video includes m frames of video images, and the m is a positive integer; and the semantic analysis obtains the start and end times of at least one video segment in the target video and The probability of belonging to the semantic category includes:
    提取所述目标视频中每帧视频图像的n维特征数据,并基于m帧视频图像的时间顺序生成m*n的视频特征矩阵,所述n为正整数;Extracting n-dimensional feature data of each frame of video image in the target video, and generating an m*n video feature matrix based on the time sequence of m frames of video image, where n is a positive integer;
    将所述视频特征矩阵转化成多层特征图,基于所述多层特征图中的各特征点在所述视频特征矩阵上生成对应的至少一个候选框;Converting the video feature matrix into a multi-layer feature map, and generating at least one corresponding candidate frame on the video feature matrix based on each feature point in the multi-layer feature map;
    根据所述候选框确定至少一个连续语义特征序列,并确定每个所述连续语义特征序列对应的视频片段的起止时间和所属语义类别的概率。Determine at least one continuous semantic feature sequence according to the candidate frame, and determine the start and end time of the video segment corresponding to each continuous semantic feature sequence and the probability of the semantic category to which it belongs.
  3. 根据权利要求1所述的方法,其特征在于,所述所属语义类别的概率包括所属行为类别的概率和所属场景类别的概率;所述目标视频包括m帧视频图像,所述m为正整数;所述通过语义分析获得所述目标视频中的至少一个视频片段的起止时间和所属语义类别的概率包括:The method according to claim 1, wherein the probability of belonging to the semantic category includes the probability of belonging to the behavior category and the probability of belonging to the scene category; the target video includes m frames of video images, and the m is a positive integer; The obtaining by semantic analysis of the start and end time of at least one video segment in the target video and the probability of the semantic category to which it belongs includes:
    提取所述目标视频中每帧视频图像的n维特征数据,并基于m帧视频图像的时间顺序生成m*n的视频特征矩阵,所述n为正整数;Extracting n-dimensional feature data of each frame of video image in the target video, and generating an m*n video feature matrix based on the time sequence of m frames of video image, where n is a positive integer;
    将所述视频特征矩阵转化成多层特征图,基于所述多层特征图中的各特征点在所述视频特征矩阵上生成对应的至少一个候选框;Converting the video feature matrix into a multi-layer feature map, and generating at least one corresponding candidate frame on the video feature matrix based on each feature point in the multi-layer feature map;
    根据所述候选框确定至少一个连续语义特征序列,并确定每个所述连续语义特征序列对应的视频片段的起止时间和所属行为类别的概率;Determine at least one continuous semantic feature sequence according to the candidate frame, and determine the start and end time of the video segment corresponding to each of the continuous semantic feature sequence and the probability of the behavior category to which it belongs;
    根据所述目标视频中每帧视频图像的所述n维特征数据识别并输出所述目标视频中每帧视频图像的所属场景类别的概率。Identify and output the probability of the scene category of each frame of video image in the target video according to the n-dimensional feature data of each frame of video image in the target video.
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述至少一个视频片段的起止时间和所属语义类别的概率,从所述至少一个视频片段中生成所述目标视频对应的短视频包括:The method according to claim 3, wherein the generating a short video corresponding to the target video from the at least one video segment according to the start and end time of the at least one video segment and the probability of belonging to a semantic category comprises :
    根据每个所述视频片段的起止时间和所属行为类别的概率、每个所述视频片段中的每帧视频图像的所属场景类别的概率,确定所述至少一个视频片段的平均类别概率;Determine the average category probability of the at least one video segment according to the start and end time of each video segment and the probability of the behavior category to which it belongs, and the probability of the scene category of each frame of video image in each video segment;
    根据所述至少一个视频片段的平均类别概率,从所述至少一个视频片段中生成所述目标视频对应的短视频。According to the average category probability of the at least one video segment, a short video corresponding to the target video is generated from the at least one video segment.
  5. 根据权利要求4所述的方法,其特征在于,所述根据每个所述视频片段的起止时 间和所属行为类别的概率、每个所述视频片段中的每帧视频图像的场景类别的概率,确定所述至少一个视频片段的平均类别概率包括:The method according to claim 4, characterized in that, according to the start and end time of each of the video segments and the probability of the behavior category to which they belong, and the probability of the scene category of each frame of video image in each of the video segments, Determining the average category probability of the at least one video segment includes:
    针对每个所述视频片段,根据所述视频片段的起止时间,确定所述视频片段对应的多帧视频图像及帧数;For each of the video segments, according to the start and end times of the video segments, determine the multiple frames of video images and the number of frames corresponding to the video segments;
    将所述视频片段的所属行为类别的概率确定为所述视频片段中每帧视频图像的所属行为类别的概率;Determining the probability of the behavior category of the video clip as the probability of the behavior category of each frame of video image in the video clip;
    获取所述多帧视频图像中的每帧视频图像的所属场景类别的概率;Acquiring the probability of the scene category of each frame of video image in the multi-frame video image;
    将所述多帧视频图像中的每帧视频图像的所属行为类别的概率与所属场景类别的概率的和除以所述帧数,得到所述视频片段的平均类别概率。The sum of the probability of the behavior category and the probability of the scene category of each frame of the video image in the multi-frame video image is divided by the number of frames to obtain the average category probability of the video segment.
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述通过语义分析获得所述目标视频中的至少一个视频片段的起止时间和所属语义类别的概率包括:The method according to any one of claims 1 to 5, wherein the obtaining the start and end time of at least one video segment in the target video and the probability of belonging to a semantic category through semantic analysis comprises:
    通过语义分析获得所述目标视频中的至少一个视频片段的起止时间、所属语义类别和所属语义类别的概率;Obtaining the start and end time, the semantic category and the probability of the semantic category of at least one video segment in the target video through semantic analysis;
    所述根据所述至少一个视频片段的起止时间和所属语义类别的概率,从所述至少一个视频片段中生成所述目标视频对应的短视频包括:The generating a short video corresponding to the target video from the at least one video segment according to the start and end time of the at least one video segment and the probability of the semantic category to which it belongs includes:
    根据每个所述视频片段的所属语义类别的概率和所属语义类别对应的类别权重,确定所述至少一个视频片段的兴趣类别概率;Determine the interest category probability of the at least one video clip according to the probability of the semantic category to which each video clip belongs and the category weight corresponding to the semantic category to which it belongs;
    根据所述至少一个视频片段的起止时间和兴趣类别概率,从所述至少一个视频片段中生成所述目标视频对应的短视频。According to the start and end time of the at least one video segment and the interest category probability, a short video corresponding to the target video is generated from the at least one video segment.
  7. 根据权利要求6所述的方法,其特征在于,所述根据每个所述视频片段的所属语义类别的概率和所属语义类别对应的类别权重,确定所述至少一个视频片段的兴趣类别概率之前,还包括:The method according to claim 6, characterized in that, before determining the probability of the interest category of the at least one video clip according to the probability of the semantic category belonging to each of the video clips and the category weight corresponding to the semantic category, Also includes:
    获取本地数据库和历史操作记录中的媒体数据信息;Obtain the media data information in the local database and historical operation records;
    根据所述媒体数据信息,确定媒体数据的各种所属语义类别分别对应的类别权重。According to the media data information, the category weights corresponding to various semantic categories of the media data are determined.
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述媒体数据信息,确定媒体数据的各种所属语义类别分别对应的类别权重包括:The method according to claim 7, wherein the determining, according to the media data information, the category weights corresponding to various semantic categories of the media data respectively comprises:
    确定本地数据库中的视频和图像的所属语义类别,统计每种所属语义类别的出现次数;Determine the semantic category of videos and images in the local database, and count the number of occurrences of each semantic category;
    确定历史操作记录中用户操作过的视频和图像的所属语义类别,统计每种所属语义类别的操作时长和操作频率;Determine the semantic category of the videos and images operated by the user in the historical operation record, and count the operation duration and frequency of each semantic category;
    根据每种所述所属语义类别的出现次数、操作时长和操作频率,计算每种所述所属语义类别对应的类别权重。According to the number of occurrences, operation duration, and operation frequency of each semantic category, the weight of the category corresponding to each semantic category is calculated.
  9. 根据权利要6-8任一项所述的方法,其特征在于,所述根据所述至少一个视频片段的起止时间和兴趣类别概率,从所述至少一个视频片段中生成所述目标视频对应的短视频包括:The method according to any one of claims 6-8, wherein the at least one video segment corresponding to the target video is generated from the at least one video segment according to the start and end time and the interest category probability of the at least one video segment. The short video includes:
    根据所述至少一个视频片段的兴趣类别概率的大小顺序和起止时间,依次从所述至少一个视频片段中确定出至少一个摘要视频片段;Determine at least one summary video segment from the at least one video segment in sequence according to the magnitude order and the start and end time of the interest category probability of the at least one video segment;
    获取所述至少一个摘要视频片段并合成所述目标视频对应的短视频。Obtain the at least one summary video segment and synthesize a short video corresponding to the target video.
  10. 根据权利要求6-8任一项所述的方法,其特征在于,所述根据所述至少一个视频片段的起止时间和兴趣类别概率,从所述至少一个视频片段中生成所述目标视频对应的短视频包括:The method according to any one of claims 6-8, wherein the at least one video segment corresponding to the target video is generated from the at least one video segment according to the start and end time and the interest category probability of the at least one video segment. The short video includes:
    根据每个所述视频片段的起止时间,在所述目标视频中截取所述视频片段;Intercept the video segment in the target video according to the start and end time of each video segment;
    根据所述至少一个视频片段的兴趣类别概率的大小顺序,对各所述视频片段进行排序显示;Sorting and displaying each of the video clips according to the order of the probability of the interest category of the at least one video clip;
    当接收到对任意一个或多个所述视频片段的选择指令时,确定被选择的所述视频片段为摘要视频片段;When receiving a selection instruction for any one or more of the video clips, determining that the selected video clip is a summary video clip;
    根据所述至少一个摘要视频片段,合成所述目标视频对应的短视频。According to the at least one summary video segment, a short video corresponding to the target video is synthesized.
  11. 根据权利要求1-10任一项所述的方法,其特征在于,所述从所述至少一个视频片段中生成所述目标视频对应的短视频包括:The method according to any one of claims 1-10, wherein the generating a short video corresponding to the target video from the at least one video segment comprises:
    对所述目标视频进行时域分割,得到至少一个分割片段的起止时间;Perform time domain segmentation on the target video to obtain the start and end times of at least one segment;
    根据所述至少一个视频片段的起止时间和所述至少一个分割片段的起止时间,确定各所述视频片段与各所述分割片段之间的至少一个重叠片段;Determine at least one overlapping segment between each video segment and each divided segment according to the start and end time of the at least one video segment and the start and end time of the at least one divided segment;
    从所述至少一个重叠片段中生成所述目标视频对应的短视频。A short video corresponding to the target video is generated from the at least one overlapping segment.
  12. 一种短视频的生成装置,其特征在于,包括:A short video generation device, which is characterized in that it comprises:
    视频获取模块,用于获取目标视频;The video acquisition module is used to acquire the target video;
    视频分析模块,用于通过语义分析获得所述目标视频中的至少一个视频片段的起止时间和所属语义类别的概率;其中,每个所述视频片段属于一个或多个语义类别;The video analysis module is configured to obtain the start and end time of at least one video segment in the target video and the probability of the semantic category to which it belongs through semantic analysis; wherein, each of the video segments belongs to one or more semantic categories;
    短视频生成模块,用于根据所述至少一个视频片段的起止时间和所属语义类别的概率,从所述至少一个视频片段中生成所述目标视频对应的短视频。The short video generation module is configured to generate a short video corresponding to the target video from the at least one video segment according to the start and end time of the at least one video segment and the probability of the semantic category to which it belongs.
  13. 根据权利要求12所述的装置,其特征在于,所述目标视频包括m帧视频图像,所述m为正整数;所述视频分析模块具体用于:The device according to claim 12, wherein the target video comprises m frames of video images, and the m is a positive integer; and the video analysis module is specifically configured to:
    提取所述目标视频中每帧视频图像的n维特征数据,并基于m帧视频图像的时间顺序生成m*n的视频特征矩阵,所述n为正整数;Extracting n-dimensional feature data of each frame of video image in the target video, and generating an m*n video feature matrix based on the time sequence of m frames of video image, where n is a positive integer;
    将所述视频特征矩阵转化成多层特征图,基于所述多层特征图中的各特征点在所述视频特征矩阵上生成对应的至少一个候选框;Converting the video feature matrix into a multi-layer feature map, and generating at least one corresponding candidate frame on the video feature matrix based on each feature point in the multi-layer feature map;
    根据所述候选框确定至少一个连续语义特征序列,并确定每个所述连续语义特征序列对应的视频片段的起止时间和所属语义类别的概率。Determine at least one continuous semantic feature sequence according to the candidate frame, and determine the start and end time of the video segment corresponding to each continuous semantic feature sequence and the probability of the semantic category to which it belongs.
  14. 根据权利要求12所述的装置,其特征在于,所述所属语义类别的概率包括所属行为类别的概率和所属场景类别的概率;所述目标视频包括m帧视频图像,所述m为正整数;所述视频分析模块具体用于:The apparatus according to claim 12, wherein the probability of the semantic category includes the probability of the behavior category and the probability of the scene category; the target video includes m frames of video images, and the m is a positive integer; The video analysis module is specifically used for:
    提取所述目标视频中每帧视频图像的n维特征数据,并基于m帧视频图像的时间顺序生成m*n的视频特征矩阵,所述n为正整数;Extracting n-dimensional feature data of each frame of video image in the target video, and generating an m*n video feature matrix based on the time sequence of m frames of video image, where n is a positive integer;
    将所述视频特征矩阵转化成多层特征图,基于所述多层特征图中的各特征点在所述视频特征矩阵上生成对应的至少一个候选框;Converting the video feature matrix into a multi-layer feature map, and generating at least one corresponding candidate frame on the video feature matrix based on each feature point in the multi-layer feature map;
    根据所述候选框确定至少一个连续语义特征序列,并确定每个所述连续语义特征序列对应的视频片段的起止时间和所属行为类别的概率;Determine at least one continuous semantic feature sequence according to the candidate frame, and determine the start and end time of the video segment corresponding to each of the continuous semantic feature sequence and the probability of the behavior category to which it belongs;
    根据所述目标视频中每帧视频图像的所述n维特征数据识别并输出所述目标视频中每帧视频图像的所属场景类别的概率。Identify and output the probability of the scene category of each frame of video image in the target video according to the n-dimensional feature data of each frame of video image in the target video.
  15. 根据权利要求14所述的装置,其特征在于,所述短视频生成模块具体用于:The device according to claim 14, wherein the short video generation module is specifically configured to:
    根据每个所述视频片段的起止时间和所属行为类别的概率、每个所述视频片段中的每帧视频图像的所属场景类别的概率,确定所述至少一个视频片段的平均类别概率;Determine the average category probability of the at least one video segment according to the start and end time of each video segment and the probability of the behavior category to which it belongs, and the probability of the scene category of each frame of video image in each video segment;
    根据所述至少一个视频片段的平均类别概率,从所述至少一个视频片段中生成所述目标视频对应的短视频。According to the average category probability of the at least one video segment, a short video corresponding to the target video is generated from the at least one video segment.
  16. 根据权利要求15所述的装置,其特征在于,所述短视频生成模块具体用于:The device according to claim 15, wherein the short video generation module is specifically configured to:
    针对每个所述视频片段,根据所述视频片段的起止时间,确定所述视频片段对应的多帧视频图像及帧数;For each of the video segments, according to the start and end times of the video segments, determine the multiple frames of video images and the number of frames corresponding to the video segments;
    将所述视频片段的所属行为类别的概率确定为所述视频片段中每帧视频图像的所属行为类别的概率;Determining the probability of the behavior category of the video clip as the probability of the behavior category of each frame of video image in the video clip;
    获取所述多帧视频图像中的每帧视频图像的所属场景类别的概率;Acquiring the probability of the scene category of each frame of video image in the multi-frame video image;
    将所述多帧视频图像中的每帧视频图像的所属行为类别的概率与所属场景类别的概率的和除以所述帧数,得到所述视频片段的平均类别概率。The sum of the probability of the behavior category and the probability of the scene category of each frame of the video image in the multi-frame video image is divided by the number of frames to obtain the average category probability of the video segment.
  17. 根据权利要求12-16任一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 12-16, wherein the device further comprises:
    所述视频分析模块具体用于:The video analysis module is specifically used for:
    通过语义分析获得所述目标视频中的至少一个视频片段的起止时间、所属语义类别和所属语义类别的概率;Obtaining the start and end time, the semantic category and the probability of the semantic category of at least one video segment in the target video through semantic analysis;
    所述短视频生成模块具体用于:The short video generation module is specifically used for:
    根据每个所述视频片段的所属语义类别的概率和所属语义类别对应的类别权重,确定所述至少一个视频片段的兴趣类别概率;Determine the interest category probability of the at least one video clip according to the probability of the semantic category to which each video clip belongs and the category weight corresponding to the semantic category to which it belongs;
    根据所述至少一个视频片段的起止时间和兴趣类别概率,从所述至少一个视频片段中生成所述目标视频对应的短视频。According to the start and end time of the at least one video segment and the interest category probability, a short video corresponding to the target video is generated from the at least one video segment.
  18. 根据权利要求17所述的装置,其特征在于,所述装置还包括:The device according to claim 17, wherein the device further comprises:
    信息获取模块,用于获取本地数据库和历史操作记录中的媒体数据信息;The information acquisition module is used to acquire the media data information in the local database and historical operation records;
    类别权重确定模块,用于根据所述媒体数据信息,确定媒体数据的各种所属语义类别分别对应的类别权重。The category weight determination module is used to determine the category weights corresponding to various semantic categories of the media data according to the media data information.
  19. 根据权利要求18所述的装置,其特征在于,所述类别权重确定模块具体用于:The device according to claim 18, wherein the category weight determination module is specifically configured to:
    确定本地数据库中的视频和图像的所属语义类别,统计每种所属语义类别的出现次数;Determine the semantic category of videos and images in the local database, and count the number of occurrences of each semantic category;
    确定历史操作记录中用户操作过的视频和图像的所属语义类别,统计每种所属语义类 别的操作时长和操作频率;Determine the semantic category of the videos and images operated by the user in the historical operation record, and count the operation duration and frequency of each semantic category;
    根据每种所述所属语义类别的出现次数、操作时长和操作频率,计算每种所述所属语义类别对应的类别权重。According to the number of occurrences, operation duration, and operation frequency of each semantic category, the weight of the category corresponding to each semantic category is calculated.
  20. 根据权利要求17-19任一项所述的装置,其特征在于,所述短视频生成模块具体用于:The device according to any one of claims 17-19, wherein the short video generation module is specifically configured to:
    根据所述至少一个视频片段的兴趣类别概率的大小顺序和起止时间,依次从所述至少一个视频片段中确定出至少一个摘要视频片段;Determine at least one summary video segment from the at least one video segment in sequence according to the magnitude order and the start and end time of the interest category probability of the at least one video segment;
    获取所述至少一个摘要视频片段并合成所述目标视频对应的短视频。Obtain the at least one summary video segment and synthesize a short video corresponding to the target video.
  21. 根据权利要求17-19任一项所述的装置,其特征在于,所述短视频生成模块具体用于:The device according to any one of claims 17-19, wherein the short video generation module is specifically configured to:
    根据每个所述视频片段的起止时间,在所述目标视频中截取所述视频片段;Intercept the video segment in the target video according to the start and end time of each video segment;
    根据所述至少一个视频片段的兴趣类别概率的大小顺序,对各所述视频片段进行排序显示;Sorting and displaying each of the video clips according to the order of the probability of the interest category of the at least one video clip;
    当接收到对任意一个或多个所述视频片段的选择指令时,确定被选择的所述视频片段为摘要视频片段;When receiving a selection instruction for any one or more of the video clips, determining that the selected video clip is a summary video clip;
    根据所述至少一个摘要视频片段,合成所述目标视频对应的短视频。According to the at least one summary video segment, a short video corresponding to the target video is synthesized.
  22. 根据权利要求12-21任一项所述的装置,其特征在于,所述短视频生成模块具体用于:The device according to any one of claims 12-21, wherein the short video generation module is specifically configured to:
    对所述目标视频进行时域分割,得到至少一个分割片段的起止时间;Perform time domain segmentation on the target video to obtain the start and end times of at least one segment;
    根据所述至少一个视频片段的起止时间和所述至少一个分割片段的起止时间,确定各所述视频片段与各所述分割片段之间的至少一个重叠片段;Determine at least one overlapping segment between each video segment and each divided segment according to the start and end time of the at least one video segment and the start and end time of the at least one divided segment;
    从所述至少一个重叠片段中生成所述目标视频对应的短视频。A short video corresponding to the target video is generated from the at least one overlapping segment.
  23. 一种终端设备,其特征在于,包括存储器和处理器,其中,A terminal device, which is characterized by comprising a memory and a processor, wherein:
    所述存储器用于存储计算机可读指令;所述处理器用于读取所述计算机可读指令并实现如权利要求1-11任一项所述的方法。The memory is used to store computer readable instructions; the processor is used to read the computer readable instructions and implement the method according to any one of claims 1-11.
  24. 一种服务器,其特征在于,包括存储器和处理器,其中,A server, which is characterized by comprising a memory and a processor, wherein:
    所述存储器用于存储计算机可读指令;所述处理器用于读取所述计算机可读指令并实现如权利要求1-11任一项所述的方法。The memory is used to store computer readable instructions; the processor is used to read the computer readable instructions and implement the method according to any one of claims 1-11.
  25. 一种计算机存储介质,其特征在于,存储有计算机可读指令,且所述计算机可读指令在被处理器执行时实现如权利要求1-11任一项所述的方法。A computer storage medium, characterized in that it stores computer readable instructions, and when the computer readable instructions are executed by a processor, the method according to any one of claims 1-11 is implemented.
PCT/CN2021/070391 2020-03-26 2021-01-06 Method and apparatus for generating short video, and related device and medium WO2021190078A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010223607.1A CN113453040B (en) 2020-03-26 2020-03-26 Short video generation method and device, related equipment and medium
CN202010223607.1 2020-03-26

Publications (1)

Publication Number Publication Date
WO2021190078A1 true WO2021190078A1 (en) 2021-09-30

Family

ID=77807575

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/070391 WO2021190078A1 (en) 2020-03-26 2021-01-06 Method and apparatus for generating short video, and related device and medium

Country Status (2)

Country Link
CN (1) CN113453040B (en)
WO (1) WO2021190078A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114390365A (en) * 2022-01-04 2022-04-22 京东科技信息技术有限公司 Method and apparatus for generating video information
CN115119050A (en) * 2022-06-30 2022-09-27 北京奇艺世纪科技有限公司 Video clipping method and device, electronic equipment and storage medium
CN116074642A (en) * 2023-03-28 2023-05-05 石家庄铁道大学 Monitoring video concentration method based on multi-target processing unit
CN116708945A (en) * 2023-04-12 2023-09-05 北京优贝卡科技有限公司 Media editing method, device, equipment and storage medium
CN116843643A (en) * 2023-07-03 2023-10-03 北京语言大学 Video aesthetic quality evaluation data set construction method
CN114390365B (en) * 2022-01-04 2024-04-26 京东科技信息技术有限公司 Method and apparatus for generating video information

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116886957A (en) * 2023-09-05 2023-10-13 深圳市蓝鲸智联科技股份有限公司 Method and system for generating vehicle-mounted short video vlog by one key

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050159956A1 (en) * 1999-09-13 2005-07-21 Microsoft Corporation Annotating programs for automatic summary generation
CN102073864A (en) * 2010-12-01 2011-05-25 北京邮电大学 Football item detecting system with four-layer structure in sports video and realization method thereof
US20140161351A1 (en) * 2006-04-12 2014-06-12 Google Inc. Method and apparatus for automatically summarizing video
CN105138953A (en) * 2015-07-09 2015-12-09 浙江大学 Method for identifying actions in video based on continuous multi-instance learning
CN106572387A (en) * 2016-11-09 2017-04-19 广州视源电子科技股份有限公司 Video sequence alignment method and video sequence alignment system
CN106897714A (en) * 2017-03-23 2017-06-27 北京大学深圳研究生院 A kind of video actions detection method based on convolutional neural networks
CN108140032A (en) * 2015-10-28 2018-06-08 英特尔公司 Automatic video frequency is summarized
CN108427713A (en) * 2018-02-01 2018-08-21 宁波诺丁汉大学 A kind of video summarization method and system for homemade video
CN109697434A (en) * 2019-01-07 2019-04-30 腾讯科技(深圳)有限公司 A kind of Activity recognition method, apparatus and storage medium
CN110851621A (en) * 2019-10-31 2020-02-28 中国科学院自动化研究所 Method, device and storage medium for predicting video wonderful level based on knowledge graph

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3161791A4 (en) * 2014-06-24 2018-01-03 Sportlogiq Inc. System and method for visual event description and event analysis
US10311913B1 (en) * 2018-02-22 2019-06-04 Adobe Inc. Summarizing video content based on memorability of the video content
CN110798752B (en) * 2018-08-03 2021-10-15 北京京东尚科信息技术有限公司 Method and system for generating video summary
CN110287374B (en) * 2019-06-14 2023-01-03 天津大学 Self-attention video abstraction method based on distribution consistency

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050159956A1 (en) * 1999-09-13 2005-07-21 Microsoft Corporation Annotating programs for automatic summary generation
US20140161351A1 (en) * 2006-04-12 2014-06-12 Google Inc. Method and apparatus for automatically summarizing video
CN102073864A (en) * 2010-12-01 2011-05-25 北京邮电大学 Football item detecting system with four-layer structure in sports video and realization method thereof
CN105138953A (en) * 2015-07-09 2015-12-09 浙江大学 Method for identifying actions in video based on continuous multi-instance learning
CN108140032A (en) * 2015-10-28 2018-06-08 英特尔公司 Automatic video frequency is summarized
CN106572387A (en) * 2016-11-09 2017-04-19 广州视源电子科技股份有限公司 Video sequence alignment method and video sequence alignment system
CN106897714A (en) * 2017-03-23 2017-06-27 北京大学深圳研究生院 A kind of video actions detection method based on convolutional neural networks
CN108427713A (en) * 2018-02-01 2018-08-21 宁波诺丁汉大学 A kind of video summarization method and system for homemade video
CN109697434A (en) * 2019-01-07 2019-04-30 腾讯科技(深圳)有限公司 A kind of Activity recognition method, apparatus and storage medium
CN110851621A (en) * 2019-10-31 2020-02-28 中国科学院自动化研究所 Method, device and storage medium for predicting video wonderful level based on knowledge graph

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114390365A (en) * 2022-01-04 2022-04-22 京东科技信息技术有限公司 Method and apparatus for generating video information
CN114390365B (en) * 2022-01-04 2024-04-26 京东科技信息技术有限公司 Method and apparatus for generating video information
CN115119050A (en) * 2022-06-30 2022-09-27 北京奇艺世纪科技有限公司 Video clipping method and device, electronic equipment and storage medium
CN115119050B (en) * 2022-06-30 2023-12-15 北京奇艺世纪科技有限公司 Video editing method and device, electronic equipment and storage medium
CN116074642A (en) * 2023-03-28 2023-05-05 石家庄铁道大学 Monitoring video concentration method based on multi-target processing unit
CN116708945A (en) * 2023-04-12 2023-09-05 北京优贝卡科技有限公司 Media editing method, device, equipment and storage medium
CN116708945B (en) * 2023-04-12 2024-04-16 半月谈新媒体科技有限公司 Media editing method, device, equipment and storage medium
CN116843643A (en) * 2023-07-03 2023-10-03 北京语言大学 Video aesthetic quality evaluation data set construction method
CN116843643B (en) * 2023-07-03 2024-01-16 北京语言大学 Video aesthetic quality evaluation data set construction method

Also Published As

Publication number Publication date
CN113453040B (en) 2023-03-10
CN113453040A (en) 2021-09-28

Similar Documents

Publication Publication Date Title
WO2021190078A1 (en) Method and apparatus for generating short video, and related device and medium
CN111465918B (en) Method for displaying service information in preview interface and electronic equipment
WO2021052414A1 (en) Slow-motion video filming method and electronic device
CN110377204B (en) Method for generating user head portrait and electronic equipment
WO2022100221A1 (en) Retrieval processing method and apparatus, and storage medium
US20220343648A1 (en) Image selection method and electronic device
US20220116497A1 (en) Image Classification Method and Electronic Device
CN113536866A (en) Character tracking display method and electronic equipment
WO2020073317A1 (en) File management method and electronic device
WO2024055797A9 (en) Method for capturing images in video, and electronic device
WO2024055797A1 (en) Method for capturing images in video, and electronic device
WO2023160170A1 (en) Photographing method and electronic device
CN115661941B (en) Gesture recognition method and electronic equipment
CN115661912A (en) Image processing method, model training method, electronic device and readable storage medium
CN115086710B (en) Video playing method, terminal equipment, device, system and storage medium
WO2024067442A1 (en) Data management method and related apparatus
CN116828099B (en) Shooting method, medium and electronic equipment
WO2023246666A1 (en) Search method and electronic device
CN114513575B (en) Method for collection processing and related device
WO2024067129A1 (en) System, song list generation method, and electronic device
WO2024082914A1 (en) Video question answering method and electronic device
CN114697525B (en) Method for determining tracking target and electronic equipment
US20240107092A1 (en) Video playing method and apparatus
WO2024087202A1 (en) Search method and apparatus, model training method and apparatus, and storage medium
CN117112087A (en) Ordering method of desktop cards, electronic equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21774416

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21774416

Country of ref document: EP

Kind code of ref document: A1