CN117319765A - Video processing method, device, computing equipment and computer storage medium - Google Patents

Video processing method, device, computing equipment and computer storage medium Download PDF

Info

Publication number
CN117319765A
CN117319765A CN202311386741.3A CN202311386741A CN117319765A CN 117319765 A CN117319765 A CN 117319765A CN 202311386741 A CN202311386741 A CN 202311386741A CN 117319765 A CN117319765 A CN 117319765A
Authority
CN
China
Prior art keywords
video
time point
content
subtitle
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311386741.3A
Other languages
Chinese (zh)
Inventor
汤然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bilibili Technology Co Ltd
Original Assignee
Shanghai Bilibili Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bilibili Technology Co Ltd filed Critical Shanghai Bilibili Technology Co Ltd
Priority to CN202311386741.3A priority Critical patent/CN117319765A/en
Publication of CN117319765A publication Critical patent/CN117319765A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23424Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The application discloses a video processing method, a device, a computing device and a computer storage medium, wherein the method comprises the following steps: acquiring subtitle content of a first video; inputting the subtitle content of the first video to a language understanding model to obtain key subtitle content output by the language understanding model; inquiring at least one first video segment corresponding to the key subtitle content, and synthesizing at least one second video according to the at least one first video segment; detecting a playing request of a first video triggered by a user at a target time point in the playing process of any second video; and determining the continuous playing time point of the first video according to the target time point of the second video, and playing the first video from the continuous playing time point. The video clip relies on text content, so that the processing cost is greatly reduced, and the understanding and segmentation of the video content are realized by adopting a language understanding model, so that the understanding and recognition capability is higher, and the accuracy of the video clip is higher.

Description

Video processing method, device, computing equipment and computer storage medium
Technical Field
The present application relates to the field of internet technologies, and in particular, to a video processing method, a video processing device, a computing device, and a computer storage medium.
Background
Currently, short video platforms are many, and a vertical screen video stream mode is easy for users to immerse in the short video platform. This is also relevant to the recommendation algorithm of the video website, and some of the recommendation algorithms always recommend relatively short videos (perhaps about 15 seconds), usually relatively climax or wonderful clips, which are easy for users to like. Therefore, some short videos have high playing quantity, and when some long videos are played, the background is padded or the content is switched off by a user when the content is played in a progressive mode, so that the playing quantity of the long videos is not high, and some video websites can adopt a mode of clipping the short videos from the long videos to stream the long videos.
Currently, the following methods are generally used to clip long video into short video: firstly, a manual editing mode is adopted, but the cost is too high, and the large-scale editing mode is difficult; secondly, the pictures of the whole long video are understood to be automatically clipped at the same time in an AI content understanding mode, but the mode needs great calculation force, the machine cost is very high, and the picture identification and classification capability of AI is excessively depended; thirdly, based on the bullet screen or audience interaction frequency, the method is low in cost and poor in accuracy.
Disclosure of Invention
In view of the foregoing, the present application is directed to a video processing method, apparatus, computing device, and computer storage medium that overcome, or at least partially solve, the foregoing.
According to one aspect of the present application, there is provided a video processing method including:
acquiring subtitle content of a first video;
inputting the subtitle content of the first video to a language understanding model to obtain key subtitle content output by the language understanding model;
inquiring at least one first video segment corresponding to the key subtitle content, and synthesizing at least one second video according to the at least one first video segment;
detecting a playing request of a first video triggered by a user at a target time point in the playing process of any second video;
and determining a continuous playing time point of the first video according to the target time point of the second video, and starting to play the first video from the continuous playing time point.
Optionally, the querying at least one first video segment corresponding to the subtitle content further includes:
inquiring a starting time point and an ending time point corresponding to the key subtitle content according to a subtitle file protocol of the first video;
And determining at least one first video segment according to the starting time point and the ending time point corresponding to the key subtitle content.
Optionally, after the synthesizing results in at least one second video, the method further comprises:
and recording the corresponding relation between the time point of each video frame in any second video and the time point of the same video frame in the first video.
Optionally, the determining the continuous playing time point of the first video according to the target time point of the second video further includes:
and determining the continuous playing time point of the first video according to the corresponding relation between the time point of each video frame in the second video and the time point of the same video frame in the first video, wherein the continuous playing time point of the first video corresponds to the same video frame with the target time point of the second video.
Optionally, the acquiring the subtitle content of the first video further includes:
performing voice recognition processing on a first video to obtain caption content of the first video;
or performing optical character recognition processing on each video frame of the first video to obtain caption content of the first video;
or extracting the subtitle content of the first video from the externally hung subtitle file of the first video.
Optionally, the synthesizing at least one second video according to the at least one first video segment further includes:
screening at least one target first video segment from the at least one first video segment according to the synthesis configuration information;
and splicing the at least one target first video segment to obtain a second video.
Optionally, the composite configuration information includes: the configuration information of the number of fragments; the screening at least one target first video segment from the at least one first video segment according to the synthesis configuration information further comprises:
and screening the corresponding number of target first video clips from the at least one first video clip according to the clip number configuration information.
Optionally, the composite configuration information includes: duration configuration information; the screening at least one target first video segment from the at least one first video segment according to the synthesis configuration information further comprises:
and screening a plurality of target first video clips with the sum of time lengths not exceeding the time length corresponding to the time length configuration information from the at least one first video clip.
According to another aspect of the present application, there is provided a video processing apparatus including:
The acquisition module is used for acquiring the caption content of the first video;
the input module is used for inputting the caption content of the first video to the language understanding model to obtain the key caption content output by the language understanding model;
the synthesizing module is used for inquiring at least one first video segment corresponding to the key subtitle content and synthesizing at least one second video according to the at least one first video segment;
the detection module is used for detecting a play request of the first video triggered by a user at a target time point in the play process of any second video;
and the playing module is used for determining the continuous playing time point of the first video according to the target time point of the second video, and playing the first video from the continuous playing time point.
Optionally, the synthesis module is further configured to: inquiring a starting time point and an ending time point corresponding to the key subtitle content according to a subtitle file protocol of the first video;
and determining at least one first video segment according to the starting time point and the ending time point corresponding to the key subtitle content.
Optionally, the synthesis module is further configured to: and after the synthesis is carried out to obtain at least one second video, recording the corresponding relation between the time point of each video frame in any second video and the time point of the same video frame in the first video.
Optionally, the playing module is further configured to: and determining the continuous playing time point of the first video according to the corresponding relation between the time point of each video frame in the second video and the time point of the same video frame in the first video, wherein the continuous playing time point of the first video corresponds to the same video frame with the target time point of the second video.
Optionally, the acquiring module is further configured to: performing voice recognition processing on a first video to obtain caption content of the first video;
or performing optical character recognition processing on each video frame of the first video to obtain caption content of the first video;
or extracting the subtitle content of the first video from the externally hung subtitle file of the first video.
Optionally, the synthesis module is further configured to: screening at least one target first video segment from the at least one first video segment according to the synthesis configuration information;
and splicing the at least one target first video segment to obtain a second video.
Optionally, the composite configuration information includes: the fragment number configuration information, the composition module is further configured to: and screening the corresponding number of target first video clips from the at least one first video clip according to the clip number configuration information.
Optionally, the composite configuration information includes: duration configuration information, the composition module is further configured to: and screening a plurality of target first video clips with the sum of time lengths not exceeding the time length corresponding to the time length configuration information from the at least one first video clip.
According to yet another aspect of the present application, there is provided a computing device comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the video processing method.
According to still another aspect of the present application, there is provided a computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the video processing method described above.
According to the video processing method, the video processing device, the computing equipment and the computer storage medium provided by the embodiment of the application, caption content of a first video is obtained; inputting the subtitle content of the first video to a language understanding model to obtain key subtitle content output by the language understanding model; inquiring at least one first video segment corresponding to the key subtitle content, and synthesizing at least one second video according to the at least one first video segment; detecting a playing request of a first video triggered by a user at a target time point in the playing process of any second video; and determining the continuous playing time point of the first video according to the target time point of the second video, and playing the first video from the continuous playing time point. According to the method, the subtitle content of the first video is acquired, then the acquired video subtitle is subjected to subtitle understanding through the language understanding model to obtain the key subtitle, the video content is understood and segmented by means of the key subtitle content, the high-energy highlight segments are obtained, and then the video segments corresponding to the key subtitle are synthesized to obtain the short video meeting the configuration requirements. Compared with a processing mode with high calculation power and high cost, the video editing method has the advantages that text content is relied on, processing cost is greatly reduced, understanding and segmentation of the video content are realized by adopting a language understanding model, understanding and identifying capabilities are high, accuracy of video editing is high, and editing efficiency of video is greatly improved. In addition, if the user is highly interested in the video content in the process of watching the second video, the user can jump to the first video with longer duration and richer content for continuous playing, and provide more high-quality content by means of the short video mode and bring better drainage effect to the long video.
The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a flow chart of a video processing method according to an embodiment of the present application;
FIG. 2 is a flow chart of a video processing method according to another embodiment of the present application;
fig. 3 is a schematic functional structure diagram of a video processing apparatus according to an embodiment of the present application;
fig. 4 shows a schematic structural diagram of a computing device according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
First, terms related to one or more embodiments of the present application will be explained.
Language understanding model: the language understanding model is a natural language processing tool driven by artificial intelligence technology, and can understand and learn human language based on patterns and statistical rules seen in a pre-training stage, so that people can be efficiently and conveniently helped to acquire information, knowledge and inspiration. The existing language understanding models with higher practicability include ChatGPT, religion and the like.
Short video vertical screen mode: a novel video content form is characterized in that content distribution is carried out by taking short and precise stories as units, and the duration of each story is generally within 3 minutes. And presenting in a vertical screen mode supported by the information flow, and enabling a user to switch the video through a screen brushing action.
Externally hung subtitle: the plug-in subtitle (separation of subtitle file and video file) is a video file format, which is a subtitle file independent of video, and a common player does not support the playback of the plug-in subtitle, but some software such as VLC players can play the plug-in subtitle.
OCR recognition: OCR (Optical Character Recognition ) refers to the process of an electronic device (e.g., a scanner or digital camera) checking characters printed on paper, determining their shape by detecting dark and light patterns, and then translating the shape into computer text using a character recognition method. That is, the technology of converting the characters in the paper document into the image file of black-white lattice by optical mode and converting the characters in the image into the text format by the recognition software for further editing and processing by the word processing software is adopted. How to debug or use auxiliary information to improve recognition accuracy is the most important issue of OCR, and the term ICR (Intelligent Character Recognition) is generated accordingly.
ASR: ASR (automatic speech recognition) refers to a technique of converting human speech into text, the purpose of which is to enable a computer to understand and process human speech in order to perform various tasks such as speech recognition, speech translation, speech synthesis, and the like.
Fig. 1 shows a flowchart of a video processing method according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
step S110, acquiring subtitle content of the first video.
With the rapid development of streaming media, video subtitles are also becoming more and more important. Video subtitles typically appear at the bottom of a video screen (e.g., UP main video), and are generally classified into captions, annotation captions, explanatory captions (e.g., closed captions such as "laughing" and open captions), transitional captions, subtitles, and the like, according to the function of the subtitles, through which the rich information contained in the video can be better understood and obtained.
The method is generally classified into hard captions, soft captions and plug-in captions according to the manner of combining the captions with video. For the video with the hard caption, no matter what kind of player is used for playing, the caption is all the same in the video, and the problem of any caption deviation (namely, the caption is not synchronous with the video) does not exist, because the characters of the hard caption are immersed into the video, the characters of the caption are not characters any more, but images (like watermarking on the video); soft captions, also known as closed captioning/subtitle streams, have their subtitle files embedded in the video as part of the stream; the external subtitle is an external subtitle file, when playing video, the external subtitle file and the video are placed under the same catalog, and the subtitle file is selected in the player, so that the subtitle can be seen in the video.
In the prior art, relatively high-energy highlight moments or segments in (long) video (referred to herein as first video) are typically extracted based on factors such as the picture in the video, the video play bullet screen, the viewer interaction rate, or the user viewing record (e.g., viewing duration, drag-and-drop fast forward). In the embodiment of the present application, by using the caption content in the video/long video (referred to herein as the first video), since the caption content contains rich information, understanding the caption content can locate the key segments, so as to extract short videos at multiple highlight moments, so that for various types of captions, it is first required to obtain the caption content in the video to be extracted through a corresponding algorithm or tool. Compared with the existing method for extracting the highlight in the video, the method for extracting the video subtitle content in the embodiment does not need huge calculation force (relatively low machine cost), has higher accuracy, and can extract the subtitle content in a high-standardization mode in a batch and off-line mode for massive videos in advance because interaction information of users is not needed.
Step S120, inputting the caption content of the first video to the language understanding model to obtain the key caption content output by the language understanding model.
And inputting the caption content of the first video into a language understanding model, understanding the caption content through the language understanding model, and outputting the key caption content, so as to obtain a highlight moment segment corresponding to the key caption content. The language understanding model can learn the language rule hidden in the subtitle from the training data based on the self-supervision learning of the generation type, and generate a mode consistent with the user's look and feel and participation interaction information.
Step S130, at least one first video segment corresponding to the key subtitle content is queried, and at least one second video is synthesized according to the at least one first video segment.
Because the pictures and the audios (namely, the video stream, the audio stream and the subtitle stream) in the subtitles and the videos are in a highly uniform corresponding relation, the corresponding video time point can be determined by inquiring the key subtitle content, then at least one first video segment is inquired according to the corresponding video time point, at least one short video (called a second video herein) is obtained by synthesizing the at least one first video segment according to configuration requirements, wherein the corresponding relation between the time point of each video frame in any second video and the time point of the same video frame in the first video is recorded, and the subsequent source searching and subsequent broadcasting are facilitated.
Step S140, detecting a play request of the first video triggered by the user at a target time point in a play process of any second video.
During the process of watching the second video, if the user is highly interested in the video content, the user can jump to the first video with longer duration and richer content to continue watching. Specifically, when playing to the target time point of the second video, the key frame picture of the first video corresponding to the target time point of the second video may be presented/prompted to the user, so as to guide the user to select to skip playing the first video.
Step S150, determining a continuous playing time point of the first video according to the target time point of the second video, and playing the first video from the continuous playing time point.
In order to improve the viewing experience of a user, after the user selects to skip to play the first video at the target time point, determining the continuous playing time point of the first video according to the target time point of the second video, and starting to play the first video from the continuous playing time point, wherein the continuous playing time point of the first video and the target time point of the second video correspond to the same video frame so as to achieve the purpose of continuous playing.
According to the video processing method provided by the embodiment of the application, caption content of a first video is obtained; inputting the subtitle content of the first video to a language understanding model to obtain key subtitle content output by the language understanding model; inquiring at least one first video segment corresponding to the key subtitle content, and synthesizing at least one second video according to the at least one first video segment; detecting a playing request of a first video triggered by a user at a target time point in the playing process of any second video; and determining the continuous playing time point of the first video according to the target time point of the second video, and playing the first video from the continuous playing time point. According to the method, the subtitle content of the first video is acquired, then the acquired video subtitle is subjected to subtitle understanding through a language understanding model to obtain a key subtitle, the video content is understood and segmented by means of the key subtitle content, a high-energy highlight fragment is obtained, and then the video fragment corresponding to the key subtitle is synthesized to obtain the short video meeting the configuration requirement. Compared with a processing mode with high calculation power and high cost, the video editing method has the advantages that text content is relied on, processing cost is greatly reduced, understanding and segmentation of the video content are realized by adopting a language understanding model, understanding and identifying capabilities are high, accuracy of video editing is high, and editing efficiency of video is greatly improved. In addition, if the user is highly interested in the video content in the process of watching the second video, the user can jump to the first video with longer duration and richer content for continuous playing, and provide more high-quality content by means of the short video mode and bring better drainage effect to the long video.
Fig. 2 shows a flowchart of a video processing method according to another embodiment of the present application, as shown in fig. 2, the method includes the following steps:
in step S210, the first video is subjected to voice recognition processing or optical character recognition processing to obtain the subtitle content of the first video, or the subtitle content of the first video is extracted from the subtitle file of the first video.
Specifically, for hard subtitles, the subtitle content of the first video is obtained through voice recognition processing (such as ASR language-to-text recognition service) or optical character recognition processing (such as OCR recognition technology), for soft subtitles, the subtitle stream of the first video is directly read to obtain the subtitle content, and for plug-in subtitles, the subtitle content of the first video is extracted from plug-in subtitle files (such as plug-in subtitle files in the formats of srt and ass) of the first video. The three modes have different calculation power consumption, wherein the calculation power consumption of the externally hung subtitle is the lowest, the corresponding subtitle extraction mode can be selected according to the subtitle form of the original video, and the externally hung subtitle, the ASR language-to-text service and the OCR recognition embedded subtitle can be selected and used progressively according to the requirement of the calculation power consumption.
Step S220, inputting the subtitle content of the first video to the language understanding model to obtain the key subtitle content output by the language understanding model.
In the embodiment of the application, the tool based on the AIGC comparing the fire at present identifies the caption or the plug-in caption of the video, so that the AIGC identifies and classifies the video content through the caption content, and the highlight moment with higher energy is summarized. Specifically, the currently popular language understanding big model is combined with the video field, and the language understanding model adopts ChatGPT or a text language, or other language understanding models capable of understanding semantics, and the like.
Taking ChatGPT as an example, the acquired caption content is transmitted to a ChatGPT language understanding big model, the caption content is understood, then the key/key caption content is summarized by classification, the key caption content is transmitted to a segmentation service, the accuracy of segmentation can be improved by using some self-adaptive classification understanding algorithms, and then video editing is carried out according to the corresponding time points so as to summarize short video content. For example, the key subtitle content of a video can be summarized in one key by integrating the ChatGPT system (AI audio-video assistant), and the summarized key subtitles are a subset of the total subtitles.
Step S230, inquiring a starting time point and an ending time point corresponding to the key subtitle content according to the subtitle file protocol of the first video; determining at least one first video clip according to a start time point and an end time point corresponding to the key caption content; screening at least one target first video clip from the at least one first video clip according to the synthesis configuration information; and splicing the at least one target first video segment to obtain a second video.
According to the subtitle file protocol, a start time point and an end time point corresponding to a video key subtitle can be obtained, for example, a subtitle in an STL (space) format is similar to a CSV format, and a three-column structure is adopted: the starting time, the ending time and the caption text are adopted, and the time code is hours, minutes, seconds, hundredths of seconds, hh, mm, ss and ff. At least one first video clip is determined based on a start time point and an end time point corresponding to the key caption content.
After at least one first video segment is obtained, the first video segment can be screened according to preset synthesis configuration information, target first video segments meeting the requirements are screened out, and short videos meeting the configuration requirements are synthesized from the segments. Optionally, the synthesis configuration information may include information such as the number of segments and duration included in the configuration short video, at least one target first video segment is selected from the at least one first video segment, and the at least one target first video segment is spliced to obtain the second video.
In an alternative embodiment, the composite configuration information includes a number of segments configuration information, and a corresponding number of target first video segments are selected from the at least one first video segment according to the number of segments configuration information. For example, the number of segments may be configured according to the duration, category, hot spot, etc. of the first video, and for the first video with a longer duration, a higher number of segments may be configured; for the first video with higher hot spot, a higher fragment number can be configured; for the first video of the scenario, a higher number of clips may be configured.
In an alternative embodiment, the composite configuration information includes duration configuration information, and a plurality of target first video segments whose sum of durations does not exceed a duration corresponding to the duration configuration information are screened from the at least one first video segment.
The method further comprises the following steps: after synthesizing at least one second video, recording the corresponding relation between the time point of each video frame in any second video and the time point of the same video frame in the first video. And in the short video playing process or after the playing is finished, the playing can be directly continued by jumping to the time point of the same content or video frame of the corresponding long video through the corresponding time point.
Step S240 detects a play request of the first video triggered by the user at a target time point in a play process of any second video.
During the process of watching the second video, if the user is highly interested in the video content, the user can jump to the first video with longer duration and richer content to continue watching. Specifically, when playing to the target time point of the second video, the key frame picture of the first video corresponding to the target time point of the second video may be presented/prompted to the user, so as to guide the user to select to skip playing the first video.
Step S250, determining a continuous playing time point of the first video according to the corresponding relationship between the time point of each video frame in the second video and the time point of the same video frame in the first video, wherein the continuous playing time point of the first video corresponds to the same video frame with the target time point of the second video.
In order to improve the viewing experience of a user, after the user selects to skip to play the first video at the target time point, determining the continuous playing time point of the first video according to the target time point of the second video, and starting to play the first video from the continuous playing time point, wherein the continuous playing time point of the first video and the target time point of the second video correspond to the same video frame so as to achieve the purpose of continuous playing.
According to the video processing method provided by the embodiment of the application, the language understanding large model is combined with the video field, the acquired video captions are subjected to caption understanding to obtain key captions, the video corresponding to the key captions is automatically clipped to obtain high-energy wonderful moments, namely, the key caption content of the video can be summarized through integrating the language understanding large model, the subjects in the video can be automatically extracted and clipped to wonderful fragments, and compared with a processing mode with high calculation power and high cost, the video clipping relies on text content, so that the processing cost is greatly reduced, and the understanding and segmentation of the video content are realized by adopting the language understanding model, so that the understanding and recognition capability is higher, the accuracy of video clipping is higher, and the clipping efficiency of the video is greatly improved. In addition, if the user is highly interested in the video content in the process of watching the second video, the user can jump to the first video with longer duration and richer content for continuous playing, and provide more high-quality content by means of the short video mode and bring better drainage effect to the long video. By configuring the number and duration of the fragments contained in the short video, more high-quality content is further provided for the short video mode and drainage is performed for the long video.
Fig. 3 is a schematic functional structure diagram of a video processing apparatus according to an embodiment of the present application. As shown in fig. 3, the apparatus includes an acquisition module 31, an input module 32, a synthesis module 33, a detection module 34, and a playing module 35:
the acquiring module 31 is configured to acquire subtitle content of the first video;
the input module 32 is configured to input the subtitle content of the first video to a language understanding model, so as to obtain the key subtitle content output by the language understanding model;
the synthesizing module 33 is configured to query at least one first video segment corresponding to the subtitle content, and synthesize at least one second video according to the at least one first video segment;
the detection module 34 is configured to detect a play request of the first video triggered by a user at a target time point in a play process of any second video;
the playing module 35 is configured to determine a continuous playing time point of the first video according to a target time point of the second video, and start playing the first video from the continuous playing time point.
In an alternative way, the synthesis module 33 is further configured to: inquiring a starting time point and an ending time point corresponding to the key subtitle content according to a subtitle file protocol of the first video;
And determining at least one first video segment according to the starting time point and the ending time point corresponding to the key subtitle content.
In an alternative way, the synthesis module 33 is further configured to: and after the synthesis is carried out to obtain at least one second video, recording the corresponding relation between the time point of each video frame in any second video and the time point of the same video frame in the first video.
In an alternative manner, the playing module 35 is further configured to: and determining the continuous playing time point of the first video according to the corresponding relation between the time point of each video frame in the second video and the time point of the same video frame in the first video, wherein the continuous playing time point of the first video corresponds to the same video frame with the target time point of the second video.
In an alternative manner, the obtaining module 31 is further configured to: performing voice recognition processing on a first video to obtain caption content of the first video;
or performing optical character recognition processing on each video frame of the first video to obtain caption content of the first video;
or extracting the subtitle content of the first video from the externally hung subtitle file of the first video.
In an alternative way, the synthesis module 33 is further configured to: screening at least one target first video segment from the at least one first video segment according to the synthesis configuration information;
and splicing the at least one target first video segment to obtain a second video.
In an alternative manner, the composition configuration information includes the configuration information of the number of segments, and the composition module 33 is further configured to: and screening the corresponding number of target first video clips from the at least one first video clip according to the clip number configuration information.
In an alternative manner, the synthesis configuration information includes duration configuration information, and the synthesis module 33 is further configured to: and screening a plurality of target first video clips with the sum of time lengths not exceeding the time length corresponding to the time length configuration information from the at least one first video clip.
In summary, according to the video processing apparatus provided in the embodiment of the present application, caption content of a first video is obtained; inputting the subtitle content of the first video to a language understanding model to obtain key subtitle content output by the language understanding model; inquiring at least one first video segment corresponding to the key subtitle content, and synthesizing at least one second video according to the at least one first video segment; detecting a playing request of a first video triggered by a user at a target time point in the playing process of any second video; and determining the continuous playing time point of the first video according to the target time point of the second video, and playing the first video from the continuous playing time point. The device obtains the key subtitle by acquiring the subtitle content of the first video and then understanding the acquired video subtitle through the language understanding model, realizes understanding and segmentation of the video content by means of the key subtitle content, obtains a high-energy highlight fragment, and synthesizes the video fragment corresponding to the key subtitle to obtain the short video meeting the configuration requirement. Compared with a processing mode with high calculation power and high cost, the video editing device relies on text content, so that the processing cost is greatly reduced, and the video editing device adopts a language understanding model to realize understanding and segmentation of the video content, has higher understanding and identifying capabilities, has higher accuracy of video editing and greatly improves the editing efficiency of video. In addition, if the user is highly interested in the video content in the process of watching the second video, the user can jump to the first video with longer duration and richer content for continuous playing, and provide more high-quality content by means of the short video mode and bring better drainage effect to the long video.
Embodiments of the present application provide a non-volatile computer storage medium storing at least one executable instruction that may perform the video processing method of any of the above-described method embodiments.
FIG. 4 illustrates a schematic diagram of an embodiment of a computing device of the present application, which is not limited to a particular implementation of the computing device.
As shown in fig. 4, the computing device may include: a processor 402, a communication interface (Communications Interface) 404, a memory 406, and a communication bus 408.
Wherein: processor 402, communication interface 404, and memory 406 communicate with each other via communication bus 408. A communication interface 404 for communicating with network elements of other devices, such as clients or other servers. Processor 402 is configured to execute program 410 and may specifically perform the relevant steps described above in the video processing method embodiment for a computing device.
In particular, program 410 may include program code including computer-operating instructions.
The processor 402 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present application. The one or more processors included by the computing device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.
Memory 406 for storing programs 410. Memory 406 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
Program 410 may be specifically operable to cause processor 402 to:
acquiring subtitle content of a first video;
inputting the subtitle content of the first video to a language understanding model to obtain key subtitle content output by the language understanding model;
inquiring at least one first video segment corresponding to the key subtitle content, and synthesizing at least one second video according to the at least one first video segment;
detecting a playing request of a first video triggered by a user at a target time point in the playing process of any second video;
and determining a continuous playing time point of the first video according to the target time point of the second video, and starting to play the first video from the continuous playing time point.
Optionally, the program 410 causes the processor to:
inquiring a starting time point and an ending time point corresponding to the key subtitle content according to a subtitle file protocol of the first video;
and determining at least one first video segment according to the starting time point and the ending time point corresponding to the key subtitle content.
Optionally, the program 410 causes the processor to:
and after the synthesis is carried out to obtain at least one second video, recording the corresponding relation between the time point of each video frame in any second video and the time point of the same video frame in the first video.
Optionally, the program 410 causes the processor to:
and determining the continuous playing time point of the first video according to the corresponding relation between the time point of each video frame in the second video and the time point of the same video frame in the first video, wherein the continuous playing time point of the first video corresponds to the same video frame with the target time point of the second video.
Optionally, the program 410 causes the processor to:
performing voice recognition processing on a first video to obtain caption content of the first video;
or performing optical character recognition processing on each video frame of the first video to obtain caption content of the first video;
or extracting the subtitle content of the first video from the externally hung subtitle file of the first video.
Optionally, the program 410 causes the processor to:
screening at least one target first video segment from the at least one first video segment according to the synthesis configuration information;
And splicing the at least one target first video segment to obtain a second video.
Optionally, the program 410 causes the processor to:
the synthesis configuration information comprises fragment number configuration information, and a corresponding number of target first video fragments are screened out from the at least one first video fragment according to the fragment number configuration information.
Optionally, the program 410 causes the processor to:
and the synthesis configuration information comprises duration configuration information, and a plurality of target first video clips with the duration sum not exceeding the duration corresponding to the duration configuration information are screened out from the at least one first video clip.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, embodiments of the present application are not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and the above description of specific languages is provided for disclosure of preferred embodiments of the present application.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the present application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the above description of exemplary embodiments of the application, various features of embodiments of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed application requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the present application and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components according to embodiments of the present application may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present application may also be embodied as an apparatus or device program (e.g., computer program and computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present application may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims (11)

1. A video processing method, comprising:
acquiring subtitle content of a first video;
inputting the subtitle content of the first video to a language understanding model to obtain key subtitle content output by the language understanding model;
inquiring at least one first video segment corresponding to the key subtitle content, and synthesizing at least one second video according to the at least one first video segment;
detecting a playing request of a first video triggered by a user at a target time point in the playing process of any second video;
and determining a continuous playing time point of the first video according to the target time point of the second video, and starting to play the first video from the continuous playing time point.
2. The method of claim 1, wherein said querying at least one first video segment corresponding to said key caption content further comprises:
inquiring a starting time point and an ending time point corresponding to the key subtitle content according to a subtitle file protocol of the first video;
and determining at least one first video segment according to the starting time point and the ending time point corresponding to the key subtitle content.
3. The method according to claim 1 or 2, wherein after the synthesizing results in at least one second video, the method further comprises:
And recording the corresponding relation between the time point of each video frame in any second video and the time point of the same video frame in the first video.
4. The method of claim 3, wherein the determining the continued play time point of the first video from the target time point of the second video further comprises:
and determining the continuous playing time point of the first video according to the corresponding relation between the time point of each video frame in the second video and the time point of the same video frame in the first video, wherein the continuous playing time point of the first video corresponds to the same video frame with the target time point of the second video.
5. The method of any of claims 1-4, wherein the acquiring subtitle content for the first video further comprises:
performing voice recognition processing on a first video to obtain caption content of the first video;
or performing optical character recognition processing on each video frame of the first video to obtain caption content of the first video;
or extracting the subtitle content of the first video from the externally hung subtitle file of the first video.
6. The method of any of claims 1-5, wherein synthesizing at least one second video from the at least one first video segment further comprises:
Screening at least one target first video segment from the at least one first video segment according to the synthesis configuration information;
and splicing the at least one target first video segment to obtain a second video.
7. The method of claim 6, wherein the composite configuration information comprises: the configuration information of the number of fragments; the screening at least one target first video segment from the at least one first video segment according to the synthesis configuration information further comprises:
and screening the corresponding number of target first video clips from the at least one first video clip according to the clip number configuration information.
8. The method of claim 6, wherein the composite configuration information comprises: duration configuration information; the screening at least one target first video segment from the at least one first video segment according to the synthesis configuration information further comprises:
and screening a plurality of target first video clips with the sum of time lengths not exceeding the time length corresponding to the time length configuration information from the at least one first video clip.
9. A video processing apparatus comprising:
the acquisition module is used for acquiring the caption content of the first video;
The input module is used for inputting the caption content of the first video to the language understanding model to obtain the key caption content output by the language understanding model;
the synthesizing module is used for inquiring at least one first video segment corresponding to the key subtitle content and synthesizing at least one second video according to the at least one first video segment;
the detection module is used for detecting a play request of the first video triggered by a user at a target time point in the play process of any second video;
and the playing module is used for determining the continuous playing time point of the first video according to the target time point of the second video, and playing the first video from the continuous playing time point.
10. A computing device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform operations corresponding to the video processing method according to any one of claims 1 to 8.
11. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the video processing method of any one of claims 1-8.
CN202311386741.3A 2023-10-24 2023-10-24 Video processing method, device, computing equipment and computer storage medium Pending CN117319765A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311386741.3A CN117319765A (en) 2023-10-24 2023-10-24 Video processing method, device, computing equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311386741.3A CN117319765A (en) 2023-10-24 2023-10-24 Video processing method, device, computing equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN117319765A true CN117319765A (en) 2023-12-29

Family

ID=89281086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311386741.3A Pending CN117319765A (en) 2023-10-24 2023-10-24 Video processing method, device, computing equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN117319765A (en)

Similar Documents

Publication Publication Date Title
WO2019157977A1 (en) Method for labeling performance segment, video playing method and device, and terminal
CA2924065C (en) Content based video content segmentation
CN107920256B (en) Live broadcast data playing method and device and storage medium
US8938393B2 (en) Extended videolens media engine for audio recognition
US7519618B2 (en) Metadata generating apparatus
US8307403B2 (en) Triggerless interactive television
JP5135024B2 (en) Apparatus, method, and program for notifying content scene appearance
CN110740387A (en) bullet screen editing method, intelligent terminal and storage medium
CN108683924B (en) Video processing method and device
CN106021496A (en) Video search method and video search device
EP1648172A1 (en) System and method for embedding multimedia editing information in a multimedia bitstream
JP2008124574A (en) Preference extracting apparatus, preference extracting method and preference extracting program
CN1581951A (en) Information processing apparatus and method
CN110781328A (en) Video generation method, system, device and storage medium based on voice recognition
CN101553814A (en) Method and apparatus for generating a summary of a video data stream
US20190174174A1 (en) Automatic generation of network pages from extracted media content
TW201225669A (en) System and method for synchronizing with multimedia broadcast program and computer program product thereof
CN107688792B (en) Video translation method and system
CN114845149A (en) Editing method of video clip, video recommendation method, device, equipment and medium
CN112328834A (en) Video association method and device, electronic equipment and storage medium
CN116017088A (en) Video subtitle processing method, device, electronic equipment and storage medium
JP2014130536A (en) Information management device, server, and control method
CN117319765A (en) Video processing method, device, computing equipment and computer storage medium
JP2004328478A (en) Abstract generating device and its program
US20150179228A1 (en) Synchronized movie summary

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination