CN113992944A - Video cataloging method, device, equipment, system and medium - Google Patents

Video cataloging method, device, equipment, system and medium Download PDF

Info

Publication number
CN113992944A
CN113992944A CN202111265047.7A CN202111265047A CN113992944A CN 113992944 A CN113992944 A CN 113992944A CN 202111265047 A CN202111265047 A CN 202111265047A CN 113992944 A CN113992944 A CN 113992944A
Authority
CN
China
Prior art keywords
video
text
target
cataloging
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111265047.7A
Other languages
Chinese (zh)
Inventor
马先钦
刘宏宇
张佳旭
王璋盛
罗引
王磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Wenge Technology Co ltd
Original Assignee
Beijing Zhongke Wenge Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Wenge Technology Co ltd filed Critical Beijing Zhongke Wenge Technology Co ltd
Priority to CN202111265047.7A priority Critical patent/CN113992944A/en
Publication of CN113992944A publication Critical patent/CN113992944A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234336Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by media transcoding, e.g. video is transformed into a slideshow of still pictures or audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440236Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8455Structuring of content, e.g. decomposing content into time segments involving pointers to the content, e.g. pointers to the I-frames of the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to a video cataloging method, apparatus, device, system and medium. The video cataloging method comprises the following steps: acquiring video characteristics of a target video; segmenting a target video based on video characteristics of the target video to obtain a plurality of video segments; for each video clip, determining a video tag corresponding to the video clip based on a video text corresponding to the video clip, wherein the video text comprises a first audio text and a first subtitle text, and the video tag at least comprises a semantic tag; and cataloging the target video by using the plurality of video segments and the video label corresponding to each video segment to obtain a cataloging result corresponding to the target video. According to the embodiment of the disclosure, an automatic video cataloging method can be provided, and the efficiency of video cataloging is improved.

Description

Video cataloging method, device, equipment, system and medium
Technical Field
The present disclosure relates to the field of intelligent media technologies, and in particular, to a video cataloging method, apparatus, device, system, and medium.
Background
With the rapid development of the media convergence technology, the value of the video content is increasingly highlighted. In order to better convert video contents into digital assets available for a long time in the future, so that a user can conveniently and quickly retrieve required video contents, cataloguing work on the video contents is needed.
However, the traditional video cataloging method needs to rely on professionals to catalog video contents, and if facing massive videos in the all-media era, not only is the labor cost greatly increased, but also the timeliness is difficult to guarantee.
Disclosure of Invention
To solve the above technical problem or at least partially solve the above technical problem, the present disclosure provides a video cataloging method, apparatus, device, system, and medium.
In a first aspect, the present disclosure provides a video cataloging method, comprising:
acquiring video characteristics of a target video;
segmenting a target video based on video characteristics of the target video to obtain a plurality of video segments;
for each video clip, determining a video tag corresponding to the video clip based on a video text corresponding to the video clip, wherein the video text comprises a first audio text clip and a first subtitle text clip, and the video tag at least comprises a semantic tag;
and cataloging the target video by using the plurality of video segments and the video label corresponding to each video segment to obtain a cataloging result corresponding to the target video.
In a second aspect, the present disclosure provides a video cataloging apparatus, the apparatus comprising:
the video characteristic acquisition module is used for acquiring the video characteristics of the target video;
the target video segmentation module is used for segmenting the target video based on the video characteristics of the target video to obtain a plurality of video segments;
the video tag extraction module is used for determining a video tag corresponding to each video clip based on a video text corresponding to the video clip, wherein the video text comprises a first audio text and a first subtitle text, and the video tag at least comprises a semantic tag;
and the video cataloging module is used for cataloging the target video by utilizing the plurality of video segments and the video label corresponding to each video segment to obtain the cataloging result corresponding to the target video.
In a third aspect, the present disclosure provides a video cataloging apparatus comprising:
a processor;
a memory for storing executable instructions;
wherein the processor is configured to read the executable instructions from the memory and execute the executable instructions to implement the video cataloging method according to the first aspect.
In a fourth aspect, the present disclosure provides a video cataloging system, the system comprising: video cataloging equipment and display equipment;
the video cataloguing equipment is used for acquiring video characteristics of a target video;
segmenting the target video based on the video characteristics of the target video to obtain a plurality of video segments;
for each video segment, determining a video tag corresponding to the video segment based on a video text corresponding to the video segment, wherein the video text comprises a first audio text and a first subtitle text, and the video tag at least comprises a semantic tag;
cataloging the target video by using the plurality of video segments and the video label corresponding to each video segment to obtain a cataloging result corresponding to the target video;
the display equipment is used for receiving target video display operation, and the target video display operation carries a target cataloguing label;
and responding to the video clip target video display operation, and screening out a target video corresponding to the target catalogue label from a plurality of videos.
In a fifth aspect, the present disclosure provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to carry out the video cataloging method of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
according to the video cataloging method, the video cataloging device, the video cataloging equipment, the video cataloging system and the video cataloging medium, after the video characteristics of the target video are obtained, the target video is segmented based on the video characteristics of the target video to obtain the video fragments, then the video labels corresponding to the video fragments are extracted based on the video texts corresponding to the video fragments aiming at each video fragment, the video texts comprise the first audio texts and the first subtitle texts, so that multi-mode video information can be obtained, the video labels at least comprise semantic labels, and the target video is further cataloged by the video fragments and the video labels corresponding to the video fragments to obtain the cataloging result corresponding to the target video. Therefore, an automatic video cataloging method can be provided, the video cataloging efficiency is improved, if massive videos in the all-media era are faced, a professional is not needed to catalog multimedia contents, therefore, a large amount of labor cost is reduced, the timeliness of the video cataloging can be guaranteed, and the target videos can be accurately cataloged based on the video tags corresponding to the multi-mode video information, so that the accuracy of the video cataloging is improved.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
Fig. 1 is a schematic flow chart of a video cataloging method according to an embodiment of the present disclosure;
fig. 2 is a logic diagram of a segmentation method for a target video according to an embodiment of the present disclosure;
fig. 3 is a schematic overall flow chart of a video cataloging method according to an embodiment of the present disclosure;
FIG. 4 is a schematic overall logic diagram of another video cataloging method according to the embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a video cataloging apparatus according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a video cataloging apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a video cataloging system according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
With the rapid development of the media convergence technology, the value of the video content is increasingly highlighted. In order to better convert video contents into digital assets available for a long time in the future, so that a user can conveniently and quickly retrieve required video contents, cataloguing work on the video contents is needed.
Video cataloging may be understood as cataloging a target video, which is based on information indexing. The information indexing refers to a process of analyzing, selecting and recording the form and content characteristics of information resources and giving a certain retrieval identifier, and the process of organizing the description information in an ordered manner according to a certain rule is video cataloging. Since video cataloging is the basis of retrieval, the perfection of a media data cataloging system is directly related to the accuracy and convenience of the cataloging when inquiring programs.
In the traditional cataloguing method, a target video is segmented in a manual mode, and label description and classification are carried out on video segments obtained by segmentation. Because the requirement of the manual cataloging mode on cataloging personnel is high, the cataloging personnel needs to be trained for cataloging work, the quality of the cataloging result depends on experience knowledge of the cataloging personnel, and the accuracy and efficiency of manual cataloging are difficult to guarantee, therefore, the traditional cataloging method needs to rely on professional personnel to catalog video contents, if facing massive videos in the all-media era, not only is the labor cost greatly increased, but also the timeliness is difficult to guarantee. Therefore, it is an urgent need to provide an automated video cataloging method.
Current automated video cataloging methods generally catalog target videos using single modality information. For example, shot edge detection is performed on a target video by using a video shot segmentation technology in computer vision, and the target video is segmented into a plurality of video segments according to a shot edge detection result, but the segmentation strength of the target video is too high in this way, so that the difficulty in splicing the plurality of video segments at a later stage is increased. Therefore, the method for cataloging the target video by using the single-mode information is difficult to cover the complete process required to be completed by the cataloging work, only has the function of assisting manual cataloging to a certain extent, and cannot realize the full process of automatic cataloging.
In order to solve the above problem, embodiments of the present disclosure provide an automated video cataloging method, apparatus, device, system, and storage medium.
First, a video cataloging method provided by the embodiment of the present disclosure is described with reference to fig. 1 to 4.
Fig. 1 shows a schematic flow chart of a video cataloging method provided by the embodiment of the present disclosure.
In some embodiments of the present disclosure, the video cataloging method illustrated in fig. 1 may be performed by a video cataloging apparatus. The video cataloging device may be an electronic device or a server. The electronic device may include, but is not limited to, a mobile terminal such as a smart phone, a notebook computer, a Personal Digital Assistant (PDA), a PAD, a Portable Multimedia Player (PMP), a vehicle mounted terminal (e.g., a car navigation terminal), a wearable device, etc., and a stationary terminal such as a digital TV, a desktop computer, a smart home device, etc. The server may be a cloud server or a server cluster or other devices with storage and computing functions.
As shown in fig. 1, the video cataloging method may include the following steps.
And S110, acquiring video characteristics of the target video.
In the embodiment of the disclosure, after the video cataloging device acquires the target video, the video cataloging device may perform feature identification on the target video to obtain the video features of the target video.
In the disclosed embodiment, the target video may be any video that needs to be cataloged.
Optionally, the target video may be a news video, a variety video, a video, and the like, which is not limited herein.
Alternatively, the video features may include video subtitle text and video audio text.
Specifically, after the video cataloging device acquires the target video, visual modal information can be extracted from the target video through a shot edge detection technology, the visual modal information is used as shot data, the shot data is used as a minimum unit for video segmentation, meanwhile, video frames of the target video are extracted to obtain video data of the target video, non-subtitle appearing areas in image frame groups formed by the shot data and the video data corresponding to the shot data are shielded to remove information of areas irrelevant to video scenes in all groups of image frames, and video subtitle texts in all groups of image frames are further identified based on a character identification technology.
The shot data may include video features of all shot scenes of the target video, among other things.
The video data may be image data corresponding to each video frame extracted from the target video.
The video subtitle text may include subtitles for groups of image frames of the target video.
Optionally, by using a shot edge detection technology, visual modality information is extracted from a target video, and the method may be implemented as follows:
sv=Detects(vid)
wherein s isvDetect for shot data, vid for target videos(*)。
Optionally, the removing of the information of the regions in each group of image frames that are not related to the video scene may be implemented by:
imgocr=mask(img)
wherein img is a set of image frames, imgocrAnd (3) for a group of processed image frames, one group of image frames comprises subtitle data, and mask (x) is a process for shielding the non-subtitle appearing region in each group of image frames.
Alternatively, the Character Recognition technology may be an Optical Character Recognition technology (OCR), which is not limited herein.
Optionally, identifying the video subtitle text in each group of image frames through the OCR technology may be implemented as follows:
textocr=MOCR(imgocr)
wherein textocrFor video subtitle text, MOCR(. a) is a process for subtitle recognition based on OCR technology.
Specifically, after the video cataloging device acquires the target video, the target video can be converted into audio data through an audio/video conversion technology, and the audio data is input into the voice recognition model, so that the video/audio text in each video text is obtained.
The audio data may include, among other things, audio information for all audio frames of the target video.
Optionally, the target video is converted into the audio data by an audio/video conversion technology, and the following method is used for realizing the following steps:
aud=Trans(vid)
wherein aud is audio data, and Trans (×) is audio-video conversion process.
Optionally, converting the audio data into a video audio text based on the speech recognition model may be implemented as follows:
textasr=MASR(aud)
wherein textasrFor video audio text in video text aud for audio data, MASR() is a speech recognition model transformation process.
Therefore, in the embodiment of the present disclosure, the video subtitle text and the video audio text of the target video may be acquired, so that the multi-modal video features are acquired.
And S120, segmenting the target video based on the video characteristics of the target video to obtain a plurality of video segments.
In the embodiment of the present disclosure, after the video cataloging device obtains the video features, the target video may be segmented according to the video subtitle text and the video audio text in the video features, so as to obtain a plurality of video segments.
In the embodiment of the present disclosure, the video segment may be one video frame or a plurality of consecutive video frames obtained by segmenting the target video based on the video audio text and the video subtitle text.
Specifically, for a video audio text, the video cataloguing device may identify audio content contained in the video audio text to identify a target audio content, segment a video audio text corresponding to the target video according to the target audio content to obtain a plurality of first audio texts, for a video subtitle text, the video cataloguing device may cluster same subtitle sub-data in the video subtitle text according to a time sequence, so that the subtitle sub-data of the same subtitle are merged together, and the subtitle sub-data of different subtitles are separated to obtain a plurality of first subtitle texts corresponding to the target video, and further segment the target video based on segmentation frame positions corresponding to the plurality of first subtitle texts and segmentation frame positions corresponding to the first audio texts to obtain a plurality of video segments.
Therefore, in the embodiment of the disclosure, the target video can be segmented based on the video audio text and the video subtitle text, and the segmentation of the target video based on the multi-modal video features can be realized, so as to further improve the segmentation accuracy of the target video.
S130, extracting a video label corresponding to each video clip based on a video text corresponding to the video clip, wherein the video text comprises a first audio text and a first subtitle text, and the video label at least comprises a semantic label.
In the embodiment of the disclosure, after the video cataloguing apparatus obtains a plurality of video segments, for each video segment, each video segment may be converted into a video text, and the video text may include a first audio text and a first subtitle text, and a keyword is extracted from the first audio text and the first subtitle text, and is used as a semantic tag of each video segment, so as to obtain a video tag.
In the embodiment of the present disclosure, the video text may be a text obtained by performing video content recognition on a video segment. In particular, the video text may include text data in each video frame image.
In this disclosure, the first audio text may be a text obtained by performing speech recognition on audio data in the video segment. In particular, the first audio text may comprise audio data in each audio frame.
In this disclosure, the first subtitle text may be a text obtained by performing character recognition on video content in the video segment. Specifically, the first subtitle text may include subtitle data in each video frame image.
In the disclosed embodiment, the keywords may be used to characterize semantic information of the video segments.
Specifically, the video cataloguing device may perform keyword analysis on the first audio text and the first subtitle text to extract keywords from the first audio text and the first subtitle text, and the keywords are used as semantic tags of each video segment to obtain video tags.
Therefore, in the embodiment of the disclosure, the semantic tag of each video segment can be extracted based on the first audio text and the first subtitle text in the video text, so that the video tag can be accurately extracted based on multi-modal video information.
S140, cataloging the target video by using the plurality of video segments and the video label corresponding to each video segment to obtain a cataloging result corresponding to the target video.
In the embodiment of the present disclosure, after the video cataloging device determines the video tag corresponding to each video segment, the video segments of the same video tag may be merged to obtain the cataloging result of the target video.
In the embodiment of the present disclosure, the cataloging result may include all video segments of the target video and video tags corresponding to the video segments.
In the embodiment of the present disclosure, optionally, S140 may include:
merging the video segments with the same semantic label to obtain a plurality of merged video segments;
determining a cataloguing result of the target video according to the merged video segments, wherein the cataloguing result at least comprises the following steps: and the video tags corresponding to the merged video clips and the start-stop frames corresponding to the merged video clips.
Specifically, the video cataloging equipment can merge video segments with the same semantic tags based on the semantic tags in the video tags, separate the video segments with inconsistent semantic tags to obtain a plurality of merged video segments, and then determine the cataloging result of the target video according to the merged video segments and the semantic tags corresponding to the video segments, so that the cataloging result comprises the video tags corresponding to the merged video segments and the start-stop frames corresponding to the merged video segments, the automatic video cataloging process is completed, secondary merging of the video segments based on the video tags can be realized, and the target video is prevented from being excessively split.
The start-stop frame corresponding to the merged video segment may be a start-stop timestamp of the merged video segment.
In the embodiment of the disclosure, after the video features of the target video are obtained, the target video is segmented based on the video features of the target video to obtain a plurality of video segments, then, for each video segment, the video tags corresponding to the video segments are extracted based on the video text corresponding to the video segment, the video text includes the first audio text and the first subtitle text, so that multi-modal video information can be obtained, the video tags at least include semantic tags, and the target video is catalogued by further using the plurality of video segments and the video tags corresponding to each video segment to obtain the cataloguing result corresponding to the target video. Therefore, an automatic video cataloging method can be provided, the video cataloging efficiency is improved, if massive videos in the all-media era are faced, a professional is not needed to catalog multimedia contents, therefore, a large amount of labor cost is reduced, the timeliness of the video cataloging can be guaranteed, and the target videos can be accurately cataloged based on the video tags corresponding to the multi-mode video information, so that the accuracy of the video cataloging is improved.
In other embodiments of the present disclosure, the high-frequency keywords and the abstract keywords may be extracted from the video text, respectively, and semantic tags in the video tags may be determined based on the high-frequency keywords and the abstract keywords.
In the embodiment of the present disclosure, optionally, S130 may include:
s1301, extracting high-frequency keywords and abstract keywords corresponding to the video clips from the video texts;
s1302, fusing the high-frequency keywords and the abstract keywords to obtain video keywords corresponding to the video clips;
and S1303, taking the video keywords meeting the preset keyword extraction conditions as semantic tags corresponding to the video clips.
In the embodiment of the disclosure, after the video cataloging device acquires the video segments, the high-frequency keywords and the abstract keywords corresponding to the video segments can be extracted from the video texts corresponding to the video segments, the high-frequency keywords and the abstract keywords are fused to obtain the video keywords corresponding to the video segments, and then the video keywords meeting the preset keyword extraction conditions are used as the semantic tags corresponding to the video segments.
The high-frequency keywords may be keywords having a higher frequency of appearance in the first audio text and the first subtitle text.
Wherein the summary key words may be summaries in the first audio text and the first subtitle text.
The specific step of S1301 extracting the high-frequency keyword may include:
respectively inputting the first audio text and the first caption text into a pre-trained keyword extraction model to obtain a probability value of occurrence of each keyword in the first audio text and the first caption text;
and taking the keywords with the probability values larger than the first probability threshold as the high-frequency keywords of the first audio text and the first subtitle text.
Specifically, the video cataloguing device may input the first audio text and the first subtitle text into a pre-trained keyword extraction model respectively to obtain a probability value of occurrence corresponding to each keyword in the first audio text and the first subtitle text, and if the probability value of occurrence is greater than a first probability threshold, the keyword is used as a high-frequency keyword of the first audio text and the first subtitle text, where the probability value corresponding to the high-frequency keyword is a high-frequency occurrence probability value, otherwise, the keyword is rejected.
The first probability threshold may be preset as needed to determine whether the keyword corresponding to the occurrence probability value is used as the high-frequency keyword.
Optionally, the keyword extraction model may be a Term Frequency-Inverse text Frequency index (TF-IDF) model.
Optionally, extracting the high-frequency keyword based on the keyword extraction model may be implemented in the following manner:
KTF-IDE=MTF-IDF(Tasr+Tocr)
wherein, TasrFor the first audio text, TocrAs first subtitle text, KTF-IDEAs abstract keywords, MTF-IDFAnd (, is the processing process of the keyword extraction model.
The specific step of extracting the abstract keyword based on S1301 may include:
respectively inputting the first audio text and the first caption text into a pre-trained abstract keyword extraction model to obtain an abstract composition probability value corresponding to each keyword in the first audio text and the first caption text;
and forming keywords with probability values larger than a second probability threshold value by the abstract, and using the keywords as abstract keywords of the first audio text and the first subtitle text.
Specifically, the video cataloguing device may input the first audio text and the first subtitle text into a pre-trained abstract keyword extraction model respectively to obtain an abstract composition probability value corresponding to each keyword in the first audio text and the first subtitle text, and if the abstract composition probability value is greater than a second probability threshold, the keyword is used as the abstract keyword extraction model of the first audio text and the first subtitle text, otherwise, the keyword is rejected.
The second probability threshold may be preset as required to determine whether to use the keyword corresponding to the digest formation probability value as the digest keyword.
Alternatively, the abstract keyword extraction model may be a text ranking (TextRank) model.
Optionally, extracting the abstract keywords based on the abstract keyword extraction model may be implemented as follows:
KTR=MTextRank(Tasr+Tocr)
wherein, KTRAs abstract keywords, MTextRankAnd (, is the processing procedure of the abstract keyword extraction model.
Wherein, the specific steps of S1302 may include:
s1, determining a high-frequency occurrence probability value corresponding to each high-frequency keyword in the video text data and a summary composition probability value corresponding to each summary keyword in the video text data;
s2, carrying out weighted average calculation on the probability value formed by the high-frequency occurrence probability value and the abstract to obtain a probability weighted value;
and S3, taking the high-frequency occurrence probability value and the abstract key word corresponding to each probability weighted value as the video key word corresponding to the video clip.
Specifically, after the video cataloging equipment acquires the high-frequency keywords and the abstract keywords, respectively determining high-frequency occurrence probability values corresponding to the high-frequency keywords and abstract composition probabilities corresponding to the abstract keywords, multiplying the high-frequency occurrence probability values by a first weight to obtain a first product, multiplying the abstract composition probability values by a second weight to obtain a second product, adding the first product to the second product to obtain probability weighted values, and taking the high-frequency occurrence probability values and the abstract keywords corresponding to the probability weighted values as the video keywords corresponding to the video segments.
Optionally, the weighted average calculation of the probability value formed by the high frequency occurrence probability value and the abstract may be implemented in the following manner:
Kword=αKTF-IDE+βKTR
where α is a first weight, β is a second weight, KwordAre weighted values of the probability.
In the embodiment of the present disclosure, the preset keyword extraction condition may be used to determine whether to use the high-frequency keyword and the abstract keyword as semantic tags.
In some embodiments, the preset keyword extraction conditions may include a probability weighting value and a probability weighting value threshold.
Wherein, S1303 may specifically include: and if the probability weighted value is greater than the probability weighted value threshold value, determining that the video keywords meet the preset keyword extraction conditions, and taking the video keywords meeting the preset keyword extraction conditions as semantic labels corresponding to the video segments, namely taking the high-frequency occurrence probability value and the abstract formation probability value which form the probability weighted value as the semantic labels corresponding to the high-frequency keywords and the abstract keywords respectively.
Specifically, after the video cataloging device calculates the probability weighted value, the probability weighted value is compared with a probability weighted value threshold, if the probability weighted value is greater than the probability weighted value threshold, the video keyword meeting the preset keyword extraction condition is used as a semantic label corresponding to the video clip, namely, the high-frequency keyword and the abstract keyword which respectively correspond to the high-frequency occurrence probability value and the abstract forming probability value which form the probability weighted value are used as semantic labels, otherwise, the high-frequency keyword which respectively corresponds to the high-frequency occurrence probability value and the abstract forming probability value which form the probability weighted value is removed.
In other embodiments, the predetermined keyword extraction condition may include a probability weighting value and a predetermined number of probability weighting values having the largest value.
Wherein, S3 may specifically include: sequencing a plurality of probability weighted values corresponding to the high-frequency occurrence probability value and the abstract forming probability value which form the probability weighted values according to the descending order, determining the high-frequency occurrence target probability value and the abstract forming target probability value which correspond to the probability weighted value with the maximum value and the preset number, determining that the high-frequency keyword corresponding to the high-frequency occurrence target probability value and the abstract key word corresponding to the abstract forming target probability value meet preset keyword extraction conditions, and taking the high-frequency keyword and the abstract key word which respectively correspond to the high-frequency occurrence target probability value and the abstract forming target probability value as semantic labels.
The preset number with the maximum value can be a preset number according to needs. Alternatively, the pre-set number may be 5, 6, 7, etc., and is not limited herein.
Specifically, for example, when the preset number is 5, after the video cataloging device calculates the probability weighted values, according to the sequence from large to small, the multiple probability weighted values corresponding to the high-frequency occurrence target probability value and the abstract forming target probability value which form the probability weighted value are ranked, the high-frequency occurrence target probability value and the abstract forming target probability value corresponding to the first 5 probability weighted values are determined, then it is determined that the high-frequency keyword corresponding to the high-frequency occurrence target probability value and the abstract key word corresponding to the abstract forming target probability value meet the preset key word extraction condition, and the high-frequency occurrence target probability value and the abstract forming target probability value respectively correspond to the high-frequency keyword and the abstract key word as semantic labels.
Therefore, in the embodiment of the disclosure, the high-frequency keywords and the abstract keywords can be extracted from the video text, the high-frequency keywords and the abstract keywords are fused to obtain the video keywords corresponding to the video segments, and then the video keywords meeting the preset keyword extraction conditions are used as the semantic tags corresponding to the video segments, so that the semantic tags can be determined according to the multi-modal audio information and subtitle information, and the accuracy of determining the semantic tags is improved.
In still other embodiments of the present disclosure, the video tags further comprise classification tags, the classification tags comprising at least one of a regional classification tag and a domain classification tag.
In the embodiment of the disclosure, the video cataloging device may further extract at least one of the region classification label and the field classification label of each video clip based on the video text corresponding to each video clip, so as to catalog the target video based on the region classification label and/or the field classification label and the semantic label, so that the cataloging result of the target video carries more cataloging information.
In some embodiments of the present disclosure, the video tag further comprises: the category labels may include geo category labels.
In the embodiment of the present disclosure, the region classification label may be used to characterize region information in the target video.
Optionally, the regional information may include international information, domestic information, provincial information, city information, street information, organization structure, and historical sites, which are not limited herein.
In the embodiment of the present disclosure, optionally, S130 may include:
s1301, performing sentence segmentation on the video text to obtain a plurality of text clauses;
s1302, based on a pre-established region entity knowledge base, identifying region information in each text clause to obtain region classification probability corresponding to each text clause;
s1303, fusing the region classification probabilities corresponding to the text clauses to obtain the region classification probability of the video clip;
and S1304, determining a region classification label corresponding to the video clip according to the region classification probability of the video clip.
Specifically, the video cataloguing device can respectively identify punctuation marks at the ends of sentences of a video text, segment the video text based on the positions of the punctuation marks to obtain a plurality of text clauses, then identify region information in each text clause based on a pre-established region entity knowledge base to obtain region classification probabilities corresponding to each text clause, then fuse the region classification probabilities corresponding to each text clause to obtain the region classification probability of a video clip, and finally determine the region classification label corresponding to the video clip according to the region classification probability of the video clip.
The pre-established geographic entity repository may be a pre-generated entity repository containing geographic information across the country.
Optionally, the sentence segmentation is performed on the video text to obtain a plurality of text clauses, and the method can be implemented as follows:
sent=TextSent(Tasr)+TextSent(Tocr)
={S1,S2,L,Sj,L,Sr}
where r is the number of text clauses, SjFor the jth text clause, send is the text clause, and textset (×) is the sentence segmentation process.
Optionally, the region classification probability corresponding to each text clause may include:
P=(pI,pD,pC)=rec(Sj),j=1,2,L,r
wherein P is a text clause SjMay be expressed as (p)I,pD,pC) Wherein p isIIs the classification probability value of the world, pDIs the domestic classification probability value, pCIs the classification probability value of the local city.
Wherein, the dimension of the geographical classification label can be preset according to the requirement. For example, if the geo category label includes country, province, city, the dimension of the geo category label is 3, and the preset number is 3.
The specific step of S1303 may include: and according to the dimensionality of the preset region classification label, fusing the region classification probabilities corresponding to the text clauses to obtain the region classification probability of the video clip.
Wherein, the specific step of S1304 may include: and aiming at each fused region classification result, taking the fused region classification result with the largest region classification probability as a region classification label corresponding to the video clip.
Optionally, the fused region classification result with the largest region classification probability is obtained, and the method may be implemented as follows:
Figure BDA0003326551700000161
wherein s isFIn the form of a zone-specific label,
Figure BDA0003326551700000162
the process of obtaining the fused region classification result with the maximum region classification probability is provided.
Therefore, in the embodiment of the disclosure, the video text can be sentence-segmented to obtain a plurality of text clauses, and the region classification label of each video segment is determined according to the text clauses to obtain more label information.
Furthermore, after the region classification labels are obtained, the target video can be cataloged based on the region classification labels and the semantic labels, so that the cataloging result of the target video carries more cataloging information.
In the embodiment of the present disclosure, optionally, S140 may include:
merging the video segments with the same semantic label and region classification label to obtain a plurality of merged video segments;
determining a cataloguing result of the target video according to the merged video segments, wherein the cataloguing result at least comprises the following steps: and the video tags corresponding to the merged video clips and the start-stop frames corresponding to the merged video clips.
The principle of S140 is similar to that of the previous embodiment, and is not described herein.
Therefore, in the embodiment of the disclosure, after the geographical classification tag is obtained, the target video can be cataloged by using the plurality of video segments, the semantic tag and the geographical classification tag to obtain the cataloging result corresponding to the target video, so that the obtained cataloging result carries more tag information, and the accuracy of video cataloging is improved.
In some embodiments of the present disclosure, the video tag further comprises: a classification tag comprising a domain classification tag.
In embodiments of the present disclosure, domain classification tags may be used to characterize domain information in a target video.
Alternatively, the domain information may include information such as, without limitation, political, economic, scientific, sports, entertainment, military, natural, disaster, law, real estate, industry, construction, transportation, education, history, agriculture, medical health, social civil, fun, and the like.
In the embodiment of the present disclosure, optionally, S130 may include:
s1305, extracting an audio text abstract of the first audio text and a subtitle text abstract of the first subtitle text;
s1306, fusing the audio text abstract and the subtitle text abstract to obtain a video abstract;
and S1307, inputting the video abstract into a pre-trained domain classification model to obtain a domain classification label of each video segment.
Specifically, after obtaining the first audio text and the first subtitle text, the video cataloguing apparatus may extract an audio text abstract of the first audio text and a subtitle text abstract of the first subtitle text, respectively, based on the subtitle text abstract extraction model, and then combine the same audio text abstract and the same subtitle text abstract to obtain a video abstract, so as to further input the video abstract into a pre-trained domain classification model to obtain a domain classification label of the segmented video.
Optionally, the audio text abstract and the subtitle text abstract are fused to obtain the video text abstract, and the method can be implemented as follows:
Ta=Summ(Tasr)+Summ(Tocr)
wherein, Summ (T)asr) For audio text summarization, Summ (T)ocr) For text summaries of subtitles, TaIs a video text summary.
Alternatively, the domain classification model may be a Boert (bert) model.
Optionally, the video text abstract is input into a pre-trained domain classification model to obtain a domain classification label of the segmented video, which can be implemented in the following manner:
CF=Bert(Ta)
wherein, CFAre domain classification labels.
Therefore, in the embodiment of the disclosure, the audio text abstract of the first audio text and the subtitle text abstract of the first subtitle text can be respectively extracted, and the regional classification label corresponding to each video segment is determined according to the audio text abstract and the subtitle text abstract, so as to obtain more label information.
Furthermore, after the region classification labels are obtained, the target video can be cataloged based on the region classification labels and the semantic labels, so that the cataloging result of the target video carries more cataloging information.
In the embodiment of the present disclosure, optionally, S140 may include:
merging the video segments with the same semantic label and domain classification label to obtain a plurality of merged video segments;
determining a cataloguing result of the target video according to the merged video segments, wherein the cataloguing result at least comprises the following steps: and the video tags corresponding to the merged video clips and the start-stop frames corresponding to the merged video clips.
In the embodiment of the present disclosure, optionally, S140 may include:
merging the video segments with the same semantic label, region classification label and field classification label to obtain a plurality of merged video segments;
determining a cataloguing result of the target video according to the merged video segments, wherein the cataloguing result at least comprises the following steps: and the video tags corresponding to the merged video clips and the start-stop frames corresponding to the merged video clips.
The principle of S140 is similar to that of the previous embodiment, and is not described herein.
Therefore, in the embodiment of the disclosure, after the domain classification tag is obtained, the target video can be cataloged by using the plurality of video segments, the semantic tag, the domain classification tag and/or the region classification tag, and the cataloging result corresponding to the target video is obtained, so that the obtained cataloging result carries more tag information, and the accuracy of video cataloging is improved.
In order to improve the accuracy of segmentation of the target video, the video subtitle text and the video voice text of the target video can be subjected to error correction processing, so that the target video is segmented by using the video subtitle text and the video voice text subjected to error correction processing to obtain a plurality of video segments.
Optionally, the error correction processing on the video/audio text may be implemented by:
Tasr=TextCorr(tasr)
wherein, tasrFor video-audio text, TasrTextCorr (, r) is the error correction process for the video audio text after the error correction process.
Optionally, the error correction processing on the video subtitle text may be implemented in the following manner:
Tocr=TextCorr(tocr)
wherein, tocrFor video subtitle text, TocrTextCorr (×) is the video subtitle text after error correction processing.
Therefore, in the embodiment of the present disclosure, after the video subtitle text and the video voice text are obtained, error correction processing may be further performed on the video subtitle text and the video voice text, so that the target video is accurately segmented by using the video subtitle text and the video voice text after the error correction processing, and a plurality of video segments are obtained.
In still other embodiments of the present disclosure, the video features include video caption text.
In the embodiment of the present disclosure, optionally, S110 may include:
s1101, carrying out lens edge detection on a target video to obtain a plurality of transition image frames;
s1102, extracting a group of image frames corresponding to each transition image frame from the target video;
and S1103, performing subtitle recognition on each group of image frames to obtain second subtitle texts corresponding to each group of image frames, wherein the plurality of second subtitle texts form video subtitle texts.
In the embodiment of the present disclosure, after the video cataloging device obtains the target video, shot edge detection may be performed on the target video to obtain a plurality of transition image frames, a group of image frames corresponding to each transition image frame is extracted from the target video, and then subtitle recognition is performed on each group of image frames to obtain second subtitle text segments corresponding to each group of image frames, so that the plurality of second subtitle text segments form a video subtitle text.
Wherein the transition image frame may include video images of all shot scenes of the target video.
Wherein a set of image frames may be generated based on each transition image frame and the video frame to which each transition image frame corresponds.
Specifically, the specific step of S1102 may include:
acquiring each transition image frame;
and extracting a group of image frames which meet preset video frame extraction conditions from the target video based on the transition image frames for each transition image frame.
Wherein the preset video frame extraction condition may be a preset extraction condition for generating a group of image frames.
Optionally, the preset video frame extraction condition may be 5 frames of images before and after the transition image frame.
Optionally, based on the transition image frame, a group of image frames meeting a preset video frame extraction condition is extracted from the target video, and the following steps may be performed:
Figure BDA0003326551700000201
where img is a set of image frames, sv(5±)5 frames of images, s, preceding and following the transition image framemAre video frames.
Therefore, in the embodiment of the disclosure, a group of image frames meeting the preset video frame extraction condition may be generated based on each transition image frame in the target video, and then a video subtitle text may be accurately generated based on the group of image frames, and the obtained video subtitle text may be used as a video feature.
In still other embodiments of the present disclosure, the video features include video subtitle text and video audio text, and the target video may be segmented based on the video subtitle text and the video audio text.
In the embodiment of the present disclosure, optionally, S140 may include:
s1401, segmenting the video and audio text to obtain a plurality of first segmentation positions;
s1402, segmenting the video subtitle text to obtain a plurality of second segmentation positions;
and S1403, segmenting the target video based on the first segmentation position and the second segmentation position to obtain a plurality of video segments.
In the embodiment of the present disclosure, after the video cataloging device obtains the video audio text and the video subtitle text, the video audio text may be segmented to obtain a plurality of first segmentation positions, the video subtitle text may be segmented to obtain a plurality of second segmentation positions, and then the target video may be segmented based on the first segmentation positions and the second segmentation positions to obtain a plurality of video segments.
The specific steps of S1401 may include:
and inputting the video and audio text into a pre-trained transition statement recognition model to obtain a plurality of first segmentation frame positions output by the transition statement recognition model.
Specifically, the video cataloging device may directly input the video and audio text into a pre-trained transition statement identification model, identify a transition statement in the video and audio text and a first frame position corresponding to the transition statement, and obtain a plurality of first frame positions output by the transition statement identification model.
Wherein the transition sentence recognition model may be a model for recognizing transition data. Specifically, the transition sentence recognition model may be obtained by training the initial model based on the sample transition sentence, the sample non-transition sentence, and the sample video/audio text.
Optionally, the transition statement recognition model may be a classifier based on training of a convolutional neural network, which is not limited herein.
The transition sentence can be a video audio text corresponding to the transition sentence used for connecting different lecture contents in the target video.
Optionally, the transition statement output by the transition statement identification model is obtained by using the transition statement identification model, and the method can be implemented as follows:
Ra=textCNN(textasr)
wherein textasrFor video-audio text, RaFor transition statements, textCNN (, sn) is the process of extracting transition statements using the transition statement recognition model.
Therefore, in the embodiment of the disclosure, the video and audio texts can be accurately segmented based on the transition sentences.
The specific steps of S1402 may include:
and analyzing the video subtitle text by using a time sequence text clustering method to obtain a plurality of second segmentation frame positions.
In the embodiment of the present disclosure, after the video cataloging device obtains the video subtitle text, a time sequence text clustering method may be used to cluster a plurality of sub-data in the video subtitle text, so that the same sub-data are merged, and different sub-data are separated, so as to obtain a plurality of second segmentation frame positions.
Optionally, the video subtitle text is analyzed by using a time sequence text clustering method to obtain a plurality of second segmentation frame positions, and the method can be implemented as follows:
Ro=SP(textocr)
wherein R isoFor the second sliced frame position, textocrFor video subtitle text, SP (×) is a time series text clustering method.
Therefore, in the embodiment of the disclosure, the video subtitle text can be accurately segmented by using a time sequence text clustering method.
Wherein, the specific step of S1403 may include:
combining the plurality of first segmentation frame positions and the plurality of second segmentation frame positions to obtain a plurality of video segmentation frame positions of the target video;
and segmenting the target video based on the positions of the plurality of video segmentation frames to obtain a plurality of video segments.
In the embodiment of the present disclosure, after the video cataloging device obtains the first and second split frame positions, the first and second split frame positions may be merged to obtain a plurality of video split frame positions of the target video, and the target video is split based on the plurality of video split frame positions to obtain a plurality of video segments.
The merging the plurality of first sliced frame positions and the plurality of second sliced frame positions to obtain a plurality of video sliced frame positions of the target video includes:
dividing the plurality of first segmentation frame positions and the plurality of second segmentation frame positions into a group in pairs according to the time sequence;
for each group of the first segmentation frame position and the second segmentation frame position, taking the first segmentation frame position as a video segmentation frame position under the condition that the time difference between the first segmentation frame position and the second segmentation frame position is less than or equal to a preset time difference threshold value;
and aiming at each group of the first segmentation frame position and the second segmentation frame position, respectively taking the first segmentation frame position and the second segmentation frame position as video segmentation frame positions under the condition that the time difference between the first segmentation frame position and the second segmentation frame position is greater than a preset time difference threshold value.
Specifically, after the video cataloging device obtains a first segmentation frame position and a second segmentation frame position, the first segmentation frame positions and the second segmentation frame positions are pairwise divided into a group according to a time sequence, then a time difference between the first segmentation frame position and the second segmentation frame position is calculated for each group of the first segmentation frame position and the second segmentation frame position, if the time difference between the first segmentation frame position and the second segmentation frame position is smaller than or equal to a preset time difference threshold value, the first segmentation timestamp and the second segmentation timestamp are considered to be the same, therefore, the second timestamp is combined to the first timestamp, that is, the first segmentation frame position is used as the video segmentation frame position, and otherwise, the first segmentation frame position and the second segmentation frame position are respectively used as the video segmentation frame position.
The preset time difference threshold may be a time length preset as required to determine whether to use the first segmentation frame position as the video segmentation frame position.
Optionally, taking the first segmentation frame position as the video segmentation frame position may be implemented as follows:
Raj=(||Roi-Raj||<200)
wherein R isajFor the first slice frame position RoiFor the second sliced frame position, 200 is the preset time difference threshold.
Further, the target video is segmented based on the video segmentation frame position of the target video to obtain a plurality of video segments, which can be expressed as:
R=[[F1s,F1e],[F2s,F2e],L,[FNs,FNe]]
wherein, the starting frame of the ith video segment is FisThe end frame is FieAnd N is the segmentation number of the target video.
Fig. 2 illustrates a logic diagram of a segmentation method of a target video according to an embodiment of the present disclosure.
As shown in fig. 2, first, a video subtitle text and a video voice text are obtained; secondly, identifying transition sentences in the video voice text, segmenting the video voice text into first audio texts based on the transition sentences, obtaining a plurality of first segmentation frame positions, performing time sequence clustering on a plurality of subdata in the video subtitle text to obtain a plurality of first subtitle texts, and obtaining a plurality of second segmentation frame positions; and then, segmenting the target video based on the first segmentation frame position and the second segmentation frame position to obtain a plurality of video segments.
Therefore, in the embodiment of the disclosure, the target video can be segmented based on the first segmentation frame position and the second segmentation frame position, so that the segmentation frame position determined based on the video features of two dimensions can be used, and the segmentation accuracy of the target video is improved.
In still another embodiment of the present disclosure, the overall flow of the video cataloging method is specifically explained.
Fig. 3 is a schematic overall flow chart of a video cataloging method provided by the embodiment of the present disclosure.
As shown in fig. 3, the video cataloging method may include the following steps.
And S310, acquiring the video characteristics of the target video.
Fig. 4 is an overall logic diagram of another video cataloging method provided by the embodiment of the present disclosure.
As shown in fig. 4, first, a shot edge detection technology is used to extract transition image frames from a target video, and based on the transition image frames and video frames corresponding to each transition image frame, a set of image frames corresponding to each transition image frame is obtained, and at the same time, the target video is converted into audio data, and the target video is subjected to voice recognition, so as to obtain a video voice text of voice data; and then, performing subtitle recognition on each group of image frames to obtain a video subtitle text of the target video.
S320, segmenting the target video based on the video subtitle text and the video voice text in the video characteristics to obtain a plurality of video segments.
Continuing to refer to fig. 4, the video audio text is segmented to obtain a plurality of first audio texts and a plurality of first segmentation frame positions, the video subtitle text is segmented to obtain a plurality of first subtitle texts and a plurality of second segmentation frame positions, and then the target video is segmented based on the first segmentation frame positions and the second segmentation frame positions to obtain a plurality of video segments.
S330, extracting semantic labels, region classification labels and field classification labels corresponding to the video clips based on the video texts corresponding to the video clips aiming at each video clip.
Continuing to refer to fig. 4, extracting high-frequency keywords and abstract keywords from a first audio text and a first subtitle text corresponding to the video segment, and extracting keywords meeting preset keyword extraction conditions from the video keywords as semantic tags corresponding to the video segment; meanwhile, performing sentence segmentation on the video text to obtain a plurality of text clauses, identifying region information in each text clause based on a pre-constructed region entity knowledge base to obtain region classification probabilities corresponding to each text clause, fusing the region classification probabilities corresponding to each text clause to obtain region classification probabilities of video segments, and determining a region classification label corresponding to each video segment according to the region classification probabilities of the video segments; and extracting an audio text abstract of the first audio text and a subtitle text abstract of the first subtitle text, fusing the audio text abstract and the subtitle text abstract to obtain a video abstract, inputting the video abstract into a pre-trained domain classification model, and obtaining a domain classification label corresponding to each video segment.
S340, cataloging the target video by utilizing the plurality of video segments and the video labels to obtain a cataloging result corresponding to the target video.
Continuing to refer to fig. 4, merging the video segments with the same semantic label to obtain a plurality of merged video segments, and determining an inventory result of the target video according to the merged video segments, wherein the inventory result at least comprises: and storing the cataloguing result.
The embodiment of the present disclosure further provides a video cataloging apparatus for implementing the above-mentioned video cataloging apparatus, which is described below with reference to fig. 5. In the disclosed embodiment, the video cataloging apparatus may be a video cataloging device. The video cataloging device may be an electronic device or a server. The electronic device may include a mobile terminal, a tablet computer, a vehicle-mounted terminal, a wearable device, a Virtual Reality (VR) all-in-one machine, an intelligent home device, and other devices having a communication function. The server may be a cloud server or a server cluster or other devices with storage and computing functions.
Fig. 5 shows a schematic structural diagram of a video slicing apparatus according to an embodiment of the present disclosure.
As shown in fig. 5, the video cataloging apparatus 500 may include: a video feature acquisition module 510, a target video segmentation module 520, a video tag extraction module 530, and a video cataloging module 540.
The video feature obtaining module 510 is configured to obtain video features of a target video;
the target video segmentation module 520 is configured to segment a target video based on video characteristics of the target video to obtain a plurality of video segments;
a video tag extraction module 530, configured to determine, for each video segment, a video tag corresponding to the video segment based on a video text corresponding to the video segment, where the video text includes a first audio text and a first subtitle text, and the video tag includes at least a semantic tag;
the video cataloging module 540 is configured to catalog the target video by using the plurality of video segments and the video tag corresponding to each video segment to obtain a cataloging result corresponding to the target video.
In the embodiment of the disclosure, after the video features of the target video are obtained, the target video is segmented based on the video features of the target video to obtain a plurality of video segments, then, for each video segment, the video tags corresponding to the video segments are extracted based on the video text corresponding to the video segment, the video text includes the first audio text and the first subtitle text, so that multi-modal video information can be obtained, the video tags at least include semantic tags, and the target video is catalogued by further using the plurality of video segments and the video tags corresponding to each video segment to obtain the cataloguing result corresponding to the target video. Therefore, an automatic video cataloging method can be provided, the video cataloging efficiency is improved, if massive videos in the all-media era are faced, a professional is not needed to catalog multimedia contents, therefore, a large amount of labor cost is reduced, the timeliness of the video cataloging can be guaranteed, and the target videos can be accurately cataloged based on the video tags corresponding to the multi-mode video information, so that the accuracy of the video cataloging is improved.
Optionally, the video tag extraction module 530 is further configured to extract high-frequency keywords and abstract keywords corresponding to the video segments from the video text;
fusing the high-frequency keywords and the abstract keywords to obtain video keywords corresponding to the video clips;
and taking the video keywords meeting the preset keyword extraction conditions as semantic tags corresponding to the video segments.
Optionally, the video tag further includes a classification tag, and the classification tag includes at least one of a region classification tag and a field classification tag.
Optionally, the video tag further comprises: the classification labels comprise region classification labels;
correspondingly, the video tag extraction module 530 is further configured to perform sentence segmentation on the video text to obtain a plurality of text clauses;
identifying the region information in each text clause based on a pre-constructed region entity knowledge base to obtain the region classification probability corresponding to each text clause;
fusing the region classification probabilities corresponding to the text clauses to obtain the region classification probability of the video clip;
and determining the region classification labels corresponding to the video clips according to the region classification probability of the video clips.
Optionally, the video tag further comprises: a classification tag comprising a domain classification tag;
correspondingly, the video tag extraction module 530 is further configured to extract an audio text abstract of the first audio text and a subtitle text abstract of the first subtitle text;
fusing the audio text abstract and the subtitle text abstract to obtain a video abstract;
and inputting the video abstract into a pre-trained domain classification model to obtain a domain classification label corresponding to each video clip.
Optionally, the video feature data includes: video subtitle text;
correspondingly, the video feature obtaining module 510 is further configured to perform shot edge detection on the target video to obtain a plurality of transition image frames;
extracting a group of image frames corresponding to each transition image frame from the target video;
and performing subtitle recognition on each group of image frames to obtain second subtitle texts corresponding to each group of image frames, wherein the plurality of second subtitle texts form video subtitle texts.
Optionally, the video feature data obtaining module 510 is further configured to obtain each transition image frame;
and extracting a group of image frames which meet preset video frame extraction conditions from the target video based on the transition image frames for each transition image frame.
Optionally, the video feature data includes video audio text and video subtitle text;
correspondingly, the target video segmentation module 520 is further configured to segment the video audio text to obtain a plurality of first segmentation frame positions;
segmenting the video subtitle text to obtain a plurality of second segmentation frame positions;
and segmenting the target video based on the first segmentation frame position and the second segmentation frame position to obtain a plurality of video segments.
Optionally, the target video segmentation module 520 is further configured to input the video/audio text into a pre-trained transition sentence recognition model, so as to obtain a plurality of first segmentation frame positions output by the transition sentence recognition model.
Optionally, the target video segmentation module 520 is further configured to analyze the video subtitle text by using a time sequence text clustering method to obtain a plurality of second segmentation frame positions.
Optionally, the target video segmentation module 520 is further configured to combine the plurality of first segmentation frame positions with the plurality of second segmentation frame positions to obtain a plurality of video segmentation frame positions of the target video;
and segmenting the target video based on the positions of the plurality of video segmentation frames to obtain a plurality of video segments.
Optionally, the target video segmentation module 520 is further configured to segment the plurality of first segmentation frame positions and the plurality of second segmentation frame positions into a group in pairs according to a time sequence;
for each group of the first segmentation frame position and the second segmentation frame position, taking the first segmentation frame position as a video segmentation frame position under the condition that the time difference between the first segmentation frame position and the second segmentation frame position is less than or equal to a preset time difference threshold value;
and aiming at each group of the first segmentation frame position and the second segmentation frame position, respectively taking the first segmentation frame position and the second segmentation frame position as video segmentation frame positions under the condition that the time difference between the first segmentation frame position and the second segmentation frame position is greater than a preset time difference threshold value.
Optionally, the video cataloging module 540 is further configured to merge the video segments with the same semantic tag to obtain a plurality of merged video segments;
determining a cataloguing result of the target video according to the merged video segments, wherein the cataloguing result at least comprises the following steps: and the video tags corresponding to the merged video clips and the start-stop frames corresponding to the merged video clips.
It should be noted that the video cataloging apparatus 500 shown in fig. 5 may perform each step in the method embodiment shown in fig. 1 to 4, and implement each process and effect in the method embodiment shown in fig. 1 to 4, which are not described herein again.
Fig. 6 shows a schematic diagram of a hardware circuit structure of a video cataloging apparatus according to an embodiment of the present disclosure.
As shown in fig. 6, the video cataloging apparatus may include a controller 601 and a memory 602 having stored thereon computer program instructions.
Specifically, the controller 601 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
Memory 602 may include a mass storage for information or instructions. By way of example, and not limitation, memory 602 may include a Hard Disk Drive (HDD), floppy Disk Drive, flash memory, optical Disk, magneto-optical Disk, magnetic tape, or Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 602 may include removable or non-removable (or fixed) media, where appropriate. Memory 602 may be internal or external to the integrated gateway device, where appropriate. In a particular embodiment, the memory 602 is a non-volatile solid-state memory. In a particular embodiment, the Memory 602 includes Read-Only Memory (ROM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (Electrically Erasable PROM, EPROM), Electrically Erasable PROM (Electrically Erasable PROM, EEPROM), Electrically Alterable ROM (Electrically Alterable ROM, EAROM), or flash memory, or a combination of two or more of these, where appropriate.
The controller 601 performs the steps of the video cataloging method provided by the embodiments of the present disclosure by reading and executing computer program instructions stored in the memory 602.
In one example, the video cataloging device may also include a transceiver 603 and a bus 604. As shown in fig. 6, the controller 601, the memory 602, and the transceiver 603 are connected via a bus 604 and communicate with each other.
Bus 604 includes hardware, software, or both. By way of example, and not limitation, a BUS may include an Accelerated Graphics Port (AGP) or other Graphics BUS, an Enhanced Industry Standard Architecture (EISA) BUS, a Front-Side BUS (FSB), a HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) BUS, an InfiniBand interconnect, a Low Pin Count (LPC) BUS, a memory Bus, a Micro Channel Architecture (MCA) Bus, a Peripheral Component Interconnect (PCI) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Standards Association Local Bus (VLB) Bus, or other suitable Bus, or a combination of two or more of these. Bus 604 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.
The following is an embodiment of a video cataloging system provided in the embodiments of the present disclosure, the video cataloging system and the video cataloging method of the above embodiments belong to the same inventive concept, and details that are not described in detail in the embodiments of the video cataloging system may refer to the embodiments of the above video cataloging method.
Fig. 7 shows a schematic structural diagram of a video cataloging system provided by the embodiment of the present disclosure.
As shown in fig. 7, the system includes: a video cataloging device 710 and a display device 720;
the video cataloging equipment 710 is used for acquiring video characteristics of a target video;
segmenting a target video based on video characteristics of the target video to obtain a plurality of video segments;
for each video clip, determining a video tag corresponding to the video clip based on a video text corresponding to the video clip, wherein the video text comprises a first audio text and a first subtitle text, and the video tag at least comprises a semantic tag;
cataloging the target video by using the plurality of video segments and the video label corresponding to each video segment to obtain a cataloging result corresponding to the target video;
a display device 720, configured to receive a target video display operation, where the target video display operation carries a target cataloging tag;
and screening out a target video corresponding to the target cataloging label from the plurality of videos in response to the video clip target video display operation.
The following is an embodiment of a computer-readable storage medium provided in an embodiment of the present disclosure, the computer-readable storage medium and the video cataloging method in the foregoing embodiments belong to the same inventive concept, and details that are not described in detail in the embodiment of the computer-readable storage medium may refer to the embodiment of the video cataloging method.
The present embodiments provide a storage medium containing computer-executable instructions which, when executed by a computer processor, perform a method of video cataloging, the method comprising:
acquiring video characteristics of a target video;
segmenting a target video based on video characteristics of the target video to obtain a plurality of video segments;
for each video clip, determining a video tag corresponding to the video clip based on a video text corresponding to the video clip, wherein the video text comprises a first audio text and a first subtitle text, and the video tag at least comprises a semantic tag;
and cataloging the target video by using the plurality of video segments and the video label corresponding to each video segment to obtain a cataloging result corresponding to the target video.
Of course, the storage medium provided by the embodiments of the present disclosure contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the video cataloging method provided by any embodiments of the present disclosure.
From the above description of the embodiments, it is obvious for a person skilled in the art that the present disclosure can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk, or an optical disk of a computer, and includes several instructions to enable a computer cloud platform (which may be a personal computer, a server, or a network cloud platform, etc.) to perform the video cataloging method provided in the embodiments of the present disclosure.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the term "comprises/comprising" is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (17)

1. A method of video cataloging, comprising:
acquiring video characteristics of a target video;
segmenting the target video based on the video characteristics of the target video to obtain a plurality of video segments;
for each video segment, determining a video tag corresponding to the video segment based on a video text corresponding to the video segment, wherein the video text comprises a first audio text and a first subtitle text, and the video tag at least comprises a semantic tag;
and cataloging the target video by using the plurality of video segments and the video label corresponding to each video segment to obtain a cataloging result corresponding to the target video.
2. The method of claim 1, wherein the determining the video tag corresponding to the video clip based on the video text corresponding to the video clip comprises:
extracting high-frequency keywords and abstract keywords corresponding to the video clips from the video texts;
fusing the high-frequency keywords and the abstract keywords to obtain video keywords corresponding to the video clips;
and taking the video keywords meeting the preset keyword extraction conditions as semantic tags corresponding to the video clips.
3. The method of claim 1, wherein the video tags further comprise classification tags, the classification tags comprising at least one of a regional classification tag and a domain classification tag.
4. The method of claim 1, wherein the video tags further comprise category tags, the category tags comprising geo category tags;
wherein, the determining the video label corresponding to the video clip based on the video text corresponding to the video clip further comprises:
performing sentence segmentation on the video text to obtain a plurality of text clauses;
identifying the region information in each text clause based on a pre-constructed region entity knowledge base to obtain the region classification probability corresponding to each text clause;
fusing the region classification probabilities corresponding to the text clauses to obtain the region classification probability of the video clip;
and determining a geographical classification label corresponding to the video clip according to the geographical classification probability of the video clip.
5. The method of claim 1, wherein the video tags further comprise classification tags, the classification tags comprising domain classification tags;
determining a video label corresponding to the video clip based on the video text corresponding to the video clip comprises:
extracting an audio text abstract of the first audio text and a subtitle text abstract of the first subtitle text;
fusing the audio text abstract and the subtitle text abstract to obtain a video abstract;
and inputting the video abstract into a pre-trained domain classification model to obtain a domain classification label corresponding to each video segment.
6. The method of claim 1, wherein the video features comprise video subtitle text;
the acquiring of the video characteristics of the target video includes:
carrying out lens edge detection on the target video to obtain a plurality of transition image frames;
extracting a group of image frames corresponding to each transition image frame from the target video;
and performing caption identification on each group of image frames to obtain second caption texts corresponding to each group of image frames, wherein the plurality of second caption texts form the video caption texts.
7. The method of claim 6, wherein said extracting a set of image frames corresponding to each of said transition image frames from said target video comprises:
acquiring each transition image frame;
and extracting a group of image frames meeting preset video frame extraction conditions from the target video based on the transition image frames for each transition image frame.
8. The method of claim 1, wherein the video features comprise video audio text and video subtitle text;
the segmenting of the target video based on the video characteristics of the target video to obtain a plurality of video segments includes:
segmenting the video and audio text to obtain a plurality of first segmentation frame positions;
segmenting the video subtitle text to obtain a plurality of second segmentation frame positions;
and segmenting the target video based on the first segmentation frame position and the second segmentation frame position to obtain the plurality of video segments.
9. The method of claim 8, wherein said segmenting said video audio text into a plurality of first segmentation frame positions comprises:
and inputting the video and audio text into a pre-trained transition statement recognition model to obtain a plurality of first segmentation frame positions output by the transition statement recognition model.
10. The method of claim 8, wherein the segmenting the video subtitle text into a plurality of second segmented frame positions comprises:
and analyzing the video subtitle text by using a time sequence text clustering method to obtain a plurality of second segmentation frame positions.
11. The method of claim 8, wherein the slicing the target video based on the first and second slicing frame positions to obtain the plurality of video segments comprises:
merging the plurality of first segmentation frame positions and the plurality of second segmentation frame positions to obtain a plurality of video segmentation frame positions of the target video;
and segmenting the target video based on the positions of the plurality of video segmentation frames to obtain the plurality of video segments.
12. The method of claim 11, wherein merging the first plurality of sliced frame positions with the second plurality of sliced frame positions to obtain the video sliced frame position of the target video comprises:
dividing the plurality of first segmentation frame positions and the plurality of second segmentation frame positions into a group in pairs according to a time sequence;
for each group of a first segmentation frame position and a second segmentation frame position, taking the first segmentation frame position as the video segmentation frame position when the time difference between the first segmentation frame position and the second segmentation frame position is less than or equal to a preset time difference threshold value;
and aiming at each group of first segmentation frame position and second segmentation frame position, respectively taking the first segmentation frame position and the second segmentation frame position as the video segmentation frame positions under the condition that the time difference between the first segmentation frame position and the second segmentation frame position is greater than the preset time difference threshold value.
13. The method according to claim 1, wherein the cataloging the target video by using the plurality of video segments and the video tag corresponding to each of the video segments to obtain the cataloging result corresponding to the target video comprises:
merging the video segments with the same semantic label to obtain a plurality of merged video segments;
determining a cataloguing result of the target video according to the merged video segment, wherein the cataloguing result at least comprises: and the video label corresponding to the merged video clip and the start-stop frame corresponding to the merged video clip.
14. A video cataloging apparatus, comprising:
the video characteristic acquisition module is used for acquiring the video characteristics of the target video;
the target video segmentation module is used for segmenting the target video based on the video characteristics of the target video to obtain a plurality of video segments;
a video tag extraction module, configured to determine, for each video segment, a video tag corresponding to the video segment based on a video text corresponding to the video segment, where the video text includes a first audio text and a first subtitle text, and the video tag includes at least a semantic tag;
and the video cataloging module is used for cataloging the target video by utilizing the plurality of video segments and the video label corresponding to each video segment to obtain the cataloging result corresponding to the target video.
15. A video cataloging apparatus, comprising:
a processor;
a memory for storing executable instructions;
wherein the processor is configured to read the executable instructions from the memory and execute the executable instructions to implement the video cataloging method of any of the preceding claims 1-13.
16. A video cataloging system, comprising: video cataloging equipment and display equipment;
the video cataloguing equipment is used for acquiring video characteristics of a target video;
segmenting the target video based on the video characteristics of the target video to obtain a plurality of video segments;
for each video segment, determining a video tag corresponding to the video segment based on a video text corresponding to the video segment, wherein the video text comprises a first audio text and a first subtitle text, and the video tag at least comprises a semantic tag;
cataloging the target video by using the plurality of video segments and the video label corresponding to each video segment to obtain a cataloging result corresponding to the target video;
the display equipment is used for receiving target video display operation, and the target video display operation carries a target cataloguing label;
and responding to the video clip target video display operation, and screening out a target video corresponding to the target catalogue label from a plurality of videos.
17. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, causes the processor to carry out a video cataloging method according to any one of the preceding claims 1-13.
CN202111265047.7A 2021-10-28 2021-10-28 Video cataloging method, device, equipment, system and medium Pending CN113992944A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111265047.7A CN113992944A (en) 2021-10-28 2021-10-28 Video cataloging method, device, equipment, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111265047.7A CN113992944A (en) 2021-10-28 2021-10-28 Video cataloging method, device, equipment, system and medium

Publications (1)

Publication Number Publication Date
CN113992944A true CN113992944A (en) 2022-01-28

Family

ID=79743682

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111265047.7A Pending CN113992944A (en) 2021-10-28 2021-10-28 Video cataloging method, device, equipment, system and medium

Country Status (1)

Country Link
CN (1) CN113992944A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241471A (en) * 2022-02-23 2022-03-25 阿里巴巴达摩院(杭州)科技有限公司 Video text recognition method and device, electronic equipment and readable storage medium
CN114697762A (en) * 2022-04-07 2022-07-01 脸萌有限公司 Processing method, processing device, terminal equipment and medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101616264A (en) * 2008-06-27 2009-12-30 中国科学院自动化研究所 News video categorization and system
US20150082349A1 (en) * 2013-09-13 2015-03-19 Arris Enterprises, Inc. Content Based Video Content Segmentation
CN110121033A (en) * 2018-02-06 2019-08-13 上海全土豆文化传播有限公司 Video categorization and device
CN110263217A (en) * 2019-06-28 2019-09-20 北京奇艺世纪科技有限公司 A kind of video clip label identification method and device
CN111259215A (en) * 2020-02-14 2020-06-09 北京百度网讯科技有限公司 Multi-modal-based topic classification method, device, equipment and storage medium
US20210073551A1 (en) * 2019-09-10 2021-03-11 Ruiwen Li Method and system for video segmentation
CN112633241A (en) * 2020-12-31 2021-04-09 中山大学 News story segmentation method based on multi-feature fusion and random forest model
CN112699687A (en) * 2021-01-07 2021-04-23 北京声智科技有限公司 Content cataloging method and device and electronic equipment
CN112733660A (en) * 2020-12-31 2021-04-30 支付宝(杭州)信息技术有限公司 Method and device for splitting video strip
CN113051966A (en) * 2019-12-26 2021-06-29 中国移动通信集团重庆有限公司 Video keyword processing method and device
CN113539304A (en) * 2020-04-21 2021-10-22 华为技术有限公司 Video strip splitting method and device

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101616264A (en) * 2008-06-27 2009-12-30 中国科学院自动化研究所 News video categorization and system
US20150082349A1 (en) * 2013-09-13 2015-03-19 Arris Enterprises, Inc. Content Based Video Content Segmentation
CN110121033A (en) * 2018-02-06 2019-08-13 上海全土豆文化传播有限公司 Video categorization and device
CN110263217A (en) * 2019-06-28 2019-09-20 北京奇艺世纪科技有限公司 A kind of video clip label identification method and device
US20210073551A1 (en) * 2019-09-10 2021-03-11 Ruiwen Li Method and system for video segmentation
CN113051966A (en) * 2019-12-26 2021-06-29 中国移动通信集团重庆有限公司 Video keyword processing method and device
CN111259215A (en) * 2020-02-14 2020-06-09 北京百度网讯科技有限公司 Multi-modal-based topic classification method, device, equipment and storage medium
CN113539304A (en) * 2020-04-21 2021-10-22 华为技术有限公司 Video strip splitting method and device
CN112633241A (en) * 2020-12-31 2021-04-09 中山大学 News story segmentation method based on multi-feature fusion and random forest model
CN112733660A (en) * 2020-12-31 2021-04-30 支付宝(杭州)信息技术有限公司 Method and device for splitting video strip
CN112699687A (en) * 2021-01-07 2021-04-23 北京声智科技有限公司 Content cataloging method and device and electronic equipment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241471A (en) * 2022-02-23 2022-03-25 阿里巴巴达摩院(杭州)科技有限公司 Video text recognition method and device, electronic equipment and readable storage medium
CN114241471B (en) * 2022-02-23 2022-06-21 阿里巴巴达摩院(杭州)科技有限公司 Video text recognition method and device, electronic equipment and readable storage medium
CN114697762A (en) * 2022-04-07 2022-07-01 脸萌有限公司 Processing method, processing device, terminal equipment and medium
US11706505B1 (en) 2022-04-07 2023-07-18 Lemon Inc. Processing method, terminal device, and medium
CN114697762B (en) * 2022-04-07 2023-11-28 脸萌有限公司 Processing method, processing device, terminal equipment and medium
WO2023195914A3 (en) * 2022-04-07 2023-11-30 脸萌有限公司 Processing method and apparatus, terminal device and medium

Similar Documents

Publication Publication Date Title
CN111259215B (en) Multi-mode-based topic classification method, device, equipment and storage medium
CN108833973B (en) Video feature extraction method and device and computer equipment
CN108009293B (en) Video tag generation method and device, computer equipment and storage medium
CN112163122B (en) Method, device, computing equipment and storage medium for determining label of target video
CN109325148A (en) The method and apparatus for generating information
CN103299324A (en) Learning tags for video annotation using latent subtags
CN112559800B (en) Method, apparatus, electronic device, medium and product for processing video
CN113766314B (en) Video segmentation method, device, equipment, system and storage medium
CN110287375B (en) Method and device for determining video tag and server
CN113469298B (en) Model training method and resource recommendation method
CN113992944A (en) Video cataloging method, device, equipment, system and medium
CN104463177A (en) Similar face image obtaining method and device
CN113159010A (en) Video classification method, device, equipment and storage medium
CN111639228A (en) Video retrieval method, device, equipment and storage medium
CN114298007A (en) Text similarity determination method, device, equipment and medium
CN113407775B (en) Video searching method and device and electronic equipment
CN111538903A (en) Method and device for determining search recommended word, electronic equipment and computer readable medium
CN107577667B (en) Entity word processing method and device
CN113987264A (en) Video abstract generation method, device, equipment, system and medium
CN110888896A (en) Data searching method and data searching system thereof
CN114051154A (en) News video strip splitting method and system
US11132393B2 (en) Identifying expressions for target concept with images
CN113343069A (en) User information processing method, device, medium and electronic equipment
CN108182191B (en) Hotspot data processing method and device
CN114157882B (en) Video cataloging method, device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination