WO2020215696A1 - 提取视频字幕的方法、装置、计算机设备及存储介质 - Google Patents

提取视频字幕的方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2020215696A1
WO2020215696A1 PCT/CN2019/118411 CN2019118411W WO2020215696A1 WO 2020215696 A1 WO2020215696 A1 WO 2020215696A1 CN 2019118411 W CN2019118411 W CN 2019118411W WO 2020215696 A1 WO2020215696 A1 WO 2020215696A1
Authority
WO
WIPO (PCT)
Prior art keywords
subtitle
video
area
preset
pixel area
Prior art date
Application number
PCT/CN2019/118411
Other languages
English (en)
French (fr)
Inventor
肖玉宾
喻红
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020215696A1 publication Critical patent/WO2020215696A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Definitions

  • This application relates to the computer field, in particular to methods, devices, computer equipment and storage media for extracting video subtitles.
  • the training of automatic speech recognition technology requires a large amount of labeled data, but to obtain labeled data, most of the two methods are currently used.
  • One method is to invest a lot of manpower to record and then manually label; the other is to perform existing recordings.
  • Manual dictation and annotation the cost of annotating data is very high, and the quality of annotation is not high.
  • the audio data in the existing video is made into annotated data, it can greatly save costs, but the audio data into annotated data requires one-to-one correspondence with the text content, and most of the video subtitles on the market are synthesized with the video. As a whole, the video and subtitles are not separated.
  • the existing text positioning method is used to locate the text from the video picture and extract the subtitles. The recognition process is complicated and the recognition efficiency is low.
  • the main purpose of this application is to provide a method for extracting video subtitles, which aims to solve the existing technical problems of complicated processes and low recognition efficiency when directly obtaining audio-corresponding subtitle information from a video.
  • This application proposes a method for extracting video subtitles, including:
  • the change pixel area includes at least one;
  • the first changed pixel area exists in the preset area range of the video display interface, determining whether the first changed pixel area meets the characteristics of the preset subtitle area;
  • the first changed pixel area meets the characteristics of the preset subtitle area, determining that the first changed pixel area is the subtitle area;
  • Extract subtitle text from the subtitle area Extract subtitle text from the subtitle area.
  • This application also provides a device for extracting video subtitles, including:
  • the first acquisition module is configured to acquire the changed pixel area of the second frame of the video compared to the first frame of the picture through the Gaussian mixture model algorithm, wherein the first frame of picture and the second frame of picture are in the video For any two adjacent frames of pictures, the changed pixel area includes at least one;
  • a first determining module configured to determine whether there is a first changed pixel area within a preset area of the video display interface, wherein the first changed pixel area is included in the changed pixel area;
  • the second determining module is configured to determine whether the first changed pixel area meets the characteristics of the preset subtitle area if the first changed pixel area exists in the preset area range of the video display interface;
  • a determining module configured to determine that the first changed pixel area is the subtitle area if the first changed pixel area meets the characteristics of a preset subtitle area
  • the extraction module is used to extract subtitle text from the subtitle area.
  • the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method when the computer program is executed.
  • the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the above method are realized.
  • This application uses the first frame picture corresponding to the previous time series as the background of the second frame picture corresponding to the next time series, so as to determine the changed pixels of the second frame picture compared to the first frame picture through the Gaussian mixture model algorithm
  • the subtitle area is further determined according to the changed pixel area, so as to extract the subtitle text from the subtitle area, realize the separation of the subtitle text and the video display interface, and improve the accuracy of subtitle extraction.
  • the feature of the unique aspect ratio of the subtitle area is used as the feature of the preset subtitle area.
  • the foregoing preset threshold is the minimum aspect ratio of the acquired subtitles, and the setting value range of the foregoing minimum aspect ratio r is greater than or equal to one third of the video width.
  • the preset area range in the present application refers to the area of the video display interface that is close to the bottom edge of the video display interface, which occupies a quarter of the height of the video, and the area that is located in the middle area, which is one-third of the width of the video. In the boundary area, the preset area range selected in advance can greatly reduce the amount of data processing, which is beneficial to quickly and accurately locate the subtitle area.
  • This application uses the existing audio separation tool to extract and save the audio in the video, and complete the audio annotation by one-to-one correspondence between the subtitle text and the cut audio file.
  • the above-mentioned annotation data can be used as a sample for the training of automatic speech recognition technology Data to reduce the cost of existing manual annotation data and improve the quality of the annotation data.
  • FIG. 1 is a schematic flowchart of a method for extracting video subtitles according to an embodiment of the present application
  • Fig. 2 is a schematic structural diagram of an apparatus for extracting video subtitles according to an embodiment of the present application
  • Fig. 3 is a schematic diagram of the internal structure of a computer device in an embodiment of the present application.
  • the method for extracting video subtitles in this embodiment includes:
  • S1 Obtain the changed pixel area of the second frame of the video compared to the first frame of the picture by using the Gaussian mixture model algorithm, where the first frame of picture and the second frame of pictures are any two adjacent frames of the video ,
  • the change pixel area includes at least one.
  • the mixed Gaussian model algorithm of this embodiment is a combination of multiple single models, which improves the rationality of data distribution.
  • the adjacent first frame picture and second frame picture in the image sequence of the video are input into the mixed Gaussian model algorithm, if the first frame picture and the second frame picture are at (x, y)
  • the first frame picture that is adjacent and is earlier than the second frame picture is used as the background of the second frame picture to determine the changed pixel area of the second frame picture compared to the first frame picture.
  • the area is the area including the difference pixels.
  • S2 Determine whether there is a first changed pixel area within a preset area of the video display interface, wherein the first changed pixel area is included in the changed pixel area.
  • the preset area range of this embodiment includes the video display area that is often set in the existing subtitles.
  • the preset area range includes the middle position range of the bottom area of the video display interface.
  • the coordinate data in the video display interface can be used to achieve positioning prediction. Set the area range to improve the accuracy of obtaining the subtitle area and reduce the amount of calculation in the data processing process.
  • This implementation initially determines that there may be a subtitle area by identifying the first changed pixel area in the preset area range.
  • the feature of the first changed pixel area is compared with the feature of the preset subtitle area to determine whether the first changed pixel area is a subtitle area through the feature of the preset subtitle area, thereby improving the accuracy of determining the subtitle area. If the characteristics of the first changed pixel area are consistent with the characteristics of the preset subtitle area, or are within the preset difference range, it is considered that the first changed pixel area meets the characteristics of the preset subtitle area, and it is determined that the first changed pixel area is all The subtitle area, otherwise the first changed pixel area is not the subtitle area.
  • the aforementioned features of the preset subtitle area include the height value range of the subtitle area, the aspect ratio of the subtitle area, and so on.
  • the changed pixel area of the second frame of picture compared to the first frame of picture includes the change of the subtitle area, the change of the video image, etc., for example, different frame images correspond to different subtitle content.
  • the preset rule in this embodiment is set according to the setting characteristics of the subtitle area in the existing video.
  • the existing subtitle area is mostly set in the middle of the bottom area of the video display interface, and it often exists in the form of a wide strip.
  • This embodiment first obtains the changed pixel area corresponding to each frame of the picture through the Gaussian mixture model algorithm, and then determines the subtitle area from the changed pixel area, and then realizes the extraction of the subtitle text in the subtitle area, and can quickly and accurately extract the corresponding image from the video file.
  • Corresponding subtitle text in order to perform secondary processing of the subtitle text, such as annotating audio, optimizing the display process and even making text training samples.
  • the above-mentioned subtitle area is an image mapping area of the subtitle text, and the subtitle area of different subtitle texts can be distinguished according to different mapping pixels corresponding to different texts.
  • the subtitle text is extracted from the subtitle area by using the text recognition technology in the picture to realize the separation of the subtitle text and the video display interface.
  • the subtitle text is extracted from the subtitle area by using the text recognition technology in the picture to realize the separation of the subtitle text and the video display interface.
  • Including optimizing the display mode of subtitle text such as setting to 3D display state, changing the display color of subtitle text, optimizing the animation display effect of subtitle text, etc., and expanding the use range of subtitle text.
  • step S3 of judging whether the first changed pixel area meets the characteristics of the preset subtitle area includes:
  • S31 Calculate the aspect ratio of the first changed pixel area, wherein the playback direction along the video sequence in the first changed pixel area is the width, and the direction perpendicular to the width is the height, and The aspect ratio is the width divided by the height.
  • S32 Determine whether the aspect ratio is greater than a preset threshold.
  • the unique aspect ratio feature of the subtitle area is used as the feature of the preset subtitle area.
  • the foregoing preset threshold is the minimum aspect ratio of the acquired subtitles, and the setting value range of the foregoing minimum aspect ratio r is greater than or equal to one third of the video width. In case r is set too large, there will be too few subtitle areas that meet the conditions in one frame of the video, and it is easy to miss the selection; if r is set too small, the extracted subtitle position will be inaccurate, the calculation amount will increase, and the subtitles will be positioned The area error increases.
  • the method includes:
  • S20a Obtain the video width and the video height of the video, where in the video display interface along the video timing playback direction is the video width, and a direction perpendicular to the video width is the video height.
  • S20b Set the preset value to be equal to the first preset value, and set an area close to the bottom edge of the video display interface and occupying the second preset value as the preset area range.
  • the preset area range in this embodiment refers to the area of the video display interface that is close to the bottom edge of the video display interface, which occupies a quarter of the height of the video, and the area that is one third of the width of the video in the middle area. That is, the first preset value is one-third of the width of the video, and the second preset value is one-fourth of the height of the video.
  • step S5 of extracting subtitle text from the subtitle area includes:
  • the subtitle area is cut and separated from the second frame of picture and stored separately, so as to accurately process the subtitle area.
  • the text recognition is performed by sequentially inputting the subtitle regions in each frame of pictures obtained in sequence according to the video sequence into an OCR (optical character recognition) character recognition model.
  • OCR text recognition refers to the process of electronic equipment (such as scanners or digital cameras) checking the characters printed on paper, and then using character recognition methods to translate the character shape into computer text; by scanning the text data corresponding to the subtitle area, and then The process of analyzing and processing the image file corresponding to the subtitle area to obtain text and layout information.
  • the preset format in this embodiment includes the video file name of the video, the frame index of the second frame of picture, the text content of the subtitle, the total number of video frames, and the width and height of the video.
  • the foregoing preset text is the subtitle text content sequentially stored according to the sequence of the frame picture where the subtitle is located.
  • the preset format includes at least the video file name of the video and the frame index corresponding to the second frame picture.
  • the video file name in this embodiment is the file name of the current video, such as AVI.123, etc.; the above-mentioned frame index refers to the order of frame pictures in all frames, for example, pictures located in the third frame according to time sequence.
  • the content of the annotation format including the video file name of the video and the frame index of the second frame of picture, is used to preliminarily determine whether there is repeated caption text. If the annotation content is different, the difference between the first caption text and the second caption text The text content is not the same; if the marked content is the same, then determine whether the specific text content is the same; if not, the text content of the first subtitle text and the second subtitle text are different.
  • step-by-step judgment in order to preliminarily judge whether the caption text is the same through the annotation information, so as to save the process of repeatedly calling the character recognition method to translate the character shape into the computer text, so as to save the process and speed up the response efficiency.
  • This embodiment avoids the repeated extraction of subtitle text when multiple frames of pictures have the same subtitle by identifying the changed pixel area, and eliminates the repeated extraction of subtitle text due to video background interference by the above-mentioned step-by-step judgment method to purify the pre- Set the subtitle text in the file.
  • step S5 of extracting caption text from the caption area the method includes:
  • S7 Determine the corresponding first audio file within the range of the start time and the end time.
  • S8 Cut and separate the first audio file from the audio file corresponding to the video by using an audio cutting tool.
  • the time position of the previous buffer and the current new buffer time position are regarded as the corresponding frame of the previous time sequence
  • the time interval of the subtitle area and save the time interval in association with the subtitle text of the subtitle area of the corresponding frame in the previous time sequence.
  • the existing audio separation tool is used to extract and save the audio in the video, and the subtitle text and the cut audio file are corresponding to each other to complete the audio annotation.
  • the above-mentioned annotation data can be used for the training of automatic speech recognition technology. Sample data to reduce the cost of existing manual annotation data and improve the quality of annotation data.
  • the method further includes:
  • S541 According to the video file name of the video and the frame index of the second frame picture, determine whether there is an empty subtitle file corresponding to the first frame index in the preset file, wherein the first frame index is included in all In the index of all frames in the preset file;
  • S543 Determine whether the designated subtitle text of the frame picture corresponding to the first frame index is extracted according to the text positioning model
  • the corresponding relationship between the frame index and the subtitle text is used to find the missing and extracted subtitle text, so as to ensure the integrity of the subtitle text in the entire video file.
  • an empty subtitle file corresponding to the first frame index is found, that is, there is no corresponding subtitle text corresponding to it, it is determined that there is an omission to extract, and the frame picture corresponding to the first frame index is input into the text positioning model to proceed according to the text positioning model Subtitle text positioning and extraction.
  • the above text localization model is CTPN, which combines CNN and LSTM deep network.
  • CTPN is an improvement from Faster R-CNN, which can effectively detect the horizontally distributed text in complex scenes, such as recognizing text in video pictures, although the recognition process It is complex and has low recognition efficiency, but has high recognition accuracy, which can effectively supplement the subtitle area missed by the Gaussian mixture model algorithm, and improve the integrity of the subtitle text in the entire video file.
  • the first frame of pictures corresponding to the previous time series is used as the background of the second frame of pictures corresponding to the next time series, so as to determine the change of the second frame of pictures compared to the first frame of pictures through the Gaussian mixture model algorithm
  • the pixel area and then determine the subtitle area according to the changed pixel area, so as to extract the subtitle text from the subtitle area, realize the separation of the subtitle text and the video display interface, and improve the accuracy of subtitle extraction.
  • the feature of the unique aspect ratio of the subtitle area is used as the feature of the preset subtitle area.
  • the foregoing preset threshold is the minimum aspect ratio of the acquired subtitles, and the setting value range of the foregoing minimum aspect ratio r is greater than or equal to one third of the video width.
  • the preset area range in this embodiment refers to the area of the video display interface that is close to the bottom edge of the video display interface, which occupies a quarter of the height of the video, and the area that is one third of the width of the video in the middle area.
  • the boundary area of the selected preset area can be set in advance to greatly reduce the amount of data processing, which is beneficial to quickly and accurately locate the subtitle area.
  • the existing audio separation tool is used to extract and save the audio in the video, and the subtitle text and the cut audio file are corresponding to each other to complete the audio annotation.
  • the above-mentioned annotation data can be used for the training of automatic speech recognition technology. Sample data to reduce the cost of existing manual annotation data and improve the quality of annotation data.
  • the device for extracting video subtitles in this embodiment includes:
  • the first acquisition module 1 is used to acquire the changed pixel area of the second frame of the video compared to the first frame of the picture by using the Gaussian mixture model algorithm, wherein the first frame of picture and the second frame of picture are the phases in the video.
  • the changed pixel area includes at least one.
  • the mixed Gaussian model algorithm of this embodiment is a combination of multiple single models, which improves the rationality of data distribution.
  • the adjacent first frame picture and second frame picture in the image sequence of the video are input into the mixed Gaussian model algorithm, if the first frame picture and the second frame picture are at (x, y)
  • the first frame picture that is adjacent and is earlier than the second frame picture is used as the background of the second frame picture to determine the changed pixel area of the second frame picture compared to the first frame picture.
  • the area is the area including the difference pixels.
  • the first determining module 2 is configured to determine whether there is a first changed pixel area within a preset area of the video display interface, wherein the first changed pixel area is included in the changed pixel area.
  • the preset area range of this embodiment includes the video display area that is often set in the existing subtitles.
  • the preset area range includes the middle position range of the bottom area of the video display interface.
  • the coordinate data in the video display interface can be used to achieve positioning prediction. Set the area range to improve the accuracy of obtaining the subtitle area and reduce the amount of calculation in the data processing process.
  • This implementation initially determines that there may be a subtitle area by identifying the first changed pixel area in the preset area range.
  • the second determining module 3 is configured to determine whether the first changed pixel area meets the characteristics of the preset subtitle area if the first changed pixel area exists in the preset area range of the video display interface.
  • the feature of the first changed pixel area is compared with the feature of the preset subtitle area to determine whether the first changed pixel area is a subtitle area through the feature of the preset subtitle area, thereby improving the accuracy of determining the subtitle area. If the characteristics of the first changed pixel area are consistent with the characteristics of the preset subtitle area, or are within the preset difference range, it is considered that the first changed pixel area meets the characteristics of the preset subtitle area, and it is determined that the first changed pixel area is all The subtitle area, otherwise the first changed pixel area is not the subtitle area.
  • the aforementioned features of the preset subtitle area include the height value range of the subtitle area, the aspect ratio of the subtitle area, and so on.
  • the determining module 4 is configured to determine that the first changed pixel area is the subtitle area if the first changed pixel area meets the characteristics of the preset subtitle area.
  • the changed pixel area of the second frame of picture compared to the first frame of picture includes the change of the subtitle area, the change of the video image, etc., for example, different frame images correspond to different subtitle content.
  • the preset rule in this embodiment is set according to the setting characteristics of the subtitle area in the existing video.
  • the existing subtitle area is mostly set in the middle of the bottom area of the video display interface, and it often exists in the form of a wide strip.
  • This embodiment first obtains the changed pixel area corresponding to each frame of the picture through the Gaussian mixture model algorithm, and then determines the subtitle area from the changed pixel area, and then realizes the extraction of the subtitle text in the subtitle area, and can quickly and accurately extract the corresponding image from the video file.
  • Corresponding subtitle text in order to perform secondary processing of the subtitle text, such as annotating audio, optimizing the display process and even making text training samples.
  • the above-mentioned subtitle area is the image mapping area of the subtitle text, and the subtitle area of different subtitle texts can be distinguished according to different mapping pixels corresponding to different texts.
  • the extraction module 5 is used for extracting subtitle text from the subtitle area.
  • the subtitle text is extracted from the subtitle area by using the text recognition technology in the picture to realize the separation of the subtitle text and the video display interface.
  • the subtitle text is extracted from the subtitle area by using the text recognition technology in the picture to realize the separation of the subtitle text and the video display interface.
  • Including optimizing the display mode of subtitle text such as setting to 3D display state, changing the display color of subtitle text, optimizing the animation display effect of subtitle text, etc., and expanding the use range of subtitle text.
  • the second judgment module includes:
  • a calculation unit configured to calculate the aspect ratio of the first changed pixel area, where the width in the first changed pixel area along the video sequence playback direction is the width, and the direction perpendicular to the width is the height ,
  • the aspect ratio is the width divided by the height.
  • the first determining unit is configured to determine whether the aspect ratio is greater than a preset threshold.
  • the first determining unit is configured to determine that the first changed pixel area meets the characteristics of the preset subtitle area if the aspect ratio is greater than a preset threshold.
  • the second determining unit is configured to determine that the first changed pixel area does not meet the characteristics of the preset subtitle area if the aspect ratio is not greater than a preset threshold.
  • the unique aspect ratio feature of the subtitle area is used as the feature of the preset subtitle area.
  • the foregoing preset threshold is the minimum aspect ratio of the acquired subtitles, and the setting value range of the foregoing minimum aspect ratio r is greater than or equal to one third of the video width. In case r is set too large, there will be too few subtitle areas that meet the conditions in one frame of the video, and it is easy to miss the selection; if r is set too small, the extracted subtitle position will be inaccurate, the calculation amount will increase, and the subtitles will be positioned The area error increases.
  • the device for extracting video subtitles includes:
  • the second acquisition module is configured to acquire the video width and the video height of the video, wherein the video display interface along the video sequence play direction is the video width, and the direction perpendicular to the video width is the video high.
  • the setting module is configured to set the preset value to be equal to the first preset value, and set the area close to the bottom edge of the video display interface and occupying the second preset value as the preset area range.
  • the preset area range in this embodiment refers to the area of the video display interface that is close to the bottom edge of the video display interface, which occupies a quarter of the height of the video, and the area that is one third of the width of the video in the middle area. That is, the first preset value is one-third of the width of the video, and the second preset value is one-fourth of the height of the video.
  • the extraction module 5 includes:
  • the separating unit is configured to cut and separate the subtitle area from the second frame picture.
  • the recognition unit is configured to recognize the subtitle text through the image text recognition algorithm on the separated subtitle area.
  • the assignment unit is used to copy the subtitle text to the preset file.
  • the marking unit is used to mark and store the subtitle text in a preset format.
  • the subtitle area is cut and separated from the second frame of picture and stored separately, so as to accurately process the subtitle area.
  • the text recognition is performed by sequentially inputting the subtitle regions in each frame of pictures obtained in sequence according to the video sequence into an OCR (optical character recognition) character recognition model.
  • OCR text recognition refers to the process of electronic equipment (such as scanners or digital cameras) checking the characters printed on paper, and then using character recognition methods to translate the character shape into computer text; by scanning the text data corresponding to the subtitle area, and then The process of analyzing and processing the image file corresponding to the subtitle area to obtain text and layout information.
  • the preset format in this embodiment includes the video file name of the video, the frame index of the second frame of picture, the text content of the subtitle, the total number of video frames, and the width and height of the video.
  • the foregoing preset text is the subtitle text content sequentially stored according to the sequence of the frame picture where the subtitle is located.
  • the preset format includes at least the video file name of the video and the frame index corresponding to the second frame picture
  • the extraction module 5 includes:
  • the second determining unit is configured to determine whether there is a second subtitle text with the same annotation information as the first subtitle text in the preset file according to the video file name of the video and the frame index corresponding to the second frame picture;
  • the first subtitle text and the second subtitle text are respectively included in all the subtitle texts in the preset file.
  • the third determining unit is configured to determine whether the text content of the first subtitle text and the second subtitle text are the same if there is a second subtitle text with the same annotation information as the first subtitle text in the preset file .
  • the deleting unit is configured to delete the first subtitle text or the second subtitle text if the text content of the first subtitle text and the second subtitle text are the same.
  • the video file name in this embodiment is the file name of the current video, such as AVI.123, etc.; the above-mentioned frame index refers to the order of frame pictures in all frames, for example, pictures located in the third frame according to time sequence.
  • the content of the annotation format including the video file name of the video and the frame index of the second frame of picture, is used to preliminarily determine whether there is repeated caption text. If the annotation content is different, the difference between the first caption text and the second caption text The text content is not the same; if the marked content is the same, then determine whether the specific text content is the same; if not, the text content of the first subtitle text and the second subtitle text are different.
  • step-by-step judgment in order to preliminarily judge whether the caption text is the same through the annotation information, so as to save the process of repeatedly calling the character recognition method to translate the character shape into computer text, so as to save the process and speed up the response efficiency.
  • This embodiment avoids the repeated extraction of subtitle text when multiple frames of pictures have the same subtitle by identifying the changed pixel area, and eliminates the repeated extraction of subtitle text due to video background interference by the above-mentioned step-by-step judgment method to purify the pre- Set the subtitle text in the file.
  • the device for extracting video subtitles includes:
  • the third acquiring module is used to acquire the start time and end time of the second caption text.
  • the determining module is used to determine the corresponding first audio file within the range of the start time and the end time.
  • the interception module is used to intercept and separate the first audio file from the audio file corresponding to the video by using an audio interception tool.
  • the marking module is configured to perform audio marking on the second subtitle text and the first audio file in a one-to-one correspondence.
  • the time position of the previous buffer and the current new buffer time position are regarded as the corresponding frame of the previous time sequence
  • the time interval of the subtitle area and save the time interval in association with the subtitle text of the subtitle area of the corresponding frame in the previous time sequence.
  • the existing audio separation tool is used to extract and save the audio in the video, and the subtitle text and the cut audio file are corresponding to each other to complete the audio annotation.
  • the above-mentioned annotation data can be used for the training of automatic speech recognition technology. Sample data to reduce the cost of existing manual annotation data and improve the quality of annotation data.
  • the extraction module 5 further includes:
  • the fourth determining unit is configured to determine whether there is an empty subtitle file corresponding to the first frame index in the preset file according to the video file name of the video and the frame index of the second frame picture, wherein the first The frame index is included in all frame indexes in the preset file;
  • the input unit is configured to input the frame picture corresponding to the first frame index into the text positioning model when there is an empty subtitle file corresponding to the first frame index in the preset file;
  • the fifth determining unit is configured to determine whether the designated subtitle text of the frame picture corresponding to the first frame index is extracted according to the text positioning model;
  • a supplementary unit configured to, if the designated subtitle text of the frame picture corresponding to the first frame index is extracted according to the text positioning model, supplement the designated subtitle text to the first frame index in the preset file Corresponding position
  • a marking unit configured to mark the position corresponding to the first frame index in the preset file as an empty caption if the designated subtitle text of the frame picture corresponding to the first frame index is not extracted according to the text positioning model .
  • the corresponding relationship between the frame index and the subtitle text is used to find the missing and extracted subtitle text, so as to ensure the integrity of the subtitle text in the entire video file.
  • an empty subtitle file corresponding to the first frame index is found, that is, there is no corresponding subtitle text corresponding to it, it is determined that there is an omission to extract, and the frame picture corresponding to the first frame index is input into the text positioning model to proceed according to the text positioning model Subtitle text positioning and extraction.
  • the above text localization model is CTPN, which combines CNN and LSTM deep network.
  • CTPN is an improvement from Faster R-CNN, which can effectively detect the horizontally distributed text in complex scenes, such as recognizing text in video pictures, although the recognition process It is complex and has low recognition efficiency, but has high recognition accuracy, which can effectively supplement the subtitle area missed by the Gaussian mixture model algorithm, and improve the integrity of the subtitle text in the entire video file.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 3.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the computer designed processor is used to provide calculation and control capabilities.
  • the memory of the computer device includes a readable storage medium and an internal memory.
  • the readable storage medium stores an operating system, computer readable instructions, and a database.
  • the above-mentioned readable storage medium includes non-volatile readable storage medium and volatile readable storage medium.
  • the memory provides an environment for the operation of the operating system and computer readable instructions in the non-volatile storage medium.
  • the database of the computer equipment is used to store and extract data such as video subtitles.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instruction executes the process of the above-mentioned method embodiment.
  • FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • An embodiment of the present application further provides a computer-readable storage medium on which computer-readable instructions are stored.
  • the processes of the foregoing method embodiments are executed.
  • the above-mentioned readable storage medium includes non-volatile readable storage medium and volatile readable storage medium.

Abstract

本申请揭示了提取视频字幕的方法、装置、计算机设备及存储介质,其中,提取视频字幕的方法,包括:通过混合高斯模型算法获取视频的第二帧图片相比于第一帧图片的变化像素区域;判断所述视频显示界面的预设区域范围内是否存在第一变化像素区域;若所述视频显示界面的预设区域范围内存在所述第一变化像素区域,则判断所述第一变化像素区域是否满足预设字幕区特征;若所述第一变化像素区域满足预设字幕区特征,则判定所述第一变化像素区域为所述字幕区域;从字幕区域中提取字幕文字。通过混合高斯模型算法确定第二帧图片相比于第一帧图片的变化像素区域,进而根据变化像素区域确定字幕区域,提高字幕提取的精准度。

Description

提取视频字幕的方法、装置、计算机设备及存储介质
本申请要求于2019年04月22日提交中国专利局、申请号为2019103249786,发明名称为“提取视频字幕的方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及到计算机领域,特别是涉及到提取视频字幕的方法、装置、计算机设备及存储介质。
背景技术
目前自动语音识别技术的训练需要大量的标注数据,但是想要获取标注数据目前大多采用两类方法,一类方法是投入大量人力去录音,然后人工标注;另一类是将已有的录音进行人工的听写标注,标注数据的成本非常高,且标注的质量并不高。若将现有视频中的音频数据制成标注数据可大大节省成本,但音频数据制成标注数据是需要一一对应的文字内容对应,而目前市场上绝大多数的视频字幕都是与视频合成为一体,未对视频和字幕进行分离,现有通过文本定位的方式从视频图片中进行文本定位并提取字幕,识别过程复杂、识别效率较低。
技术问题
本申请的主要目的为提供提取视频字幕的方法,旨在解决现有从视频中直接获取到音频对应的字幕信息时过程复杂行且识别效率低的技术问题。
技术解决方案
本申请提出一种提取视频字幕的方法,包括:
通过混合高斯模型算法获取视频的第二帧图片相比于第一帧图片的变化像素区域,其中所述第一帧图片和所述第二帧图片是所述视频中相邻的任意两帧图片,所述变化像素区域至少包括一个;
判断所述视频显示界面的预设区域范围内是否存在第一变化像素区域,其中所述第一变化像素区域包含于所述变化像素区域;
若所述视频显示界面的预设区域范围内存在所述第一变化像素区域,则判断所述第一变化像素区域是否满足预设字幕区特征;
若所述第一变化像素区域满足预设字幕区特征,则判定所述第一变化像素区域为所述字幕区域;
从所述字幕区域中提取字幕文字。
本申请还提供了一种提取视频字幕的装置,包括:
第一获取模块,用于通过混合高斯模型算法获取视频的第二帧图片相比于第一帧图片的变化像素区域,其中所述第一帧图片和所述第二帧图片是所述视频中相邻的任意两帧图片,所述变化像素区域至少包括一个;
第一判断模块,用于判断所述视频显示界面的预设区域范围内是否存在第一变化像素区域,其中所述第一变化像素区域包含于所述变化像素区域;
第二判断模块,用于若所述视频显示界面的预设区域范围内存在第一变化像素区域,则判断所述第一变化像素区域是否满足预设字幕区特征;
判定模块,用于若所述第一变化像素区域满足预设字幕区特征,则判定所述第一变化像素区域为所述字幕区域;
提取模块,用于从所述字幕区域中提取字幕文字。
本申请还提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现上述方法的步骤。
本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述的方法的步骤。
有益效果
本申请通过将前一时间序列对应的第一帧图片,作为后一时间序列对应的第二帧图片的背景,以便通过混合高斯模型算法确定第二帧图片相比于第一帧图片的变化像素区域,进而根据变化像素区域确定字幕区域,以便从字幕区域中提取字幕文字,实现字幕文字与视频显示界面的分离,提高字幕提取的精准度。通过字幕区域特有的宽高比特征作为预设字幕区特征。上述预设阈值为获取到字幕的最小宽高比,上述最小宽高比r的设定值范围为r大于等于视频宽的三分之一。以防r设置得太大会造成视频的一个帧图片中,满足条件的字幕区域太少,容易漏选;r设置得太小会造成提取的字幕位置不准确,计算量增大,且使定位字幕区域的误差增大。本申请的预设区域范围内指视频显示界面中靠近所述视频显示界面的底部边缘,占比所述视频高的四分之一区域,与位于中部区域的视频宽的三分之一区域的交界区域,通过预先设定选择的预设区域范围可极大地降低数据处理量,有利于快速且准确的定位到字幕区域。本申请采用现有的音频分离工具将视频中的音频提取出来并保存,并将字幕文字与切割后的音频文件一一对应完成音频标注,上述标注数据可用于自动语音识别技术的训练时的样本数据,以降低现有人工标注数据的成本,且提高标注数据的质 量。
附图说明
图1本申请一实施例的提取视频字幕的方法流程示意图;
图2本申请一实施例的提取视频字幕的装置结构示意图;
图3本申请一实施例的计算机设备内部结构示意图。
本发明的最佳实施方式
参照图1,本实施例的提取视频字幕的方法,包括:
S1:通过混合高斯模型算法获取视频的第二帧图片相比于第一帧图片的变化像素区域,其中所述第一帧图片和第二帧图片是所述视频中相邻的任意两帧图片,所述变化像素区域至少包括一个。
本实施例的混合高斯模型算法是多个单模型的组合,提高数据分配的合理性。本实施例中视频的每帧图片中的每个像素由多个单模型描述:P(p)={[w i(x,y,t),u i(x,y,t),σ i(x,y,t) 2]},i=1,2,......,k,k的值为3到5,表示混合高斯模型中单模型的个数,w i(x,y,t)表示每个单模型的权重,满足
Figure PCTCN2019118411-appb-000001
u i(x,y,t)表示每个单模型的均值,σ i(x,y,t) 2表示每个单模型对应的方差,上述权重、均值和方差共同确定一个单模型。本实施例中通过将视频的图像序列中的相邻的第一帧图片和第二帧图片输入到混合高斯模型算法中,若第一帧图片和第二帧图片在(x,y)处的像素值对于i=1,2,......,k满足I(x,y,t)-u i(x,y,t)≤λ*σ i(x,y,t),则像素值与该单模型匹配,则判定该像素值为背景,若不存在与该像素值匹配的单模型,则为前景,即视频内容。本实施例通过将相邻且时间早于第二帧图片的第一帧图片,作为第二帧图片的背景,以便确定第二帧图片相比于第一帧图片的变化像素区域,上述变化像素区域为包括差异像素点的区域。
S2:判断所述视频显示界面的预设区域范围内是否存在第一变化像素区域,其中所述第一变化像素区域包含于所述变化像素区域。
本实施例的预设区域范围包括现有字幕常设置的视频显示区域,比如,预设区域范围包括视频显示界面的底部区域的中间位置范围,可通过视频显示界面中的坐标数据,实现定位预设区域范围,以便提高获取字幕区域的精准性,降低数据处理过程中的计算量。本实施通过识别预设区域范围内存在的第一变化像素区域,初步确定可能存在字幕区域。
S3:若所述视频显示界面的预设区域范围内存在第一变化像素区域,则判断所述第一变化像素区域是否满足预设字幕区特征。
本实施例通过将第一变化像素区域的特征与预设字幕区特征进行比较,以便通过预设字幕区特征确定第一变化像素区域是否为字幕区域,提高确定字幕区域的精准度。第一变化像素区域的特征与预设字幕区特征一致,或处于预设差异范围之内,则均认为第一变化像素区域满足预设字幕区特征,则判定所述第一变化像素区域为所述字幕区域,否则第一变化像素区域不是所述字幕区域。上述预设字幕区特征包括字幕区的高度值范围、字幕区的宽高比等。
S4:若所述第一变化像素区域满足预设字幕区特征,则判定所述第一变化像素区域为所述字幕区域。
本实施例的视频中第二帧图片相比于第一帧图片的变化像素区域中,包括字幕区域的变化、视频图像变化等,比如不同帧图像对应不同的字幕内容。本实施例的预设规则遵循现有视频中字幕区域的设置特点进行设定。比如现有字幕区域多设置于视频显示界面的底部区域中间位置,且常以宽条状形态存在。本实施例首先通过混合高斯模型算法获取各帧图片对应的变化像素区域,然后再从变化像素区域中确定字幕区域,进而实现对字幕区域的字幕文字的提取,可快速从视频文件中准确提取相对应的字幕文字,以便将字幕文字进行二次处理,比如标注音频、优化显示过程甚至制作文本训练样本等。上述字幕区域为字幕文字的图像映射区域,根据不同的文字对应的映射像素不同,进而区别不同字幕文字的字幕区域。
S5:从所述字幕区域中提取字幕文字。
本实施例通过图片中文字识别技术,从所述字幕区域中提取字幕文字,实现字幕文字与视频显示界面的分离。以便对字幕文字实现进一步的优化处理。包括优化字幕文字的显示方式,比如设置为3D显示状态、改变字幕文字的显示颜色,优化字幕文字的动画显示效果等,扩大字幕文字的使用范围。
进一步地,所述判断所述第一变化像素区域是否满足预设字幕区特征的步骤S3,包括:
S31:计算所述第一变化像素区域的宽高比,其中所述第一变化像素区域中沿所述视频时序播放方向为所述宽,垂直于所述宽的方向为所述高,所述宽高比为所述宽除以所述高。
S32:判断所述宽高比是否大于预设阈值。
S33:若所述宽高比大于预设阈值,则判定所述第一变化像素区域满足所述预设字幕区特征。
S34:若所述宽高比不大于预设阈值,则判定所述第一变化像素区域不满足所述预设字幕区特征。
本实施例通过字幕区域特有的宽高比特征作为预设字幕区特征。上述预设阈值为获取到字幕的最小宽高比,上述最小宽高比r的设定值范围为r大于等于视频宽的三分之一。以防r设置得太大会造成视频的一个帧图片中,满足条件的字幕区域太少,容易漏选;r设置得太小会造成提取的字幕位置不准确,计算量增大,且使定位字幕区域的误差增大。
进一步地,所述判断所述视频显示界面的预设区域范围内是否存在第一变化像素区域的步骤S2之前,包括:
S20a:获取所述视频的视频宽和视频高,其中所述视频显示界面中沿所述视频时序播放方向为所述视频宽,垂直于所述视频宽的方向为所述视频高。
S20b:设定所述预设值等于第一预设值,设定靠近所述视频显示界面的底部边缘,占比第二预设值的区域范围为所述预设区域范围。
本实施例的预设区域范围内指视频显示界面中靠近所述视频显示界面的底部边缘,占比所述视频高的四分之一区域,与位于中部区域的视频宽的三分之一区域的交界区域,即上述第一预设值为所述视频宽的三分之一,第二预设值为所述视频高的四分之一。通过预先设定选择的预设区域范围可极大地降低数据处理量,有利于快速且准确的定位到字幕区域。
进一步地,所述从所述字幕区域中提取字幕文字的步骤S5,包括:
S51:将所述字幕区域从所述第二帧图片中切割分离。
S52:将分离后的所述字幕区域通过图像文字识别算法识别出所述字幕文字。
S53:将所述字幕文字复制到预设文件中。
S54:通过预设格式标注所述字幕文字并存储。
本实施例通过将字幕区域从第二帧图片中切割分离,进行单独存储,以便精准地处理字幕区域。通过将按照视频时序依次获得的各帧图片中的字幕区域按照顺序依次输入到OCR(optical character recognition)文字识别模型中进行文字识别。OCR文字识别是指电子设备(例如扫描仪或数码相机)检查纸上打印的字符,然后用字符识别方法将字符形状翻译成计算机文字的过程;通 过对字幕区域对应的文本资料进行扫描,然后对字幕区域对应的图像文件进行分析处理,获取文字及版面信息的过程。本实施例的预设格式包括视频的视频文件名、第二帧图片的帧索引、字幕的文字内容、视频总帧数及视频宽高尺寸等。上述预设文本为按照字幕所在帧图片的时序,依次存储的字幕文字内容。
进一步地,所述预设格式至少包括所述视频的视频文件名和第二帧图片对应的帧索引,所述通过预设格式标注所述字幕文字并存储的步骤S54之后,包括:
S55:根据所述视频的视频文件名和所述第二帧图片对应的帧索引,判断所述预设文件中是否存在与第一字幕文字具有相同标注信息的第二字幕文字,其中所述第一字幕文字和所述第二字幕文字,分别包含于所述预设文件中所有所述字幕文字。
S56:若所述预设文件中存在与第一字幕文字具有相同标注信息的第二字幕文字,则判断所述第一字幕文字和所述第二字幕文字的文字内容是否相同。
S57:若所述第一字幕文字和所述第二字幕文字的文字内容相同,则删除所述所述第一字幕文字或所述第二字幕文字。
本实施例的视频文件名为当前视频的文件名,比如AVI.123等等;上述帧索引指帧图片处于所有帧中的排序,比如按照时序位于第三帧的图片。本实施例通过标注格式中标注内容,包括视频的视频文件名和所述第二帧图片的帧索引,初步判断是否出现重复字幕文字,若标注内容不同,则第一字幕文字和第二字幕文字的文字内容不相同;若标注内容相同,再判断具体的文字内容是否相同,若不相同,则第一字幕文字和第二字幕文字的文字内容不相同。通过逐步判断的方式,以便通过标注信息初步判断字幕文字是否相同,以便节省重复调用字符识别方法将字符形状翻译成计算机文字的过程,以节省流程,加快响应效率。本实施例通过识别变化像素区域,避免了连续多帧图片具有相同字幕时重复提取字幕文字的情况,且通过上述的逐步判断方式剔除由于视频背景干扰,而导致重复提取的字幕文字,以净化预设文件中的字幕文字。
进一步地,从所述字幕区域中提取字幕文字的步骤S5之后,包括:
S6:获取所述第二字幕文字的起始时间和终止时间。
S7:确定所述起始时间和终止时间范围内对应的第一音频文件。
S8:将所述第一音频文件通过音频截取工具从所述视频对应的音频文件中截取分离。
S9:将所述第二字幕文字与所述第一音频文件一一对应进行音频标注。
本实施例通过遍历视频中所有的变化像素区域a1,a2,a3,…an,计算各变化像素区域的宽高比(w/h)是否大于设定r,如果大于设定r,则从当前帧的图片中切割对应的字幕区域,并将当前帧的帧索引换算成对应的时间[帧索引*(1/视频帧率)就得到当前帧在视频中的时间位置],并缓存该时间点的字幕区域,将本次缓存的字幕区域与上一时序缓存的字幕区域进行像素对比,差异小于预设阈值时,则上一次缓存的时间位置与当前新的缓存时间位置作为上一时序对应帧的字幕区域的时间间隔,并将时间间隔与上一时序对应帧的字幕区域的字幕文字关联保存。本实施例采用现有的音频分离工具将视频中的音频提取出来并保存,并将字幕文字与切割后的音频文件一一对应完成音频标注,上述标注数据可用于自动语音识别技术的训练时的样本数据,以降低现有人工标注数据的成本,且提高标注数据的质量。
进一步地,所述通过预设格式标注所述字幕文字并存储的步骤S54之后,还包括:
S541:根据所述视频的视频文件名和所述第二帧图片的帧索引,判断所述预设文件中是否存在与第一帧索引对应的空字幕文件,其中所述第一帧索引包含于所述预设文件中所有帧索引中;
S542:若存在,则将所述第一帧索引对应的帧图片,输入文本定位模型;
S543:判断根据所述文本定位模型是否提取到所述第一帧索引对应的帧图片的指定字幕文字;
S544:若提取到,则将所述指定字幕文字补充到所述预设文件中所述第一帧索引对应位置;
S545:若未提取到,则在所述预设文件中所述第一帧索引对应位置标记为空字幕。
本实施例通过帧索引与字幕文字的对应关系,查找遗漏提取的字幕文字,以保证整个视频文件中的字幕文字的完整性。当查找到第一帧索引对应空字幕文件,即无对应的字幕文字与其相对应,则判定存在遗漏提取,则将第一帧索引对应的帧图片,输入文本定位模型,以根据文本定位模型进行字幕文字定位与提取。上述文本定位模型为CTPN,CTPN结合CNN与LSTM深度网络,CTPN是从Faster R-CNN改进而来,能有效的检测出复杂场景的横向分布的文字,比如识别视频图片中的文字,虽然识别过程复杂、识别效率较低,但识别精度高,可 有效补充通过混合高斯模型算法遗漏的字幕区域,提高整个视频文件中的字幕文字的完整性。
本实施例通过将前一时间序列对应的第一帧图片,作为后一时间序列对应的第二帧图片的背景,以便通过混合高斯模型算法确定第二帧图片相比于第一帧图片的变化像素区域,进而根据变化像素区域确定字幕区域,以便从字幕区域中提取字幕文字,实现字幕文字与视频显示界面的分离,提高字幕提取的精准度。通过字幕区域特有的宽高比特征作为预设字幕区特征。上述预设阈值为获取到字幕的最小宽高比,上述最小宽高比r的设定值范围为r大于等于视频宽的三分之一。以防r设置得太大会造成视频的一个帧图片中,满足条件的字幕区域太少,容易漏选;r设置得太小会造成提取的字幕位置不准确,计算量增大,且使定位字幕区域的误差增大。本实施例的预设区域范围内指视频显示界面中靠近所述视频显示界面的底部边缘,占比所述视频高的四分之一区域,与位于中部区域的视频宽的三分之一区域的交界区域,通过预先设定选择的预设区域范围可极大地降低数据处理量,有利于快速且准确的定位到字幕区域。本实施例采用现有的音频分离工具将视频中的音频提取出来并保存,并将字幕文字与切割后的音频文件一一对应完成音频标注,上述标注数据可用于自动语音识别技术的训练时的样本数据,以降低现有人工标注数据的成本,且提高标注数据的质量。
参照图2,本实施例的提取视频字幕的装置,包括:
第一获取模块1,用于通过混合高斯模型算法获取视频的第二帧图片相比于第一帧图片的变化像素区域,其中所述第一帧图片和第二帧图片是所述视频中相邻的任意两帧图片,所述变化像素区域至少包括一个。
本实施例的混合高斯模型算法是多个单模型的组合,提高数据分配的合理性。本实施例中视频的每帧图片中的每个像素由多个单模型描述:P(p)={[w i(x,y,t),u i(x,y,t),σ i(x,y,t) 2]},i=1,2,......,k,k的值为3到5,表示混合高斯模型中单模型的个数,w i(x,y,t)表示每个单模型的权重,满足
Figure PCTCN2019118411-appb-000002
u i(x,y,t)表示每个单模型的均值,σ i(x,y,t) 2表示每个单模型对应的方差,上述权重、均值和方差共同确定一个单模型。本实施例中通过将视频的图像序列中的相邻的第一帧图片和第二帧图片输入到混合高斯模型算法中,若第一帧图片和第二帧图片在(x,y)处的像素值对于i=1,2,......,k满足 I(x,y,t)-u i(x,y,t)≤λ*σ i(x,y,t),则像素值与该单模型匹配,则判定该像素值为背景,若不存在与该像素值匹配的单模型,则为前景,即视频内容。本实施例通过将相邻且时间早于第二帧图片的第一帧图片,作为第二帧图片的背景,以便确定第二帧图片相比于第一帧图片的变化像素区域,上述变化像素区域为包括差异像素点的区域。
第一判断模块2,用于判断所述视频显示界面的预设区域范围内是否存在第一变化像素区域,其中所述第一变化像素区域包含于所述变化像素区域。
本实施例的预设区域范围包括现有字幕常设置的视频显示区域,比如,预设区域范围包括视频显示界面的底部区域的中间位置范围,可通过视频显示界面中的坐标数据,实现定位预设区域范围,以便提高获取字幕区域的精准性,降低数据处理过程中的计算量。本实施通过识别预设区域范围内存在的第一变化像素区域,初步确定可能存在字幕区域。
第二判断模块3,用于若所述视频显示界面的预设区域范围内存在第一变化像素区域,则判断所述第一变化像素区域是否满足预设字幕区特征。
本实施例通过将第一变化像素区域的特征与预设字幕区特征进行比较,以便通过预设字幕区特征确定第一变化像素区域是否为字幕区域,提高确定字幕区域的精准度。第一变化像素区域的特征与预设字幕区特征一致,或处于预设差异范围之内,则均认为第一变化像素区域满足预设字幕区特征,则判定所述第一变化像素区域为所述字幕区域,否则第一变化像素区域不是所述字幕区域。上述预设字幕区特征包括字幕区的高度值范围、字幕区的宽高比等。
判定模块4,用于若所述第一变化像素区域满足预设字幕区特征,则判定所述第一变化像素区域为所述字幕区域。
本实施例的视频中第二帧图片相比于第一帧图片的变化像素区域中,包括字幕区域的变化、视频图像变化等,比如不同帧图像对应不同的字幕内容。本实施例的预设规则遵循现有视频中字幕区域的设置特点进行设定。比如现有字幕区域多设置于视频显示界面的底部区域中间位置,且常以宽条状形态存在。本实施例首先通过混合高斯模型算法获取各帧图片对应的变化像素区域,然后再从变化像素区域中确定字幕区域,进而实现对字幕区域的字幕文字的提取,可快速从视频文件中准确提取相对应的字幕文字,以便将字幕文字进行二次处理,比如标注音频、优化显示过程甚至制作文本训练样本等。上述字幕区域为字幕文字的图像映射区域,根据不同的文字对应的映射像素不同,进而区别不 同字幕文字的字幕区域。
提取模块5,用于从所述字幕区域中提取字幕文字。
本实施例通过图片中文字识别技术,从所述字幕区域中提取字幕文字,实现字幕文字与视频显示界面的分离。以便对字幕文字实现进一步的优化处理。包括优化字幕文字的显示方式,比如设置为3D显示状态、改变字幕文字的显示颜色,优化字幕文字的动画显示效果等,扩大字幕文字的使用范围。
进一步地,所述第二判断模块,包括:
计算单元,用于计算所述第一变化像素区域的宽高比,其中所述第一变化像素区域中沿所述视频时序播放方向为所述宽,垂直于所述宽的方向为所述高,所述宽高比为所述宽除以所述高。
第一判断单元,用于判断所述宽高比是否大于预设阈值。
第一判定单元,用于若所述宽高比大于预设阈值,则判定所述第一变化像素区域满足所述预设字幕区特征。
第二判定单元,用于若所述宽高比不大于预设阈值,则判定所述第一变化像素区域不满足所述预设字幕区特征。
本实施例通过字幕区域特有的宽高比特征作为预设字幕区特征。上述预设阈值为获取到字幕的最小宽高比,上述最小宽高比r的设定值范围为r大于等于视频宽的三分之一。以防r设置得太大会造成视频的一个帧图片中,满足条件的字幕区域太少,容易漏选;r设置得太小会造成提取的字幕位置不准确,计算量增大,且使定位字幕区域的误差增大。
进一步地,提取视频字幕的装置,包括:
第二获取模块,用于获取所述视频的视频宽和视频高,其中所述视频显示界面中沿所述视频时序播放方向为所述视频宽,垂直于所述视频宽的方向为所述视频高。
设定模块,用于设定所述预设值等于第一预设值,设定靠近所述视频显示界面的底部边缘,占比第二预设值的区域范围为所述预设区域范围。
本实施例的预设区域范围内指视频显示界面中靠近所述视频显示界面的底部边缘,占比所述视频高的四分之一区域,与位于中部区域的视频宽的三分之一区域的交界区域,即上述第一预设值为所述视频宽的三分之一,第二预设值为所述视频高的四分之一。通过预先设定选择的预设区域范围可极大地降低数据处理量,有利于快速且准确的定位到字幕区域。
进一步地,所述提取模块5,包括:
分离单元,用于将所述字幕区域从所述第二帧图片中切割分离。
识别单元,用于将分离后的所述字幕区域通过图像文字识别算法识别出所述字幕文字。
赋值单元,用于将所述字幕文字复制到预设文件中。
标注单元,用于通过预设格式标注所述字幕文字并存储。
本实施例通过将字幕区域从第二帧图片中切割分离,进行单独存储,以便精准地处理字幕区域。通过将按照视频时序依次获得的各帧图片中的字幕区域按照顺序依次输入到OCR(optical character recognition)文字识别模型中进行文字识别。OCR文字识别是指电子设备(例如扫描仪或数码相机)检查纸上打印的字符,然后用字符识别方法将字符形状翻译成计算机文字的过程;通过对字幕区域对应的文本资料进行扫描,然后对字幕区域对应的图像文件进行分析处理,获取文字及版面信息的过程。本实施例的预设格式包括视频的视频文件名、第二帧图片的帧索引、字幕的文字内容、视频总帧数及视频宽高尺寸等。上述预设文本为按照字幕所在帧图片的时序,依次存储的字幕文字内容。
进一步地,所述预设格式至少包括所述视频的视频文件名和第二帧图片对应的帧索引,所述提取模块5,包括:
第二判断单元,用于根据所述视频的视频文件名和所述第二帧图片对应的帧索引,判断所述预设文件中是否存在与第一字幕文字具有相同标注信息的第二字幕文字,其中所述第一字幕文字和所述第二字幕文字,分别包含于所述预设文件中所有所述字幕文字。
第三判断单元,用于若所述预设文件中存在与第一字幕文字具有相同标注信息的第二字幕文字,则判断所述第一字幕文字和所述第二字幕文字的文字内容是否相同。
删除单元,用于若所述第一字幕文字和所述第二字幕文字的文字内容相同,则删除所述所述第一字幕文字或所述第二字幕文字。
本实施例的视频文件名为当前视频的文件名,比如AVI.123等等;上述帧索引指帧图片处于所有帧中的排序,比如按照时序位于第三帧的图片。本实施例通过标注格式中标注内容,包括视频的视频文件名和所述第二帧图片的帧索引,初步判断是否出现重复字幕文字,若标注内容不同,则第一字幕文字和第二字幕文字的文字内容不相同;若标注内容相同,再判断具体的文字内容是否 相同,若不相同,则第一字幕文字和第二字幕文字的文字内容不相同。通过逐步判断的方式,以便通过标注信息初步判断字幕文字是否相同,以便节省重复调用字符识别方法将字符形状翻译成计算机文字的过程,以节省流程,加快响应效率。本实施例通过识别变化像素区域,避免了连续多帧图片具有相同字幕时重复提取字幕文字的情况,且通过上述的逐步判断方式剔除由于视频背景干扰,而导致重复提取的字幕文字,以净化预设文件中的字幕文字。
进一步地,本申请另一实施例的提取视频字幕的装置,包括:
第三获取模块,用于获取所述第二字幕文字的起始时间和终止时间。
确定模块,用于确定所述起始时间和终止时间范围内对应的第一音频文件。
截取模块,用于将所述第一音频文件通过音频截取工具从所述视频对应的音频文件中截取分离。
标注模块,用于将所述第二字幕文字与所述第一音频文件一一对应进行音频标注。
本实施例通过遍历视频中所有的变化像素区域a1,a2,a3,…an,计算各变化像素区域的宽高比(w/h)是否大于设定r,如果大于设定r,则从当前帧的图片中切割对应的字幕区域,并将当前帧的帧索引换算成对应的时间[帧索引*(1/视频帧率)就得到当前帧在视频中的时间位置],并缓存该时间点的字幕区域,将本次缓存的字幕区域与上一时序缓存的字幕区域进行像素对比,差异小于预设阈值时,则上一次缓存的时间位置与当前新的缓存时间位置作为上一时序对应帧的字幕区域的时间间隔,并将时间间隔与上一时序对应帧的字幕区域的字幕文字关联保存。本实施例采用现有的音频分离工具将视频中的音频提取出来并保存,并将字幕文字与切割后的音频文件一一对应完成音频标注,上述标注数据可用于自动语音识别技术的训练时的样本数据,以降低现有人工标注数据的成本,且提高标注数据的质量。
进一步地,所述提取模块5,还包括:
第四判断单元,用于根据所述视频的视频文件名和所述第二帧图片的帧索引,判断所述预设文件中是否存在与第一帧索引对应的空字幕文件,其中所述第一帧索引包含于所述预设文件中所有帧索引中;
输入单元,用于所述预设文件中存在与第一帧索引对应的空字幕文件,则将所述第一帧索引对应的帧图片,输入文本定位模型;
第五判断单元,用于判断根据所述文本定位模型是否提取到所述第一帧索 引对应的帧图片的指定字幕文字;
补充单元,用于若根据所述文本定位模型提取到所述第一帧索引对应的帧图片的指定字幕文字,则将所述指定字幕文字补充到所述预设文件中所述第一帧索引对应位置;
标记单元,用于若根据所述文本定位模型未提取到所述第一帧索引对应的帧图片的指定字幕文字,则在所述预设文件中所述第一帧索引对应位置标记为空字幕。
本实施例通过帧索引与字幕文字的对应关系,查找遗漏提取的字幕文字,以保证整个视频文件中的字幕文字的完整性。当查找到第一帧索引对应空字幕文件,即无对应的字幕文字与其相对应,则判定存在遗漏提取,则将第一帧索引对应的帧图片,输入文本定位模型,以根据文本定位模型进行字幕文字定位与提取。上述文本定位模型为CTPN,CTPN结合CNN与LSTM深度网络,CTPN是从Faster R-CNN改进而来,能有效的检测出复杂场景的横向分布的文字,比如识别视频图片中的文字,虽然识别过程复杂、识别效率较低,但识别精度高,可有效补充通过混合高斯模型算法遗漏的字幕区域,提高整个视频文件中的字幕文字的完整性。
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括可读存储介质、内存储器。该可读存储介质存储有操作系统、计算机可读指令和数据库。上述可读存储介质包括非易失性可读存储介质和易失性可读存储介质。该内存器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储提取视频字幕等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令在执行时,执行如上述各方法的实施例的流程。本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。
本申请一实施例还提供一种计算机可读存储介质,其上存储有计算机可读指令,该计算机可读指令在执行时,执行如上述各方法的实施例的流程。上述可读存储介质包括非易失性可读存储介质和易失性可读存储介质。以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说 明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种提取视频字幕的方法,其特征在于,包括:
    通过混合高斯模型算法获取视频的第二帧图片相比于第一帧图片的变化像素区域,其中所述第一帧图片和所述第二帧图片是所述视频中相邻的任意两帧图片,所述变化像素区域至少包括一个;
    判断所述视频显示界面的预设区域范围内是否存在第一变化像素区域,其中所述第一变化像素区域包含于所述变化像素区域;
    若所述视频显示界面的预设区域范围内存在所述第一变化像素区域,则判断所述第一变化像素区域是否满足预设字幕区特征;
    若所述第一变化像素区域满足预设字幕区特征,则判定所述第一变化像素区域为所述字幕区域;
    从所述字幕区域中提取字幕文字。
  2. 根据权利要求1所述的提取视频字幕的方法,其特征在于,所述判断所述第一变化像素区域是否满足预设字幕区特征的步骤,包括:
    计算所述第一变化像素区域的宽高比,其中所述第一变化像素区域中沿所述视频时序播放方向为所述宽,垂直于所述宽的方向为所述高,所述宽高比为所述宽除以所述高;
    判断所述宽高比是否大于预设阈值;
    若所述宽高比大于预设阈值,则判定所述第一变化像素区域满足所述预设字幕区特征;
    若所述宽高比不大于预设阈值,则判定所述第一变化像素区域不满足所述预设字幕区特征。
  3. 根据权利要求2所述的提取视频字幕的方法,其特征在于,所述判断所述视频显示界面的预设区域范围内是否存在第一变化像素区域的步骤之前,包括:
    获取所述视频的视频宽和视频高,其中所述视频显示界面中沿所述视频时序播放方向为所述视频宽,垂直于所述视频宽的方向为所述视频高;
    设定所述预设值等于第一预设值,设定靠近所述视频显示界面的底部边缘,占比第二预设值的区域范围为所述预设区域范围。
  4. 根据权利要求1所述的提取视频字幕的方法,其特征在于,所述从所述字幕区域中提取字幕文字的步骤,包括:
    将所述字幕区域从所述第二帧图片中切割分离;
    将分离后的所述字幕区域通过图像文字识别算法识别出所述字幕文字;
    将所述字幕文字复制到预设文件中;
    通过预设格式标注所述字幕文字并存储。
  5. 根据权利要求4所述的提取视频字幕的方法,其特征在于,所述预设格式至少包括所述视频的视频文件名和所述第二帧图片对应的帧索引,所述通过预设格式标注所述字幕文字并存储的步骤之后,包括:
    根据所述视频的视频文件名和所述第二帧图片对应的帧索引,判断所述预设文件中是否存在与第一字幕文字具有相同标注信息的第二字幕文字,其中所述第一字幕文字和所述第二字幕文字,分别包含于所述预设文件中所有所述字幕文字中;
    若所述预设文件中存在与第一字幕文字具有相同标注信息的第二字幕文字,则判断所述第一字幕文字和所述第二字幕文字的文字内容是否相同;
    若所述第一字幕文字和所述第二字幕文字的文字内容相同,则删除所述所述第一字幕文字或所述第二字幕文字。
  6. 根据权利要求4所述的提取视频字幕的方法,其特征在于,所述通过预设格式标注所述字幕文字并存储的步骤之后,还包括:
    根据所述视频的视频文件名和所述第二帧图片的帧索引,判断所述预设文件中是否存在与第一帧索引对应的空字幕文件,其中所述第一帧索引包含于所述预设文件中所有帧索引中;
    若存在,则将所述第一帧索引对应的帧图片,输入文本定位模型;
    判断根据所述文本定位模型是否提取到所述第一帧索引对应的帧图片的指定字幕文字;
    若提取到,则将所述指定字幕文字补充到所述预设文件中所述第一帧索引对应位置;
    若未提取到,则在所述预设文件中所述第一帧索引对应位置标记为空字幕。
  7. 根据权利要求1所述的提取视频字幕的方法,其特征在于,所述从所述字幕区域中提取字幕文字的步骤之后,包括:
    获取所述第二字幕文字的起始时间和终止时间;
    确定所述起始时间和终止时间范围内对应的第一音频文件;
    将所述第一音频文件通过音频截取工具从所述视频对应的音频文件中截 取分离;
    将所述第二字幕文字与所述第一音频文件一一对应进行音频标注。
  8. 一种提取视频字幕的装置,其特征在于,包括:
    第一获取模块,用于通过混合高斯模型算法获取视频的第二帧图片相比于第一帧图片的变化像素区域,其中所述第一帧图片和所述第二帧图片是所述视频中相邻的任意两帧图片,所述变化像素区域至少包括一个;
    第一判断模块,用于判断所述视频显示界面的预设区域范围内是否存在第一变化像素区域,其中所述第一变化像素区域包含于所述变化像素区域;
    第二判断模块,用于若所述视频显示界面的预设区域范围内存在第一变化像素区域,则判断所述第一变化像素区域是否满足预设字幕区特征;
    判定模块,用于若所述第一变化像素区域满足预设字幕区特征,则判定所述第一变化像素区域为所述字幕区域;
    提取模块,用于从所述字幕区域中提取字幕文字。
  9. 根据权利要求8所述的迁徙数据的装置,其特征在于,所述第二判断模块,包括:
    计算单元,用于计算所述第一变化像素区域的宽高比,其中所述第一变化像素区域中沿所述视频时序播放方向为所述宽,垂直于所述宽的方向为所述高,所述宽高比为所述宽除以所述高;
    第一判断单元,用于判断所述宽高比是否大于预设阈值;
    第一判定单元,用于若所述宽高比大于预设阈值,则判定所述第一变化像素区域满足所述预设字幕区特征;
    第二判定单元,用于若所述宽高比不大于预设阈值,则判定所述第一变化像素区域不满足所述预设字幕区特征。
  10. 根据权利要求9所述的迁徙数据的装置,其特征在于,包括:
    第二获取模块,用于获取所述视频的视频宽和视频高,其中所述视频显示界面中沿所述视频时序播放方向为所述视频宽,垂直于所述视频宽的方向为所述视频高;
    设定模块,用于设定所述预设值等于第一预设值,设定靠近所述视频显示界面的底部边缘,占比第二预设值的区域范围为所述预设区域范围。
  11. 根据权利要求8所述的迁徙数据的装置,其特征在于,所述提取模块,包括:
    分离单元,用于将所述字幕区域从所述第二帧图片中切割分离;
    识别单元,用于将分离后的所述字幕区域通过图像文字识别算法识别出所述字幕文字;
    赋值单元,用于将所述字幕文字复制到预设文件中;
    标注单元,用于通过预设格式标注所述字幕文字并存储。
  12. 根据权利要求11所述的迁徙数据的装置,其特征在于,所述预设格式至少包括所述视频的视频文件名和第二帧图片对应的帧索引,所述提取模块,包括:
    第二判断单元,用于根据所述视频的视频文件名和所述第二帧图片对应的帧索引,判断所述预设文件中是否存在与第一字幕文字具有相同标注信息的第二字幕文字,其中所述第一字幕文字和所述第二字幕文字,分别包含于所述预设文件中所有所述字幕文字中;
    第三判断单元,用于若所述预设文件中存在与第一字幕文字具有相同标注信息的第二字幕文字,则判断所述第一字幕文字和所述第二字幕文字的文字内容是否相同;
    删除单元,用于若所述第一字幕文字和所述第二字幕文字的文字内容相同,则删除所述所述第一字幕文字或所述第二字幕文字。
  13. 根据权利要求11所述的迁徙数据的装置,其特征在于,所述提取模块,包括:
    第四判断单元,用于根据所述视频的视频文件名和所述第二帧图片的帧索引,判断所述预设文件中是否存在与第一帧索引对应的空字幕文件,其中所述第一帧索引包含于所述预设文件中所有帧索引中;
    输入单元,用于所述预设文件中存在与第一帧索引对应的空字幕文件,则将所述第一帧索引对应的帧图片,输入文本定位模型;
    第五判断单元,用于判断根据所述文本定位模型是否提取到所述第一帧索引对应的帧图片的指定字幕文字;
    补充单元,用于若根据所述文本定位模型提取到所述第一帧索引对应的帧图片的指定字幕文字,则将所述指定字幕文字补充到所述预设文件中所述第一帧索引对应位置;
    标记单元,用于若根据所述文本定位模型未提取到所述第一帧索引对应的帧图片的指定字幕文字,则在所述预设文件中所述第一帧索引对应位置标记为 空字幕。
  14. 根据权利要求8所述的迁徙数据的装置,其特征在于,包括:
    第三获取模块,用于获取所述第二字幕文字的起始时间和终止时间;
    确定模块,用于确定所述起始时间和终止时间范围内对应的第一音频文件;
    截取模块,用于将所述第一音频文件通过音频截取工具从所述视频对应的音频文件中截取分离;
    标注模块,用于将所述第二字幕文字与所述第一音频文件一一对应进行音频标注。
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现提取视频字幕的方法,提取视频字幕的方法,包括:
    通过混合高斯模型算法获取视频的第二帧图片相比于第一帧图片的变化像素区域,其中所述第一帧图片和所述第二帧图片是所述视频中相邻的任意两帧图片,所述变化像素区域至少包括一个;
    判断所述视频显示界面的预设区域范围内是否存在第一变化像素区域,其中所述第一变化像素区域包含于所述变化像素区域;
    若所述视频显示界面的预设区域范围内存在所述第一变化像素区域,则判断所述第一变化像素区域是否满足预设字幕区特征;
    若所述第一变化像素区域满足预设字幕区特征,则判定所述第一变化像素区域为所述字幕区域;
    从所述字幕区域中提取字幕文字。
  16. 根据权利要求15所述的计算机设备,其特征在于,所述判断所述第一变化像素区域是否满足预设字幕区特征的步骤,包括:
    计算所述第一变化像素区域的宽高比,其中所述第一变化像素区域中沿所述视频时序播放方向为所述宽,垂直于所述宽的方向为所述高,所述宽高比为所述宽除以所述高;
    判断所述宽高比是否大于预设阈值;
    若所述宽高比大于预设阈值,则判定所述第一变化像素区域满足所述预设字幕区特征;
    若所述宽高比不大于预设阈值,则判定所述第一变化像素区域不满足所述预设字幕区特征。
  17. 根据权利要求16所述的计算机设备,其特征在于,所述判断所述视频显示界面的预设区域范围内是否存在第一变化像素区域的步骤之前,包括:
    获取所述视频的视频宽和视频高,其中所述视频显示界面中沿所述视频时序播放方向为所述视频宽,垂直于所述视频宽的方向为所述视频高;
    设定所述预设值等于第一预设值,设定靠近所述视频显示界面的底部边缘,占比第二预设值的区域范围为所述预设区域范围。
  18. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现提取视频字幕的方法,提取视频字幕的方法包括:
    通过混合高斯模型算法获取视频的第二帧图片相比于第一帧图片的变化像素区域,其中所述第一帧图片和所述第二帧图片是所述视频中相邻的任意两帧图片,所述变化像素区域至少包括一个;
    判断所述视频显示界面的预设区域范围内是否存在第一变化像素区域,其中所述第一变化像素区域包含于所述变化像素区域;
    若所述视频显示界面的预设区域范围内存在所述第一变化像素区域,则判断所述第一变化像素区域是否满足预设字幕区特征;
    若所述第一变化像素区域满足预设字幕区特征,则判定所述第一变化像素区域为所述字幕区域;
    从所述字幕区域中提取字幕文字。
  19. 根据权利要求18所述的计算机可读存储介质,其特征在于,所述判断所述第一变化像素区域是否满足预设字幕区特征的步骤,包括:
    计算所述第一变化像素区域的宽高比,其中所述第一变化像素区域中沿所述视频时序播放方向为所述宽,垂直于所述宽的方向为所述高,所述宽高比为所述宽除以所述高;
    判断所述宽高比是否大于预设阈值;
    若所述宽高比大于预设阈值,则判定所述第一变化像素区域满足所述预设字幕区特征;
    若所述宽高比不大于预设阈值,则判定所述第一变化像素区域不满足所述预设字幕区特征。
  20. 根据权利要求19所述的计算机可读存储介质,其特征在于,所述判断所述视频显示界面的预设区域范围内是否存在第一变化像素区域的步骤之 前,包括:
    获取所述视频的视频宽和视频高,其中所述视频显示界面中沿所述视频时序播放方向为所述视频宽,垂直于所述视频宽的方向为所述视频高;
    设定所述预设值等于第一预设值,设定靠近所述视频显示界面的底部边缘,占比第二预设值的区域范围为所述预设区域范围。
PCT/CN2019/118411 2019-04-22 2019-11-14 提取视频字幕的方法、装置、计算机设备及存储介质 WO2020215696A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910324978.6A CN110197177B (zh) 2019-04-22 2019-04-22 提取视频字幕的方法、装置、计算机设备及存储介质
CN201910324978.6 2019-04-22

Publications (1)

Publication Number Publication Date
WO2020215696A1 true WO2020215696A1 (zh) 2020-10-29

Family

ID=67752135

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118411 WO2020215696A1 (zh) 2019-04-22 2019-11-14 提取视频字幕的方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN110197177B (zh)
WO (1) WO2020215696A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347990A (zh) * 2020-11-30 2021-02-09 重庆空间视创科技有限公司 基于多模态智能审稿系统及方法
CN112925905A (zh) * 2021-01-28 2021-06-08 北京达佳互联信息技术有限公司 提取视频字幕的方法、装置、电子设备和存储介质
CN114615520A (zh) * 2022-03-08 2022-06-10 北京达佳互联信息技术有限公司 字幕定位方法、装置、计算机设备及介质
CN114666649A (zh) * 2022-03-31 2022-06-24 北京奇艺世纪科技有限公司 字幕被裁视频的识别方法、装置、电子设备及存储介质

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110197177B (zh) * 2019-04-22 2024-03-19 平安科技(深圳)有限公司 提取视频字幕的方法、装置、计算机设备及存储介质
CN113014834B (zh) * 2019-12-19 2024-02-27 合肥杰发科技有限公司 图片字幕显示方法、装置及相关装置
CN114391260A (zh) * 2019-12-30 2022-04-22 深圳市欢太科技有限公司 文字识别方法、装置、存储介质及电子设备
CN111405359B (zh) * 2020-03-25 2022-05-10 北京奇艺世纪科技有限公司 处理视频数据的方法、装置、计算机设备和存储介质
CN112232260A (zh) * 2020-10-27 2021-01-15 腾讯科技(深圳)有限公司 字幕区域识别方法、装置、设备及存储介质
CN112464935A (zh) * 2020-12-09 2021-03-09 深圳康佳电子科技有限公司 一种lrc歌词显示控制方法、智能终端及存储介质
CN112735476A (zh) * 2020-12-29 2021-04-30 北京声智科技有限公司 一种音频数据标注方法及装置
CN116208802A (zh) * 2023-05-05 2023-06-02 广州信安数据有限公司 视频数据多模态合规检测方法、存储介质和合规检测设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102208023A (zh) * 2011-01-23 2011-10-05 浙江大学 基于边缘信息和分布熵的视频字幕识别设计方法
CN102802074A (zh) * 2012-08-14 2012-11-28 海信集团有限公司 从电视信号中提取文字信息并显示的方法及电视机
US20160360123A1 (en) * 2003-12-08 2016-12-08 Sonic Ip, Inc. Multimedia Distribution System for Multimedia Files with Interleaved Media Chunks of Varying Types
CN110197177A (zh) * 2019-04-22 2019-09-03 平安科技(深圳)有限公司 提取视频字幕的方法、装置、计算机设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003051031A2 (en) * 2001-12-06 2003-06-19 The Trustees Of Columbia University In The City Of New York Method and apparatus for planarization of a material by growing and removing a sacrificial film
KR100540735B1 (ko) * 2003-07-25 2006-01-11 엘지전자 주식회사 자막문자기반의 영상 인덱싱 방법
CN104735521B (zh) * 2015-03-30 2018-04-13 北京奇艺世纪科技有限公司 一种滚动字幕检测方法及装置
CN108769776B (zh) * 2018-05-31 2021-03-19 北京奇艺世纪科技有限公司 标题字幕检测方法、装置及电子设备
CN109271988A (zh) * 2018-08-30 2019-01-25 中国传媒大学 一种基于图像分割及动态阈值的字幕提取方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160360123A1 (en) * 2003-12-08 2016-12-08 Sonic Ip, Inc. Multimedia Distribution System for Multimedia Files with Interleaved Media Chunks of Varying Types
CN102208023A (zh) * 2011-01-23 2011-10-05 浙江大学 基于边缘信息和分布熵的视频字幕识别设计方法
CN102802074A (zh) * 2012-08-14 2012-11-28 海信集团有限公司 从电视信号中提取文字信息并显示的方法及电视机
CN110197177A (zh) * 2019-04-22 2019-09-03 平安科技(深圳)有限公司 提取视频字幕的方法、装置、计算机设备及存储介质

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347990A (zh) * 2020-11-30 2021-02-09 重庆空间视创科技有限公司 基于多模态智能审稿系统及方法
CN112347990B (zh) * 2020-11-30 2024-02-02 重庆空间视创科技有限公司 基于多模态智能审稿系统及方法
CN112925905A (zh) * 2021-01-28 2021-06-08 北京达佳互联信息技术有限公司 提取视频字幕的方法、装置、电子设备和存储介质
CN112925905B (zh) * 2021-01-28 2024-02-27 北京达佳互联信息技术有限公司 提取视频字幕的方法、装置、电子设备和存储介质
CN114615520A (zh) * 2022-03-08 2022-06-10 北京达佳互联信息技术有限公司 字幕定位方法、装置、计算机设备及介质
CN114615520B (zh) * 2022-03-08 2024-01-02 北京达佳互联信息技术有限公司 字幕定位方法、装置、计算机设备及介质
CN114666649A (zh) * 2022-03-31 2022-06-24 北京奇艺世纪科技有限公司 字幕被裁视频的识别方法、装置、电子设备及存储介质
CN114666649B (zh) * 2022-03-31 2024-03-01 北京奇艺世纪科技有限公司 字幕被裁视频的识别方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN110197177B (zh) 2024-03-19
CN110197177A (zh) 2019-09-03

Similar Documents

Publication Publication Date Title
WO2020215696A1 (zh) 提取视频字幕的方法、装置、计算机设备及存储介质
US6546185B1 (en) System for searching a particular character in a motion picture
US8280158B2 (en) Systems and methods for indexing presentation videos
US20090310863A1 (en) Finding image capture date of hardcopy medium
EP2544099A1 (en) Method for creating an enrichment file associated with a page of an electronic document
CN111931775A (zh) 自动获取新闻标题方法、系统、计算机设备及存储介质
US20230237825A1 (en) Wine product positioning method, wine product information management method and apparatus, device, and storage medium
CN111626145A (zh) 一种简捷有效的残缺表格识别及跨页拼接方法
CN113435438B (zh) 一种图像和字幕融合的视频报幕板提取及视频切分方法
CN109726369A (zh) 一种基于标准文献的智能模板化题录技术实现方法
CN114386504A (zh) 一种工程图纸文字识别方法
CN114005121A (zh) 一种移动终端的文本识别方法及设备
CN110933520B (zh) 一种基于螺旋摘要的监控视频展示方法及存储介质
CN111046770B (zh) 一种照片档案人物自动标注方法
CN110503087A (zh) 一种拍照框题的搜索方法、装置、终端及存储介质
CN101335811B (zh) 打印方法和打印装置
CN113065559B (zh) 图像比对方法、装置、电子设备及存储介质
CN111507991B (zh) 特征区域的遥感图像分割方法及装置
CN113807173A (zh) 一种车道线数据集的构建标注方法及应用系统
CN114792425A (zh) 一种基于人工智能算法的考生试卷照片错题自动整理方法及相关算法
JPH0149998B2 (zh)
CN114302170A (zh) 一种逐字显示歌词时间的方法、系统及计算机存储介质
JP3831180B2 (ja) 映像情報印刷装置、映像情報要約方法およびその方法をコンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体
CN113763389B (zh) 一种基于多主体检测分割的图像识别方法
CN116912867B (zh) 结合自动标注和召回补全的教材结构提取方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19925722

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19925722

Country of ref document: EP

Kind code of ref document: A1