WO2022089170A1 - 字幕区域识别方法、装置、设备及存储介质 - Google Patents

字幕区域识别方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022089170A1
WO2022089170A1 PCT/CN2021/122697 CN2021122697W WO2022089170A1 WO 2022089170 A1 WO2022089170 A1 WO 2022089170A1 CN 2021122697 W CN2021122697 W CN 2021122697W WO 2022089170 A1 WO2022089170 A1 WO 2022089170A1
Authority
WO
WIPO (PCT)
Prior art keywords
subtitle
text
candidate
region
text content
Prior art date
Application number
PCT/CN2021/122697
Other languages
English (en)
French (fr)
Inventor
黄杰
王书培
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022089170A1 publication Critical patent/WO2022089170A1/zh
Priority to US17/960,004 priority Critical patent/US20230027412A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • G06V30/147Determination of region of interest
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • G06V10/23Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on positionally close patterns or neighbourhood relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present application relates to the field of computer vision technology of artificial intelligence, and in particular, to a method, apparatus, device and storage medium for subtitle area recognition.
  • subtitle extraction technology needs to be applied to videos in various scenarios. For example, in the training process of a speech-to-text model, subtitles in videos need to be used as training samples.
  • the text information in the short video is not necessarily the text of subtitles, it may also include brand watermark text, video title text, and so on. Therefore, for the extraction of subtitles in short videos, the subtitle area is manually marked, and then OCR (Optical Character Recognition, Optical Character Recognition) technology is used to perform text recognition on the marked position to obtain subtitles. For example, manually take a screenshot of the video, then use the image viewing software to open the screenshot, move the mouse to the upper left corner and the lower right corner of the subtitle, you can get the coordinates of the two positions, and then get the position of the subtitle.
  • OCR Optical Character Recognition, Optical Character Recognition
  • the method in the related art requires a lot of manpower to extract subtitles.
  • the embodiments of the present application provide a method, apparatus, device and storage medium for subtitle area identification, which can automatically extract subtitles and save human resources.
  • the technical solution is as follows.
  • a subtitle area identification method the method is executed by a computer device, and the method includes:
  • the candidate subtitle region is the region displayed by the text content in the video, and n is a positive integer;
  • the subtitle region is obtained by screening the n candidate subtitle regions according to a subtitle region screening strategy, the subtitle region screening strategy is used to select the candidate subtitle region whose text content repetition rate is lower than the repetition rate threshold and has the longest display duration Determined as the subtitle area.
  • a subtitle recognition device comprising:
  • the identification module is used to identify the video to obtain n candidate subtitle regions, the candidate subtitle region is the region displayed by the text content in the video, and n is a positive integer;
  • the screening module is configured to obtain the subtitle region by screening the n candidate subtitle regions according to a subtitle region screening strategy, the subtitle region screening strategy being used to make the repetition rate of the text content lower than the repetition rate threshold and the display total duration is the longest.
  • a long candidate subtitle area is determined as the subtitle area.
  • a computer device comprising: a processor and a memory, wherein the memory stores at least one instruction, at least a piece of program, a code set or an instruction set, the at least one The instructions, the at least one piece of program, the code set or the instruction set are loaded and executed by the processor to implement the subtitle region identification method as described above.
  • a computer-readable storage medium stores at least one instruction, at least one piece of program, code set or instruction set, the at least one instruction, the at least one piece of program , the code set or the instruction set is loaded and executed by the processor to implement the subtitle region identification method as described in the above aspect.
  • a computer program product or computer program including computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the subtitle region identification method provided in the foregoing optional implementation manner.
  • the beneficial effects brought by the technical solutions provided in the embodiments of the present application include at least the following beneficial effects.
  • the subtitle regions are obtained by filtering the candidate subtitle regions identified from the video by using the subtitle region screening strategy.
  • the subtitle area is selected from the candidate subtitle area according to the characteristics of fixed subtitle display position, diverse text content and long display time, so that the subtitle of the video can be extracted according to the subtitle area. Compared with the method of manually labeling the subtitle area, This method saves the human resources required for subtitle recognition and accelerates the speed and efficiency of subtitle recognition.
  • FIG. 1 is a block diagram of a computer system provided by an exemplary embodiment of the present application.
  • Fig. 2 is a method flow chart of a method for identifying a subtitle region provided by another exemplary embodiment of the present application
  • Fig. 3 is a method flow chart of a method for subtitle region identification provided by an exemplary embodiment of the present application
  • FIG. 4 is a schematic diagram of a video frame image of a method for identifying a subtitle region provided by another exemplary embodiment of the present application;
  • FIG. 5 is a schematic diagram of a video frame image of a method for identifying a subtitle region provided by another exemplary embodiment of the present application.
  • FIG. 6 is a method flowchart of a method for identifying a subtitle region provided by another exemplary embodiment of the present application.
  • FIG. 7 is a schematic diagram of a video frame image of a method for identifying a subtitle region provided by another exemplary embodiment of the present application.
  • FIG. 8 is a schematic diagram of a text area of a method for recognizing a subtitle area provided by another exemplary embodiment of the present application.
  • FIG. 9 is a method flowchart of a method for identifying a subtitle area provided by another exemplary embodiment of the present application.
  • FIG. 10 is a method flowchart of a method for identifying a subtitle area provided by another exemplary embodiment of the present application.
  • FIG. 11 is a method flowchart of a method for identifying a subtitle area provided by another exemplary embodiment of the present application.
  • FIG. 12 is a block diagram of a subtitle recognition device provided by another exemplary embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a server provided by another exemplary embodiment of the present application.
  • FIG. 14 is a block diagram of a terminal provided by another exemplary embodiment of the present application.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Computer Vision is a science that studies how to make machines "see”. Further, it refers to the use of cameras and computers instead of human eyes to identify, track, and measure targets, and further. Do graphics processing to make computer processing become images more suitable for human eye observation or transmission to instruments for detection. As a scientific discipline, computer vision studies related theories and technologies, trying to build artificial intelligence systems that can obtain information from images or multidimensional data.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition, Optical Character Recognition), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D (Three Dimensional , three-dimensional) technology, virtual reality, augmented reality, simultaneous positioning and map construction and other technologies, as well as common biometric identification technologies such as face recognition and fingerprint recognition.
  • OCR is the abbreviation of English Optical Character Recognition, which means optical character recognition, which can also be simply called text recognition, which is a method of automatic text input. It obtains text and image information on paper through optical input methods such as scanning and photography, and uses various pattern recognition algorithms to analyze text morphological features to convert bills, newspapers, books, manuscripts and other printed matter into image information, and then use text recognition technology to convert Image information is converted into computer input technology that can be used.
  • FIG. 1 shows a schematic structural diagram of a computer system provided by an exemplary embodiment of the present application, where the computer system includes a terminal 120 and a server 140 .
  • the terminal 120 and the server 140 are connected to each other through a wired or wireless network.
  • the terminal 120 includes at least one of a smart phone, a notebook computer, a desktop computer, a tablet computer, a smart speaker, and a smart robot.
  • the terminal uploads the video for which subtitle recognition needs to be performed to the server, and the server performs subtitle recognition on the video uploaded by the terminal.
  • the server may also perform subtitle recognition on the locally stored video.
  • the terminal may also perform subtitle recognition on the locally stored video.
  • the terminal may also download the video through the network, and perform subtitle recognition on the downloaded video.
  • the terminal 120 further includes a display; the display is used for displaying a picture of a video.
  • Terminal 120 includes a first memory and a first processor.
  • a first program is stored in the first memory; the above-mentioned first program is called and executed by the first processor to realize the subtitle region identification method provided by the present application.
  • the first memory may include but is not limited to the following: random access memory (Random Access Memory, RAM), read only memory (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory, PROM), Erasable Programmable Read-Only Memory (EPROM), and Electrically Erasable Programmable Read-Only Memory (EEPROM).
  • the first processor may be composed of one or more integrated circuit chips.
  • the first processor may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or a network processor (Network Processor, NP).
  • the first processor may implement the subtitle region identification method provided by the present application by invoking a subtitle identification algorithm.
  • Server 140 includes a second memory and a second processor.
  • a second program is stored in the second memory, and the second program is called by the second processor to implement the subtitle region identification method provided by the present application.
  • the subtitle recognition algorithm is stored in the second memory.
  • the server receives the video sent by the terminal, and uses a subtitle recognition algorithm to perform subtitle recognition.
  • the second memory may include but not limited to the following: RAM, ROM, PROM, EPROM, and EEPROM.
  • the second processor may be a general-purpose processor, such as a CPU or NP.
  • the server 140 may be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, Middleware service, domain name service, security service, CDN (Content Delivery Network, content distribution network, cloud server for basic cloud computing services such as big data and artificial intelligence platforms.
  • Terminals can be smartphones, tablets, laptops, desktop computers , smart speakers, smart watches, etc., but not limited to this.
  • the terminal and the server can be directly or indirectly connected through wired or wireless communication, which is not limited in this application.
  • the subtitle region recognition method provided by the present application can be applied to scenarios such as video subtitle extraction, acquisition of training samples of a speech-to-text model, and the like.
  • the subtitle region recognition method provided by this application to obtain the training samples of the speech-to-text model as an example, after obtaining the subtitle region of the video, obtain the text region belonging to the subtitle region, and the text data corresponding to the text region, and the text in the text data.
  • the content is the text part of the training sample.
  • the audio of the corresponding time is intercepted from the video.
  • the audio is the voice part of the training sample, and the text part and the voice part are stored correspondingly for training samples.
  • FIG. 2 shows a flowchart of a method for identifying a subtitle area provided by an exemplary embodiment of the present application.
  • the method can be performed by a computer device, for example, a terminal or a server as shown in FIG. 1 .
  • the method includes the following steps.
  • Step 101 Identify the video to obtain n candidate subtitle regions, where the candidate subtitle regions are regions displayed by text content in the video, and n is a positive integer.
  • the video may be any type of video file, for example, a short video, a TV series, a movie, a variety show, and the like.
  • subtitles are included in the video.
  • the text in the short video screen not only contains subtitles, but may also contain other text information, such as the watermark text of the short video application, the user nickname of the short video publisher, the video name of the short video, etc. . Therefore, it is impossible to accurately obtain the subtitles of the short video by performing text recognition only through OCR technology, and the method of manually labeling the subtitle area and then performing text recognition on the marked position to obtain the subtitles requires a lot of manpower. Therefore, this application provides A subtitle recognition method, which can accurately identify subtitles from multiple text information in a video, saves the steps of manually labeling subtitle regions, and improves the efficiency of subtitle extraction.
  • the manner of acquiring the video may be arbitrary, and the video may be a video file stored locally by a computer device, or may be a video file acquired by other computer devices.
  • the server can receive video files uploaded by the terminal; when the computer device is a terminal, the terminal can also download video files stored on the server through the network.
  • a client with a font extraction function can be installed on the terminal, the user can select a locally stored video file on the user interface of the client, and click the upload control to upload the video file to the server.
  • the video file undergoes subsequent subtitle area identification processing.
  • the candidate subtitle area refers to the area in the video where text content is displayed.
  • the candidate subtitle area includes an area in which text content is displayed in each frame of the video in the video.
  • the candidate subtitle area is a type of area location with a clear area range and location coordinates.
  • the text regions where the text content with similar positions in the video are located are clustered into a candidate subtitle region.
  • the subtitle region is obtained by screening the n candidate subtitle regions according to the subtitle region screening strategy, and the subtitle region screening strategy is used to determine the candidate subtitle region with the repetition rate of the text content lower than the repetition rate threshold and the longest total display time as the subtitle. area.
  • the repetition rate of the text content is selected from the plurality of candidate subtitle areas to be lower than the repetition rate threshold, and the content is displayed for a long time.
  • the candidate subtitle area with text content is determined as the subtitle area.
  • the repetition rate of text content is used to describe the diversity of text content displayed in the subtitle candidate area.
  • the repetition rate of the text content is high, that is, a variety of text contents are displayed in the subtitle candidate area, and the repetition rate of the text content is low, that is, only one or several types of text content are displayed in the subtitle candidate area.
  • the total display duration refers to the total duration of text content displayed in the candidate subtitle area. Since the subtitles are usually displayed for a long time in the video, the candidate subtitle area with the text content displayed for a long time is selected as the subtitle area.
  • the subtitle region is obtained by screening the candidate subtitle regions identified from the video by using the subtitle region screening strategy.
  • the subtitle area is selected from the candidate subtitle area according to the characteristics of fixed subtitle display position, diverse text content and long display time, so that the subtitle of the video can be extracted according to the subtitle area.
  • This method saves the human resources required for subtitle recognition and accelerates the speed and efficiency of subtitle recognition.
  • FIG. 3 shows a flowchart of a method for identifying a subtitle area provided by an exemplary embodiment of the present application.
  • the method can be performed by a computer device, for example, a terminal or a server as shown in FIG. 1 .
  • the method includes the following steps.
  • Step 201 Identify the text content in the video and the text area where the text content is located.
  • the text content in the video, the text area where the text content is located, and the display duration of the text content are identified. There is a corresponding relationship between text content, text area, and display duration.
  • the text in the video is recognized to obtain a text list
  • the text list includes at least one piece of text data
  • the text data includes text content, a text area and a display duration
  • the text content includes at least one text on the text area.
  • the computer device performs text recognition on the video to obtain a text list.
  • the text list may be a data table, in which each row represents a piece of text data, and each column contains specific content of the text data: text content, text area, and display duration.
  • each row represents a piece of text data
  • each column contains specific content of the text data: text content, text area, and display duration.
  • different areas on the image may contain different text content.
  • the same area on the image may also display different text content at different times. Therefore, the video By extracting multiple text contents with different Chinese text regions and different display times, multiple pieces of text data can be obtained to form a text list.
  • the two text contents belong to two pieces of text data, that is, if the same text area on consecutive video frame images If the same text content is displayed, the text content belongs to a piece of text data, and the duration of the continuous video frame images is the display duration (display duration of the text content) in the text data.
  • the first area on the video frame image of the 1-3s displays the first text content
  • the first area of the video frame image of the 3-4s does not display text
  • the video of the 4-5s The first area on the frame image also displays the first text content
  • the two first text contents correspond to two pieces of text data respectively
  • the display durations of the two pieces of text data are 2s and 1s respectively.
  • the recognized text content, the position coordinates of the text content on the screen, and the time information of the frame are obtained.
  • the above-mentioned information obtained by performing text recognition on the multi-frame images is sorted and integrated to obtain a text list. For example, text content 1 and text content 2 are identified on the first frame of the video, text content 1 is located at position 1 on the first frame, text content 2 is located at position 2 on the first frame, and the first frame is located at position 2.
  • the time in the video is 00:01; text content 1 and text content 3 are identified on the second frame of the video, text content 1 is located at position 1 on the second frame, and text content 3 is on the second frame.
  • the time of the second frame in the video is 00:05. Therefore, by integrating the information identified in the two frames, a text list consisting of three pieces of text data can be obtained.
  • the first text data text content 1, position 1, 00:01 to 00:05 for a total of 4 minutes; the second text data: text content 2, position 2, 00:01; the third text data: text content 3 , position 3, 00:05.
  • the text list may also be a data set, database, document file, etc. composed of multiple text data.
  • the text area includes the position of a text box for framing the text.
  • the text box is a rectangular box, and the position of the text box can be expressed by the position of four lines (upper line, bottom line, left line and right line), or by the coordinates of the four vertices of the text box, or It can be expressed by the coordinates of the two vertices of the diagonally opposite corners of the text box.
  • Step 202 according to the positional relationship of the text regions, cluster the text regions whose position deviation is smaller than the deviation threshold into the same candidate subtitle region, and obtain n candidate subtitle regions in total.
  • the text region is grouped into n candidate subtitle regions, and the positional deviation between the text region belonging to the ith candidate subtitle region and the ith candidate subtitle region is less than the deviation threshold, n is a positive integer, and i is less than or equal to A positive integer of n.
  • clustering/normalizing refers to classifying the text regions according to the position distribution of the text regions, and classifying a plurality of text regions with a positional deviation smaller than a deviation threshold into the same type of text regions, that is, the same candidate subtitle region.
  • the text list includes multiple text regions. Since the subtitles of the video are usually displayed in the same region position, these text regions are normalized to obtain multiple candidate subtitle regions.
  • the displayed area may be slightly different.
  • (1) and (2) in Figure 4 are two video frame images of the video respectively. The image has a first text content located in the first text area 501 and a second text content located in the second text area 502, both of which are text content and subtitles, but due to the difference in the number of characters and lines of the text content, these two The text areas of the text content are slightly different, but these two text areas are both subtitle areas. Therefore, a deviation threshold needs to be set when sorting the candidate subtitle areas. If the positional deviation of the two text areas is less than the deviation threshold, it should be It is considered that the two text regions belong to the same candidate subtitle region, so that multiple text regions in the text list can be sorted, and finally several candidate subtitle regions are obtained.
  • the first text area includes a first upper edge, a first lower edge, a first left line, and a first right line
  • the second text area includes the first The second upper sideline, the second lower sideline, the second left sideline, and the second right sideline
  • the position deviation includes: the deviation between the first upper sideline and the second upper sideline, the deviation between the first lower sideline and the second lower sideline, the first left sideline at least one of a deviation of the line from the second left line and a deviation of the first right line from the second right line.
  • the position deviation may include two.
  • the deviation of the two upper edge lines and the deviation of the two lower edge lines of each text region, that is, the text regions with similar vertical positions are classified as the same subtitle candidate region.
  • the position deviation may also include the deviation of the two left lines and the deviation of the two right lines of the two text regions, that is, the text regions with similar horizontal positions are classified as: the same subtitle candidate region.
  • the specific value of the deviation threshold may be arbitrary.
  • the deviation threshold is preferably 30 pixels-50 pixels.
  • the deviation threshold is set to 40 pixels, the deviation of the two upper edges of the two text areas is less than 40 pixels, and The two text regions with the deviation of the two lower edges also less than 40 pixels are classified as the same candidate subtitle region.
  • the subtitle candidate region has a region position, that is, where the subtitle candidate region is located, exemplarily, the region position of the subtitle candidate region is the largest text region belonging to the subtitle candidate region.
  • the region position of the candidate subtitle region is the text region with the largest height (corresponding to the horizontally displayed subtitle) belonging to the candidate subtitle region, or the region position of the candidate subtitle region is the text region with the largest width belonging to the candidate subtitle region. (corresponding to vertical subtitles).
  • a column of data of candidate subtitle regions can be added to the text list, and then each piece of text data is added with data of the candidate subtitle region to which it belongs, then, each The text content corresponds to a text area, a display duration, and a candidate subtitle area.
  • Step 203 Screen the subtitle regions from the n candidate subtitle regions according to the subtitle region selection strategy;
  • the candidate subtitle area is determined as the subtitle area, and the total display duration is the sum of the display durations of all the text contents belonging to the candidate subtitle area.
  • the total display duration is the sum of the display durations of all the text contents belonging to the candidate subtitle area.
  • the computer device may call the algorithm of the subtitle area screening strategy to identify the subtitle area of the video from the candidate subtitle area.
  • some interference text non-subtitle text
  • these interference text have the characteristics of long display time and single displayed text, so , the subtitle area can be filtered out from the text data according to these features of the interfering text.
  • the subtitle area screening strategy is set according to the display characteristics of the distracting text and the display characteristics of the subtitles.
  • Subtitles have the characteristics of long display time, fixed location, and diverse text content.
  • Interference text has other characteristics, for example, watermarks have the characteristics of long display time, fixed position, and single text content; video titles have the characteristics of short display time, fixed position, and single text content; based on the different characteristics of subtitles and interference text, it is possible to The subtitle area where the subtitle is located is filtered out from the candidate subtitle area.
  • the candidate subtitle area is not a subtitle area. Then, among the remaining candidate subtitle regions, the candidate subtitle region with the longest display duration is selected as the subtitle region. Due to some distracting text, for example, the title text of a TV series, it will only be displayed for the first few seconds of the video, and will not be displayed after that. For example, as shown in FIG. 5 , a video title 401 and a subtitle 402 are displayed on the video frame image. The video title 401 disappears after being displayed for a while, and the text will not be displayed in this position, while the position of the subtitle 402 will be displayed for a long time. text is displayed. Therefore, from the remaining candidate subtitle regions, the candidate subtitle region with the longest display duration is selected as the subtitle region.
  • the method provided in this embodiment uses the subtitle region screening strategy to screen the text regions in the text list identified from the video to obtain candidate subtitle regions.
  • the feature with longer duration selects the subtitle area from the candidate subtitle area, so that the subtitle of the video can be extracted according to the subtitle area.
  • this method saves the human resources required for subtitle recognition. , to speed up subtitle recognition speed and efficiency.
  • FIG. 6 shows a flowchart of a method for identifying a subtitle area provided by an exemplary embodiment of the present application.
  • the method can be performed by a computer device, for example, a terminal or a server as shown in FIG. 1 .
  • step 201 further includes steps 2011 to 2012
  • step 202 further includes steps 2021 to 2025
  • step 203 further includes steps 2031 to 2034 .
  • Step 2011 periodically intercepting video frame images of the video.
  • first, frame clipping processing needs to be performed on the video, and the frame clipping processing is to periodically clip video frame images from the video and store them sequentially.
  • the time interval (period) for capturing video frame images from the video may be arbitrary, for example, 2 video frame images are captured every second.
  • each frame of the video may also be captured as a video frame image.
  • a video may be captured into multiple frames of video frame images.
  • Step 2012 Identify the text content in the video frame image, the text area where the text content is located, and the display duration of the text content.
  • a text list is obtained by recognizing the text in the video frame image.
  • the computer device performs text recognition on each frame of video frame image to obtain a text list.
  • the optical character recognition OCR model is invoked to identify the video frame image, the candidate text content and the text area of the candidate text content in the video frame image are obtained, and the display time of the candidate text content is obtained according to the display time of the video frame image;
  • the content is deduplicated to obtain the text content; deduplication includes determining the candidate text content with the earliest display time among the multiple candidate text contents with continuous display time, the same text area and the same candidate text content as the text content, and according to the multiple candidate text contents.
  • the display time calculates the display time of the text content; generates a text list according to the text content, the text area of the text content and the display time.
  • the OCR model is invoked to recognize the text in the video frame image, and the OCR model outputs the candidate text content in the video frame image and the text area of the candidate text content.
  • a data table including: candidate text content, text area, and display time can be obtained.
  • the display moment of the video frame image refers to the moment when the video frame image is displayed in the video.
  • the display time of the candidate text content extracted from the video frame image is the same as the display time of the video frame image.
  • the OCR model is used to perform text recognition on the video frame image, identify the text in the video frame image, and output the text and text area.
  • the OCR model is a neural network model, and any known OCR model can be used.
  • the video frame images correspond to display moments in the video.
  • the video frame image is stored in chronological order, and the corresponding display moment of the video frame image in the video is stored. This video frame image is stored in association with the 1s.
  • the candidate text content identified from each video frame image may also correspond to the display moment of the video frame image in the video.
  • a candidate text content it is possible to sequentially search the subsequent video frame images to find out whether there is a candidate text content that is the same as the candidate text content and has the same text area. If there is, it is determined that these candidate text contents are the same text content.
  • the display duration of the text content can be obtained from the display time corresponding to the video frame image when the candidate text content appears for the first time and the display time corresponding to the video frame image when the candidate text content appears for the last time.
  • this search is continuous, and when the candidate text content is not found in the next frame of video frame image, the search is stopped. That is, a plurality of candidate character contents that are consecutive in time, have the same character area, and have the same candidate character content are combined into one character content.
  • the text list includes at least one piece of text data of at least one text content, and one text content corresponds to one text area and one display duration.
  • the display duration in the text list also needs to include the start time and the end time of the display, that is, the start time and the end time are stored as the display time length, and the display time length can be calculated according to the start time and the end time.
  • the computer device After obtaining the video, the computer device generates a video link from the video, and then recognizes the text in the video to obtain the text list shown in Table 3.
  • the text area is described by the left line x1, right line x2, upper line y1, and lower line y2 of the rectangle, and the display duration is described by the start time "startTime” and the end time "endTime”.
  • Step 2021 Extract a text region from the m text regions corresponding to the m text contents as the first text region, determine the first text region as the first candidate subtitle region, and add the first candidate subtitle region to the candidate subtitle region. List of subtitle areas.
  • Step 2022 cyclically execute steps 2022 to 2023 until the remaining number of m text regions is 0: extract a text region from the remaining m-k+1 text regions as the k-th text region.
  • Step 2023 determine whether the positional deviation between the k-th text region and the candidate subtitle region is greater than the deviation threshold, if it is greater than (or equal to), go to step 2025, and if it is less than (or equal to), go to step 2024.
  • Step 2024 in response to the first position deviation between the kth text region and the wth candidate subtitle region in the candidate subtitle region list being smaller than the deviation threshold, classify the kth text region as the wth subtitle candidate region.
  • the first height of the k-th text region is the difference between the upper edge and the lower edge of the k-th text region
  • Step 2025 in response to the second position deviation between the k-th text region and all candidate subtitle regions in the candidate subtitle region list being greater than the deviation threshold, determine the k-th text region as the y-th candidate subtitle region, and the y-th subtitle region is determined as the y-th candidate subtitle region.
  • the candidate subtitle area is added to the list of candidate subtitle areas.
  • the first position deviation includes the difference between the two upper edge lines and the difference between the two lower edge lines
  • the second position deviation includes the difference between the two upper edge lines or the difference between the two lower edge lines
  • y is a positive integer less than or equal to n
  • k is a positive integer less than or equal to m
  • w is a positive integer less than or equal to n
  • m and n are positive integers.
  • steps 2021 to 2025 are the method steps of sorting the text region to obtain the candidate subtitle region, with m text data included in the text list, and the text region is described by the position of the upper edge and lower edge of the rectangle. example.
  • the first text region can be read in sequence, and the first text region can be directly used as the candidate subtitle region and placed in the candidate subtitle region list. , and then start from the second text area to compare with the existing candidate subtitle areas in the candidate subtitle area list to see if it can match the existing candidate subtitle area (the difference between the upper edges of the two areas should be less than the deviation threshold and the lower The deviation of the edge should also be less than the deviation threshold), if there is a matching candidate subtitle area, the text area will be attributed to this candidate subtitle area; if there is no matching candidate subtitle area, the text area will be used as a new The candidate subtitle area is stored in the candidate subtitle area list; in this way, each text area in the text list is traversed to obtain the candidate subtitle area stored in the candidate subtitle area list.
  • a candidate subtitle region may contain multiple text regions, but there is only one region position (including the upper and lower lines) of the candidate subtitle region, and the region position of the candidate subtitle region is the height of the text region belonging to the candidate subtitle region.
  • the highest text area top and bottom.
  • the newly added text area is updated as the area position of the candidate subtitle area. If the height difference of the newly added text region is smaller than the current region position of the candidate subtitle region, the current region position of the candidate subtitle region is kept unchanged.
  • first calculate the height difference of each text area and then sort the text areas according to the height difference from small to large to obtain a text area order list, according to the order of the text area order list to read and determine candidate subtitle regions starting from the first text region.
  • the problem of inaccurate subtitle regions to be determined can be solved.
  • the first text area 701 is smaller than the third text area 703 is smaller than the second text area 702
  • the first text area 701 is smaller than the third text area 703 .
  • the positional deviation between the region 701 and the second text region 702 is greater than the deviation threshold, the positional deviation between the second text region 702 and the third text region 703 is less than the deviation threshold, and the positional deviation between the first text region 701 and the third text region 703 is less than the deviation threshold, If the text regions are extracted in the order of the first text region 701 , the second text region 702 and the third text region 703 , when the second text region 702 is extracted, the position of the second text region 702 and the first text region 701 will vary due to the If the deviation is greater than the deviation threshold, the second text region 702 will be used as a new candidate subtitle region, which will lead to inaccurate recognition results of the candidate subtitle region; but if the text regions are sorted according to the height difference, the first text region will be extracted After 701, the third text area 703 is extracted first.
  • the positional deviation of the third text area 703 and the first text area 701 is less than the deviation threshold, and the height difference of the third text area 703 is greater than that of the first text area 701. It will be updated to the third sub-region 703, and then when the second text region 702 is extracted, since the positional deviation between the second text region 702 and the third text region 703 is less than the deviation threshold, the second text region 702 will also be classified into this sub-region 703. In the candidate subtitle area, the second text area 702 is updated to the area position of the candidate subtitle area.
  • Steps 2021 to 2025 are to take the horizontal subtitles as an example, and use the upper and lower lines as the text area; For subtitles, the above-mentioned upper and lower lines are changed to left and right lines, that is, the text area is left and right lines.
  • Step 2031 Calculate the repetition rate of each candidate subtitle region in the n candidate subtitle regions, where the repetition rate is used to describe the repetition probability of the text content appearing in the candidate subtitle region.
  • the repetition rate is the ratio of the cumulative duration to the total video duration of the video
  • the cumulative duration is the sum of the display durations of the same text content.
  • a method for calculating the repetition rate is given: obtaining the jth group of text content corresponding to the jth candidate subtitle region, where the jth group of text content includes at least one text content corresponding to the jth candidate subtitle region, and j is A positive integer less than or equal to n, n is a positive integer; the same text content in the jth group of text content is classified as a text content set, and a total of x text content sets are obtained; the display time of the text content in each text content set is calculated.
  • the sum is obtained to obtain the cumulative duration, and a total of x cumulative durations are obtained, where x is a positive integer; the ratio of the maximum cumulative duration to the total video duration of the video is calculated to obtain the repetition rate, and the maximum cumulative duration is the maximum of at least one cumulative duration; repeat the above four
  • the repetition rate of each candidate subtitle region is calculated in two steps.
  • the repetition rate is the ratio of the cumulative display duration of the same text content displayed on the candidate subtitle area to the total video duration. If the same text content is always displayed at one position, the position is likely to be interfering text (video title). , watermark, etc.).
  • Step 2032 Determine the candidate subtitle area with the repetition rate of the text content lower than the repetition rate threshold as the preliminary screen subtitle area.
  • the repetition rate threshold can be arbitrarily set.
  • the repetition rate threshold may be 10%.
  • the candidate subtitle region with the repetition rate higher than the repetition rate threshold may be the text region where the watermark is located, the text region where the video title is located, or the subtitle region where the text content in other videos is located with fixed text (few changes).
  • Step 2033 Calculate the total display duration of the pre-screened subtitle area.
  • a method for calculating the total display duration is given: calculating the sum of the display durations of the text content corresponding to the preliminary screened subtitle area to obtain the total display duration of the preliminary screened subtitle area.
  • the total display duration of each preliminary screen subtitle area is calculated, and the display total duration is the total duration of the text content displayed on the preliminary screen subtitle area.
  • text may be displayed briefly in some positions. For example, at the beginning of a TV series, the current episode will be displayed in the middle of the screen, or some pictures with text may be briefly captured in the video.
  • the subtitle area is not a subtitle area, and text content will be displayed on the subtitle area for a long time. Therefore, the pre-screen subtitle area with the longest display time in the pre-screen subtitle area is used as the subtitle area.
  • the first text content is displayed for 1s
  • the second text content is displayed for 2s
  • the third text content is displayed for 6s
  • Step 2034 In the preliminary screen subtitle area, the preliminary screen subtitle area with the longest total display duration is determined as the subtitle area.
  • subtitle region screening strategies can also be used to filter subtitle regions.
  • the text region whose inclination angle of the upper or lower line of the text region is greater than the angle threshold can be directly removed as the candidate subtitle region, because subtitles are usually oriented in a regular direction (horizontal or vertical). ), text data with irregular directions can be removed directly.
  • the text data corresponding to the text content displayed as other colors can be deleted from the text list, and the deleted text list is adopted from the text list provided by this application. method to identify subtitle regions.
  • the computer device may identify the subtitle of the video according to the text content belonging to the subtitle area.
  • the text content in the text data corresponding to the subtitle area is trimmed and used as the subtitle of the video.
  • the color of the subtitle can also be changed. Since the OCR model can identify the pixel where the text content is located in the image frame when the text list is obtained, after obtaining the subtitle according to the subtitle area, the color of the pixel where the subtitle is located can be changed to realize automatic subtitle recognition and quick editing of subtitles .
  • the method provided by this embodiment can be used to quickly modify the color of the subtitles, so that the subtitles can be distinguished from the overall color of the video, and the definition of the subtitles can be improved.
  • the computer device receives a color editing instruction, and the color editing instruction is used to indicate the target color; the text content belonging to the subtitle area is modified to the target color, and the target video is generated, and the subtitles in the target video are displayed in the target color.
  • the computer device modifies the pixel points corresponding to the text content in the subtitle area in the image frame of the video to the target color.
  • the method After identifying the text content in the video, the method identifies the part of the text content belonging to the subtitles from the text content, edits and processes the subtitles separately, realizes quick editing and processing of the subtitles, and does not affect other text content in the video.
  • the video frame image of the video is obtained first, and then the OCR model is used to perform text recognition on the video frame image, and the candidate text content obtained by the text recognition is deduplicated to obtain the text content containing the text content.
  • the text list so as to extract the text data in the video, it is convenient to discriminate the subtitle area according to the text data.
  • candidate subtitle regions are firstly obtained according to the text region, and multiple text regions obtained through text recognition are subjected to regularization to obtain several approximate regions of the subtitle region, which is convenient for subsequent subtitle regions according to the subtitle region identification strategy identification.
  • the candidate subtitle region is a region for displaying watermarks, video titles, etc. with a long display time and a single display content, and These candidate subtitle regions are removed to obtain a preliminary screen subtitle region.
  • the method provided in this embodiment removes the area that only displays text content for a short time from the preliminary screen subtitle area by calculating the total display duration of each preliminary screen subtitle area. Since the subtitle area usually displays text content for a long time, according to this A feature may determine the pre-screened subtitle area with the longest total display duration among the pre-screened subtitle areas as the subtitle area.
  • FIG. 9 shows a flowchart of a method for identifying a subtitle area provided by an exemplary embodiment of the present application.
  • the method can be performed by a computer device, for example, a terminal or a server as shown in FIG. 1 .
  • the method includes the following steps.
  • Step 101 Identify the video to obtain n candidate subtitle regions, where the candidate subtitle regions are regions displayed by text content in the video, and n is a positive integer.
  • Step 801 performing speech recognition on the video to obtain a speech recognition result.
  • a speech recognition result is obtained by performing speech recognition on the audio in the video, and the speech recognition result includes at least one recognized text content.
  • Step 802 among the n candidate subtitle regions, a candidate subtitle region whose similarity between the text content and the speech recognition result is higher than a threshold is determined as a reference subtitle region.
  • the speech recognition result is compared with the text content corresponding to each candidate subtitle region, and the similarity is calculated.
  • the similarity is equal to the ratio of the number of the same text content to the total number of text content corresponding to the candidate subtitle area.
  • the same text content is the text content that is the same as the text content in the speech recognition result in the text content corresponding to the candidate subtitle area.
  • Step 1021 according to the subtitle region screening strategy and the reference subtitle region, the subtitle region is obtained by screening the n candidate subtitle regions.
  • the total display duration is sorted from high to low, and a sorting result is obtained.
  • the default sorting weight of each candidate subtitle area is 1
  • the sorting weight of the reference subtitle area is set to 2
  • the total display time is weighted to obtain the weighted total display time.
  • the candidate subtitle area with the longest total duration in the revised sorting result is determined as the subtitle area.
  • the subtitle area is recognized by combining the speech recognition result. Since subtitles usually mark the speech content of characters in the video, the text content displayed in the subtitle area usually fits the speech recognition result. Determining the subtitle area based on the speech recognition result can improve the recognition accuracy of the subtitle area.
  • FIG. 10 shows a flowchart of a method for identifying a subtitle area provided by an exemplary embodiment of the present application.
  • the method can be performed by a computer device, for example, a terminal or a server as shown in FIG. 1 .
  • the method includes the following steps.
  • step 601 the computer device performs data acquisition.
  • videos of popular user accounts in the video application are first obtained, where popular user accounts are user accounts with more fans or more video clicks, or the top few on the leaderboard.
  • videos under these popular accounts are obtained as videos in the subtitle area to be identified.
  • Step 602 the computer device performs a subtitle extraction service.
  • the subtitle region identification method provided in this application is used to identify the subtitle region in the video.
  • the UGC User Generated Content, user generated content
  • video OCR frame cutting processing 802 intercepting video frame images, performing text recognition on the video frame images to obtain recognition results, and performing candidate text on the recognition results.
  • the content is deduplicated to obtain a text list) to obtain the text content, the display duration 803 of the text content and the text area 804 of the text content, then the text area 804 is normalized to obtain a plurality of candidate subtitle areas, and the repetition rate of each candidate subtitle area is calculated, Carry out repeated text judgment 805 and select the preliminary screen subtitle area whose repetition rate is lower than the repetition rate threshold, then calculate the total display duration of the preliminary screen subtitle area, and carry out duration judgment 806: select the initial screen with the longest display total time length (duration).
  • the subtitle area is screened as the subtitle area 807 .
  • Step 603 the computer device performs post-processing on the text content in the subtitle area.
  • the post-processing includes at least one of short sentence merging, special symbol stripping, text density stripping, text word count stripping, duplicate recognition merging, and single letter and number culling.
  • short-sentence merging is used to merge super-short sentences (eg: ah, ok) in the text content.
  • Special symbol stripping is used to remove non-text data (eg, emoticons) for text content.
  • Text density stripping is used to strip overly long sentences from text content.
  • Text word count stripping is used to strip text content according to the number of stripped words, for example, every 2-14 characters.
  • Duplicate Recognition Merge is used to merge data for repeated textual content.
  • Single letter and number elimination is used to eliminate single letters or numbers in other non-target languages (eg, Chinese) from the text content.
  • step 604 the computer equipment verifies the delivery quality.
  • the computer device verifies the automatically recognized subtitles by using the manual annotation result of the video subtitles.
  • the obtained subtitle recognition results are sampled and tested, the recognition results are randomly selected to construct a test set, and the confidence level is verified.
  • Delivered 605. The text content in the recognition result and the audio of the corresponding time period in the video are used as the training samples of the speech-to-text model.
  • the confidence level is equal to: the ratio of the number of correctly recognized words in the subtitle recognition result to the total number of words in the subtitle recognition result.
  • the subtitle content in the video can be accurately identified, and then the subtitle content in the video can be identified according to the identified subtitle content.
  • the training samples of the speech-to-text model can be obtained from the audio of the time period, and the speech-to-text model can be trained according to the subtitle content and audio, which can save human resources in the process of sample acquisition and improve the efficiency of sample acquisition.
  • FIG. 12 shows a schematic structural diagram of an apparatus for subtitle recognition provided by an exemplary embodiment of the present application.
  • the apparatus can be implemented by software, hardware, or a combination of the two to become all or a part of computer equipment, and the apparatus includes the following apparatuses.
  • the identification module 901 is used to identify the video to obtain n candidate subtitle regions, where the candidate subtitle region is the region displayed by the text content in the video, and n is a positive integer;
  • the screening module 903 is configured to select the subtitle region from the n candidate subtitle regions according to a subtitle region screening strategy, and the subtitle region screening strategy is used to make the repetition rate of the text content lower than the repetition rate threshold and display the total duration The longest candidate subtitle region is determined as the subtitle region.
  • the apparatus further includes:
  • a calculation module 904 configured to calculate the repetition rate of each candidate subtitle region in the n candidate subtitle regions, where the repetition rate is used to describe the repetition probability of the text content appearing in the candidate subtitle region;
  • the screening module 903 is further configured to determine the candidate subtitle region where the repetition rate of the text content is lower than the repetition rate threshold as a preliminary screening subtitle region;
  • the calculation module 904 is further configured to calculate the total display duration of the preliminary screen subtitle area
  • the screening module 903 is further configured to determine the preliminary screen subtitle region with the longest total display duration in the preliminary screen subtitle region as the subtitle region.
  • the computing module 904 is further configured to acquire the jth group of text content corresponding to the jth candidate subtitle area, where the jth group of text content includes at least one corresponding to the jth candidate subtitle area
  • the text content of the subtitle area, j is a positive integer less than or equal to n, and n is a positive integer;
  • the computing module 904 is further configured to classify the same text content in the jth group of text content into a text content set, and obtain x text content sets in total;
  • the calculation module 904 is further configured to calculate the sum of the display durations of the text content in each of the text content sets to obtain cumulative durations, and obtain x cumulative durations in total, where x is a positive integer;
  • the calculation module 904 is further configured to calculate the ratio of the maximum cumulative duration to the total video duration of the video to obtain the repetition rate, and the maximum cumulative duration is the maximum value of the at least one cumulative duration;
  • the calculation module 904 is further configured to repeat the above four steps to calculate the repetition rate of each of the candidate subtitle regions
  • the calculation module 904 is further configured to calculate the sum of the display durations of the text content corresponding to the preliminary screen subtitle area to obtain the display of the preliminary screen subtitle area total duration.
  • the apparatus further includes:
  • An identification module 901, configured to identify the text content in the video and the text area where the text content is located;
  • the candidate module 902 is configured to, according to the positional relationship of the text regions, cluster the text regions whose positional deviation is less than a deviation threshold into the same candidate subtitle region, and obtain the n candidate subtitle regions in total.
  • the text list includes m pieces of text data, the text area includes an upper edge and a lower edge of a rectangle, and m is a positive integer;
  • the candidate module 902 is further configured to extract a text region from the m text regions corresponding to the m text contents as the first text region, and determine the first text region as the first candidate subtitle region , adding the first candidate subtitle region to the candidate subtitle region list;
  • the candidate module 902 is further configured to perform the following steps cyclically until the remaining number of the m text regions is 0: extracting a text region from the remaining m-k+1 text regions as the kth text region , in response to the first position deviation between the k-th text region and the w-th candidate subtitle region in the candidate subtitle region list being less than the deviation threshold, classify the k-th text region as the w-th text region candidate subtitle regions;
  • the first position deviation includes the difference between the two upper edge lines and the difference between the two lower edge lines
  • the second position deviation includes the difference between the two upper edge lines or the difference between the two lower edge lines.
  • difference y is a positive integer less than or equal to n
  • k is a positive integer less than or equal to m
  • w is a positive integer less than or equal to n
  • n is a positive integer.
  • the candidate module 902 is further configured to calculate a first height of the k-th text area, where the first height is the difference between the upper edge of the k-th text area and the the difference between the bottom lines; calculating the second height of the wth candidate subtitle region, where the second height is the difference between the upper edge and the bottom line of the wth candidate subtitle region; in response to The first height is greater than the second height, and the k-th text area is determined as the w-th candidate subtitle area;
  • k is a positive integer less than or equal to m
  • w is a positive integer less than or equal to n
  • n and m are positive integers.
  • the identifying module 901 is further configured to identify the text content in the video, the text area where the text content is located, and the display duration of the text content.
  • the apparatus further includes:
  • an acquisition module 905, configured to periodically intercept the video frame images of the video
  • the identifying module 901 is further configured to identify the text content in the video frame image, the text area where the text content is located, and the display duration of the text content.
  • the recognition module 901 is further configured to call an optical character recognition (OCR) model to recognize the video frame image, and obtain the candidate text content in the video frame image and all the candidate text content.
  • OCR optical character recognition
  • the identification module 901 is further configured to de-duplicate the candidate text content to obtain the text content; the de-duplication includes de-duplicating multiple texts with the same display time, the same text area, and the same candidate text content.
  • the candidate text content with the earliest display time among the candidate text contents is determined as the text content, and the display duration of the text content is calculated according to the display times of the plurality of candidate text contents.
  • the apparatus further includes:
  • the subtitle module 906 is configured to identify the subtitle of the video according to the text content belonging to the subtitle area.
  • the apparatus further includes: a subtitle module 906, configured to receive a color editing instruction, where the color editing instruction is used to indicate a target color;
  • the subtitle module 906 is configured to modify the text content belonging to the subtitle area to the target color to generate a target video, and the subtitles in the target video are displayed in the target color.
  • the apparatus further includes:
  • a receiving module for receiving a color editing instruction the color editing instruction is used to indicate the target color
  • the editing module is configured to modify the text content belonging to the subtitle area to the target color, and generate a target video, and the subtitles in the target video are displayed in the target color.
  • the apparatus further includes:
  • a speech recognition module for performing speech recognition on the video to obtain a speech recognition result
  • a reference module configured to determine a candidate subtitle region in which the similarity between the text content and the speech recognition result is higher than a threshold in the n candidate subtitle regions as a reference subtitle region;
  • the screening module 903 is further configured to select the subtitle region from the n candidate subtitle regions according to the subtitle region screening strategy and the reference subtitle region.
  • the screening module 903 is further configured to sort the n candidate subtitle regions according to a subtitle region screening strategy to obtain a sorting result
  • the screening module 903 is further configured to improve the sorting weight of the reference subtitle area, and correct the sorting result based on the sorting weight of the n candidate subtitle areas;
  • the screening module 903 is further configured to select the subtitle region from the n candidate subtitle regions based on the revised sorting result.
  • FIG. 13 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 1000 includes a central processing unit (English: Central Processing Unit, referred to as: CPU) 1001, includes a random access memory (English: Random Access Memory, referred to as: RAM) 1002 and a read-only memory (English: Read-Only Memory, abbreviated as: ROM) 1003 of the system memory 1004, and the system bus 1005 connecting the system memory 1004 and the central processing unit 1001.
  • Server 1000 also includes a basic input/output system (I/O system) 1006 that facilitates the transfer of information between various components within the computer, and a mass storage device 1007 for storing operating system 1013, application programs 1014, and other program modules 1015 .
  • I/O system basic input/output system
  • the basic input/output system 1006 includes a display 1008 for displaying information and input devices 1009 such as a mouse, keyboard, etc., for user input of information. Wherein both the display 1008 and the input device 1009 are connected to the central processing unit 1001 through the input/output controller 1010 connected to the system bus 1005.
  • the basic input/output system 1006 may also include an input/output controller 1010 for receiving and processing input from various other devices such as a keyboard, mouse, or electronic stylus. Similarly, input/output controller 1010 also provides output to a display screen, printer, or other type of output device.
  • Mass storage device 1007 is connected to central processing unit 1001 through a mass storage controller (not shown) connected to system bus 1005 .
  • Mass storage device 1007 and its associated computer-readable media provide non-volatile storage for server 1000 . That is, the mass storage device 1007 may include a computer-readable medium (not shown) such as a hard disk or a Compact Disc Read-Only Memory (English: Compact Disc Read-Only Memory, CD-ROM for short) drive.
  • Computer-readable media can include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, Erasable Programmable Read-Only Memory (English: Erasable Programmable Read-Only Memory, referred to as: EPROM), Electrically Erasable Programmable Read-Only Memory (English: Electrically Erasable Programmable Read-Only Memory) , referred to as: EEPROM), flash memory or other solid-state storage technology, CD-ROM, Digital Versatile Disc (English: Digital Versatile Disc, referred to as: DVD) or other optical storage, tape cassettes, magnetic tape, disk storage or other magnetic storage devices.
  • the system memory 1004 and the mass storage device 1007 described above may be collectively referred to as memory.
  • the server 1000 may also be operated by connecting to a remote computer on the network through a network such as the Internet. That is, the server 1000 can be connected to the network 1012 through the network interface unit 1011 connected to the system bus 1005, or can also use the network interface unit 1011 to connect to other types of networks or remote computer systems (not shown).
  • the present application also provides a terminal, the terminal includes a processor and a memory, the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the subtitle region identification method provided by the above method embodiments. It should be noted that the terminal may be the terminal provided in FIG. 14 below.
  • FIG. 14 shows a structural block diagram of a terminal 1100 provided by an exemplary embodiment of the present application.
  • the terminal 1100 can be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, the standard audio level 3 of moving picture expert compression), MP4 (Moving Picture Experts Group Audio Layer IV, moving picture expert compression standard audio layer IV) Level 4) Player, laptop or desktop computer.
  • Terminal 1100 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.
  • the terminal 1100 includes: a processor 1101 and a memory 1102 .
  • the processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 1101 can use at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • the processor 1101 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor for processing data in a standby state.
  • the processor 1101 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 1101 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1102 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 1101 to implement the subtitle area provided by the method embodiments of the present application recognition methods.
  • the display screen 1105 is used for displaying UI (User Interface, user interface).
  • the UI can include graphics, text, icons, video, and any combination thereof.
  • the display screen 1105 also has the ability to acquire touch signals on or above the surface of the display screen 1105 .
  • the touch signal can be input to the processor 1101 as a control signal for processing.
  • the display screen 1105 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards.
  • the display screen 1105 there may be one display screen 1105, which is provided on the front panel of the terminal 1100; in other embodiments, there may be at least two display screens 1105, which are respectively arranged on different surfaces of the terminal 1100 or in a folded design; In still other embodiments, the display screen 1105 may be a flexible display screen disposed on a curved surface or a folding surface of the terminal 1100 . Even, the display screen 1105 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen.
  • the display screen 1105 can be made of materials such as LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light emitting diode).
  • FIG. 14 does not constitute a limitation on the terminal 1100, and may include more or less components than the one shown, or combine some components, or adopt different component arrangements.
  • the memory further includes one or more programs, the one or more programs are stored in the memory, and the one or more programs include a method for identifying the subtitle region provided by the embodiment of the present application.
  • the present application also provides a computer device, the computer device includes: a processor and a memory, the storage medium stores at least one instruction, at least a piece of program, code set or instruction set, the at least one instruction, at least a piece of program, code set Or the instruction set is loaded and executed by the processor to implement the subtitle region identification method provided by the above method embodiments.
  • the present application also provides a computer-readable storage medium, which stores at least one instruction, at least one piece of program, code set or instruction set, and the at least one instruction, at least one piece of program, code set or instruction set is loaded by a processor And execute the method to realize the subtitle region identification method provided by the above method embodiments.
  • the present application also provides a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the subtitle region identification method provided in the foregoing optional implementation manner.
  • references herein to "a plurality” means two or more.
  • "And/or" which describes the association relationship of the associated objects, means that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, and B exists alone.
  • the character “/” generally indicates that the associated objects are an "or" relationship.

Abstract

一种字幕区域识别方法、装置、设备及存储介质,涉及人工智能的计算机视觉技术领域。该方法包括:识别视频得到n个候选字幕区域,候选字幕区域为所述视频中的文字内容所显示的区域,n为正整数(101);根据字幕区域筛选策略从所述n个候选字幕区域中筛选得到所述字幕区域,所述字幕区域筛选策略用于将文字内容的重复率低于重复率阈值且显示总时长最长的候选字幕区域确定为所述字幕区域(102)。采用上述方法、装置、设备及系统可以节省字幕区域识别所需的人力资源。

Description

字幕区域识别方法、装置、设备及存储介质
本申请要求于2020年10月27日提交的申请号为202011165751.0、发明名称为“字幕区域识别方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能的计算机视觉技术领域,特别涉及一种字幕区域识别方法、装置、设备及存储介质。
背景技术
随着短视频的普及,在多种场景下都需要应用到视频中的字幕提取技术,例如,在语音转文字模型的训练过程中,需要使用视频中的字幕作为训练样本。
相关技术中,由于短视频中的文字信息不一定都是字幕的文字,还可能包括品牌水印文字、视频标题文字等等。因此,对于短视频中字幕的提取,是通过人工进行字幕区域标注,然后使用OCR(Optical Character Recognition,光学字符识别)技术对标注位置进行文字识别得到字幕。例如,人工对视频进行截图,然后用图像查看软件打开截图,将鼠标移动至字幕的左上角以及右下角位置,可以得到两个位置的坐标,进而得到字幕的位置。
相关技术中的方法,需要耗费大量人力进行字幕的提取。
发明内容
本申请实施例提供了一种字幕区域识别方法、装置、设备及存储介质,可以自动进行字幕提取,节省人力资源。所述技术方案如下。
根据本申请的一个方面,提供了一种字幕区域识别方法,所述方法由计算机设备执行,所述方法包括:
识别视频得到n个候选字幕区域,候选字幕区域为所述视频中的文字内容所显示的区域,n为正整数;
根据字幕区域筛选策略从所述n个候选字幕区域中筛选得到所述字幕区域,所述字幕区域筛选策略用于将文字内容的重复率低于重复率阈值且显示总时长最长的候选字幕区域确定为所述字幕区域。
根据本申请的另一方面,提供了一种字幕识别装置,所述装置包括:
识别模块,用于识别视频得到n个候选字幕区域,候选字幕区域为所述视频中的文字内容所显示的区域,n为正整数;
筛选模块,用于根据字幕区域筛选策略从所述n个候选字幕区域中筛选得到所述字幕区域,所述字幕区域筛选策略用于将文字内容的重复率低于重复率阈值且显示总时长最长的候选字幕区域确定为所述字幕区域。
根据本申请的另一方面,提供了一种计算机设备,所述计算机设备包括:处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上方面所述的字幕区域识别方法。
根据本申请的另一方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如上方面所述的字幕区域识别方法。
根据本公开实施例的另一个方面,提供一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述可选实现方式中提供的字幕区域识别方法。
本申请实施例提供的技术方案带来的有益效果至少包括如下的有益效果。
通过使用字幕区域筛选策略,对从视频中识别出的候选字幕区域进行筛选得到字幕区域。根据字幕显示位置固定、文本内容多样、显示时长较长的特征从候选字幕区域中选出字幕区域,从而可以根据字幕区域提取到视频的字幕,相比于使用人工对字幕区域进行标注的方法,该方法节省了字幕识别所需要的人力资源,加快字幕识别速度和效率。
附图说明
图1是本申请一个示例性实施例提供的计算机系统的框图;
图2是本申请另一个示例性实施例提供的字幕区域识别方法的方法流程图;
图3是本申请一个示例性实施例提供的字幕区域识别方法的方法流程图;
图4是本申请另一个示例性实施例提供的字幕区域识别方法的视频帧图像示意图;
图5是本申请另一个示例性实施例提供的字幕区域识别方法的视频帧图像示意图;
图6是本申请另一个示例性实施例提供的字幕区域识别方法的方法流程图;
图7是本申请另一个示例性实施例提供的字幕区域识别方法的视频帧图像示意图;
图8是本申请另一个示例性实施例提供的字幕区域识别方法的文字区域的示意图;
图9是本申请另一个示例性实施例提供的字幕区域识别方法的方法流程图;
图10是本申请另一个示例性实施例提供的字幕区域识别方法的方法流程图;
图11是本申请另一个示例性实施例提供的字幕区域识别方法的方法流程图;
图12是本申请另一个示例性实施例提供的字幕识别装置的框图;
图13是本申请另一个示例性实施例提供的服务器的结构示意图;
图14是本申请另一个示例性实施例提供的终端的框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
首先对本申请实施例涉及的若干个名词进行简介。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
计算机视觉技术(Computer Vision,CV)是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、OCR(Optical Character Recognition,光学字符识别)、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D(Three Dimensional,三维)技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。
OCR是英文Optical Character Recognition的缩写,意思是光学字符识别,也可简单地称为文字识别,是文字自动输入的一种方法。它通过扫描和摄像等光学输入方式获取纸张上的文字图像信息,利用各种模式识别算法分析文字形态特征可以将票据、报刊、书籍、文稿及其它印刷品转化为图像信息,再利用文字识别技术将图像信息转化为可以使用的计算机输入 技术。
图1示出了本申请一个示例性实施例提供的计算机系统的结构示意图,该计算机系统包括终端120和服务器140。
终端120与服务器140之间通过有线或者无线网络相互连接。
终端120包括智能手机、笔记本电脑、台式电脑、平板电脑、智能音箱、智能机器人中的至少一种。在一种可选的实现方式中,由终端将需要进行字幕识别的视频上传到服务器,服务器对终端上传的视频进行字幕识别。在另一种可选的方式中,服务器也可以对本地存储的视频进行字幕识别。在另一种可选的方式中,终端也可以对本地存储的视频进行字幕识别。在另一种可选的方式中,终端也可以通过网络下载视频,对下载的视频进行字幕识别。
示例性的,终端120还包括显示器;显示器用于显示视频的画面。
终端120包括第一存储器和第一处理器。第一存储器中存储有第一程序;上述第一程序被第一处理器调用执行以实现本申请提供的字幕区域识别方法。第一存储器可以包括但不限于以下几种:随机存取存储器(Random Access Memory,RAM)、只读存储器(Read Only Memory,ROM)、可编程只读存储器(Programmable Read-Only Memory,PROM)、可擦除只读存储器(Erasable Programmable Read-Only Memory,EPROM)、以及电可擦除只读存储器(Electric Erasable Programmable Read-Only Memory,EEPROM)。
第一处理器可以是一个或者多个集成电路芯片组成。可选地,第一处理器可以是通用处理器,比如,中央处理器(Central Processing Unit,CPU)或者网络处理器(Network Processor,NP)。可选地,第一处理器可以通过调用字幕识别算法来实现本申请提供的字幕区域识别方法。
服务器140包括第二存储器和第二处理器。第二存储器中存储有第二程序,上述第二程序被第二处理器调用来实现本申请提供的字幕区域识别方法。示例性的,第二存储器中存储有字幕识别算法。在一种可选的实现方式中,服务器接收终端发送的视频,使用字幕识别算法来进行字幕识别。可选地,第二存储器可以包括但不限于以下几种:RAM、ROM、PROM、EPROM、EEPROM。可选地,第二处理器可以是通用处理器,比如,CPU或者NP。
服务器140可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络、以及大数据和人工智能平台等基础云计算服务的云服务器。终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。
示意性的,本申请提供的字幕区域识别方法可以应用于视频字幕提取、语音转文本模型的训练样本的获取等场景中。以使用本申请提供的字幕区域识别方法获取语音转文本模型的训练样本为例,在得到视频的字幕区域后,获取属于字幕区域的文字区域,以及文字区域对应的文本数据,文本数据中的文字内容即为训练样本的文字部分,根据文本数据中的显示时长(起始时刻和终止时刻)从视频中截取对应时间的音频,该音频为训练样本的语音部分,将文字部分和语音部分对应存储为训练样本。
图2示出了本申请一个示例性实施例提供的字幕区域识别方法的流程图。该方法可以由计算机设备来执行,例如,如图1所示的终端或服务器来执行。所述方法包括如下步骤。
步骤101,识别视频得到n个候选字幕区域,候选字幕区域为视频中的文字内容所显示的区域,n为正整数。
示例性的,视频可以是任意类型的视频文件,例如,短视频、电视剧、电影、综艺节目等。示例性的,视频中包括字幕。以短视频为例,在短视频画面中的文字,不仅包含字幕,还可能包含其他文字信息,例如,短视频应用程序的水印文字、短视频发布者的用户昵称、短视频的视频名称等等。因此,仅仅通过OCR技术进行文字识别是无法准确获得短视频的字 幕的,而人工对字幕区域进行标注,再对标注位置进行文字识别得到字幕的方式又需要耗费大量人力,因此,本申请提供了一种字幕识别方式,可以从视频中多个文字信息中准确识别出字幕,节省了人工标注字幕区域的步骤提高了字幕提取的效率。
示例性的,视频的获取方式可以是任意的,视频可以是计算机设备本地存储的视频文件,也可以是通过其他计算机设备获取的视频文件。例如,当计算机设备是服务器时,服务器可以接收由终端上传的视频文件;当计算机设备是终端时,终端也可以通过网络下载服务器上存储的视频文件。以计算机设备是服务器为例,在终端上可以安装有具有字模提取功能的客户端,用户可以在客户端的用户界面上选择本地存储的视频文件,并点击上传控件将视频文件上传至服务器,服务器对视频文件进行后续的字幕区域识别处理。
候选字幕区域是指视频中显示有文字内容的区域。示例性的,候选字幕区域包括视频中每一帧视频画面显示有文字内容的区域。候选字幕区域是一种区域位置,具有明确的区域范围、位置坐标。示例性的,将视频中位置相近的文字内容所在的文字区域聚类为一个候选字幕区域。
步骤102,根据字幕区域筛选策略从n个候选字幕区域中筛选得到字幕区域,字幕区域筛选策略用于将文字内容的重复率低于重复率阈值且显示总时长最长的候选字幕区域确定为字幕区域。
示例性的,基于字幕区域中所显示的文字内容多样、字幕区域长时间显示有文字内容的特征,从多个候选字幕区域中筛选出文字内容重复率低于重复率阈值,并且,长时间显示有文字内容的候选字幕区域,确定为字幕区域。
文字内容的重复率用于描述在该候选字幕区域中所显示的文字内容的多样性。文字内容的重复率高,即,在该候选字幕区域中会显示多种文字内容,文字内容的重复率低,即,在该候选字幕区域中只显示一种或几种文字内容。
显示总时长是指该候选字幕区域中显示有文字内容的总时长。由于字幕通常在视频中长时间显示,因此,选择长时间显示有文字内容的候选字幕区域作为字幕区域。
综上所述,本实施例提供的方法,通过使用字幕区域筛选策略,对从视频中识别出的候选字幕区域进行筛选得到字幕区域。根据字幕显示位置固定、文本内容多样、显示时长较长的特征从候选字幕区域中选出字幕区域,从而可以根据字幕区域提取到视频的字幕,相比于使用人工对字幕区域进行标注的方法,该方法节省了字幕识别所需要的人力资源,加快字幕识别速度和效率。
图3示出了本申请一个示例性实施例提供的字幕区域识别方法的流程图。该方法可以由计算机设备来执行,例如,如图1所示的终端或服务器来执行。所述方法包括如下步骤。
步骤201,识别视频中的文字内容、文字内容所在的文字区域。
示例性的,识别视频中的文字内容、文字内容所在的文字区域、文字内容的显示时长。文字内容、文字区域、显示时长之间具有对应关系。
示例性的,识别视频中的文字得到文本列表,文本列表包括至少一条文本数据,文本数据包括文字内容、文字区域和显示时长,文字内容包括位于文字区域上的至少一个文字。
示例性的,计算机设备对视频进行文字识别,得到文本列表。示例性的,文本列表可以是一个数据表格,其中的每一行代表一条文本数据,每一列为文本数据的具体内容:文字内容、文字区域以及显示时长。对于视频的一帧视频帧图像,图像上的不同区域可能包含不同的文字内容,对于视频的多帧视频帧图像,图像上的相同区域也可能在不同时间显示不同的文字内容,因此,将视频中文字区域不同、显示时间不同的多个文字内容提取出来,可以得到多条文本数据,组成文本列表。示例性的,如果视频中在相同文字区域的不同时间段内显示了相同的文字内容,则这两个文字内容分别属于两个文本数据,即,如果在连续的视频帧图像上的相同文字区域显示有相同的文字内容,则该文字内容属于一条文本数据,该连续地视频帧图像持续的时长即为该文本数据中的显示时长(文字内容的显示时长)。例如,在第 1-3s(秒)的视频帧图像上的第一区域显示了第一文字内容,在第3-4s的视频帧图像上的第一区域没有显示文字,在第4-5s的视频帧图像上的第一区域又显示了第一文字内容,则这两个第一文字内容分别对应两条文本数据,两条文本数据中的显示时长分别为2s和1s。
示例性的,通过对视频的每一帧画面进行文字识别,得到识别出的文字内容,以及文字内容在画面上的位置坐标,以及该帧画面的时间信息。对多帧画面进行文字识别得到的上述信息进行整理整合,得到文本列表。例如,在视频的第一帧画面上识别得到文字内容1和文字内容2,文字内容1在第一帧画面上位于位置1,文字内容2在第一帧画面上位于位置2,第一帧画面在视频中的时间为00:01;在视频的第二帧画面上识别得到文字内容1和文字内容3,文字内容1在第二帧画面上位于位置1,文字内容3在第二帧画面上位于位置3,第二帧画面在视频中的时间为00:05。因此,对两帧画面识别出的信息进行整合,可以得到由三条文本数据组成的文本列表。第一条文本数据:文字内容1、位置1、00:01至00:05共4分钟;第二条文本数据:文字内容2、位置2、00:01;第三条文本数据:文字内容3、位置3、00:05。
示例性的,文本列表还可以是由多个文本数据组成的数据集、数据库、文档文件等。
示例性的,文字区域包括用于框出文字的文字框的位置。示例性的,文字框是矩形框,文字框的位置可以用四条线(上边线、下边线、左边线和右边线)的位置来表达、也可以用文字框四个顶点的坐标来表达、也可以用文字框斜对角的两个顶点的坐标来表达。
步骤202,根据文字区域的位置关系,将位置偏差小于偏差阈值的文字区域聚类至同一个候选字幕区域,共得到n个候选字幕区域。
示例性的,将文字区域归整为n个候选字幕区域,属于第i个候选字幕区域的文字区域与第i个候选字幕区域的位置偏差小于偏差阈值,n为正整数,i为小于或等于n的正整数。
示例性的,聚类/归整是指按照文字区域的位置分布对文字区域进行归类,将位置偏差小于偏差阈值的多个文字区域归为同一类文字区域,即,同一个候选字幕区域。
示例性的,在得到文本列表后,文本列表中包括了多个文字区域,由于视频的字幕通常都显示在同一个区域位置,因此,将这些文字区域进行归整得到多个候选字幕区域。示例性的,由于不同字幕文字内容不同,其显示的区域范围可能也有些许差异,例如,如图4中的(1)和(2)分别为视频的两个视频帧图像,在两个视频帧图像上分别有位于第一文字区域501的第一文字内容和位于第二文字区域502的第二文字内容,这两个为文字内容都是字幕,但由于文字内容的字数以及行数不同,这两个文字内容的文字区域有些许差异,但这两个文字区域都为字幕区域,因此,在归整候选字幕区域时需要设定一个偏差阈值,若两个文字区域的位置偏差小于偏差阈值,则应该认为这两个文字区域属于同一个候选字幕区域,如此,便可以对文本列表中的多个文字区域进行归整,最终得到几个候选字幕区域。
示例性的,以计算第一文字区域和第二文字区域的位置偏差为例,第一文字区域包括第一上边线、第一下边线、第一左边线、第一右边线,第二文字区域包括第二上边线、第二下边线、第二左边线、第二右边线,位置偏差包括:第一上边线与第二上边线的偏差、第一下边线与第二下边线的偏差、第一左边线与第二左边线的偏差和第一右边线与第二右边线的偏差中的至少一种。示例性的,由于字幕通常为横向显示的字幕,则由于文字内容字数多少的不同,文字区域在左右方向上的位置差异较大,在上下方向上的位置差异较小,则位置偏差可以包括两个文字区域的两个上边线的偏差和两个下边线的偏差,即,将纵向位置相差不多的文字区域归为同一个候选字幕区域。示例性的,由于部分字幕是纵向显示的字幕,则位置偏差也可以包括两个文字区域的两个左边线的偏差和两个右边线的偏差,即,将横向位置相差不多的文字区域归为同一个候选字幕区域。
示例性的,偏差阈值的具体数值可以是任意的。示例性的,在经过反复试验后得出偏差阈值取30像素-50像素较佳,例如,偏差阈值设定为40像素,则将两个文字区域的两个上边线的偏差小于40像素,且两个下边线的偏差也小于40像素的两个文字区域归为同一个候选字幕区域。
示例性的,候选字幕区域具有一个区域位置,即,该候选字幕区域位于哪里,示例性的,候选字幕区域的区域位置为属于该候选字幕区域的最大文字区域。示例性的,候选字幕区域的区域位置为属于该候选字幕区域的高度最大的文字区域(对应横向显示的字幕),或,候选字幕区域的区域位置为属于该候选字幕区域的宽度最大的文字区域(对应纵向显示的字幕)。
示例性的,将文字区域归整为多个候选字幕区域后,可以在文本列表中增加一列候选字幕区域的数据,则每条文本数据中增加了一个所属候选字幕区域的数据,则,每个文字内容对应一个文字区域对应一个显示时长还对应一个候选字幕区域。
步骤203,根据字幕区域筛选策略从n个候选字幕区域中筛选得到字幕区域;字幕区域筛选策略用于将n个候选字幕区域中文字内容的重复率低于重复率阈值且显示总时长最长的候选字幕区域确定为字幕区域,显示总时长为属于候选字幕区域的全部文字内容的显示时长之和。
示例性的,显示总时长为属于候选字幕区域的全部文字内容的显示时长之和。
示例性的,在得到候选字幕区域,计算机设备可以调用字幕区域筛选策略的算法从候选字幕区域中识别出该视频的字幕区域。示例性的,由于视频中可能出现的部分干扰文字(非字幕文字)包括视频标题、应用程序水印、用户昵称等,而这些干扰文字具有显示时间长,且显示的文字单一不变的特点,因此,可以根据干扰文字的这些特征从文本数据中筛选出字幕区域。
示例性的,字幕区域筛选策略是根据干扰文字的显示特征和字幕的显示特征设定的。字幕具有显示时间长、位置固定、文字内容多样等特征。而干扰文字具有其他特征,例如,水印具有显示时间长、位置固定、文字内容单一等特征;视频标题具有显示时间短、位置固定、文字内容单一等特征;基于字幕与干扰文字的不同特征,可以将字幕所在的字幕区域从候选字幕区域中筛选出来。
本申请提供的字幕区域筛选策略,首先,分别判断每个候选字幕区域上是否显示单一的文字内容,若是单一的文字内容,则该候选字幕区域不是字幕区域。然后在剩下的候选字幕区域中选出显示总时长最长的候选字幕区域作为字幕区域。由于部分干扰文字,例如,电视剧标题文字,只会在视频开始的前几秒有显示,之后就不会再显示。例如,如图5所示,在视频帧图像上显示有视频标题401和字幕402,视频标题401在显示一会儿之后就会消失,该位置上不会再显示文字,而字幕402的位置会长时间地显示有文字。所以,从剩下的候选字幕区域中选出显示总时长最长的候选字幕区域作为字幕区域。
综上所述,本实施例提供的方法,通过使用字幕区域筛选策略,对从视频中识别出的文本列表中的文字区域进行筛选得到候选字幕区域,根据字幕显示位置固定、文本内容多样、显示时长较长的特征从候选字幕区域中选出字幕区域,从而可以根据字幕区域提取到视频的字幕,相比于使用人工对字幕区域进行标注的方法,该方法节省了字幕识别所需要的人力资源,加快字幕识别速度和效率。
示例性的,给出一种根据字幕区域筛选策略进行字幕区域筛选的示例性实施例。
图6示出了本申请一个示例性实施例提供的字幕区域识别方法的流程图。该方法可以由计算机设备来执行,例如,如图1所示的终端或服务器来执行。在图3所示的示例性实施例的基础上,步骤201还包括步骤2011至步骤2012,步骤202还包括步骤2021至步骤2025,步骤203还包括步骤2031至步骤2034。
步骤2011,周期性截取视频的视频帧图像。
示例性的,首先需要对视频进行截帧处理,截帧处理即为周期性地从视频中截取视频帧图像,将其顺序地存储。示例性的,从视频中截取视频帧图像的时间间隔(周期)可以是任意的,例如,每秒钟截取2张视频帧图像。示例性的,也可以将视频的每一帧画面都截取为视频帧图像。示例性的,一个视频可以截取到多帧视频帧图像。
步骤2012,识别视频帧图像中的文字内容、文字内容所在的文字区域、文字内容的显示 时长。
示例性的,识别视频帧图像中的文字得到文本列表。
示例性的,计算机设备对每一帧视频帧图像进行文字识别得到文本列表。
示例性的,调用光学字符识别OCR模型识别视频帧图像,得到视频帧图像中的候选文字内容和候选文字内容的文字区域,根据视频帧图像的显示时刻得到候选文字内容的显示时刻;对候选文字内容进行去重得到文字内容;去重包括将显示时刻连续、文字区域相同、候选文字内容相同的多个候选文字内容中显示时刻最早的候选文字内容确定为文字内容,根据多个候选文字内容的显示时刻计算文字内容的显示时长;根据文字内容、文字内容的文字区域和显示时长生成文本列表。
示例性的,调用OCR模型来识别视频帧图像中的文字,OCR模型输出视频帧图像中的候选文字内容以及候选文字内容的文字区域。如此,可以得到一个包含:候选文字内容、文字区域、显示时刻的数据表。
其中,视频帧图像的显示时刻是指该视频帧图像在视频中显示的时刻。从视频帧图像上提取出的候选文字内容的显示时刻与该视频帧图像的显示时刻相同。
OCR模型用于对视频帧图像进行文字识别,识别出视频帧图像中的文字,输出文字以及文字区域。示例性的,OCR模型为神经网络模型,可以采用任意一种已知的OCR模型。
例如,如图7所示,在视频的一帧视频帧图像中,显示有三条文字:第一文字301、第二文字302、第三文字303,OCR模型识别这三条文字输出:第一文字301的候选文字内容:“《三十**》妈妈能为孩子拼尽全力”,文字区域:第一文字框304左边界位置x1=2、右边界位置x2=8、上边界位置y1=10、下边界位置y2=8;第二文字302的候选文字内容:“怎怎么喝酒了”,文字区域:第二文字框305左边界位置x3=3、右边界位置x4=7、上边界位置y3=6、下边界位置y4=5;第三文字303的候选文字内容:“WS电视剧”,文字区域:第三文字框306左边界位置x5=4、右边界位置x6=6、上边界位置y5=3、下边界位置y6=2。
示例性的,视频帧图像对应有在视频中的显示时刻。截取视频帧图像时,会将视频帧图像按照时间顺序进行存储,并存储有该视频帧图像在视频中对应的显示时刻,例如,截取视频中第1s的视频帧得到第1s的视频帧图像,将该视频帧图像与第1s对应地进行存储。
因此,从每个视频帧图像中识别出的候选文字内容也可以对应该视频帧图像在视频中的显示时刻。对于一个候选文字内容,可以顺序地在后续视频帧图像中寻找是否存在与该候选文字内容相同且文字区域相同的候选文字内容,若存在,则确定这些候选文字内容为同一个文字内容,根据该候选文字内容第一次出现时的视频帧图像对应的显示时刻和最后一次出现时的视频帧图像对应的显示时刻即可得到该文字内容的显示时长。示例性的,这种寻找是连续性的,当在下一帧视频帧图像中未寻找到该候选文字内容,则停止寻找。即,将时间连续、文字区域相同、候选文字内容相同的多个候选文字内容合并为一个文字内容。
例如,如表一所示,经过OCR模型的文字识别后,从1s至7s共7个视频帧图像中识别得到了7个候选文字内容。其中,第一个“你好”从第1s至第4s都出现在(1,1),(2,2)文字区域,则确定这四个候选文字内容“你好”为同一文字内容,根据其出现的第一个时刻1s和最后一个时刻4s可以求出该文字内容的显示时长为3s;同理可以得到第二个“你好”的显示时长为1s,对于只有一帧视频帧图像上显示的候选文字内容,直接将其作为文字内容,其显示时长可以设置为视频帧图像截取的时间间隔,例如:1s,因此,合并候选文字内容后可以得到如表二所示的文字内容。
表一
候选文字内容 文字区域 时刻
你好 (1,1),(2,2) 1s
你好 (1,1),(2,2) 2s
你好 (1,1),(2,2) 3s
你好 (1,1),(2,2) 4s
hi (1,1),(2,2) 5s
你好 (1,1),(2,2) 6s
你好 (1,1),(2,2) 7s
表二
文字内容 文字区域 显示时长
你好 (1,1),(2,2) 3s
hi (1,1),(2,2) 1s
你好 (1,1),(2,2) 1s
示例性的,文本列表包括至少一个文字内容的至少一条文本数据,一个文字内容对应一个文字区域对应一个显示时长。
示例性的,文本列表中的显示时长还需要包括显示的起始时刻和终止时刻,即,将起始时刻和终止时刻作为显示时长进行存储,显示时长可以根据起始时刻和终止时刻计算得到。例如,如计算机设备在得到视频后,将视频生成一个视频链接,然后识别视频中的文字得到如表三所示的文本列表。其中,文字区域是以矩形的左边线x1、右边线x2、上边线y1、下边线y2来描述的,显示时长是以起始时刻“startTime”和终止时刻“endTime”来描述的。
表三
Figure PCTCN2021122697-appb-000001
步骤2021,从m个文字内容对应的m个文字区域中抽出一个文字区域作为第1个文字区域,将第1个文字区域确定为第1个候选字幕区域,将第1个候选字幕区域加入候选字幕区域列表。
步骤2022,循环执行步骤2022至步骤2023,直至m个文字区域的剩余数量为0:从剩 下的m-k+1个文字区域中抽出一个文字区域作为第k个文字区域。
步骤2023,判断第k个文字区域与候选字幕区域的位置偏差是否大于偏差阈值,若大于(或等于)则进行步骤2025,若小于(或等于)则进行步骤2024。
步骤2024,响应于第k个文字区域与候选字幕区域列表中的第w个候选字幕区域的第一位置偏差小于偏差阈值,将第k个文字区域归为第w个候选字幕区域。
示例性的,在将第k个文字区域归为第w个候选字幕区域之后,计算第k个文字区域的第一高度,第一高度为第k个文字区域的上边线与下边线之差;
计算第w个候选字幕区域的第二高度,第二高度为第w个候选字幕区域的上边线与下边线之差;响应于第一高度大于第二高度,将第k个文字区域确定为第w个候选字幕区域;其中,k为小于等于m的正整数,w为小于等于n的正整数,n、m为正整数。
步骤2025,响应于第k个文字区域与候选字幕区域列表中的全部候选字幕区域的第二位置偏差都大于偏差阈值,将第k个文字区域确定为第y个候选字幕区域,将第y个候选字幕区域加入候选字幕区域列表。
其中,第一位置偏差包括两个上边线之差和两个下边线之差,第二位置偏差包括两个上边线之差或两个下边线之差,y为小于或等于n的正整数,k为小于等于m的正整数,w为小于等于n的正整数,m、n为正整数。
示例性的,步骤2021至步骤2025是对文字区域进行归整得到候选字幕区域的方法步骤,以文本列表中包括m个文本数据,文字区域是以矩形的上边线和下边线位置进行描述的为例。
示例性的,可以根据文本列表中文本数据的排列顺序(可以是任意排序方式)从第一个文字区域依次开始读取,将第一个文字区域直接作为候选字幕区域放入候选字幕区域列表中,然后从第二个文字区域开始先与候选字幕区域列表中现有的候选字幕区域作比较,是否能与现有的候选字幕区域相匹配(两个区域上边线之差要小于偏差阈值并且下边线的偏差也要小于偏差阈值),若存在相匹配的候选字幕区域,则将该文字区域归属到这个候选字幕区域中;若不存在相匹配的候选字幕区域,则将该文字区域作为新的候选字幕区域存入候选字幕区域列表中;如此遍历文本列表中的每一个文字区域,得到存放在候选字幕区域列表中的候选字幕区域。
示例性的,一个候选字幕区域可能包含多个文字区域,但候选字幕区域的区域位置(包括上边线和下边线)只有一个,候选字幕区域的区域位置是归属该候选字幕区域的文字区域中高度最高的那个文字区域(上边线和下边线)。
因此,在将一个文字区域归属到一个候选字幕区域中后,需要判断新加入的文字区域的高度是否大于候选字幕区域目前的区域位置的高度,若新加入的文字区域的高度更大,则将新加入的文字区域更新为候选字幕区域的区域位置。若新加入的文字区域的高度差小于候选字幕区域目前的区域位置,则保持候选字幕区域目前的区域位置不变。
示例性的,在另一种可选的实现方式中,首先计算一下每个文字区域的高度差,然后将文字区域按照高度差从小到大排序得到文字区域顺序列表,根据文字区域顺序列表的顺序来从第一个文字区域开始读取和确定候选字幕区域。这种方式可以解决候确定的选字幕区域不准确的问题。例如,如图8所示,以第一文字区域701、第二文字区域702、第三文字区域703为例,其中,第一文字区域701小于第三文字区域703小于第二文字区域702,并且第一文字区域701与第二文字区域702的位置偏差大于偏差阈值,第二文字区域702与第三文字区域703的位置偏差小于偏差阈值,第一文字区域701与第三文字区域703的位置偏差小于偏差阈值,若按照第一文字区域701、第二文字区域702、第三文字区域703的顺序对文字区域进行抽取,则在抽取到第二文字区域702时,由于第二文字区域702与第一文字区域701的位置偏差大于偏差阈值,则会将第二文字区域702作为新的候选字幕区域,会导致候选字幕区域的识别结果不准确;但若按照高度差对文字区域进行排序后,则会在抽取第一文字区域701之后先抽取第三文字区域703,第三文字区域703与第一文字区域701的位置偏差小 于偏差阈值,且第三文字区域703的高度差大于第一文字区域701,则该候选字幕区域的区域位置会被更新为第三位子区域703,然后再抽取第二文字区域702时,由于第二文字区域702与第三文字区域703的位置偏差小于偏差阈值,第二文字区域702也会被归到该候选字幕区域中,并将第二文字区域702更新为该候选字幕区域的区域位置。
示例性的,由于惯有的阅读顺序,字幕大部分都是横向字幕,步骤2021至步骤2025就是以横向的字幕为例,将上边线与下边线作为文字区域;同理,若要识别纵向的字幕,则将上述的上边线与下边线变更为左边线与右边线,即,文字区域为左边线与右边线。
步骤2031,计算n个候选字幕区域中每个候选字幕区域的重复率,重复率用于描述候选字幕区域中出现的文字内容的重复概率。
示例性的,重复率为累计时长与视频的视频总时长之比,累计时长为相同的文字内容的显示时长之和。
示例性的,给出一种计算重复率的方法:获取对应第j个候选字幕区域的第j组文字内容,第j组文字内容包括至少一个对应第j个候选字幕区域的文字内容,j为小于等于n的正整数,n为正整数;将第j组文字内容中相同的文字内容归为一个文字内容集合,共得到x个文字内容集合;计算每个文字内容集合中文字内容的显示时长之和得到累计时长,共得到x个累计时长,x为正整数;计算最大累计时长与视频的视频总时长之比得到重复率,最大累计时长为至少一个累计时长中的最大值;重复上述四个步骤计算得到每个候选字幕区域的重复率。
即,将获取属于该候选字幕区域的全部文本数据,然后将其中文字内容相同的文本数据进行合并:文字内容保留一个,显示时长进行累加得到累计时长,这里不需要用到文字位置所以可以去掉;合并后的文本数据没有重复的文字内容,取合并后的文本数据中最大的累计时长与视频的视频总时长相除即可得到重复率。
重复率是在候选字幕区域上显示出同一种文字内容的显示累计时长占视频总时长的比例,若在一个位置上总是显示相同的文字内容,则该位置很有可能是干扰文字(视频标题、水印等)。
步骤2032,将文字内容的重复率低于重复率阈值的候选字幕区域确定为初筛字幕区域。
示例性的,重复率阈值可以任意设置。示例性的,重复率阈值可以取10%。
示例性的,重复率高于重复率阈值的候选字幕区域可能为水印所在的文字区域、视频标题所在的文字区域或其他视频中文字固定不变(变换很少)的文字内容所在的字幕区域。
步骤2033,计算初筛字幕区域的显示总时长。
示例性的,给出一种计算显示总时长的方法:计算对应初筛字幕区域的文字内容的显示时长之和,得到初筛字幕区域的显示总时长。
示例性的,在对候选字幕区域进行初筛得到初筛字幕区域后,计算每个初筛字幕区域的显示总时长,显示总时长即为在该初筛字幕区域上显示文字内容的总时长,由于在视频中,某些位置可能会短暂显示文字,例如,电视剧开头会在画面中间位置显示当前是第几集,或,在视频中可能会短暂拍摄到一些带有文字的画面,这些文字所在的区域都不是字幕区域,字幕区域上会长期显示有文字内容,因此,将初筛字幕区域中显示总时长最长的初筛字幕区域作为字幕区域。
例如,在第一初筛字幕区域,第一文字内容显示了1s、第二文字内容显示了2s、第三文字内容显示了6s,则第一初筛字幕区域的显示总时长为1+2+6=9s。
步骤2034,将初筛字幕区域中,显示总时长最长的初筛字幕区域确定为字幕区域。
示例性的,当然还可以采用一些其他字幕区域筛选策略来筛选字幕区域。
例如,在根据文字区域确定候选字幕区域时,可以将文字区域的上边线或下边线的倾斜角度大于角度阈值的文字区域直接去除不作为候选字幕区域,由于字幕通常为规整方向的(横向或纵向),则可以将不规整方向的文本数据直接去除。
再如,由于字幕通常为白色或黑色字体,则在识别得到文本列表后,可以将显示为其他 颜色的文字内容对应的文本数据从文本列表中删除,用删除后的文本列表采用本申请提供的方法来识别字幕区域。
示例性的,在得到视频的字幕区域后,计算机设备可以根据属于字幕区域中的文字内容识别视频的字幕。
例如,将字幕区域对应的文本数据中的文字内容进行修整,将其作为视频的字幕。
示例性的,在得到字幕后,还可以更改字幕的颜色。由于在得到文本列表时OCR模型可以识别出文本内容在图像帧中所在的像素点,则在根据字幕区域得到字幕后,可以更改字幕所在像素点的颜色,实现字幕自动化识别以及对字幕的快捷编辑。在字幕与视频本身的颜色相近,导致字幕不清楚的情况下,可以采用本实施例提供的方法,快捷修改字幕颜色,使字幕与视频整体颜色相区分,提高字幕清晰度。
例如,计算机设备接收颜色编辑指令,颜色编辑指令用于指示目标颜色;将属于字幕区域中的文字内容修改为目标颜色,生成目标视频,目标视频中的字幕显示为目标颜色。
计算机设备将属于字幕区域中的文本内容在视频的图像帧中所对应的像素点修改为目标颜色。
该方法在对视频中的文字内容进行识别后,从文字内容中识别出属于字幕的这部分文字内容,单独编辑处理字幕,实现对字幕的快捷编辑处理,并且不影响视频中的其他文字内容。
综上所述,本实施例提供的方法,通过先获取视频的视频帧图像,然后对视频帧图像采用OCR模型进行文字识别,对文字识别得到的候选文字内容进行去重后得到包含文字内容的文本列表,从而提取到视频中的文本数据,便于根据文本数据来判别字幕区域。
本实施例提供的方法,首先根据文字区域来规整得到候选字幕区域,将经过文字识别得到的多个文字区域进行规则,得到字幕区域的几个大概区域,便于之后根据字幕区域识别策略进行字幕区域的识别。
本实施例提供的方法,通过计算每个候选字幕区域上显示的文字内容的重复率,来判别该候选字幕区域是否是用来显示水印、视频标题等显示时间长且显示内容单一的区域,并将这些候选字幕区域去除,得到初筛字幕区域。
本实施例提供的方法,通过计算每个初筛字幕区域的显示总时长,来从初筛字幕区域中去除只短时间显示文字内容的区域,由于字幕区域通常长时间显示文字内容,则根据这一特征可以将初筛字幕区域中显示总时长最长的初筛字幕区域确定为字幕区域。
示例性的,给出一种结合语音识别结果确定字幕区域的示例性实施例。
图9示出了本申请一个示例性实施例提供的字幕区域识别方法的流程图。该方法可以由计算机设备来执行,例如,如图1所示的终端或服务器来执行。该方法包括以下步骤。
步骤101,识别视频得到n个候选字幕区域,候选字幕区域为视频中的文字内容所显示的区域,n为正整数。
步骤801,对视频进行语音识别得到语音识别结果。
示例性的,对视频中的音频进行语音识别得到语音识别结果,语音识别结果包括识别出的至少一个文字内容。
步骤802,将n个候选字幕区域中,文字内容与语音识别结果的相似度高于阈值的候选字幕区域,确定为参照字幕区域。
示例性的,将语音识别结果与每个候选字幕区域对应的文字内容进行对比,计算相似度。例如,相似度等于:相同文字内容的数量,与,候选字幕区域对应的文字内容的总数,之比。相同文字内容是候选字幕区域对应的文字内容中与语音识别结果中的文字内容相同的文字内容。
步骤1021,根据字幕区域筛选策略和参照字幕区域从n个候选字幕区域中筛选得到字幕区域。
根据字幕区域筛选策略对n个候选字幕区域进行排序,得到排序结果;提高参照字幕区 域的排序权重,基于n个候选字幕区域的排序权重修正排序结果;基于修正后的排序结果从n个候选字幕区域中筛选得到字幕区域。
例如,按照图6所示的示例性实施例,根据字幕区域筛选策略按照显示总时长由高到低进行排序,得到排序结果。然后每个候选字幕区域的默认排序权重为1,将参照字幕区域的排序权重设置为2,对显示总时长进行加权,得到加权后的显示总时长,按照加权后的显示总时长进行排序得到修正后的排序结果。将修正后的排序结果中显示总时长最长的候选字幕区域确定为字幕区域。
综上所述,本实施例提供的方法,通过结合语音识别结果进行字幕区域识别。由于字幕通常是对视频中人物言语内容的标注,则字幕区域所显示的文字内容通常贴合语音识别结果,基于语音识别结果确定字幕区域,可以提高对字幕区域的识别准确率。
示例性的,给出一种采用本申请提供的方法获取语音转文字模型的训练样本的示例性实施例。
图10示出了本申请一个示例性实施例提供的字幕区域识别方法的流程图。该方法可以由计算机设备来执行,例如,如图1所示的终端或服务器来执行。该方法包括以下步骤。
步骤601,计算机设备进行数据获取。
示例性的,首先获取视频应用程序中热门用户帐号的视频,热门用户帐号是粉丝量较多或视频点击量较多或排行榜上前几位的用户账号。示例性的,获取这些热门帐号下的全部视频作为待识别字幕区域的视频。
步骤602,计算机设备进行字幕提取服务。
示例性的,采用本申请提供的字幕区域识别方法,来识别视频中的字幕区域。例如,如图11所示,首先对UGC(User Generated Content,用户生成内容)进行视频OCR截帧处理802(截取视频帧图像,对视频帧图像进行文字识别得到识别结果,对识别结果进行候选文字内容去重得到文本列表)得到文字内容、文字内容的显示时长803以及文字内容的文字区域804,然后对文字区域804进行归整得到多个候选字幕区域,计算每个候选字幕区域的重复率,进行重复文字判断805选出重复率低于重复率阈值的初筛字幕区域,然后计算初筛字幕区域的显示总时长,进行持续时间判断806:选出显示总时长(持续时间)最长的初筛字幕区域作为字幕区域807。
步骤603,计算机设备对字幕区域中的文字内容进行后处理。
例如,后处理包括短句合并、特殊符号剥离、文字密度剥离、文字字数剥离、重复识别合并、单个字母和数字剔除中的至少一种。示例性的,短句合并用于将文字内容中的超短句(例如:啊、好的)进行合并。特殊符号剥离用于剔除文字内容用的非文字数据(例如:表情)。文字密度剥离用于从文字内容中剔除超长语句。文字字数剥离用于根据剥离字数对文字内容进行剥离,例如,每隔2-14个文字进行剥离。重复识别合并用于合并重复文字内容的数据。单个字母和数字剔除用于从文字内容中剔除其他非目标语言(例如,汉语)的单个字母或者数字。
步骤604,计算机设备验证交付质量。
示例性的,计算机设备使用人工对视频字幕的标注结果来对自动识别得到的字幕进行验证。示例性的,对得到的字幕识别结果进行抽样检测,随机抽取识别结果构建测试集,进行置信度验证,若置信度在95±3%的区间内,则确定识别结果准确,将识别结果进行数据交付605。将识别结果中的文字内容与视频中对应时间段的音频作为语音转文字模型的训练样本。示例性的,置信度等于:字幕识别结果中正确识别的字数与字幕识别结果总字数之比。
综上所述,本实施例提供的方法,通过使用本申请提供的字幕区域识别方法,来进行字幕的识别,可以准确识别到视频中的字幕内容,然后根据识别到的字幕内容与视频中对应时段的音频,就可以得到语音转文字模型的训练样本,根据字幕内容与音频训练语音转文字模型,可以节省样本获取过程中的人力资源,提高样本获取效率。
以下为本申请的装置实施例,对于装置实施例中未详细描述的细节,可以结合参考上述方法实施例中相应的记载,本文不再赘述。
图12示出了本申请的一个示例性实施例提供的字幕识别装置的结构示意图。该装置可以通过软件、硬件或者两者的结合实现成为计算机设备的全部或一部分,该装置包括如下装置。
识别模块901,用于识别视频得到n个候选字幕区域,候选字幕区域为所述视频中的文字内容所显示的区域,n为正整数;
筛选模块903,用于根据字幕区域筛选策略从所述n个候选字幕区域中筛选得到所述字幕区域,所述字幕区域筛选策略用于将文字内容的重复率低于重复率阈值且显示总时长最长的候选字幕区域确定为所述字幕区域。
在一个可选的实施例中,所述装置还包括:
计算模块904,用于计算所述n个候选字幕区域中每个候选字幕区域的重复率,所述重复率用于描述所述候选字幕区域中出现的文字内容的重复概率;
所述筛选模块903,还用于将所述文字内容的所述重复率低于所述重复率阈值的所述候选字幕区域确定为初筛字幕区域;
所述计算模块904,还用于计算所述初筛字幕区域的所述显示总时长;
所述筛选模块903,还用于将所述初筛字幕区域中,所述显示总时长最长的所述初筛字幕区域确定为所述字幕区域。
在一个可选的实施例中,所述计算模块904,还用于获取对应第j个候选字幕区域的第j组文字内容,所述第j组文字内容包括至少一个对应所述第j个候选字幕区域的文字内容,j为小于等于n的正整数,n为正整数;
所述计算模块904,还用于将所述第j组文字内容中相同的文字内容归为一个文字内容集合,共得到x个文字内容集合;
所述计算模块904,还用于计算每个所述文字内容集合中所述文字内容的显示时长之和得到累计时长,共得到x个所述累计时长,x为正整数;
所述计算模块904,还用于计算最大累计时长与所述视频的所述视频总时长之比得到所述重复率,所述最大累计时长为所述至少一个累计时长中的最大值;
所述计算模块904,还用于重复上述四个步骤计算得到每个所述候选字幕区域的所述重复率
在一个可选的实施例中,所述计算模块904,还用于计算对应所述初筛字幕区域的所述文字内容的所述显示时长之和,得到所述初筛字幕区域的所述显示总时长。
在一个可选的实施例中,装置还包括:
识别模块901,用于识别所述视频中的所述文字内容、所述文字内容所在的文字区域;
候选模块902,用于根据所述文字区域的位置关系,将位置偏差小于偏差阈值的所述文字区域聚类至同一个候选字幕区域,共得到所述n个候选字幕区域。
在一个可选的实施例中,所述文本列表包括m个文本数据,所述文字区域包括矩形的上边线和下边线,m为正整数;
所述候选模块902,还用于从所述m个文字内容对应的m个文字区域中抽出一个文字区域作为第1个文字区域,将所述第1个文字区域确定为第1个候选字幕区域,将所述第1个候选字幕区域加入候选字幕区域列表;
所述候选模块902,还用于循环执行以下步骤,直至所述m个文字区域的剩余数量为0:从剩下的m-k+1个文字区域中抽出一个文字区域作为第k个文字区域,响应于所述第k个文字区域与所述候选字幕区域列表中的第w个候选字幕区域的第一位置偏差小于所述偏差阈值,将所述第k个文字区域归为所述第w个候选字幕区域;
响应于所述第k个文字区域与所述候选字幕区域列表中的全部候选字幕区域的第二位置偏差都大于所述偏差阈值,将所述第k个文字区域确定为第y个候选字幕区域,将所述第y 个候选字幕区域加入所述候选字幕区域列表;
其中,所述第一位置偏差包括两个所述上边线之差和两个所述下边线之差,所述第二位置偏差包括两个所述上边线之差或两个所述下边线之差,y为小于或等于n的正整数,k为小于等于m的正整数,w为小于等于n的正整数,n为正整数。
在一个可选的实施例中,所述候选模块902,还用于计算所述第k个文字区域的第一高度,所述第一高度为所述第k个文字区域的所述上边线与所述下边线之差;计算所述第w个候选字幕区域的第二高度,所述第二高度为所述第w个候选字幕区域的所述上边线与所述下边线之差;响应于所述第一高度大于所述第二高度,将所述第k个文字区域确定为所述第w个候选字幕区域;
其中,k为小于等于m的正整数,w为小于等于n的正整数,n、m为正整数。
在一个可选的实施例中,所述识别模块901,还用于识别所述视频中的所述文字内容、所述文字内容所在的文字区域、所述文字内容的显示时长。
在一个可选的实施例中,所述装置还包括:
获取模块905,用于周期性截取所述视频的视频帧图像;
所述识别模块901,还用于识别所述视频帧图像中的所述文字内容、所述文字内容所在的文字区域、所述文字内容的显示时长。
在一个可选的实施例中,所述识别模块901,还用于调用光学字符识别OCR模型识别所述视频帧图像,得到所述视频帧图像中的候选文字内容和所述候选文字内容的所述文字区域,根据所述视频帧图像的显示时刻得到所述候选文字内容的显示时刻;
所述识别模块901,还用于对所述候选文字内容进行去重得到所述文字内容;所述去重包括将所述显示时刻连续、所述文字区域相同、所述候选文字内容相同的多个候选文字内容中所述显示时刻最早的所述候选文字内容确定为所述文字内容,根据所述多个候选文字内容的所述显示时刻计算所述文字内容的所述显示时长。
在一个可选的实施例中,所述装置还包括:
字幕模块906,用于根据属于所述字幕区域中的所述文字内容识别所述视频的字幕。
在一个可选的实施例中,所述装置还包括:字幕模块906,用于接收颜色编辑指令,所述颜色编辑指令用于指示目标颜色;
字幕模块906,用于将属于所述字幕区域中的所述文字内容修改为所述目标颜色,生成目标视频,所述目标视频中的字幕显示为所述目标颜色。
在一个可选的实施例中,所述装置还包括:
接收模块,用于接收颜色编辑指令,所述颜色编辑指令用于指示目标颜色;
编辑模块,用于将属于所述字幕区域中的所述文字内容修改为所述目标颜色,生成目标视频,所述目标视频中的字幕显示为所述目标颜色。
在一个可选的实施例中,所述装置还包括:
语音识别模块,用于对所述视频进行语音识别得到语音识别结果;
参照模块,用于将所述n个候选字幕区域中,所述文字内容与所述语音识别结果的相似度高于阈值的候选字幕区域,确定为参照字幕区域;
所述筛选模块903,还用于根据字幕区域筛选策略和所述参照字幕区域从所述n个候选字幕区域中筛选得到所述字幕区域。
在一个可选的实施例中,所述筛选模块903,还用于根据字幕区域筛选策略对所述n个候选字幕区域进行排序,得到排序结果;
所述筛选模块903,还用于提高所述参照字幕区域的排序权重,基于所述n个候选字幕区域的排序权重修正所述排序结果;
所述筛选模块903,还用于基于修正后的排序结果从所述n个候选字幕区域中筛选得到所述字幕区域。
图13是本申请一个实施例提供的服务器的结构示意图。具体来讲:服务器1000包括中央处理单元(英文:Central Processing Unit,简称:CPU)1001、包括随机存取存储器(英文:Random Access Memory,简称:RAM)1002和只读存储器(英文:Read-Only Memory,简称:ROM)1003的系统存储器1004,以及连接系统存储器1004和中央处理单元1001的系统总线1005。服务器1000还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统(I/O系统)1006,和用于存储操作系统1013、应用程序1014和其他程序模块1015的大容量存储设备1007。
基本输入/输出系统1006包括有用于显示信息的显示器1008和用于用户输入信息的诸如鼠标、键盘之类的输入设备1009。其中显示器1008和输入设备1009都通过连接到系统总线1005的输入/输出控制器1010连接到中央处理单元1001。基本输入/输出系统1006还可以包括输入/输出控制器1010以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入/输出控制器1010还提供输出到显示屏、打印机或其他类型的输出设备。
大容量存储设备1007通过连接到系统总线1005的大容量存储控制器(未示出)连接到中央处理单元1001。大容量存储设备1007及其相关联的计算机可读介质为服务器1000提供非易失性存储。也就是说,大容量存储设备1007可以包括诸如硬盘或者只读光盘(英文:Compact Disc Read-Only Memory,简称:CD-ROM)驱动器之类的计算机可读介质(未示出)。
不失一般性,计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、可擦除可编程只读存储器(英文:Erasable Programmable Read-Only Memory,简称:EPROM)、电可擦除可编程只读存储器(英文:Electrically Erasable Programmable Read-Only Memory,简称:EEPROM)、闪存或其他固态存储其技术,CD-ROM、数字通用光盘(英文:Digital Versatile Disc,简称:DVD)或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知计算机存储介质不局限于上述几种。上述的系统存储器1004和大容量存储设备1007可以统称为存储器。
根据本申请的各种实施例,服务器1000还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即服务器1000可以通过连接在系统总线1005上的网络接口单元1011连接到网络1012,或者说,也可以使用网络接口单元1011来连接到其他类型的网络或远程计算机系统(未示出)。
本申请还提供了一种终端,该终端包括处理器和存储器,存储器中存储有至少一条指令,至少一条指令由处理器加载并执行以实现上述各个方法实施例提供的字幕区域识别方法。需要说明的是,该终端可以是如下图14所提供的终端。
图14示出了本申请一个示例性实施例提供的终端1100的结构框图。该终端1100可以是:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端1100还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端1100包括有:处理器1101和存储器1102。
处理器1101可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1101可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1101也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器 1101可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器1101还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器1102可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1102还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1102中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器1101所执行以实现本申请中方法实施例提供的字幕区域识别方法。
显示屏1105用于显示UI(User Interface,用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏1105是触摸显示屏时,显示屏1105还具有采集在显示屏1105的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器1101进行处理。此时,显示屏1105还可以用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,显示屏1105可以为一个,设置终端1100的前面板;在另一些实施例中,显示屏1105可以为至少两个,分别设置在终端1100的不同表面或呈折叠设计;在再一些实施例中,显示屏1105可以是柔性显示屏,设置在终端1100的弯曲表面上或折叠面上。甚至,显示屏1105还可以设置成非矩形的不规则图形,也即异形屏。显示屏1105可以采用LCD(Liquid Crystal Display,液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。
本领域技术人员可以理解,图14中示出的结构并不构成对终端1100的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
所述存储器还包括一个或者一个以上的程序,所述一个或者一个以上程序存储于存储器中,所述一个或者一个以上程序包含用于进行本申请实施例提供的字幕区域识别方法。
本申请还提供一种计算机设备,该计算机设备包括:处理器和存储器,该存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,该至少一条指令、至少一段程序、代码集或指令集由处理器加载并执行以实现上述各方法实施例提供的字幕区域识别方法。
本申请还提供一种计算机可读存储介质,该存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,该至少一条指令、至少一段程序、代码集或指令集由处理器加载并执行以实现上述各方法实施例提供的字幕区域识别方法。
本申请还提供一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述可选实现方式中提供的字幕区域识别方法。
应当理解的是,在本文中提及的“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种字幕区域识别方法,其中,所述方法由计算机设备执行,所述方法包括:
    识别视频得到n个候选字幕区域,候选字幕区域为所述视频中的文字内容所显示的区域,n为正整数;
    根据字幕区域筛选策略从所述n个候选字幕区域中筛选得到所述字幕区域,所述字幕区域筛选策略用于将文字内容的重复率低于重复率阈值且显示总时长最长的候选字幕区域确定为所述字幕区域。
  2. 根据权利要求1所述的方法,其中,所述根据字幕区域筛选策略从所述n个候选字幕区域中筛选得到所述字幕区域,包括:
    计算所述n个候选字幕区域中每个候选字幕区域的重复率,所述重复率用于描述所述候选字幕区域中出现的文字内容的重复概率;
    将所述文字内容的所述重复率低于所述重复率阈值的所述候选字幕区域确定为初筛字幕区域;
    计算所述初筛字幕区域的所述显示总时长;
    将所述初筛字幕区域中,所述显示总时长最长的所述初筛字幕区域确定为所述字幕区域。
  3. 根据权利要求2所述的方法,其中,所述计算所述n个候选字幕区域中每个候选字幕区域的重复率,包括:
    获取对应第j个候选字幕区域的第j组文字内容,所述第j组文字内容包括至少一个对应所述第j个候选字幕区域的文字内容,j为小于等于n的正整数,n为正整数;
    将所述第j组文字内容中相同的文字内容归为一个文字内容集合,共得到x个文字内容集合;计算每个所述文字内容集合中所述文字内容的显示时长之和得到累计时长,共得到x个所述累计时长,x为正整数;
    计算最大累计时长与所述视频的所述视频总时长之比得到所述重复率,所述最大累计时长为所述至少一个累计时长中的最大值;
    重复上述四个步骤计算得到每个所述候选字幕区域的所述重复率。
  4. 根据权利要求2所述的方法,其中,所述计算所述初筛字幕区域的所述显示总时长,包括:
    计算对应所述初筛字幕区域的所述文字内容的所述显示时长之和,得到所述初筛字幕区域的所述显示总时长。
  5. 根据权利要求1至4任一所述的方法,其中,所述识别视频得到n个候选字幕区域,包括:
    识别所述视频中的所述文字内容、所述文字内容所在的文字区域;
    根据所述文字区域的位置关系,将位置偏差小于偏差阈值的所述文字区域聚类至同一个候选字幕区域,共得到所述n个候选字幕区域。
  6. 根据权利要求5所述的方法,其中,所述文字内容的数量为m个,所述文字区域包括矩形的上边线和下边线,m为大于n的整数;
    所述根据所述文字区域的位置关系,将位置偏差小于偏差阈值的所述文字区域聚类至同一个候选字幕区域,共得到所述n个候选字幕区域,包括:
    从所述m个文字内容对应的m个文字区域中抽出一个文字区域作为第1个文字区域,将所述第1个文字区域确定为第1个候选字幕区域,将所述第1个候选字幕区域加入候选字幕区域列表;
    循环执行以下步骤,直至所述m个文字区域的剩余数量为0:从剩下的m-k+1个文字区域中抽出一个文字区域作为第k个文字区域,响应于所述第k个文字区域与所述候选字幕区 域列表中的第w个候选字幕区域的第一位置偏差小于所述偏差阈值,将所述第k个文字区域归为所述第w个候选字幕区域;
    响应于所述第k个文字区域与所述候选字幕区域列表中的全部候选字幕区域的第二位置偏差都大于所述偏差阈值,将所述第k个文字区域确定为第y个候选字幕区域,将所述第y个候选字幕区域加入所述候选字幕区域列表;
    其中,所述第一位置偏差包括两个所述上边线之差和两个所述下边线之差,所述第二位置偏差包括两个所述上边线之差或两个所述下边线之差,y为小于或等于n的正整数,k为小于等于m的正整数,w为小于等于n的正整数,n为正整数。
  7. 根据权利要求6所述的方法,其中,所述响应于所述第k个文字区域与所述候选字幕区域列表中的第w个候选字幕区域的第一位置偏差小于偏差阈值,将所述第k个文字区域归为所述第w个候选字幕区域之后,还包括:
    计算所述第k个文字区域的第一高度,所述第一高度为所述第k个文字区域的所述上边线与所述下边线之差;
    计算所述第w个候选字幕区域的第二高度,所述第二高度为所述第w个候选字幕区域的所述上边线与所述下边线之差;
    响应于所述第一高度大于所述第二高度,将所述第k个文字区域确定为所述第w个候选字幕区域;
    其中,k为小于等于m的正整数,w为小于等于n的正整数,n、m为正整数。
  8. 根据权利要求5所述的方法,其中,所述识别所述视频中的所述文字内容、所述文字内容所在的文字区域,包括:
    识别所述视频中的所述文字内容、所述文字内容所在的文字区域、所述文字内容的显示时长。
  9. 根据权利要求8所述的方法,其中,所述识别所述视频中的所述文字内容、所述文字内容所在的文字区域、所述文字内容的显示时长,包括:
    周期性截取所述视频的视频帧图像;
    识别所述视频帧图像中的所述文字内容、所述文字内容所在的文字区域、所述文字内容的显示时长。
  10. 根据权利要求9所述的方法,其中,所述识别所述视频帧图像中的所述文字内容、所述文字内容所在的文字区域、所述文字内容的显示时长,包括:
    调用光学字符识别OCR模型识别所述视频帧图像,得到所述视频帧图像中的候选文字内容和所述候选文字内容的所述文字区域,根据所述视频帧图像的显示时刻得到所述候选文字内容的显示时刻;
    对所述候选文字内容进行去重得到所述文字内容;所述去重包括将所述显示时刻连续、所述文字区域相同、所述候选文字内容相同的多个候选文字内容中所述显示时刻最早的所述候选文字内容确定为所述文字内容,根据所述多个候选文字内容的所述显示时刻计算所述文字内容的所述显示时长。
  11. 根据权利要求1至4任一所述的方法,其中,所述方法还包括:
    根据属于所述字幕区域中的所述文字内容识别所述视频的字幕。
  12. 根据权利要求11所述的方法,其中,所述方法还包括:
    接收颜色编辑指令,所述颜色编辑指令用于指示目标颜色;
    将属于所述字幕区域中的所述文字内容修改为所述目标颜色,生成目标视频,所述目标视频中的字幕显示为所述目标颜色。
  13. 根据权利要求1至4任一所述的方法,其中,所述方法还包括:
    对所述视频进行语音识别得到语音识别结果;
    将所述n个候选字幕区域中,所述文字内容与所述语音识别结果的相似度高于阈值的候 选字幕区域,确定为参照字幕区域;
    所述根据字幕区域筛选策略从所述n个候选字幕区域中筛选得到所述字幕区域,包括:
    根据字幕区域筛选策略和所述参照字幕区域从所述n个候选字幕区域中筛选得到所述字幕区域。
  14. 根据权利要求13所述的方法,其中,所述根据字幕区域筛选策略和所述参照字幕区域从所述n个候选字幕区域中筛选得到所述字幕区域,包括:
    根据字幕区域筛选策略对所述n个候选字幕区域进行排序,得到排序结果;
    提高所述参照字幕区域的排序权重,基于所述n个候选字幕区域的排序权重修正所述排序结果;
    基于修正后的排序结果从所述n个候选字幕区域中筛选得到所述字幕区域。
  15. 一种字幕区域识别装置,其中,所述装置包括:
    识别模块,用于识别视频得到n个候选字幕区域,候选字幕区域为所述视频中的文字内容所显示的区域,n为正整数;
    筛选模块,用于根据字幕区域筛选策略从所述n个候选字幕区域中筛选得到所述字幕区域,所述字幕区域筛选策略用于将文字内容的重复率低于重复率阈值且显示总时长最长的候选字幕区域确定为所述字幕区域。
  16. 根据权利要求15所述的装置,其中,所述装置还包括:
    计算模块,用于计算所述n个候选字幕区域中每个候选字幕区域的重复率,所述重复率用于描述所述候选字幕区域中出现的文字内容的重复概率;
    所述筛选模块,还用于将所述文字内容的所述重复率低于所述重复率阈值的所述候选字幕区域确定为初筛字幕区域;
    所述计算模块,还用于计算所述初筛字幕区域的所述显示总时长;
    所述筛选模块,还用于将所述初筛字幕区域中,所述显示总时长最长的所述初筛字幕区域确定为所述字幕区域。
  17. 根据权利要求16所述的装置,其中,所述计算模块,还用于获取对应第j个候选字幕区域的第j组文字内容,所述第j组文字内容包括至少一个对应所述第j个候选字幕区域的文字内容,j为小于等于n的正整数,n为正整数;
    所述计算模块,还用于将所述第j组文字内容中相同的文字内容归为一个文字内容集合,共得到x个文字内容集合;
    所述计算模块,还用于计算每个所述文字内容集合中所述文字内容的显示时长之和得到累计时长,共得到x个所述累计时长,x为正整数;
    所述计算模块,还用于计算最大累计时长与所述视频的所述视频总时长之比得到所述重复率,所述最大累计时长为所述至少一个累计时长中的最大值;
    所述计算模块,还用于重复上述四个步骤计算得到每个所述候选字幕区域的所述重复率。
  18. 一种计算机设备,所述计算机设备包括:处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行,以实现如权利要求1至14任一项所述的字幕区域识别方法。
  19. 一种计算机可读存储介质,其中,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行,以实现如权利要求1至14任一项所述的字幕区域识别方法。
  20. 一种计算机程序产品或计算机程序,其中,所述计算机程序产品或计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中;计算机设备的处理器从所述计算机可读存储介质读取所述计算机指令,所述处理器执行所述计算机指令,以实现如权利要求1至14任一项所述的字幕区域识别方法。
PCT/CN2021/122697 2020-10-27 2021-10-08 字幕区域识别方法、装置、设备及存储介质 WO2022089170A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/960,004 US20230027412A1 (en) 2020-10-27 2022-10-04 Method and apparatus for recognizing subtitle region, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011165751.0A CN112232260A (zh) 2020-10-27 2020-10-27 字幕区域识别方法、装置、设备及存储介质
CN202011165751.0 2020-10-27

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/960,004 Continuation US20230027412A1 (en) 2020-10-27 2022-10-04 Method and apparatus for recognizing subtitle region, device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022089170A1 true WO2022089170A1 (zh) 2022-05-05

Family

ID=74110646

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/122697 WO2022089170A1 (zh) 2020-10-27 2021-10-08 字幕区域识别方法、装置、设备及存储介质

Country Status (3)

Country Link
US (1) US20230027412A1 (zh)
CN (1) CN112232260A (zh)
WO (1) WO2022089170A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232260A (zh) * 2020-10-27 2021-01-15 腾讯科技(深圳)有限公司 字幕区域识别方法、装置、设备及存储介质
CN112925905B (zh) * 2021-01-28 2024-02-27 北京达佳互联信息技术有限公司 提取视频字幕的方法、装置、电子设备和存储介质
CN113138824A (zh) * 2021-04-26 2021-07-20 北京沃东天骏信息技术有限公司 一种弹窗显示方法和装置
CN113920507B (zh) * 2021-12-13 2022-04-12 成都索贝数码科技股份有限公司 一种针对新闻场景的滚动字幕提取方法
CN115396690A (zh) * 2022-08-30 2022-11-25 京东方科技集团股份有限公司 音频与文本组合方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080143880A1 (en) * 2006-12-14 2008-06-19 Samsung Electronics Co., Ltd. Method and apparatus for detecting caption of video
CN101887439A (zh) * 2009-05-13 2010-11-17 富士通株式会社 生成视频摘要的方法、装置、包含该装置的图像处理系统
CN103546667A (zh) * 2013-10-24 2014-01-29 中国科学院自动化研究所 一种面向海量广播电视监管的自动新闻拆条方法
CN110197177A (zh) * 2019-04-22 2019-09-03 平安科技(深圳)有限公司 提取视频字幕的方法、装置、计算机设备及存储介质
CN112232260A (zh) * 2020-10-27 2021-01-15 腾讯科技(深圳)有限公司 字幕区域识别方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080143880A1 (en) * 2006-12-14 2008-06-19 Samsung Electronics Co., Ltd. Method and apparatus for detecting caption of video
CN101887439A (zh) * 2009-05-13 2010-11-17 富士通株式会社 生成视频摘要的方法、装置、包含该装置的图像处理系统
CN103546667A (zh) * 2013-10-24 2014-01-29 中国科学院自动化研究所 一种面向海量广播电视监管的自动新闻拆条方法
CN110197177A (zh) * 2019-04-22 2019-09-03 平安科技(深圳)有限公司 提取视频字幕的方法、装置、计算机设备及存储介质
CN112232260A (zh) * 2020-10-27 2021-01-15 腾讯科技(深圳)有限公司 字幕区域识别方法、装置、设备及存储介质

Also Published As

Publication number Publication date
US20230027412A1 (en) 2023-01-26
CN112232260A (zh) 2021-01-15

Similar Documents

Publication Publication Date Title
WO2022089170A1 (zh) 字幕区域识别方法、装置、设备及存储介质
US9436883B2 (en) Collaborative text detection and recognition
US8750573B2 (en) Hand gesture detection
US8280158B2 (en) Systems and methods for indexing presentation videos
CN110446063B (zh) 视频封面的生成方法、装置及电子设备
US11704357B2 (en) Shape-based graphics search
WO2021213067A1 (zh) 物品显示方法、装置、设备及存储介质
CN111209897B (zh) 视频处理的方法、装置和存储介质
US20150143236A1 (en) Generating photo albums from unsorted collections of images
CN109726712A (zh) 文字识别方法、装置及存储介质、服务器
US11681409B2 (en) Systems and methods for augmented or mixed reality writing
US9542756B2 (en) Note recognition and management using multi-color channel non-marker detection
WO2017197593A1 (en) Apparatus, method and computer program product for recovering editable slide
CN112381104A (zh) 一种图像识别方法、装置、计算机设备及存储介质
CN113205047A (zh) 药名识别方法、装置、计算机设备和存储介质
CN113436222A (zh) 图像处理方法、图像处理装置、电子设备及存储介质
KR20210008075A (ko) 시각 검색 방법, 장치, 컴퓨터 기기 및 저장 매체 (video search method and apparatus, computer device, and storage medium)
CN113591433A (zh) 一种文本排版方法、装置、存储介质及计算机设备
CN111274447A (zh) 基于视频的目标表情生成方法、装置、介质、电子设备
CN113486171B (zh) 一种图像处理方法及装置、电子设备
CN111062377B (zh) 一种题号检测方法、系统、存储介质及电子设备
CN114399645A (zh) 多模态数据扩充方法、系统、介质、计算机设备及终端
CN111160265B (zh) 文件转换方法、装置、存储介质及电子设备
CN111787389A (zh) 转置视频识别方法、装置、设备及存储介质
US20230144394A1 (en) Systems and methods for managing digital notes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21884899

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25.09.2023)