CN115690762A

CN115690762A - Caption character-by-character display method and device, electronic equipment and readable storage medium

Info

Publication number: CN115690762A
Application number: CN202211268591.1A
Authority: CN
Inventors: 马哲; 刘剑
Original assignee: Beijing Yunshuike Technology Co ltd
Current assignee: Beijing Yunshuike Technology Co ltd
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2023-02-03

Abstract

The application discloses a subtitle word-by-word display method, a device, an electronic device and a readable storage medium, wherein the subtitle word-by-word display method comprises the following steps: extracting each image frame from a target video, and detecting a subtitle area in each image frame through a character detection model to obtain coordinates of each subtitle area, wherein the character detection model is obtained through training according to a picture with the subtitle area being marked; extracting characters in each image frame according to each subtitle region coordinate to obtain a character string queue formed by each character string containing a time stamp; and comparing each character string in the character string queue with the corresponding previous character string in sequence, determining a caption character-by-character recognition result, and displaying the caption character-by-character according to the caption character-by-character recognition result and the corresponding timestamp. The method and the device solve the technical problem that the accuracy of displaying the text word by the subtitle is low.

Description

Caption character-by-character display method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of display technologies, and in particular, to a method and an apparatus for displaying subtitles word by word, an electronic device, and a readable storage medium.

Background

Along with the continuous enrichment of entertainment modes of people, more and more people choose to sing to relax entertainment during leisure time, and lyric display in MV (Music Video) is an important mode for a singer to control the speed of speech and the rhythm. At present, word-by-word display texts in lyric display in an MV video are mainly obtained by performing OCR (optical character recognition) recognition on the lyric subtitle colors in a target MV video. The method based on color recognition only keeps the color characteristics of the original lyric subtitles, and the characteristics of the character part are simplified, so that the character characteristics extracted by the lyric subtitle recognition in the target MV video are less and are not matched with the corresponding time, and the accuracy of the generated subtitle word-by-word display text is low.

Disclosure of Invention

The present application mainly aims to provide a subtitle word-by-word display method, an electronic device, and a computer-readable storage medium, and aims to solve the technical problem of low accuracy of subtitle word-by-word display of a text.

In order to achieve the above object, the present application provides a subtitle word-by-word display method, including:

extracting each image frame from a target video, and detecting a caption area in each image frame through a character detection model to obtain the coordinate of each caption area, wherein the character detection model is obtained by training according to a picture which is marked with the caption area;

extracting characters in each image frame according to each subtitle region coordinate to obtain a character string queue formed by each character string containing a time stamp;

and comparing each character string in the character string queue with a corresponding previous character string in sequence, determining a caption character-by-character recognition result, and displaying the caption character-by-character according to the caption character-by-character recognition result and a corresponding timestamp.

Optionally, extracting each image frame from the target video, and detecting a subtitle region in each image frame through a character detection model to obtain coordinates of each subtitle region, wherein the character detection model is obtained by training according to a picture with which the subtitle region is marked;

extracting characters in each image frame according to the coordinates of each subtitle area to obtain a character string queue formed by each character string containing a timestamp;

and comparing each character string in the character string queue with the corresponding previous character string in sequence, determining a caption character-by-character recognition result, and displaying the caption character-by-character according to the caption character-by-character recognition result and the corresponding timestamp.

Optionally, the step of comparing each of the character strings in the character string queue with a corresponding previous character string in sequence, determining a caption character-by-character recognition result, and displaying the caption character-by-character according to the caption character-by-character recognition result and a corresponding timestamp includes:

comparing each character string in the character string queue with a corresponding previous character string in sequence, and if the character strings and the corresponding time stamps are different, adding the character strings and the corresponding time stamps into an identification queue;

comparing each character string in the identification queue with the corresponding previous character string, determining a caption character-by-character identification result, and displaying the caption character-by-character according to the caption character-by-character identification result and the corresponding timestamp.

Optionally, the step of comparing each of the character strings in the character string queue with a corresponding previous character string in sequence, and if there is a difference, adding the character string and a corresponding timestamp into an identification queue includes:

arranging all character strings in the character string queue according to the corresponding time stamp sequence;

comparing each character string with a corresponding previous character string in sequence, and if the character string is different from the previous character string, adding the character string into the identification queue;

and if the character string is the same as the previous character string, discarding the character string.

Optionally, the step of arranging the character strings in the identification queue according to the corresponding timestamp sequence, comparing the character strings in the identification queue with the corresponding previous character string, determining a caption character-by-character identification result, and displaying the caption character-by-character according to the caption character-by-character identification result and the corresponding timestamp includes:

judging whether the previous character string is a subset of the character string;

if so, judging that the character string and the previous character string belong to the same subtitle, and taking the difference value of the character string and the previous character string as a subtitle character-by-character identification result corresponding to the character string;

if not, judging that the character string and the previous character string do not belong to the same subtitle, and taking the character string as a subtitle character-by-character recognition result corresponding to the character string;

and displaying the subtitle character-by-character recognition results according to the time stamps corresponding to the character strings and the subtitle character-by-character recognition results.

Optionally, the step of extracting each image frame from the target video, and detecting a subtitle region in each image frame through a text detection model to obtain coordinates of each subtitle region includes:

extracting a frame of picture from the target video every other preset time period to obtain each image frame;

and performing character detection on the preset area of each image frame through the character detection model to obtain the coordinates of each subtitle area.

The present application also provides a caption line-by-line display device, which is applied to a caption line-by-line display apparatus, the caption line-by-line display device including:

the region detection module is used for extracting each image frame from a target video, detecting a subtitle region in each image frame through a character detection model and obtaining coordinates of each subtitle region, wherein the character detection model is obtained through training according to a picture with the subtitle region being labeled;

the character extraction module is used for extracting characters in each image frame according to the coordinates of each subtitle area to obtain a character string queue formed by each character string containing a timestamp;

and the identification determining module is used for comparing each character string in the character string queue with the corresponding previous character string in sequence, determining a caption character-by-character identification result, and displaying the caption character-by-character according to the caption character-by-character identification result and the corresponding timestamp.

Optionally, the identification determination module is further configured to:

comparing each character string in the character string queue with a corresponding previous character string in sequence;

if the character string is longer than the previous character string, taking the difference value between the character string and the previous character string as a character-by-character recognition result of the caption corresponding to the character string;

if the character string is shorter than the former character string, taking the character string as a character-by-character recognition result of the caption corresponding to the character string;

if the character string is the same as the previous character string, discarding the character string;

and displaying the character-by-character recognition result of each subtitle according to the time stamp corresponding to each character string and the character-by-character recognition result of each subtitle.

Optionally, the identification determination module is further configured to:

Optionally, the area detection module is further configured to:

Optionally, the text extraction module is further configured to:

performing OCR character recognition on each image frame according to the subtitle region coordinates corresponding to each image frame to obtain corresponding character strings;

and arranging the character strings according to the time stamps contained in the image frames corresponding to the character strings to obtain the character string queue.

The present application further provides an electronic device, the electronic device is an entity device, the electronic device includes: the display device comprises a memory, a processor and a program of the subtitle word-by-word display method stored on the memory and capable of running on the processor, wherein the program of the subtitle word-by-word display method can realize the steps of the subtitle word-by-word display method when being executed by the processor.

The present application also provides a computer-readable storage medium having stored thereon a program for implementing a subtitle word-by-word display method, the program implementing the steps of the subtitle word-by-word display method as described above when executed by a processor.

The present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the subtitle word-by-word display method as described above.

The method comprises the steps of firstly extracting each image frame from a target video, detecting a subtitle area in each image frame through a character detection model to obtain coordinates of each subtitle area, wherein the character detection model is obtained through training pictures with finished subtitle area labels, then extracting characters in each image frame according to the coordinates of each subtitle area to obtain a character string queue formed by each character string with a timestamp, finally sequentially comparing each character string in the character string queue with a corresponding previous bit string to determine a subtitle character-by-character identification result, and performing subtitle character-by-character display according to the subtitle character-by-character identification result and the corresponding timestamp.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive labor.

Fig. 1 is a schematic flowchart of a first embodiment of a subtitle word-by-word display method of the present application;

FIG. 2 is a schematic diagram of lyrics caption calibration coordinates in a first embodiment of the caption line-by-line display method of the present application;

fig. 3 is a schematic flowchart illustrating steps S31 to S35 in a first embodiment of a subtitle word-by-word display method according to the present application;

FIG. 4 is a schematic diagram of a subtitle display apparatus according to the present application;

fig. 5 is a schematic device structure diagram of a hardware operating environment related to a subtitle word-by-word display method in an embodiment of the present application.

The implementation of the objectives, functional features, and advantages of the present application will be further described with reference to the accompanying drawings.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying figures are described in detail below. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Example one

At present, word-by-word display texts in lyric display in MV video are mainly obtained by performing OCR recognition on lyric subtitle colors in target MV video. The method based on color recognition only keeps the color characteristics of the original lyric subtitles, and the characteristics of the character part are greatly simplified, so that the recognition and extraction of the character characteristics of the lyric subtitles in the target MV video are less, and the characters are not matched with the corresponding time, thereby causing the low accuracy of the generated subtitles to display texts word by word.

In a first embodiment of the subtitle word-by-word display method according to the present application, referring to fig. 1, the subtitle word-by-word display method includes:

step S10, extracting each image frame from a target video, detecting a caption area in each image frame through a character detection model to obtain the coordinate of each caption area, wherein the character detection model is obtained according to the training of the picture which is marked with the caption area;

step S20, extracting characters in each image frame according to each subtitle region coordinate to obtain a character string queue formed by each character string containing a time stamp;

and S30, comparing each character string in the character string queue with the corresponding previous character string in sequence, determining a caption character-by-character recognition result, and displaying the caption character-by-character according to the caption character-by-character recognition result and the corresponding timestamp.

In the embodiment of the present application, it should be noted that the text detection model may be a YOLO model (a young Only Look one, target detection model), and the YOLO model implements end-to-end target detection through a series of convolution operations. The YOLO model divides the picture into grids (grid) of sxs, each grid is responsible for detecting the object falling therein, and finally, the frame (bounding box) of the contained object, the position information of the positioning and the confidence of all categories are output, referring to fig. 2, the coordinate frame coordinate information (the frame in fig. 2) of the object caption is detected from the MV caption image frame; the character string is a character string formed by each lyric character; the method used in the process of extracting the characters in each image frame can be OCR recognition.

As an example, steps S10 to S30 include: extracting each image frame and a time stamp corresponding to each image frame from a target video according to a preset interval time period; detecting the caption area in each image frame through the trained target detection model to obtain the coordinates of each caption area, wherein the character detection model is obtained by training pictures marked by the coordinates of the caption areas according to the artificial standard; performing OCR character recognition on each image frame according to each subtitle region coordinate, extracting to obtain a character string corresponding to each image frame, and adding each character string and a time stamp corresponding to each character string into a character string queue; sequentially comparing the character strings in the character string queue according to the time stamp sequence, and if the character strings are inconsistent with the corresponding previous character strings, adding the time stamps corresponding to the character strings and the character strings into an identification queue; sequentially comparing the timestamps in the identification queue according to the sequence of the timestamps, and if the character string is longer than the corresponding previous character string, taking the comparison result of the character string and the previous character string as the character-by-character identification result of the caption corresponding to the character string; if the character string is shorter than the former character string, taking the character string as a character-by-character recognition result of the caption corresponding to the character string; and playing and displaying the character-by-character recognition result of each subtitle according to the corresponding time stamp.

Referring to fig. 3, the step of comparing each of the character strings in the character string queue with the corresponding previous character string in sequence, determining a caption character-by-character recognition result, and displaying the caption character-by-character according to the caption character-by-character recognition result and the corresponding timestamp includes:

step S31, comparing each character string in the character string queue with the corresponding previous character string in sequence;

step S32, if the character string is longer than the previous character string, taking the difference value between the character string and the previous character string as a character-by-character recognition result of the caption corresponding to the character string;

step S33, if the character string is shorter than the former character string, taking the character string as a caption character-by-character recognition result corresponding to the character string;

step S34, if the character string is the same as the previous character string, the character string is abandoned;

and step S35, displaying each caption character-by-character recognition result according to the time stamp corresponding to each character string and each caption character-by-character recognition result.

In the embodiment of the present application, it should be noted that the caption character-by-character recognition result includes character information and a corresponding timestamp, and is used for displaying and playing the caption character-by-character; and the difference value between the character string and the previous character string is the updated content of the character string compared with the previous character string, for example, the difference value between the character string 'Beijing welcome you' and the previous character string 'Beijing welcome' is 'you', and then the 'you' is used as the character-by-character recognition result of the caption.

As an example, steps S31 to S35 include: comparing each character string with a corresponding previous character string in sequence according to the corresponding time stamp of each character string; if the character string is longer than the previous character string, judging that the character string and the previous character string belong to the same subtitle, the character string has updated content compared with the previous character string, and taking the difference value between the character string and the previous character string as a subtitle character-by-character recognition result corresponding to the character string; if the character string is shorter than the previous character string, judging that the character string and the previous character string do not belong to the same subtitle, and the character string is a new subtitle, and taking the character string as a subtitle character-by-character recognition result corresponding to the character string; if the character string is the same as the previous character string, judging that the character string and the previous character string belong to the same subtitle, comparing the character string with the previous character string, and discarding the character string, wherein the previous character string has no updated content; and performing character-by-character display on each caption character-by-character recognition result according to the time stamp corresponding to each character string.

As another example, the step of comparing each character string in the character string queue with a corresponding previous character string in sequence, determining a caption character-by-character recognition result, and displaying the caption character-by-character according to the caption character-by-character recognition result and a corresponding timestamp includes:

step A10, comparing each character string in the character string queue with a corresponding previous character string in sequence, and if the character strings and the corresponding time stamps are different, adding the character strings and the corresponding time stamps into an identification queue;

step A20, comparing each character string in the identification queue with the corresponding previous character string, determining a caption character-by-character identification result, and displaying the caption character-by-character according to the caption character-by-character identification result and the corresponding timestamp.

As an example, steps a10 to a20 include: comparing each character string in the character string queue with a previous character string corresponding to the character string in sequence according to the corresponding timestamp; when the difference between the character string and the previous character string is detected, adding a timestamp corresponding to the character string and the character string into an identification queue; if the character string is detected to be not different from the previous character string, discarding the character string; sequentially comparing each character string with a corresponding previous character string in the identification queue, and judging the length relationship between the character string and the previous character string; generating a caption character-by-character recognition result corresponding to the character string according to the length relation; and displaying the caption word by word according to the caption word by word recognition result and the time stamp corresponding to the character string.

Wherein, the step of comparing each character string in the character string queue with the corresponding previous character string in sequence, and if there is a difference, adding the character string and the corresponding timestamp into the identification queue comprises:

step A11, arranging each character string in the character string queue according to a corresponding time stamp sequence;

step A12, comparing each character string with a corresponding previous character string in sequence, and if the character string is different from the previous character string, adding the character string into the identification queue;

and step A13, if the character string is the same as the previous character string, discarding the character string.

As an example, steps a11 to a13 include: sequencing each character string in the character string queue according to the time stamp sequence corresponding to each character string; comparing each character string with a corresponding previous character string in the character string queue according to the time stamp sequence; if the character string is not completely the same as the previous character string, judging that the character string has updated content compared with the previous character string, and adding the character string into the identification queue; if the character string is the same as the previous character string, judging that the character string has no updated content compared with the previous character string, and abandoning the character string.

As an example, in the lyric word-by-word display of the MV video, the character string queue is S, each character string result is i, i and i-1 are compared, if the recognition results of i and i-1 are inconsistent, the singing is considered to enter the next character, the time stamp of the current i is recorded, and the recognition character is recorded. For example, if the corresponding character string of the i frame is "beijing welcome you" and the corresponding character string of i-1 is "beijing welcome", the time stamps of "beijing welcome you" and "beijing welcome you" are saved and entered into the recognition queue T.

The steps of comparing each character string in the identification queue with a corresponding previous character string, determining a caption character-by-character identification result, and displaying the caption character-by-character according to the caption character-by-character identification result and the corresponding timestamp comprise:

step A21, judging whether the previous character string is a subset of the character string;

step A22, if yes, judging that the character string and the previous character string belong to the same subtitle, and taking the difference value of the character string and the previous character string as a subtitle character-by-character identification result corresponding to the character string;

step A23, if not, determining that the character string and the previous character string do not belong to the same subtitle, and taking the character string as a subtitle character-by-character recognition result corresponding to the character string;

and A24, displaying each subtitle character-by-character recognition result according to the time stamp corresponding to each character string and each subtitle character-by-character recognition result.

As an example, steps a21 to a24 include: judging whether a previous character string in two adjacent character strings is a subset of a next character string in the identification queue; if the former string is a subset of the latter string, judging that the latter string and the former string belong to the same subtitle, and taking a difference value between the latter string and the former string as a subtitle character-by-character identification result corresponding to the latter string, wherein the difference value is used for expressing the updated content of the latter string compared with the former string; if the former string is not a subset of the latter string, judging that the latter string and the former string do not belong to the same subtitle, and taking the string as a subtitle character-by-character identification result corresponding to the string; and displaying the character-by-character recognition result of each subtitle according to the time stamp corresponding to each character string.

As an example, in the word-by-word display of MV video lyrics, the recognition queue T is traversed, the recognition result T is taken out, compared with T-1, if the recognition result of T is more than T-1, T- (T-1) is used to obtain the recognition difference value as the result of the word-by-word lyrics, and the time stamp of T is recorded. For example, the difference between the 'Beijing welcome you' and the 'Beijing welcome' is 'you', and then the 'you' is used as the character-by-character recognition result of the caption; if the recognition result of t is less than t-1, judging that new lines of lyrics appear in the video playing, taking the recognition result of t as the word-by-word lyrics results, such as 'Beijing welcome you' and 'is' as t-1 and t, taking 'yes' as the caption word-by-word recognition result, recording timestamps of all difference results and storing as the caption word-by-word recognition result.

The step of extracting each image frame from the target video, detecting the caption area in each image frame through a character detection model and obtaining the coordinate of each caption area comprises the following steps:

step S11, extracting a frame of picture from the target video every other preset time period to obtain each image frame;

and S12, performing character detection on the preset area of each image frame through the character detection model to obtain the coordinates of each subtitle area.

In this embodiment, it should be noted that the target video may be an MV lyric video, and the text detection model may be a YOLO target detection model, which is used to detect a subtitle region in the image frame and is trained by extracting pictures from each MV video that has been annotated by a manual method.

Preferably, the preset time period is 0.5s, and the preset area is an area one third below the video, namely a subtitle area, because no subtitle is displayed in an area two thirds above the video in the MV video, and in order to avoid redundant recognition, the RGB value in an area two thirds above the video in the MV video is set to 0.

As an example, steps S11 to S12 include: sequentially extracting image frames from the target video file according to the preset time period to obtain each image frame corresponding to the target video; and performing caption area detection on the preset area of each image frame through a YOLO target detection model obtained by extracting picture training according to each MV video subjected to caption area labeling to obtain corresponding coordinates of each caption area.

The step of extracting the characters in each image frame according to the coordinates of each subtitle area to obtain a character string queue formed by each character string containing a timestamp comprises the following steps:

step S21, according to the subtitle region coordinates corresponding to each image frame, performing OCR character recognition on each image frame to obtain corresponding character strings;

and step S22, arranging each character string according to the time stamp contained in the image frame corresponding to each character string to obtain the character string queue.

As one example, steps S21 to S22 include: dividing each subtitle region containing a target subtitle according to the subtitle region coordinate corresponding to each image frame, and performing OCR character recognition on each subtitle region to obtain corresponding character strings, wherein each character string contains characters in the subtitle; confirming the time stamp corresponding to each character string according to the image frame corresponding to each character string and the time stamp of each image frame; and adding each character string into the character string queue according to the sequence of the time stamp corresponding to each character string to obtain the character string queue.

The method comprises the steps of firstly extracting each image frame from a target video, detecting a subtitle area in each image frame through a character detection model to obtain coordinates of each subtitle area, extracting characters in each image frame according to the coordinates of each subtitle area to obtain a character string queue formed by each character string containing a time stamp, finally comparing each character string in the character string queue with a corresponding previous bit character string in sequence to determine a subtitle character-by-character recognition result, and displaying the subtitle characters word by word according to the subtitle character-by-word recognition result and the corresponding time stamp.

Example two

An embodiment of the present application further provides a subtitle word-by-word display apparatus, where the subtitle word-by-word display apparatus is applied to a subtitle word-by-word display device, and with reference to fig. 4, the subtitle word-by-word display apparatus includes:

the region detection module is used for extracting each image frame from a target video, detecting a subtitle region in each image frame through a character detection model and obtaining coordinates of each subtitle region, wherein the character detection model is obtained through training according to a picture which is marked with the subtitle region;

Optionally, the identification determination module is further configured to:

Optionally, the area detection module is further configured to:

Optionally, the text extraction module is further configured to:

performing OCR character recognition on each image frame according to the subtitle region coordinate corresponding to each image frame to obtain each corresponding character string;

By adopting the subtitle word-by-word display method in the embodiment, the subtitle word-by-word display device solves the technical problem of low accuracy of displaying texts word-by-word in subtitles. Compared with the prior art, the beneficial effect of the subtitle word-by-word display device provided by the embodiment of the present application is the same as that of the subtitle word-by-word display method provided by the above embodiment, and other technical features in the subtitle word-by-word display device are the same as those disclosed in the method of the previous embodiment, which are not described herein again.

EXAMPLE III

An embodiment of the present application provides an electronic device, which includes: at least one processor; and, a memory communicatively linked with the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the subtitle word-by-word display method according to the first embodiment.

Referring now to FIG. 5, shown is a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, the electronic device may include a processing means (e.g., a central processing unit, a graphic processor, etc.) that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) or a program loaded from a storage means into a Random Access Memory (RAM). In the RAM, various programs and data necessary for the operation of the electronic apparatus are also stored. The processing device, the ROM, and the RAM are connected to each other by a bus. An input/output (I/O) interface is also linked to the bus.

In general, the following systems may be linked to the I/O interface: input devices including, for example, touch screens, touch pads, keyboards, mice, image sensors, microphones, accelerometers, gyroscopes, and the like; output devices including, for example, liquid Crystal Displays (LCDs), speakers, vibrators, and the like; storage devices including, for example, magnetic tape, hard disk, and the like; and a communication device. The communication means may allow the electronic device to communicate wirelessly or by wire with other devices to exchange data. While the figures illustrate an electronic device with various systems, it is to be understood that not all illustrated systems are required to be implemented or provided. More or fewer systems may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means, or installed from a storage means, or installed from a ROM. The computer program, when executed by a processing device, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

The electronic device provided by the application adopts the subtitle word-by-word display method in the embodiment, so that the technical problem of low accuracy of displaying a text word-by-word by a subtitle is solved. Compared with the prior art, the beneficial effect of the electronic device provided by the embodiment of the present application is the same as that of the subtitle word-by-word display method provided by the first embodiment, and other technical features in the electronic device are the same as those disclosed in the method of the previous embodiment, which are not described herein again.

It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the foregoing description of embodiments, the particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Example four

The present embodiment provides a computer-readable storage medium having computer-readable program instructions stored thereon for performing the method for displaying subtitles in a word-by-word manner in the first embodiment.

The computer readable storage medium provided by the embodiments of the present application may be, for example, a usb disk, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or a combination of any of the above. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical link having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present embodiment, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer-readable storage medium may be embodied in an electronic device; or may be separate and not incorporated into the electronic device.

The computer-readable storage medium carries one or more programs which, when executed by an electronic device, cause the electronic device to: extracting each image frame from a target video, and detecting a caption area in each image frame through a character detection model to obtain the coordinate of each caption area, wherein the character detection model is obtained by training according to a picture which is marked with the caption area; extracting characters in each image frame according to each subtitle region coordinate to obtain a character string queue formed by each character string containing a time stamp; comparing each character string in the character string queue with the corresponding previous character string in sequence, determining a caption character-by-character recognition result, and displaying the caption character-by-character according to the caption character-by-character recognition result and the corresponding timestamp

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be linked to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the link may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Wherein the names of the modules do not in some cases constitute a limitation of the unit itself.

The computer readable storage medium provided by the application stores computer readable program instructions for executing the subtitle word-by-word display method, and solves the technical problem of low accuracy of subtitle word-by-word display text. Compared with the prior art, the beneficial effects of the computer-readable storage medium provided by the embodiment of the present application are the same as the beneficial effects of the subtitle word-by-word display method provided by the above embodiment, and are not described herein again.

EXAMPLE five

The computer program product provided by the application solves the technical problem of low accuracy of displaying the text word by the subtitle. Compared with the prior art, the beneficial effects of the computer program product provided by the embodiment of the present application are the same as the beneficial effects of the subtitle word-by-word display method provided by the above embodiment, and are not described herein again.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all equivalent structures or equivalent processes, which are directly or indirectly applied to other related technical fields, and which are not limited by the present application, are also included in the scope of the present application.

Claims

1. A subtitle word-by-word display method is characterized by comprising the following steps:

extracting each image frame from a target video, and detecting a subtitle area in each image frame through a character detection model to obtain coordinates of each subtitle area, wherein the character detection model is obtained through training according to a picture with the subtitle area being marked;

2. The method for displaying the caption word by word according to claim 1, wherein the step of comparing each character string in the character string queue with the corresponding previous character string in sequence to determine a caption word by word recognition result, and displaying the caption word by word according to the caption word by word recognition result and the corresponding timestamp comprises:

if the character string is shorter than the previous character string, taking the character string as a character-by-character recognition result of the caption corresponding to the character string;

3. The method for displaying the caption word by word according to claim 1, wherein the step of comparing each character string in the character string queue with the corresponding previous character string in sequence to determine a caption word by word recognition result, and displaying the caption word by word according to the caption word by word recognition result and the corresponding timestamp comprises:

4. The method for displaying caption words according to claim 3, wherein the step of comparing each character string in the character string queue with the corresponding previous character string in sequence, and if there is a difference, adding the character string and the corresponding timestamp into the identification queue comprises:

5. The method for displaying the caption word by word according to claim 3, wherein the steps of arranging the character strings in the identification queue according to the corresponding time stamp sequence, comparing the character strings in the identification queue with the corresponding previous character string, determining the caption word by word identification result, and displaying the caption word by word according to the caption word by word identification result and the corresponding time stamp comprise:

if yes, judging that the character string and the previous character string belong to the same subtitle, and taking the difference value of the character string and the previous character string as a subtitle character-by-character identification result corresponding to the character string;

6. The method for displaying caption characters by character according to claim 1, wherein the step of extracting each image frame from the target video, detecting the caption area in each image frame through a character detection model, and obtaining the coordinates of each caption area comprises:

7. The subtitle word-by-word display method according to claim 1, wherein the step of extracting the text in each image frame according to the coordinates of each subtitle region to obtain a string queue including strings of time stamps comprises:

8. A subtitle word-by-word display apparatus, comprising:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively linked with the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the subtitle verbatim display method of any one of claims 1-7.

10. A computer-readable storage medium, having stored thereon a program for implementing a subtitle word-by-word display method, the program being executable by a processor to implement the steps of the subtitle word-by-word display method according to any one of claims 1 to 7.