CN115797921A

CN115797921A - Subtitle recognition method and device, electronic equipment and readable storage medium

Info

Publication number: CN115797921A
Application number: CN202310053894.XA
Authority: CN
Inventors: 刘艳鑫
Original assignee: Beijing Intengine Technology Co Ltd
Current assignee: Beijing Intengine Technology Co Ltd
Priority date: 2023-02-03
Filing date: 2023-02-03
Publication date: 2023-03-14
Anticipated expiration: 2043-02-03
Also published as: CN115797921B

Abstract

The application discloses a subtitle recognition method, a device, an electronic device and a readable storage medium, wherein the subtitle recognition method comprises the following steps: acquiring video data; detecting a subtitle file corresponding to the video data; when the subtitle file corresponding to the video data is not detected, detecting whether the video data has built-in subtitle information or not; when detecting that the video data has the built-in subtitle information, traversing all text boxes of the video data; and identifying the subtitles of the video data based on the text box and a preset motion detection algorithm. The subtitle recognition scheme provided by the application improves the accuracy of subtitle recognition.

Description

Subtitle recognition method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of communications, and in particular, to a method and an apparatus for recognizing subtitles, an electronic device, and a readable storage medium.

Background

With the rapid development of multimedia technology and network technology, modern computer technology, especially mass data storage and transmission technology, is mature, and video, as a main media type, increasingly becomes an indispensable information carrier in the aspects of people's life, education, entertainment and the like. Video file playback, especially for viewing foreign films, subtitles become a very important part.

The current mainstream players all provide the functions of matching the played video and the online subtitles, but the efficiency of successful matching is very different, and the experience of playing the subtitles is very poor directly. The main reason for this problem is that there is not enough abundant correspondence between subtitle files and video files that can be used by the playing client, resulting in a low hit rate of playing matching.

Disclosure of Invention

In view of the foregoing technical problems, the present application provides a method and an apparatus for recognizing subtitles, an electronic device, and a readable storage medium, which can improve the accuracy of recognizing subtitles.

In order to solve the above technical problem, the present application provides a subtitle recognition method, including:

acquiring video data;

detecting a subtitle file corresponding to the video data;

when the subtitle file corresponding to the video data is not detected, detecting whether the video data has built-in subtitle information or not;

when detecting that the video data has the built-in subtitle information, traversing all text boxes of the video data;

and identifying the subtitles of the video data based on the text box and a preset motion detection algorithm.

Optionally, in some embodiments of the present application, the identifying subtitles of the video data based on the text box and a preset motion detection algorithm includes:

acquiring the resolution of the video data;

constructing an initial matrix based on the resolution;

and identifying subtitles of the video data according to the text box, the initial matrix and a preset motion detection algorithm.

Optionally, in some embodiments of the present application, the identifying subtitles of the video data according to the text box, the initial matrix and a preset motion detection algorithm includes:

calculating a motion vector of the text box based on a preset motion detection algorithm;

determining the currently processed text box as a currently processed object;

calculating the intersection ratio between the current processing object and the text box;

and identifying subtitles of the video data based on the intersection ratio, the initial matrix and the motion vector.

Optionally, in some embodiments of the present application, the identifying subtitles of the video data based on the intersection ratio, the initial matrix, and the motion vector includes:

updating the text box based on the intersection and comparison;

filtering the updated text box according to the initial matrix and the motion vector;

and determining the subtitle of the processed text box as the subtitle of the video data.

Optionally, in some embodiments of the present application, the detecting whether the video data has built-in subtitle information when the subtitle file corresponding to the video data is not detected includes:

determining the frame number of characters contained in the video data;

detecting whether the frame number is greater than a preset value;

when the frame number is detected to be larger than a preset value, traversing a text box in the video data;

and detecting whether the video data has built-in subtitle information or not based on the text box.

Optionally, in some embodiments of the present application, the acquiring video data includes:

acquiring a video link;

and downloading the video data according to the video link, and converting the video data into video data in a preset format.

Optionally, in some embodiments of the present application, after the recognizing the subtitles of the video data based on the text box and a preset motion detection algorithm, the method further includes:

and carrying out voice alignment on the video data according to the built-in subtitle information.

Correspondingly, the present application also provides a subtitle recognition apparatus, including:

the acquisition module is used for acquiring video data;

the first detection module is used for detecting a subtitle file corresponding to the video data;

the second detection module is used for detecting whether the video data has built-in subtitle information or not when the subtitle file corresponding to the video data is not detected;

the traversal module is used for traversing all text boxes of the video data when the video data is detected to have the built-in subtitle information;

and the identification module is used for identifying the subtitles of the video data based on the text box and a preset motion detection algorithm.

The present application further provides an electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the method when executing the computer program.

The present application also provides a computer storage medium having a computer program stored thereon, which, when being executed by a processor, carries out the steps of the method as described above.

As described above, the present application provides a subtitle recognition method, an apparatus, an electronic device, and a readable storage medium, where after video data is acquired, a subtitle file corresponding to the video data is detected, when the subtitle file corresponding to the video data is not detected, whether the video data has built-in subtitle information is detected, when the video data has built-in subtitle information, all text boxes of the video data are traversed, and finally, subtitles of the video data are recognized based on the text boxes and a preset motion detection algorithm. In the subtitle recognition scheme provided by the application, whether video data have corresponding subtitle files can be detected, whether the video data have built-in subtitle information is detected when the video data do not have the subtitle files, subtitles of the video data are recognized according to a text box of the video data and a preset motion detection algorithm when the video data have the built-in subtitle information is detected, subtitle recognition is achieved independent of the subtitle files of the video data, and the problem that the subtitles of videos cannot be recognized or the recognized subtitles are poor in accuracy when the subtitle files are lacked is avoided.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic structural diagram of a subtitle recognition system according to an embodiment of the present application.

Fig. 2 is a schematic flowchart of a subtitle recognition method according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of a subtitle recognition apparatus according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of an intelligent terminal provided in an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings. Specific embodiments of the present application have been shown by way of example in the drawings and will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the recitation of a claim "comprising a" 8230a "\8230means" does not exclude the presence of additional identical elements in the process, method, article or apparatus in which the element is incorporated, and further, similarly named components, features, elements in different embodiments of the application may have the same meaning or may have different meanings, the specific meaning of which should be determined by its interpretation in the specific embodiment or by further combination with the context of the specific embodiment.

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for the convenience of description of the present application, and have no specific meaning in themselves. Thus, "module", "component" or "unit" may be used mixedly.

The following embodiments related to the present application are specifically described, and it should be noted that the order of description of the embodiments in the present application is not limited to the order of priority of the embodiments.

The embodiment of the application provides a subtitle recognition method, a subtitle recognition device, a storage medium and electronic equipment. Specifically, the subtitle recognition method according to the embodiment of the present application may be executed by an electronic device or a server, where the electronic device may be a terminal. The terminal may be an electronic device such as a smart phone, a tablet computer, a notebook computer, a touch screen, a game console, a Personal Computer (PC), a Personal Digital Assistant (PDA), or the like, and may further include a client, which may be a media playing client or an instant messaging client, or the like.

For example, when the subtitle recognition method is applied to an electronic device, the electronic device may acquire video data and detect a subtitle file corresponding to the video data, detect whether the video data has built-in subtitle information when the subtitle file corresponding to the video data is not detected, traverse all text boxes of the video data when the video data has the built-in subtitle information, and finally recognize a subtitle of the video data based on the text boxes and a preset motion detection algorithm. Wherein the electronic device may interact with the user through a graphical user interface. The manner in which the electronic device provides the graphical user interface to the user may include a variety of ways, for example, the graphical user interface may be rendered for display on a display screen of the electronic device, or presented by holographic projection. For example, the electronic device may include a touch display screen for presenting a graphical user interface and receiving user operation instructions generated by a user acting on the graphical user interface, and a processor.

Referring to fig. 1, fig. 1 is a system schematic diagram of a subtitle recognition apparatus according to an embodiment of the present disclosure. The system may include at least one electronic device 1000, at least one server or personal computer 2000. The electronic device 1000 held by the user can be connected to different servers or personal computers through a network. The electronic device 1000 may be an electronic device having computing hardware capable of supporting and executing software products corresponding to multimedia. Additionally, the electronic device 1000 may also have one or more multi-touch sensitive screens for sensing and obtaining input by a user through touch or slide operations performed at multiple points of the one or more touch sensitive display screens. In addition, the electronic apparatus 1000 may be interconnected with a server or a personal computer 2000 through a network. The network may be a wireless network or a wired network, such as a Wireless Local Area Network (WLAN), a Local Area Network (LAN), a cellular network, a 2G network, a 3G network, a 4G network, a 5G network, etc. In addition, different electronic devices 1000 may also be connected to other embedded platforms or to servers, personal computers, and the like using their own bluetooth networks or hotspot networks. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and a big data and artificial intelligence platform.

The embodiment of the application provides a subtitle recognition method which can be executed by an electronic device or a server. The embodiment of the present application is described by taking an example in which the subtitle recognition method is executed by an electronic device. The electronic equipment comprises a touch display screen and a processor, wherein the touch display screen is used for presenting a graphical user interface and receiving an operation instruction generated by a user acting on the graphical user interface. When a user operates the graphical user interface through the touch display screen, the graphical user interface can control the local content of the electronic equipment through responding to the received operation instruction, and can also control the content of the server end through responding to the received operation instruction. For example, the operation instructions generated by the user acting on the graphical user interface include instructions for processing the initial audio data, and the processor is configured to launch the corresponding application program after receiving the instructions provided by the user. Further, the processor is configured to render and draw a graphical user interface associated with the application on the touch-sensitive display screen. A touch display screen is a multi-touch sensitive screen capable of sensing a touch or slide operation performed simultaneously at a plurality of points on the screen. The user uses a finger to execute touch operation on the graphical user interface, and the graphical user interface controls the corresponding operation displayed in the graphical user interface of the application when detecting the touch operation.

The application provides a subtitle recognition scheme, whether can detect video data has corresponding subtitle file, when video data does not possess the subtitle file, whether detect video data possesses built-in subtitle information, when detecting that video data possesses built-in subtitle information, according to video data's text box and predetermined motion detection algorithm, the subtitle of discernment video data, the subtitle file that does not rely on video data, realize the subtitle discernment, avoid unable discernment video subtitle or the subtitle accuracy of discernment is not good when lacking the subtitle file, it is visible, the accuracy of subtitle discernment can be improved to this scheme.

The following are detailed below. It should be noted that the description sequence of the following embodiments is not intended to limit the priority sequence of the embodiments.

A method of subtitle recognition, comprising: acquiring video data; detecting a subtitle file corresponding to the video data; when the subtitle file corresponding to the video data is not detected, detecting whether the video data has built-in subtitle information or not; when detecting that the video data has the built-in subtitle information, traversing all text boxes of the video data; and identifying subtitles of the video data based on the text box and a preset motion detection algorithm.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a subtitle recognition method according to an embodiment of the present disclosure. The specific flow of the digital screen identification method can be as follows:

101. video data is acquired.

The video data refers to a continuous image sequence, which is substantially composed of a group of continuous images, and for an image itself, except for the sequence of its occurrence, there is no structural information, and it may be obtained through a network, for example, through a Uniform Resource Locator (URL) link, that is, optionally, in some embodiments, the step "obtaining video data" may specifically include:

(11) Acquiring a video link;

(12) And downloading the video data according to the video link, and converting the video data into the video data with a preset format.

For example, specifically, the source code of the current webpage is obtained through an incoming URL link, the URL of the required video is obtained by using a regular expression method, and the identifier and the URL of the video are saved. And then, downloading the videos through the saved URLs of the videos, and converting all the videos into the same format, such as mp4, avi, rmvb, and the like, wherein optionally, the format of the videos may be adjusted according to actual requirements, and the application is not limited.

102. And detecting a subtitle file corresponding to the video data.

For example, specifically, whether the video data has a corresponding subtitle file may be determined by detecting a header of the video data, and when the video data has a corresponding subtitle file, performing subtitle identification on the video data based on the subtitle file; when it is not detected that the video data has the corresponding subtitle file, step 103 is performed.

103. And when the subtitle file corresponding to the video data is not detected, detecting whether the video data has the built-in subtitle information or not.

When the subtitle file corresponding to the video data is not detected, the built-in subtitle information of the video data is acquired, it should be noted that not every video data has the built-in subtitle information, and therefore, whether the video data has the built-in subtitle information or not can be detected.

For example, it may be detected whether the number of frames containing text is greater than a preset value, and based on this, it is determined whether the video has the built-in subtitle information, that is, optionally, in some embodiments, the step "detecting whether the video data has the built-in subtitle information when the subtitle file corresponding to the video data is not detected" may specifically include:

(21) Determining the frame number of characters contained in the video data;

(22) Detecting whether the frame number is greater than a preset value;

(23) When the frame number is detected to be larger than a preset value, traversing a text box in the video data;

(24) Based on the text box, it is detected whether the video data has built-in subtitle information.

For example, specifically, f is defined as the number of frames containing characters, and f is initialized to zero; if a text box is identified, f accumulates one. And if f is greater than the preset value, determining that the video data has the built-in subtitle information, and executing step 104.

104. And traversing all text boxes of the video data when detecting that the video data is provided with the built-in subtitle information.

105. And identifying subtitles of the video data based on the text box and a preset motion detection algorithm.

For example, specifically, an all-zero matrix N having the same dimension as the video resolution is initialized, and a data saving list tb is initialized, where the textbox record saved in the list tb contains the following information: the text box recognition result s, the start time t1, the end time t2, the text box coordinates rect, and the motion vector mv, and then, based on the text box, the initial matrix, and a preset motion detection algorithm, the subtitle of the video data is recognized, that is, optionally, in some embodiments, the step "recognizing the subtitle of the video data based on the text box and the preset motion detection algorithm" may specifically include:

(31) Acquiring the resolution of video data;

(32) Constructing an initial matrix based on the resolution;

(33) And identifying subtitles of the video data according to the text box, the initial matrix and a preset motion detection algorithm.

For example, specifically, the motion vector of the text box may be calculated through an operation detection algorithm, and meanwhile, an intersection ratio between the currently processed text box and each text box is calculated, and finally, the subtitle of the video data is identified according to the intersection ratio, the initial matrix, and the motion vector, that is, optionally, in some embodiments, the step "identifying the subtitle of the video data according to the text box, the initial matrix, and a preset motion detection algorithm" may specifically include:

(41) Calculating a motion vector of the text box based on a preset motion detection algorithm;

(42) Determining the currently processed text box as a currently processed object;

(43) Calculating the intersection ratio between the current processing object and the text box;

(44) Subtitles of the video data are identified based on the cross-over ratio, the initial matrix, and the motion vector.

For example, specifically, a currently processed text box is determined as a currently processed object, an intersection ratio between the currently processed object and each text box is calculated, and the text box is updated based on the intersection ratio and a content difference between the currently processed object and the text box; then, using the initial matrix and the motion vector, filtering the updated text box, that is, filtering useless text information in the text box, so as to identify a subtitle of the video data, that is, optionally, in some embodiments, the step "identifying a subtitle of the video data based on the cross-over ratio, the initial matrix, and the motion vector" may specifically include:

(51) Updating the text boxes based on intersection and comparison;

(52) Filtering the updated text box according to the initial matrix and the motion vector;

(53) And determining the subtitles of the processed text box as the subtitles of the video data.

For example, if a point (x, y) in the matrix N is located in the area of the text box record Ri, N (x, y) is incremented by one.

Initializing a data save list tb, the textbox record saved in list tb containing the following information: a text box recognition result s, a start time t1, an end time t2, a text box coordinate rect, and a motion vector mv. Each frame of image in the video is processed frame by a text area detection tool, if a text box is identified, f is accumulated by one, and a motion vector mv of each text box record Ri is calculated by a motion detection algorithm (such as a frame difference method, an optical flow method and the like). Calculating an Intersection-over-Union (IoU) of the degree of overlap of the area of the text box record Ri and the area of each text box record tbi in the list tb, wherein the IoU is a concept used in object detection and is the ratio of the generated candidate box to the original mark box, i.e., the Intersection to Union ratio thereof, in this application, the candidate box is the currently processed text box and the original mark box is the text box of the data saving list, calculating the IoU by the following formula (1), and formula (1) records the text box record tbmax with the largest IoU and the maximum value IoUmax,

when IoUmax is larger than the threshold, if the characters are the same, the ending time t2 of tbmax in the textbox record is updated to the current frame number. The motion vector mv for tbmax is updated by either the maximum method or the average method.

When IoUmax is larger than the threshold value, if the characters are different, the text box record Ri is saved in the list tb, and the starting time t1 and the ending time t2 are initialized as the current frame number.

When IoUmax is smaller than the threshold value, but the tb list has a record tbi which is the same as the Ri character recognition result, the ending time t2 of updating tbi is the current frame number, and the area of tbi is updated to be the area of Ri.

When IoUmax is less than the threshold value and there is no record tbi in the list which is the same as Ri word recognition result, the text box record Ri is saved in the list tb, and the starting time t1 and the ending time t2 are initialized as the current frame number.

After all frames of the current video have been processed, all textbox records in the tb list are traversed, and the multiple textbox records that occur within the approximate time period are merged into one textbox record. For example: two lines of text may appear in a video subtitle, and the two lines of text are combined into one line of text.

Then, filtering the invalid subtitles, specifically as follows:

discard the once-flashed text: according to the matrix N obtained in step 5.2.1, dividing all elements in the matrix N by the total frame number of the video, setting a region of the matrix N whose elements are greater than a threshold γ (the threshold γ can be set according to requirements) as one, and setting a region of the matrix N whose elements are less than the threshold γ as zero. And calculating the ratio R of the sum of the numerical values in the area of the text box in the matrix N to the area of the text box, and if R is smaller than a threshold (the threshold can be set according to requirements), discarding the record of the text box.

Discarding the sports text: if the modulus of the motion vector mv recorded in the text box is greater than a certain threshold value iota (iota can be set according to requirements), the text box record is considered as a motion word, for example: a moving barrage, a rolling barrage, a moving license plate in a video, etc. The text box record is discarded if the words are not considered to be audio related.

Abandoning the keywords: defining a keyword ignore list, wherein the list is composed of the common text information in the video, such as: CCTV, hunan Wei Shi, zhejiang Wei Shi, etc. If the characters in the text box record appear in the list, and the ratio of the duration (t 2-t 1) of the appearance of the characters to the total number of frames is greater than a threshold value beta (the threshold value beta can be adjusted according to requirements), the text box record is discarded.

Discarding the specific tag: if the ratio of the text box record occurrence duration (t 2-t 1) to the total frame number is greater than a threshold λ (the threshold λ may be adjusted according to the requirement, and the threshold λ is greater than the threshold β), for example: if the continuous occurrence time of the text box record in one video accounts for 80% of the total time, the characters in the text box record are considered as specific labels, and if the characters do not occur in the voice, the text box record is discarded.

Discard words based on classifier: and counting the characteristics of the switching time interval between two caption texts in the region, the length of the caption texts, the corresponding existing duration and the like, training a classifier, and determining whether the text box record is abandoned by combining one or more algorithms such as decision trees, logistic regression, naive Bayes, neural networks and the like by the classification method.

Optionally, in some embodiments, to facilitate the user to watch the video, after the subtitles of the video data are identified, the video data may be voice aligned according to the built-in subtitle information.

For example, specifically, the built-in subtitle information is input into a pre-constructed word-level alignment model for operation, and a word-level alignment result corresponding to the original video data is output. The word-level alignment model may be a pre-constructed model, for example, a pre-constructed end-to-end neural network model. On the basis, a phoneme-level alignment result corresponding to the video data can be further obtained through a phoneme-level alignment model, so that secondary alignment of word level and phoneme level is realized

The subtitle recognition process of the present application is completed above.

According to the caption identification method, after video data are obtained, a caption file corresponding to the video data is detected, when the caption file corresponding to the video data is not detected, whether the video data have the built-in caption information is detected, when the video data have the built-in caption information is detected, all text boxes of the video data are traversed, and finally, captions of the video data are identified based on the text boxes and a preset motion detection algorithm.

In order to better implement the subtitle recognition method, the application also provides a subtitle recognition device based on the subtitle recognition method. The meaning of the noun is the same as that in the subtitle recognition method, and specific implementation details can refer to the description in the method embodiment.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a caption identifying device provided in the present application, where the caption identifying device may include an obtaining module 201, a first detecting module 202, a second detecting module 203, a traversing module 204, and an identifying module 205, which may specifically be as follows:

an obtaining module 201, configured to obtain video data.

The obtaining may be performed in a network form, for example, the obtaining may be performed through a Uniform Resource Locator (URL) link, that is, optionally, in some embodiments, the obtaining module 201 may specifically be configured to: acquiring a video link; and downloading the video data according to the video link, and converting the video data into the video data with a preset format.

The first detecting module 202 is configured to detect a subtitle file corresponding to video data.

The second detecting module 203 is configured to detect whether the video data has the built-in subtitle information when the subtitle file corresponding to the video data is not detected.

For example, it may be detected whether the number of frames containing text is greater than a preset value, and based on this, it is determined whether the video has the built-in subtitle information, that is, optionally, in some embodiments, the second detection module 203 may be specifically configured to: determining the frame number of characters contained in the video data; detecting whether the frame number is greater than a preset value; when the frame number is detected to be larger than a preset value, traversing a text box in the video data; based on the text box, it is detected whether the video data has built-in subtitle information.

And the traversing module 204 is configured to traverse all text boxes of the video data when it is detected that the video data has the built-in subtitle information.

The identifying module 205 is configured to identify a subtitle of the video data based on the text box and a preset motion detection algorithm.

The subtitle recognition process of the present application is completed above.

As can be seen from the above, the present application provides a subtitle recognition apparatus, after an obtaining module 201 obtains video data, a first detecting module 202 detects a subtitle file corresponding to the video data, when the subtitle file corresponding to the video data is not detected, a second detecting module 203 detects whether the video data has built-in subtitle information, when the video data has built-in subtitle information, a traversing module 204 traverses all text boxes of the video data, and finally, a recognition module 205 recognizes subtitles of the video data based on the text boxes and a preset motion detection algorithm.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

An embodiment of the present invention further provides an electronic device 500, as shown in fig. 4, the electronic device 500 may integrate the subtitle recognition apparatus, and may further include a Radio Frequency (RF) circuit 501, a memory 502 including one or more computer-readable storage media, an input unit 503, a display unit 504, a sensor 505, an audio circuit 506, a Wireless Fidelity (WiFi) module 507, a processor 508 including one or more processing cores, a power supply 509, and other components. Those skilled in the art will appreciate that the configuration of the electronic device 500 shown in FIG. 4 does not constitute a limitation of the electronic device 500, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components. Wherein:

the RF circuit 501 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, for receiving downlink information of a base station and then sending the received downlink information to the one or more processors 508 for processing; in addition, data relating to uplink is transmitted to the base station. In general, RF circuit 501 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 501 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global system for Mobile communications (GSM), general Packet Radio Service (GPRS), code Division Multiple Access (CDMA), wideband Code Division Multiple Access (WCDMA), long Term Evolution (LTE), email, short Messaging Service (SMS), and the like.

The memory 502 may be used to store software programs and modules, and the processor 508 executes various functional applications and information processing by operating the software programs and modules stored in the memory 502. The memory 502 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, a target data playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the electronic device 500, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 502 may also include a memory controller to provide the processor 508 and the input unit 503 access to the memory 502.

The input unit 503 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, the input unit 503 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 508, and can receive and execute commands sent by the processor 508. In addition, the touch sensitive surface can be implemented in various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 503 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 504 may be used to display information input by or provided to the user as well as various graphical user interfaces of the electronic device 500, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 504 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-emitting diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 508 to determine the type of touch event, and then the processor 508 provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 4 the touch-sensitive surface and the display panel are shown as two separate components to implement input and output functions, in some embodiments the touch-sensitive surface may be integrated with the display panel to implement input and output functions.

The electronic device 500 may also include at least one sensor 505, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or the backlight when the electronic device 500 is moved to the ear. As one of the motion sensors, the gravity acceleration sensor may detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the mobile phone is stationary, and may be used for applications of recognizing gestures of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and tapping), and other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor that may be configured to the electronic device 500, which are not described herein again.

Audio circuitry 506, a speaker, and a microphone may provide an audio interface between a user and the electronic device 500. The audio circuit 506 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signals into electrical signals, which are received by the audio circuit 506 and converted into audio data, which are then processed by the audio data output processor 508 and sent to, for example, another electronic device 500 via the RF circuit 501 or output to the memory 502 for further processing. The audio circuit 506 may also include an earbud jack to provide communication of peripheral headphones with the electronic device 500.

WiFi belongs to short-distance wireless transmission technology, and the electronic device 500 can help the user send and receive e-mails, browse web pages, access streaming media, etc. through the WiFi module 507, which provides the user with wireless broadband internet access. Although fig. 4 shows the WiFi module 507, it is understood that it does not belong to the essential constitution of the electronic device 500, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 508 is a control center of the electronic device 500, connects various parts of the entire cellular phone using various interfaces and lines, and performs various functions of the electronic device 500 and processes data by operating or executing software programs and/or modules stored in the memory 502 and calling data stored in the memory 502, thereby performing overall monitoring of the cellular phone. Optionally, processor 508 may include one or more processing cores; preferably, the processor 508 may integrate an application processor, which primarily handles operating systems, user interfaces, application programs, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 508.

The electronic device 500 further includes a power supply 509 (e.g., a battery) for powering the various components, which may be logically coupled to the processor 508 via a power management system to manage charging, discharging, and power consumption management functions via the power management system. The power supply 509 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power data indicators, and the like.

Although not shown, the electronic device 500 may further include a camera, a bluetooth module, and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 508 in the electronic device 500 loads the executable file corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 508 runs the application programs stored in the memory 502, so as to implement various functions:

acquiring video data; detecting a subtitle file corresponding to the video data; when the subtitle file corresponding to the video data is not detected, detecting whether the video data has built-in subtitle information or not; when detecting that the video data has the built-in subtitle information, traversing all text boxes of the video data; and identifying subtitles of the video data based on the text box and a preset motion detection algorithm.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and parts that are not described in detail in a certain embodiment may refer to the above detailed description of the subtitle recognition method, which is not described herein again.

As can be seen from the above, the electronic device 500 according to the embodiment of the present invention may detect whether video data has a corresponding subtitle file, detect whether the video data has built-in subtitle information when the video data does not have the subtitle file, and identify subtitles of the video data according to a text box of the video data and a preset motion detection algorithm when the video data has the built-in subtitle information, so as to implement subtitle identification without depending on the subtitle file of the video data, and avoid that subtitles of the video cannot be identified or identified subtitles are not accurate and good when the subtitle file is absent.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application further provides a storage medium, on which a plurality of instructions are stored, where the instructions are suitable for being loaded by a processor to execute the steps in the subtitle recognition method.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disk, and the like.

Since the instructions stored in the storage medium may execute the steps in any subtitle recognition method provided by the embodiment of the present invention, beneficial effects that can be achieved by any subtitle recognition method provided by the embodiment of the present invention may be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The foregoing describes in detail a subtitle recognition method, apparatus, system, and storage medium provided by an embodiment of the present invention, and a specific example is applied in the present disclosure to explain the principles and embodiments of the present invention, and the description of the foregoing embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed, and in summary, the content of the present specification should not be construed as limiting the present invention.

Claims

1. A subtitle recognition method, comprising:

acquiring video data;

detecting a subtitle file corresponding to the video data;

2. The method of claim 1, wherein the identifying subtitles for the video data based on the text box and a predetermined motion detection algorithm comprises:

acquiring the resolution of the video data;

constructing an initial matrix based on the resolution;

and identifying the subtitles of the video data according to the text box, the initial matrix and a preset motion detection algorithm.

3. The method of claim 2, wherein the identifying subtitles for the video data according to the text box, the initial matrix and a preset motion detection algorithm comprises:

determining the currently processed text box as a currently processed object;

4. The method of claim 3, wherein identifying subtitles for the video data based on the cross-over ratio, an initial matrix, and a motion vector comprises:

updating the text box based on the intersection and comparison;

5. The method of claim 2, wherein when the subtitle file corresponding to the video data is not detected, detecting whether the video data has built-in subtitle information comprises:

determining the frame number of characters contained in the video data;

detecting whether the frame number is greater than a preset value;

6. The method according to any one of claims 1 to 5, wherein the acquiring video data comprises:

acquiring a video link;

7. The method according to any one of claims 1 to 5, wherein after identifying the subtitles of the video data based on the text box and a preset motion detection algorithm, the method further comprises:

8. A subtitle recognition apparatus, comprising:

the acquisition module is used for acquiring video data;

9. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the subtitle recognition method according to any one of claims 1 to 7.

10. A readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the subtitle recognition method according to any one of claims 1 to 7.