CN109729420B

CN109729420B - Picture processing method and device, mobile terminal and computer readable storage medium

Info

Publication number: CN109729420B
Application number: CN201711027700.XA
Authority: CN
Inventors: 李大龙
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-10-27
Filing date: 2017-10-27
Publication date: 2021-04-20
Anticipated expiration: 2037-10-27
Also published as: CN109729420A

Abstract

The invention discloses a picture processing method, a picture processing device, a mobile terminal and a computer readable storage medium, wherein the method comprises the following steps: responding to a picture processing instruction, acquiring multi-frame video image frames and target subtitle information within a specified time range according to the picture processing instruction and a multimedia file of a target film, extracting at least one frame of target video image frame from the multi-frame video image frames according to the target subtitle information and the multi-frame video image within the specified time range, and generating a picture containing subtitles according to the target subtitle information and the at least one frame of target video image frame. Compared with the prior art, under the picture processing instruction of the user, the picture containing the caption is obtained based on the multi-frame video image frame and the target caption information within the appointed time range, the operation is simple, the user experience is good, and the sharing intention of the user is effectively improved.

Description

Picture processing method and device, mobile terminal and computer readable storage medium

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a picture processing method and apparatus, a mobile terminal, and a computer-readable storage medium.

Background

At present, a smart phone and a mobile internet become mainstream configurations, and people can fully utilize fragmentation time to enjoy video and audio entertainment anytime and anywhere. Based on the product characteristics of mobile application programs, almost all video application programs in the market contain friend sharing services, and one typical scene is as follows: in the process of watching the film, capturing a highlight moment of the film by using the existing screenshot function of an application program or the screenshot function of a smart phone, storing the highlight moment as pictures, and sharing the pictures of the screenshots to friends on a social network.

Because characters are aligned, voice-over is made, and the like are indispensable constituent elements of modern films, important plot information is often contained, or the expressed emotion or mood can cause strong mental resonance of film watching users. With respect to the above information, modern films are typically presented in text form, i.e. as subtitles, at the edges of the film image (e.g. most often below a rectangular image) when digitally produced. Therefore, on one hand, the method is not influenced by the sound effect of the film or the environmental noise of the watching user, and on the other hand, the method can also solve the scene requirement that the user cannot understand the semantics only by the sound, such as foreign language films. Naturally, film screenshots containing subtitles are the most common type of screenshot sharing due to their incomparable information carrier role.

However, when the user shares the screenshots of the current movie, the user needs to manually perform the screenshots one by one, and the complicated and multiple screenshots seriously reduce the user experience and the user sharing desire.

Disclosure of Invention

The invention mainly aims to provide a picture processing method and device, a mobile terminal and a computer readable storage medium, and aims to solve the technical problems that in the prior art, when a user needs to share pictures of a film, the user needs to perform complicated manual operation for multiple times to perform screenshot, the use experience of the user is seriously reduced, and the sharing desire of the user is reduced.

To achieve the above object, a first aspect of the present invention provides a picture processing method, including:

responding to a picture processing instruction, and acquiring multi-frame video image frames and target subtitle information within a specified time range according to the picture processing instruction and a multimedia file of a target film;

extracting at least one frame of target video image frame from the multi-frame video image frames according to the multi-frame video image frames and the target caption information within the appointed time range;

and generating a picture containing the subtitles according to the target subtitle information and the at least one frame of target video image frame.

To achieve the above object, a second aspect of the present invention provides a picture processing apparatus comprising:

the response acquisition module is used for responding to a picture processing instruction, and acquiring multi-frame video image frames and target subtitle information within a specified time range according to the picture processing instruction and a multimedia file of a target film;

the frame extraction module is used for extracting at least one frame of target video image frame from the multi-frame video image frames according to the multi-frame video image frames and the target caption information within the appointed time range;

and the generating module is used for generating a picture containing subtitles according to the target subtitle information and the at least one frame of target video image frame.

To achieve the above object, a third aspect of the present invention provides a mobile terminal, comprising: the image processing method comprises a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor executes the computer program to realize the steps of the image processing method according to the first aspect of the embodiment of the invention.

To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program, which, when executed by a processor, implements the steps of the picture processing method according to the first aspect of the present invention.

The invention provides a picture processing method, which comprises the following steps: responding to a picture processing instruction, acquiring multi-frame video image frames and target subtitle information within a specified time range according to the picture processing instruction and a multimedia file of a target film, extracting at least one frame of target video image frame from the multi-frame video image frames according to the target subtitle information and the multi-frame video image within the specified time range, and generating a picture containing subtitles according to the target subtitle information and the at least one frame of target video image frame. Compared with the prior art, under the picture processing instruction of a user, the picture containing the caption is generated by acquiring the target caption information and at least one frame of target video image frame and utilizing the target caption information and the at least one frame of target video image frame, so that the picture containing the caption is obtained based on the multi-frame video image frame and the target caption information within the appointed time range without manually capturing a picture by a user, the operation is simple, the user experience is good, and the sharing intention of the user is effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a block diagram of a mobile terminal;

FIG. 2 is a flowchart illustrating a method for processing pictures according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart illustrating a picture processing method according to an embodiment of the present invention;

FIG. 4A is a flow chart illustrating a refinement step of step 302 in the embodiment of FIG. 3;

FIG. 4B is another flow chart illustrating the step of refining step 302 in the embodiment of FIG. 3;

FIG. 4C is a schematic diagram of the extraction in the hard subtitle mode based on OCR technology in the embodiment of FIG. 4B;

FIG. 4D is another flow chart illustrating the step of refining step 302 in the embodiment of FIG. 3;

FIG. 5 is a schematic flow chart illustrating a picture processing method according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart illustrating a picture processing method according to an embodiment of the present invention;

FIG. 7 is a schematic flow chart illustrating a method for processing pictures according to an embodiment of the present invention;

FIG. 8 is a schematic flow chart illustrating a method for processing pictures according to an embodiment of the present invention;

FIG. 9a is a diagram illustrating a setup interface for specifying a time range in an embodiment of the invention;

FIG. 9b is a diagram illustrating a picture including subtitles generated in an embodiment of the present invention;

fig. 9c is a schematic diagram of a picture containing subtitles generated in an embodiment of the present invention;

FIG. 10 is a diagram illustrating program modules of the image processing apparatus according to an embodiment of the present invention;

FIG. 11 is a diagram illustrating program modules of the image processing apparatus according to an embodiment of the present invention;

FIG. 12A is a schematic diagram of a refinement function module of the data processing module in the embodiment shown in FIG. 11;

FIG. 12B is another diagram of a refinement function module of the data processing module in the embodiment shown in FIG. 11;

FIG. 12C is another diagram of a refinement function module of the data processing module in the embodiment shown in FIG. 11;

FIG. 13 is a diagram illustrating program modules of the image processing apparatus according to an embodiment of the present invention;

FIG. 14 is a diagram illustrating program modules of the image processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a mobile terminal 100. The image processing method provided by the embodiment of the present invention may be applied to the mobile terminal 100 shown in fig. 1, where the mobile terminal 100 may include, but is not limited to: smart phones, laptops, tablets, etc. that rely on batteries to maintain normal operation and support networking and downloading functions.

As shown in fig. 1, the mobile terminal 100 includes a memory 102, a memory controller 104, one or more processors 106 (only one of which is shown), a peripheral interface 108, a radio unit 110, a key unit 112, an audio unit 114, and a display unit 116. These components communicate with each other via one or more communication buses/signal lines 122.

It is to be understood that the structure shown in fig. 1 is only an illustration and does not limit the structure of the mobile terminal 100. For example, mobile terminal 100 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.

The memory 102 may be used to store a computer program, such as program instructions or modules corresponding to the image processing method and apparatus in the embodiments of the present invention, and the processor 106, when executing the computer program stored in the memory 102, implements each step in the image processing method in any one of the following embodiments of fig. 2 to 8.

The memory 102, a computer-readable storage medium, may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 102 may further include memory located remotely from the processor 106, which may be connected to the mobile terminal 100 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. Access to the memory 102 by the processor 106, and possibly other components, may be under the control of the memory controller 104.

The peripherals interface 108 couples various input/output devices to the processor 106 as well as to the memory 102. The processor 106 executes various software, instructions within the memory 102 to perform various functions of the mobile terminal 100 as well as data processing.

In some examples, the peripheral interface 108, the processor 106, and the memory controller 104 may be implemented in a single chip. In other examples, they may be implemented separately from the individual chips.

The rf unit 110 is used for receiving and transmitting electromagnetic waves, and implementing interconversion between the electromagnetic waves and electrical signals, so as to communicate with a communication network or other devices. The radio frequency unit 110 may include various existing circuit elements for performing these functions, such as an antenna, a radio frequency transceiver, a digital signal processor, an encryption/decryption chip, a Subscriber Identity Module (SIM) card, a memory, and so forth. The rf unit 110 may communicate with various networks such as the internet, an intranet, a preset type of wireless network, or other devices through a preset type of wireless network. The preset types of wireless networks described above may include cellular telephone networks, wireless local area networks, or metropolitan area networks.

The key unit 112 provides an interface for a user to input to the mobile terminal 100, and the user may press different keys to cause the mobile terminal 100 to perform different functions, or the key unit 112 may be a touch screen so that the user may control the mobile terminal to perform different functions through a touch operation.

Audio unit 114 provides an audio interface to a user that may include one or more microphones, one or more speakers, and audio circuitry. The audio circuitry receives audio data from the peripheral interface 108, converts the audio data to electrical information, and transmits the electrical information to the speaker. The speaker converts the electrical information into sound waves that the human ear can hear. The audio circuitry also receives electrical information from the microphone, converts the electrical information to voice data, and transmits the voice data to the peripheral interface 108 for further processing. The audio data may be retrieved from the memory 102 or through the radio unit 110. In addition, the audio data may also be stored in the memory 102 or transmitted through the radio frequency unit 110. In some examples, audio unit 114 may also include a headphone jack for providing an audio interface to headphones or other devices.

The display unit 116 provides an output interface between the mobile terminal 100 and a user. In particular, display unit 116 displays video output to the user, the content of which may include text, graphics, video, and any combination thereof. Some of the output results are for some of the user interface objects. Further, an input interface is provided between the mobile terminal 100 and the user for receiving user inputs, such as user clicks, swipes, and other gesture operations, so that the user interface objects respond to the user inputs. The technique of detecting user input may be based on resistive, capacitive, or any other possible touch detection technique.

In the prior art, when a user needs to share a picture of a film, the user needs to perform complicated and repeated screenshot operations, so that the use experience of the user is seriously reduced, and the sharing desire of the user is reduced.

In order to solve the above problems, the present invention provides a picture processing method, wherein under a picture processing instruction of a user, by obtaining target subtitle information and at least one frame of target video image frame, and generating a picture including subtitles using the target subtitle information and the at least one frame of target video image frame, a picture including subtitles is obtained based on a plurality of frames of video image frames and target subtitle information within an appointed time range without manually capturing a picture by a user, so that the operation is simple, the user experience is good, and the sharing will of the user is effectively improved.

Please refer to fig. 2, which is a flowchart illustrating a method for processing pictures according to an embodiment of the present invention, the method including:

step 201, responding to a picture processing instruction, and acquiring multi-frame video image frames and target subtitle information within a specified time range according to the picture processing instruction and a multimedia file of a target film;

in an embodiment of the present invention, the image processing method is implemented by an image processing apparatus, and the image processing apparatus is composed of program modules and can be stored in a computer-readable storage medium of the mobile terminal.

If a user needs to generate a picture in the process of watching a target film, a preset operation can be executed so as to trigger the generation of a picture processing instruction, and the mobile terminal responds to the picture processing instruction and acquires multi-frame video image frames and target subtitle information within a specified time range according to the picture processing instruction and a multimedia file of the target film. The preset operation may be to click an icon of a picture processing function displayed on the display interface, and it can be understood that the user may also trigger execution of the picture processing method in the embodiment of the present invention in a manner of importing the target movie into the application program having the picture processing method.

Wherein, the picture processing instruction may include a specified time range, so that a picture including subtitles can be generated by using the target subtitle information in the specified time range and the multiple frames of video image frames in the specified time range, for example, the specified time range may be from 5 minutes to 5 minutes and 30 seconds of the target film, after the user performs a preset operation, the mobile terminal displays a time selection window in which the user may input the specified time period, or the mobile terminal may pause the playing of the target film, and the user may place a cursor on a time axis of the target film and simultaneously display a video image frame corresponding to the time point of the cursor, and the user may select a starting video image frame and an ending video image frame by clicking, wherein the time point corresponding to the starting video image frame and the time point corresponding to the ending video image frame constitute the specified time range, after the user confirms, a picture processing instruction containing the specified time range is generated.

Alternatively, the picture processing instruction may include a subtitle selection flag indicating that a subtitle may be selected by a user for generating a picture, and a specified time range is determined based on a subtitle timestamp of the selected subtitle, and further multiple frames of video image frames within the specified time range are obtained to generate the picture including the subtitle. It can be understood that, when the picture processing instruction includes different content, there will be a difference in the manner of generating the picture including the subtitle, which will be described in detail in the following embodiments and will not be described herein again.

The target caption information is caption information used for generating a picture, and the target caption information includes text character strings and caption time stamps of the text character strings, and the caption time stamps are used for indicating the time when the corresponding text character strings appear in the movie, and usually have a start time point and an end time point.

The multi-frame video image frames in the specified time range are used for extracting target video image frames used for generating pictures, and each multi-frame video image frame in the specified time range comprises a corresponding video time stamp which is used for representing a time point when the video image frame is displayed in a movie.

Step 202, extracting at least one frame of target video image frame from the multi-frame video image frames according to the multi-frame video image frames and the target subtitle information within the appointed time range;

step 203, generating a picture containing subtitles according to the target subtitle information and the at least one frame of target video image frame.

In the embodiment of the invention, the picture processing device extracts at least one frame of target video image frame from the multi-frame video image frames according to the acquired target subtitle information in the specified time range and the multi-frame video image frames in the specified time range, and generates a picture containing subtitles according to the target subtitle information and the at least one frame of target video image frame. One frame of video image frame can generate one picture, and a plurality of frames of video image frame can also generate one picture, so that if one frame of target video image frame is extracted, one picture can be generated, and if a plurality of frames of target video image frames are extracted, one picture can be generated from the plurality of frames of target video image frames.

In the embodiment of the invention, under the picture processing instruction of a user, the picture containing the caption is generated by acquiring the target caption information and at least one frame of target video image frame and utilizing the target caption information and the at least one frame of target video image frame, so that the picture containing the caption is obtained based on the multi-frame video image frame and the target caption information within the appointed time range without manually capturing a picture by a user, the operation is simple, the user experience is good, and the sharing intention of the user is effectively improved.

Referring to fig. 3, another flow chart of the method for processing a picture according to the embodiment of the present invention is shown, and with respect to the embodiment shown in fig. 2, this embodiment will focus on a manner of acquiring original video image data and original subtitle information in different subtitle modes, and specifically, the method includes:

step 301, responding to a picture processing instruction, and determining a subtitle mode of the target film;

the target movie may refer to a movie currently being watched by the user, or may be a movie that has been imported into an application having a picture processing function.

The subtitle modes of the target film are divided into three modes, namely a soft subtitle mode, a hard subtitle mode, a non-subtitle mode, and the like, which are respectively described as follows:

(1) soft caption mode

The soft subtitle mode refers to that subtitle information of a film exists independently, and in the soft subtitle mode, carriers of the subtitle information are commonly as follows: external subtitle files and subtitle streams within video files.

The external subtitle file refers to a digital text file in which subtitle information is stored independently of a video file, for example, SRT, ASS, and SUB are files in a text subtitle format, and during playing, a multimedia file of a movie and the external subtitle file need to be acquired.

The subtitle stream in the video file is organized inside the multimedia file together with the audio track and the video track in a data track manner, and is organized separately from the video stream and the audio stream in a container format, and the typical organization form of the subtitle stream is defined by the MKV file standard.

It is understood that, for both forms, the subtitle information includes text strings and subtitle timestamps of the text strings.

(2) Hard caption mode

The caption information is superimposed on the corresponding video image frame of the video file when the film is produced, and is used as a part of the video image frame, and the multimedia file is generated after compression and audio-video confluence, namely a hard caption mode.

(3) Caption free mode

When the film is played, the displayed image has no caption, which is the non-caption mode, and the non-caption mode, the soft caption mode and the hard caption mode are three independent caption modes.

It is understood that, in the embodiment of the present invention, a mode flag may be included in the multimedia file of the target movie, and the mode flag is used to identify a specific subtitle mode of the target movie, so that when the target movie is played, whether the subtitle mode of the target movie is a soft subtitle mode, a hard subtitle mode, or a subtitle-free mode may be identified through the mode flag. Alternatively, the subtitle mode of the target movie may be determined in the following manner, specifically:

decapsulating the container format of the multimedia file of the target film, determining whether a subtitle stream can be obtained, if the subtitle stream is obtained after decapsulation, determining whether the subtitle mode of the target film is a soft subtitle mode, if the subtitle stream is not obtained after decapsulation, determining whether an external subtitle file of the multimedia file exists, and if the external subtitle file of the multimedia file exists, determining that the subtitle mode of the target film is the soft subtitle mode.

If the external subtitle file of the multimedia file does not exist, decoding a video compression code stream after the multimedia file is unpacked to obtain original video image data, identifying subtitles from the original video image data by using a subtitle identification technology, if the subtitles are identified, determining that the subtitle mode of the target film is a hard subtitle mode, and if the subtitles are not identified, determining that the subtitle mode of the target film is a non-subtitle mode. The subtitle Recognition technology may be an Optical Character Recognition (OCR) technology, and with regard to the OCR technology, description will be made later, which is not repeated herein.

Step 302, according to the subtitle mode of the target film, performing data processing on the multimedia file of the target film to obtain original video image data and original subtitle information of the target film;

in the embodiment of the present invention, under the condition that the subtitle modes of the target film are different, different ways exist to acquire the original video image data and the original subtitle information of the target film.

In an embodiment, if the subtitle mode is the soft subtitle mode, please refer to fig. 4A, which is a flowchart illustrating a step 302 of the embodiment of fig. 3 according to the present invention, including:

step 4011, if the subtitle mode of the target film is the soft subtitle mode, decapsulating the multimedia file of the target film in a container format to obtain a video compressed code stream and a subtitle stream, or decapsulating the multimedia file of the target film in the container format to obtain a video compressed code stream, and decapsulating the multimedia file of the target film in the container format based on the obtained external subtitle file of the target film to obtain the subtitle stream;

step 4012, decoding the video compression code stream to obtain the original video image data, and decoding the subtitle stream to obtain the original subtitle information.

When the subtitle mode of the target film is a soft subtitle mode, if the subtitle mode is specifically a subtitle stream in a video file, decapsulating a multimedia file of the target film in a container format to obtain a video compression code stream and a subtitle stream. And if the video stream is an external subtitle file, decapsulating the multimedia file of the target film in a container format to obtain a video compression code stream, and decapsulating the external subtitle file in the container format to obtain a subtitle stream. It is understood that the external subtitle file has a mapping relationship with the multimedia file, and the external subtitle file may be manually imported to the mobile terminal by a user or searched from a network by the mobile terminal.

The multimedia file may be in a package format such as MP4, MKV, AVI, etc., and an audio compression code stream may be obtained during decapsulation. It can be understood that how to decapsulate the multimedia file belongs to the prior art, and details are not described here.

In the embodiment of the invention, after the decapsulation is completed, the video compressed code stream obtained by the decapsulation is decoded to obtain original video image data, and the subtitle stream is decoded to obtain original subtitle information.

The original video image data comprises video image frames and video time stamps of the video image frames, and the original subtitle information comprises text character strings and subtitle time stamps of each line of text character strings.

In an embodiment, if the subtitle mode of the target movie is the hard subtitle mode, please refer to fig. 4B, which is a flowchart illustrating a step of refining step 302 in fig. 3 according to the present invention, including:

step 4021, if the subtitle mode of the target film is the hard subtitle mode, decapsulating the multimedia file of the target film in a container format to obtain a video compression code stream;

step 4022, decoding the video compression code stream to obtain the original video image data;

4023, extracting subtitles from the original video image data by using an OCR technology to obtain the original subtitle information.

When the subtitle mode is the hard subtitle mode, decapsulating the multimedia file of the target film in the container format to obtain a video compression code stream and also obtain an audio compression code stream, and it can be understood that decapsulating the multimedia file in the hard subtitle mode will not obtain a subtitle stream.

The video compression code stream obtained by de-encapsulation can be decoded to obtain original video image data, and the original video image data is subjected to subtitle extraction by using an OCR (optical character recognition) technology to obtain original subtitle information.

It is understood that in the embodiment of the present invention, OCR technology is mainly used for extracting text strings from each video image frame in the hard caption mode. Considering that, on the one hand, OCR technology is to extract text character strings and does not relate to semantic understanding, and on the other hand, a target film as a real life photo necessarily contains a great amount of text information, for example, a nameplate of a store, brand characters on a person and clothing, and the like, as part of the embodiment of the present invention, OCR technology is to extract a text in a hard subtitle mode, compared with OCR technology application in other scenes, and it is required to particularly distinguish an area and a position of a text character string belonging to a subtitle in a video image frame, that is, it is required to locate an area of a text character string belonging to a subtitle in a video image frame first, and may specifically be located in combination with features of a subtitle in a film, where the features of a subtitle in a film include: the color and the font of the subtitle are regular, and have obvious color difference with the background; the strokes of the subtitle area are rich, and the angular points and the edge characteristics are obvious; the space between characters in the caption is fixed, and the typesetting is along the horizontal direction or the vertical direction; the position of the subtitle in the same video is fixed, and the text character string in the same line generally stays for several seconds. Based on the above features, OCR technology can be used to extract text strings, and in practical applications, there are many different ways to extract text strings by OCR technology, and one of the ways is described as follows: and segmenting the single character region according to the gray level histogram projection in the line region, and then carrying out gray level image normalization, gradient feature extraction, multi-template matching, minimum classification error classification and the like on the single character region to obtain a text character string in a frame of video image frame. Please refer to fig. 4C, which is a schematic diagram illustrating the extraction in the hard subtitle mode based on the OCR technology in the embodiment shown in fig. 4B. It can be understood that a more detailed description of how to extract original subtitle information from original video image data by using OCR technology belongs to the prior art, and is not described herein again.

In an embodiment, if the subtitle mode of the target movie is the subtitle-free mode, please refer to fig. 4D, which is a flowchart illustrating the step 302 of refining in the embodiment of the present invention, including:

step 4031, if the subtitle mode of the target film is a subtitle-free mode, decapsulating a multimedia file of the target film in a container format to obtain a video compression code stream and an audio compression code stream;

4032, decoding the video compression code stream to obtain the original video image data, and decoding the audio compression code stream to obtain audio data;

4033, extracting subtitles from the audio data by using an ASR technology to obtain original subtitle information.

In the embodiment of the present invention, when the subtitle mode of the target movie is the subtitle-free mode, the multimedia file of the target movie is decapsulated in the container format to obtain the video compressed code stream and the audio compressed code stream, which can be understood that the subtitle stream cannot be obtained in the subtitle-free mode.

Further, decoding the video compression code stream to obtain original video image data, decoding the audio compression code stream to obtain audio data, and performing subtitle extraction on the audio data by using an Automatic Speech Recognition (ASR) technology to obtain original subtitle information.

The ASR technology is mainly used for treating voice data as text character strings containing understandable semantics, and is applied to scenes of picture processing of films without subtitle modes, so that the pictures containing subtitles can be generated even if the films do not contain any subtitle information, and better experience is provided for users.

It should be noted that how to extract a text string from speech data by using an ASR technique belongs to the prior art, and details are not described here.

Step 303, obtaining a plurality of frames of video image frames and target caption information within a specified time range according to the picture processing instruction, the original video image data and the original caption information;

step 304, extracting at least one frame of target video image frame from the multi-frame video image frames according to the multi-frame video image frames and the target caption information within the appointed time range;

step 305, generating a picture containing subtitles according to the target subtitle information and the at least one frame of target video image frame.

In the embodiment of the invention, after the original image data and the original subtitle information are obtained, the target subtitle information and the multi-frame video image frame within the specified time range are obtained according to the picture processing instruction, the original image data and the original subtitle information.

Step 304 and step 305 are similar to the content described in step 202 and step 203 in the embodiment shown in fig. 2, and are not repeated here.

In the embodiment of the present invention, when responding to a picture processing instruction, a subtitle mode of a target film is determined, and data processing is performed on a multimedia file of the target film according to the subtitle mode of the target film to obtain original video image data and original subtitle information of the target film, for example, in a soft subtitle mode, the multimedia file of the target film can be directly unpacked and decoded to obtain original video image data and original subtitle information; under the hard caption mode, the OCR technology can be utilized to extract the caption of the original video image data to obtain the original caption information; in the subtitle-free mode, the ASR technology can be used for extracting subtitles from the audio data to obtain original subtitle information. Through the mode, original video image data and original subtitle information can be effectively obtained, so that target video image frames and target subtitle information can be further obtained, and pictures containing subtitles can be generated, a user does not need to manually capture a picture by one, but the pictures containing the subtitles can be obtained based on the multi-frame video image frames and the target subtitle information within a specified time range, the operation is simple, the user experience is good, and the sharing desire of the user is effectively improved.

Referring to fig. 5, another flow chart of the method for processing a picture according to the embodiment of the present invention is shown, and with respect to the embodiment shown in fig. 3, the present embodiment will focus on a method for processing a picture under the condition that a picture processing instruction includes a specified time range, and specifically, the method includes:

step 501, responding to a picture processing instruction, and determining a subtitle mode of the target film;

step 502, according to the subtitle mode of the target film, performing data processing on a multimedia file of the target film to obtain original video image data and original subtitle information of the target film;

it is understood that step 501 and step 502 are similar to the content described in step 301 and step 302 in the embodiment shown in fig. 3, and please refer to the embodiment shown in fig. 3 specifically, which is not repeated herein.

Step 503, if the picture processing instruction includes a specified time range, extracting a plurality of frames of video image frames within the specified time range from the original video image data, and extracting subtitle information within the specified time range from the original subtitle information as the target subtitle information;

in the embodiment of the present invention, the picture processing instruction may include a specified time range, and the specified time range may be set by a user, specifically: in the process of watching a target movie, if a user needs to generate a picture of the movie, which includes subtitles, the user may click a picture processing function button on a display interface, and the mobile terminal will respond to the click operation and display a setting interface, which may be a time input interface, the user may input a start time point and an end time point in the time input interface and click a determination operation to generate a picture processing instruction, and the picture processing instruction includes a specified time range formed by the start time point and the end time point input by the user, for example, if the input start time point is 50 minutes 0 seconds and the input end time point is 50 minutes 30 seconds, the specified time range is 50 minutes 0 seconds to 50 minutes 30 seconds. Alternatively, the user may input only the post-start time point, or only the end time point, and if the user inputs only the start time point, the time point after adding a preset time duration to the start time point is taken as the end time point, and if the user inputs only the end time point, the time point after subtracting the preset time duration from the end time point is taken as the start time point, so as to obtain the designated time range, or the designated time range may be configured by a plurality of small time ranges. Or, after entering the setting interface, a time selection interface may also be displayed, where the time selection interface is similar to a movie playing interface and includes a progress bar, a user may select a start time point and an end time point on the progress bar, and each time the user moves a location identifier (such as an arrow of a mouse on the display interface) to a certain time point on the progress bar, the video image frame corresponding to the time point is displayed so as to be selected by the user, the user may determine a time start point and a time end point by clicking, or the selected time start point or time end point may be cancelled by clicking after the selection, so as to obtain a specified time range meeting the user's requirements, and further obtain the specified time range. Please specifically refer to fig. 9a, which is a schematic diagram of a setting interface for specifying a time range according to an embodiment of the present invention.

In the embodiment of the present invention, if the specified time range is included in the picture processing instruction, the data in the specified time range is extracted from the original video image data as the multi-frame video image frame in the specified time range, and the subtitle information in the specified time range is extracted from the original subtitle information as the target subtitle information. Wherein, the original video image data comprises video image frames, and each video image frame comprises a video time stamp thereof, so that the area of the extracted video image frame can be determined by using the start time and the end time in the designated time range, that is, in the original video image data, video image frames with the video time stamps same as the start time or different from the start time by less than a preset time (such as 40ms) are searched, the searched video image frames are used as the start video image frame, the video image frames with the video time stamps same as the end time or different from the end time by less than the preset time are searched, the searched video image frames are used as the end video image frame, and all the video image frames between the start video image frame and the end video image frame are extracted from the original video image data to obtain multi-frame video image frames in the designated time range, for example, if the specified time range is 1 min 0 sec to 1 min 30 sec, all video image frames having video timestamps between 1 min 0 sec to 1 min 30 sec will be extracted from the original video image data.

Since the original subtitle information includes the text character string and the subtitle timestamp of each line of text character string, the target subtitle information can be extracted from the original subtitle information in a manner similar to that of extracting the target video image frame from the original video image data, and the target subtitle information includes the text character string and the subtitle timestamp of each line of text character string.

It is understood that the specified time range may be one time range or a plurality of time ranges.

Step 504, extracting at least one frame of target video image frame from the multi-frame video image frames according to the multi-frame video image frames and the target caption information within the appointed time range;

and step 505, generating a picture containing subtitles according to the target subtitle information and the at least one frame of target video image frame.

It is understood that step 504 and step 505 are similar to those described in step 202 and step 203, respectively, in the embodiment shown in fig. 2, and are not repeated herein.

In the embodiment of the invention, the specified time range is included in the picture processing instruction, so that the target video image frame and the target subtitle information can be acquired by using the specified time range, and a picture containing subtitles is generated by using the target video image frame and the target subtitle information. Based on the embodiment, the user can obtain the picture containing the caption through simple setting of the appointed time range, so that the picture containing the caption can be obtained based on the multi-frame video image frame and the target caption information in the appointed time range without manually capturing the picture by one user, the operation is simple, the user experience is good, and the sharing desire of the user is effectively improved.

Referring to fig. 6, which is a flowchart illustrating a picture processing method according to an embodiment of the present invention, in contrast to the embodiment shown in fig. 3, this embodiment focuses on a picture processing method under the condition that a caption selection flag is included in a picture processing instruction, and specifically, the method includes:

step 601, responding to a picture processing instruction, and determining a subtitle mode of the target film;

step 602, according to the subtitle mode of the target film, performing data processing on the multimedia file of the target film to obtain original video image data and original subtitle information of the target film;

it is understood that step 601 and step 602 are similar to the content described in step 301 and step 302 in the embodiment shown in fig. 3, and please refer to the embodiment shown in fig. 3 specifically, which is not repeated herein.

Step 603, responding to a picture processing instruction, and if the picture processing instruction contains a subtitle selection mark, displaying a text character string contained in the original subtitle information;

step 604, if a text selection operation is detected, determining a selected text string, and acquiring a plurality of frames of video image frames within a specified time range from the original video image data based on the specified time range formed by the caption time stamps of the selected text string;

in the embodiment of the present invention, the picture processing instruction may include a subtitle selection flag, where the subtitle selection flag is used to identify that the user may generate a picture by selecting a subtitle.

If the caption selection mark is detected to be included in the picture processing instruction, all text character strings included in the original caption information are displayed, and when the text character strings are displayed, one line of text character strings is displayed by taking the caption time stamp as a unit.

After displaying the text strings, the user may perform operations on the display interface to determine which text strings to select, for example, the user may distinguish the selected text strings from the unselected text strings in color by performing a click operation on the text strings to be selected, it being understood that the user may select a plurality of text strings in succession, or may select text strings in discontinuity, for example, the user may select the text strings in lines 5 to 10, and the text strings in lines 20 to 25. Wherein the user may perform a confirmation operation after completing the selection of the text string, for example, clicking a confirmation button on the display interface. It is to be understood that, in the embodiment of the present invention, a text string having the same subtitle timestamp is generally regarded as a line of text strings.

After clicking the confirmation button, the user indicates that the selection of the text character string is completed, the selected text character string is used as target subtitle information, and further based on a specified time range formed by subtitle time stamps of the selected text character string, a plurality of frames of video image frames in the specified time range are obtained from original video image data. For example, after determining that the text strings of the 5 th to 10 th lines and the text strings of the 20 th to 25 th lines have been selected, it is determined that the start time of the caption time stamp of the text string of the 5 th line is 1 minute 0 second, the end time of the caption time stamp of the text string of the 10 th line is 1 minute 8 seconds, the start time of the caption time stamp of the text string of the 20 th line is 1 minute 30 seconds, and the end time of the caption time stamp of the text string of the 25 th line is 1 minute 40 seconds, it is possible to determine that the time range formed by the caption time stamps is: 1 minute 0 second to 1 minute 8 second, and 1 minute 30 second to 1 minute 40 second. And acquiring video image frames with video time stamps within 1 min 0 sec to 1 min 8 sec and acquiring video image frames with video time stamps within 1 min 30 sec to 1 min 40 sec from the original video image data so as to obtain a plurality of frames of video image frames within a specified time range.

605, extracting at least one frame of target video image frame from the multi-frame video image frames according to the multi-frame video image frames and the target caption information within the appointed time range;

step 606, generating a picture containing subtitles according to the target subtitle information and the at least one frame of target video image frame.

It is understood that

steps

605 and 606 are similar to those described in step 202 and step 203, respectively, in the embodiment shown in fig. 2, and are not repeated herein.

In the embodiment of the invention, by including a caption selection marker in the picture processing instruction, a text character string can be selected by a user, a plurality of frames of video image frames in a specified time range are obtained based on the text character string selected by the user, at least one frame of target video image frame is extracted from the plurality of frames of video image frames, and a picture containing captions is generated based on the at least one frame of target video image frame and the selected text character string. Based on the embodiment, the user can obtain the picture containing the caption through simple selection of the text character string, so that the picture containing the caption can be obtained based on the multi-frame video image frame and the target caption information within the appointed time range without manually capturing the picture by one user, the operation is simple, the user experience is good, and the sharing desire of the user is effectively improved.

Referring to fig. 7, another flow chart of the image processing method according to the embodiment of the present invention is shown, and with respect to the embodiment shown in fig. 2, the embodiment focuses on the image processing method under the condition of extracting a frame of video image frame, which includes:

step 701, responding to a picture processing instruction, and acquiring multi-frame video image frames and target subtitle information within a specified time range according to the picture processing instruction and a multimedia file of a target film;

it is understood that step 701 is similar to that described in step 201 in the embodiment shown in fig. 2, and step 301 to step 303 in the embodiment shown in fig. 3 are schematic flow charts of the step 701, so that step 201, and step 301 to step 303 may be referred to specifically, and are not repeated herein.

Step 702, comparing the video time stamp of the multi-frame video image frame with the subtitle time stamp of the target subtitle information, and extracting a frame of target video image frame from the multi-frame video image frame;

step 703, embedding the text character string contained in the target subtitle information into the target video image frame, and generating a picture containing subtitles.

In the embodiment of the invention, the target caption information comprises the text character string and the caption time stamp of each line of text character string, and the multi-frame video image frame in the specified time range comprises the video time stamp of each video image frame, so that the video time stamp and the caption time stamp can be used for comparison to extract one frame of target video image frame from the multi-frame video image frame in the specified time range. Specifically, a frame of target video image frame that meets a preset first extraction rule may be extracted.

When the subtitle mode of the target film is the soft subtitle mode or the subtitle-free mode, the preset first extraction rule may be to extract any one frame of video time frame, or extract a first frame of video image frame, or extract an intermediate frame of video image frame. When the subtitle mode of the target film is the hard subtitle mode, the preset first extraction rule may be any one frame of video image frame that does not contain subtitles. In practical applications, the preset first extraction rule may be set according to specific needs, and is not limited herein.

In the case of extracting only one frame of target video image frame, the target caption information is displayed on the image of the target video image frame, the text character string contained in the target caption information is embedded in the target video image frame, a picture containing the caption is generated, and only one picture is generated.

In the embodiment of the invention, one frame of target video image frame is extracted from the multi-frame video image frame within the appointed time range, and the target subtitle information is embedded into the target video frame to generate the picture containing the subtitle, so that the size of the picture can be effectively reduced, the flow and time consumed during picture sharing and uploading are reduced, the picture containing the subtitle can be obtained in the mode without manually capturing the picture by a user, the operation is simple, the user experience is good, and the sharing desire of the user is effectively improved.

Referring to fig. 8, a flowchart of a picture processing method according to an embodiment of the present invention is shown, and with respect to the embodiment shown in fig. 2, the present embodiment focuses on a picture processing method under a condition of extracting multiple frames of video image frames, where the method includes:

step 801, responding to a picture processing instruction, and acquiring multi-frame video image frames and target subtitle information within a specified time range according to the picture processing instruction and a multimedia file of a target film;

Step 802, comparing the caption time stamp of each line of text character string in the target caption information with the video time stamp of the multi-frame video image frame in sequence, and determining a set of video image frames corresponding to each line of text character string;

step 803, sequentially extracting one frame of target video image frame corresponding to each line of text character string from the set of video image frames corresponding to each line of text character string to obtain a plurality of frames of target video image frames, and executing step 804, or executing step 805, or executing step 806;

step 804, if the subtitle mode of the target film is a hard subtitle mode, splicing the multiple frames of target video image frames according to the sequence of video time stamps to generate the image containing the subtitles; or intercepting the subtitle areas of other target video image frames except the first frame of target video image frame, and splicing the subtitle areas of the first frame of target video image frame and the other target video image frames according to the sequence of the video timestamps to generate the picture containing the subtitles;

step 805, if the subtitle mode of the target film is a soft subtitle mode or a subtitle-free mode, embedding each line of text character string in the target subtitle information into a corresponding target video image frame, and splicing the multiple frames of target video image frames according to the sequence of video timestamps to generate a picture containing subtitles;

step 806, if the subtitle mode of the target film is the soft subtitle mode or the subtitle-free mode, embedding each line of text character strings in the target subtitle information into the corresponding target video image frame, intercepting the subtitle area where the subtitles are embedded in other target video image frames except the first frame of target video image frame, and splicing the first frame of target video image frame in which the subtitles are embedded and the subtitle areas of the other target video image frames according to the sequence of the video timestamps to generate the picture containing the subtitles.

In the embodiment of the present invention, after obtaining target caption information and multiple video image frames within a specified time range, if multiple target video image frames are required to be obtained, a caption timestamp of each line of text character string in the target caption information may be compared with a video timestamp of the multiple video image frames to determine a set of video image frames corresponding to each line of text character string, for example, if a first line of text character string has a caption timestamp of 1 minute 0 second to 1 minute 3 seconds, and if video image frames within a time stamp of 1 minute 0 second to 1 minute 3 seconds of a video image frame include video image frames A, B, C and D, the set of video image frames corresponding to the first line of text character string includes video image frames a to D. In this way, a set of video image frames corresponding to each line of text character strings in the target subtitle information can be obtained.

Sequentially and respectively extracting a frame of target video image frame corresponding to each line of text character string from a set of video image frames corresponding to each line of text character string, wherein the target video image frame can be specifically extracted based on a preset second extraction rule, the preset second extraction rule can be a first frame, an intermediate frame or a last frame in the extraction set, and the like. Taking the example of extracting the first frame, if the set corresponding to the text character string of the first line includes video image frames a to D, the video image frame a is used as a frame of target video image frame corresponding to the text character string of the first line. It can be understood that by means of extracting the target video image frame, different video image frames can be prevented from corresponding to the same text character string, and the situation that the text character string repeatedly appears in the generated image is avoided.

If the subtitle mode of the target film is the hard subtitle mode, it indicates that the selected target video frame already contains subtitles, in this case, the multi-frame target video frames are spliced according to the sequence of the video timestamps to generate a picture containing the subtitles, or subtitle areas of other target video image frames except the first frame target video image frame are firstly intercepted, so that subtitles and subtitle backgrounds of the other target video image frames are intercepted, and the subtitle areas of the first frame target video image frame and the other target video image frames are spliced according to the sequence of the video timestamps to generate the picture containing the subtitles. That is, one frame of target video image frame is spliced in the vertical direction, or only the integrity of the first frame of target video image frame is kept, and other target video image frames are all spliced with the text character string for display. It will be appreciated that, in general, the subtitle region is located at the lower side of the video image frame.

If the subtitle mode of the target film is a soft subtitle mode or a subtitle-free mode, which indicates that the obtained target video image frame does not contain subtitles, each line of text character string can be embedded into the corresponding target video image frame, and the multi-frame target video image frames in which the text character strings are embedded are spliced according to the sequence of the video timestamps to generate a picture containing subtitles. And splicing the N frames of target video image frames embedded with the subtitles to generate a picture containing the subtitles. The splicing may be complete splicing, that is, a mode of splicing one frame of target video image frame to one frame of target video image frame in the vertical direction. For example, please refer to fig. 9b, which is a schematic diagram of a picture containing subtitles generated in an embodiment of the present invention, in which fig. 9b is a picture generated by completely splicing two target video frames together, where "yesterday is history, tomorrow is a puzzle group, and only today is a granted gift" is target subtitle information.

Or, under the condition that the subtitle mode of the target film is the soft subtitle mode or the subtitle-free mode, text character strings of each line in the target subtitle information can be embedded into corresponding target video image frames, subtitle areas where subtitles are embedded in other target video frames except the first target video image frame are intercepted, the first frame target video image frame with the subtitles embedded therein and the subtitle areas of the other target video image frames are spliced according to the sequence of the video timestamps, and a picture containing the subtitles is generated. The splicing may be partial splicing, that is, the subtitle region where the embedded text character string of the second frame picture is spliced in the first frame target video frame in the vertical direction, and the subtitle regions where the embedded text character strings of the target video frames such as the third frame and the fourth frame are spliced in sequence. Please refer to fig. 9c, which is a schematic diagram of a picture containing subtitles generated in an embodiment of the present invention, where fig. 9c is a picture generated by partially splicing two frames of target video frames, where "yesterday is history, tomorrow is a puzzle group" is a text character string corresponding to a first frame of target video frame, "today is a courier" is a text character string corresponding to a second frame of target video frame, and when partially splicing, only a subtitle region of the second frame of target video frame is spliced.

It can be understood that if the number of the target video image frames is greater than the preset number, the number of the target video image frames can be further reduced, and the text character strings are uniformly embedded into each target video image frame.

(1) Embedding with respect to text strings

In the soft subtitle mode or the subtitle-free mode, the text string needs to be embedded into the target video image frame, and the embedding of the text string can be realized through an Application Programming Interface (API), for example, the text string embedding is realized through a code on an IOS system and an Android system, and in practical Application, the code can be further deformed or other codes for embedding the text string are developed, so the above codes do not limit the technical scheme of the present invention.

(2) Stitching with respect to target video image frames

Taking the processing of splicing two target video image frames into one image completely as an example, since the image is a digital image, and the digital image is generally expressed by a matrix in a computer program, two target video image frames a and two target video image frames B to be spliced can be respectively defined as follows:

as can be seen from the above, the target video image frame a and the target video image frame B have the same image width and different image heights, and therefore, the new picture generated after the image a and the image B are spliced is:

it should be understood that, in the embodiment of the present invention, the above is only one possible implementation manner of the splicing, and in practical applications, the splicing manner may also be modified based on the resolution of the target video image frame and/or the requirement of the resolution of the picture to be generated, which includes the subtitle, and therefore, the splicing manner described above does not limit the technical solution of the present invention.

In the embodiment of the invention, the video timestamp and the subtitle timestamp are compared, so that the target video image frame can be extracted based on the text character string contained in the target subtitle information, the picture containing the subtitle can be generated, the size of the generated picture containing the subtitle can be effectively reduced, the flow and time consumed by uploading or sharing the picture can be reduced, the use experience of a user can be improved, and the willingness of the user to share can be improved.

It should be noted that, in the embodiment of the present invention, the specific forms of the subtitle appearance, such as the font, the font size, the color, the frame, and the like, do not affect the implementation of the technical solution in the embodiment of the present invention.

Referring to fig. 10, which is a schematic diagram illustrating program modules of a picture processing apparatus according to an embodiment of the present invention, the apparatus includes:

a response obtaining module 1001, configured to, in response to a picture processing instruction, obtain, according to the picture processing instruction and a multimedia file of a target movie, a multi-frame video image frame and target subtitle information within a specified time range;

a frame extracting module 1002, configured to extract at least one frame of target video image frame from the multiple frames of video image frames according to the target subtitle information and the multiple frames of video image frames;

a generating module 1003, configured to generate a picture including subtitles according to the target subtitle information and the at least one frame of target video image frame.

Wherein the picture processing instruction may include a specified time range, such that a picture including subtitles can be generated by using the target subtitle information in the specified time range and the multi-frame video image frame in the specified time range, for example, the specified time range may be 5 minutes to 5 minutes and 30 seconds of the target movie; still alternatively, the picture processing instruction may include a subtitle selection flag indicating that a subtitle may be selected by a user for generating a picture, and the image processing instruction may determine a specified time range based on the selected subtitle and further determine a plurality of frames of video image frames within the specified time range to generate the picture including the subtitle. It can be understood that, when the picture processing instruction includes different content, there will be a difference in the manner of generating the picture including the subtitle, which will be described in detail in the following embodiments and will not be described herein again.

The multi-frame video image frames in the specified time range are used for extracting target video image frames used for generating pictures, and the multi-frame video image frames in the specified time range contain video time stamps which are used for representing time points when the video image frames are displayed in the movie.

Please refer to fig. 11, which is a schematic diagram of program modules of an image processing apparatus according to an embodiment of the present invention, including a response obtaining module 1001, a frame extracting module 1002 and a generating module 1003 in the embodiment shown in fig. 10, and similar to the content described in the embodiment shown in fig. 10, which is not repeated herein.

In this embodiment of the present invention, the response obtaining module 1001 includes:

a response determining module 1101, configured to determine a subtitle mode of the target movie in response to a picture processing instruction;

a data processing module 1102, configured to perform data processing on a multimedia file of the target movie according to the subtitle mode of the target movie, so as to obtain original video image data and original subtitle information of the target movie;

the first determining module 1103 is configured to obtain, according to the picture processing instruction, the original video image data, and the original subtitle information, a multi-frame video image frame and target subtitle information within a specified time range.

(1) soft caption mode

The external subtitle file refers to a digital text file in which subtitle information is stored independently from a video file, for example, SRT, ASS, and SUB are files in a text subtitle format, and during playing, a multimedia file of a movie and the external subtitle file need to be acquired.

(2) Hard caption mode

(3) Caption free mode

and decapsulating the container format of the multimedia file of the target film, determining whether a subtitle stream can be obtained, if the subtitle stream is obtained after decapsulation, determining whether the subtitle mode of the target film is a soft subtitle mode, if the subtitle stream is not obtained after decapsulation, determining whether an external subtitle file of the multimedia file exists, and if the external subtitle file of the multimedia file exists, determining that the subtitle mode of the target film is the soft subtitle mode.

Referring to fig. 12A, in an embodiment, referring to the data processing module 1102, a schematic diagram of a refinement program module of the data processing module 1102 in the embodiment shown in fig. 11 includes:

a first decapsulation module 1201, configured to, if the subtitle mode of the target film is a soft subtitle mode, decapsulate a multimedia file of the target film in a container format to obtain a video compressed code stream and a subtitle stream, or decapsulate the multimedia file of the target film in the container format to obtain the video compressed code stream, and decapsulate the multimedia file of the target film in the container format based on the obtained external subtitle file of the target film to obtain the subtitle stream;

the first decoding module 1202 is configured to decode the video compressed code stream to obtain the original video image data, and decode the subtitle stream to obtain the original subtitle information.

In an embodiment, please refer to fig. 12B, which is a schematic diagram of a refinement program module of the data processing module 1102 in the embodiment shown in fig. 11, including:

a second decapsulation module 1203, configured to decapsulate, in a container format, the multimedia file of the target movie if the subtitle mode of the target movie is the hard subtitle mode, to obtain a video compression code stream;

a second decoding module 1204, configured to decode the video compressed code stream to obtain the original video image data;

the first extraction module 1205 is configured to perform subtitle extraction on the original video image data by using an optical character recognition OCR technology to obtain the original subtitle information.

In an embodiment, please refer to fig. 12C, which is a schematic diagram of a refinement program module of the data processing module 1102 in the embodiment shown in fig. 11, including:

a third decapsulation module 1206, configured to decapsulate, in a container format, the multimedia file of the target movie to obtain a video compression code stream and an audio compression code stream if the subtitle mode of the target movie is a subtitle-free mode;

a third decoding module 1207, configured to decode the video compressed code stream to obtain the original video image data, and decode the audio compressed code stream to obtain audio data;

and a second extraction module 1208, configured to perform subtitle extraction on the audio data by using an automatic speech recognition ASR technique to obtain original subtitle information.

Referring to fig. 13, a schematic diagram of program modules of a picture processing apparatus according to an embodiment of the present invention includes: as shown in the response obtaining module 1001, the frame extracting module 1002 and the generating module 1003 in the embodiment shown in fig. 11, and as shown in fig. 12, the response obtaining module 1001 includes a response determining module 1101, a data processing module 1102 and a first determining module 1103, and is similar to the content described in the embodiments shown in fig. 11 and 12, which is not repeated herein.

In an embodiment of the present invention, the first determining module 1103 includes:

a third extracting module 1301, configured to, if the picture processing instruction includes a specified time range, extract the multiframe video image frames within the specified time range from the original video image data, and extract the subtitle information within the specified time range from the original subtitle information as the target subtitle information.

And the first determining module 1103 further comprises:

a display module 1302, configured to display a text string included in the original subtitle information if the image processing instruction includes a subtitle selection flag;

a data obtaining module 1303, configured to determine a selected text string if a text selection operation is detected, and obtain, based on the specified time range formed by the subtitle timestamps of the selected text string, a multi-frame video image frame within the specified time range from the original video image data, where the selected text string is the target subtitle information.

In the embodiment of the present invention, the picture processing instruction may include a specified time range, and the specified time range may be set by a user, specifically: in the process of watching a target movie, if a user needs to generate a picture of the movie, which includes subtitles, the user may click a picture processing function button on a display interface, and the mobile terminal will respond to the click operation and display a setting interface, which may be a time input interface, the user may input a start time point and an end time point in the time input interface and click a determination operation to generate a picture processing instruction, and the picture processing instruction includes a specified time range formed by the start time point and the end time point input by the user, for example, if the input start time point is 50 minutes 0 seconds and the input end time point is 50 minutes 30 seconds, the specified time range is 50 minutes 0 seconds to 50 minutes 30 seconds. Alternatively, the user may input only the post-start time point, or only the end time point, and if the user inputs only the start time point, the time point after adding a preset time duration to the start time point is taken as the end time point, and if the user inputs only the end time point, the time point after subtracting the preset time duration from the end time point is taken as the start time point, so as to obtain the designated time range, or the designated time range may be configured by a plurality of small time ranges. Or, after entering the setting interface, a time selection interface may also be displayed, where the time selection interface is similar to a movie playing interface and includes a progress bar, a user may select a start time point and an end time point on the progress bar, and each time the user moves a location identifier (such as an arrow of a mouse on the display interface) to a certain time point on the progress bar, the video image frame corresponding to the time point is displayed so as to be selected by the user, the user may determine a time start point and a time end point by clicking, or the selected time start point or time end point may be cancelled by clicking after the selection, so as to obtain a specified time range meeting the user's requirements, and further obtain the specified time range.

In the embodiment of the invention, the specified time range is included in the picture processing instruction, so that the target video image frame and the target subtitle information can be acquired by using the specified time range, and a picture containing subtitles is generated by using the target video image frame and the target subtitle information. Based on the embodiment, the user can obtain the picture containing the caption through simple setting of the appointed time range, so that the picture containing the caption can be obtained based on the multi-frame video image frame and the target caption information in the appointed time range without manually capturing the picture by one user, the operation is simple, the user experience is good, and the sharing desire of the user is effectively improved. Alternatively, by including a subtitle selection flag in the picture processing instruction, it is enabled to select a text string by the user, and acquire a target video image frame based on the text string selected by the user, and generate a picture including subtitles based on the target video image frame and the selected text string. Based on the embodiment, the user can obtain the picture containing the caption through simple selection of the text character string, so that the picture containing the caption can be obtained based on the multi-frame video image frame and the target caption information within the appointed time range without manually capturing the picture by one user, the operation is simple, the user experience is good, and the sharing desire of the user is effectively improved.

Referring to fig. 14, a schematic diagram of program modules of a picture processing apparatus according to an embodiment of the present invention includes: the response obtaining module 1001, the frame extracting module 1002 and the generating module 1003 in the embodiment shown in fig. 11 are respectively similar to those described in the embodiments shown in fig. 11 and fig. 12, and are not repeated here.

In this embodiment of the present invention, the frame extracting module 1002 includes:

a comparison and extraction module 1401, configured to compare the video timestamp of the multiple frames of video image frames with the subtitle timestamp of the target subtitle information, and extract a frame of target video image frame from the multiple frames of video image frames;

and the generating module 1003 includes:

an embedding generation module 1402, configured to embed the text character string included in the target subtitle information into the extracted one frame of target video image frame, so as to generate a picture including subtitles.

In this embodiment of the present invention, the frame extracting module 1003 further includes:

a comparison determination module 1403, configured to compare the caption time stamp of each line of text character strings in the target caption information with the video time stamp of the multi-frame video image frame in sequence, and determine a set of video image frames corresponding to each line of text character strings;

a fourth extracting module 1404, configured to sequentially extract one frame of target video image frame corresponding to each line of text character string from the set of video image frames corresponding to each line of text character string, respectively, to obtain multiple frames of target video image frames.

In this embodiment of the present invention, the generating module 1003 further includes:

a first generating module 1405, configured to splice the multiple frames of target video image frames according to a sequence of video timestamps to generate the picture including the subtitles if the subtitle mode of the target movie is the hard subtitle mode; or intercepting the subtitle areas of other target video image frames except the first frame of target video image frame, and splicing the subtitle areas of the first frame of target video image frame and the other target video image frames according to the sequence of the video timestamps to generate the picture containing the subtitles;

a second generating module 1406, configured to, if the subtitle mode of the target movie is the soft subtitle mode or the subtitle-free mode, embed each line of text character string in the target subtitle information into a corresponding target video image frame, and splice the multiple frames of target video image frames according to the order of the video timestamps to generate a picture including subtitles; or embedding each line of text character string in the target caption information into the corresponding target video image frame, intercepting caption areas where other target video image frames except the first frame of target video image frame are embedded with captions, splicing the first frame of target video image frame embedded with captions and the caption areas of the other target video images according to the sequence of the video timestamps, and generating the picture containing the captions.

In the embodiment of the invention, one frame of target video image frame is extracted from the multi-frame video image frame within the specified time range, and the target subtitle information is embedded into the target video frame to generate the picture containing the subtitle, so that the size of the generated picture can be effectively reduced, the flow and time consumed during picture sharing and uploading are reduced, the picture containing the subtitle can be obtained in the mode without manually capturing the picture by a user, the operation is simple, the user experience is good, and the sharing desire of the user is effectively improved. Or the video timestamp and the subtitle timestamp are compared, so that the target video image frame can be extracted based on the text character string contained in the target subtitle information, and the multi-frame target video image frame is synthesized into a picture, the size of the generated picture containing the subtitle can be effectively reduced, the flow and time consumed by uploading or sharing the picture are reduced, the use experience of a user is improved, and the willingness of the user to share is improved.

It is to be understood that, in the embodiments of the present invention, the embodiments of the image processing apparatus may be combined based on specific requirements to obtain more feasible embodiments, which are not described herein again.

The invention further provides a mobile terminal, which comprises a memory, a processor and a computer program stored on the memory and running on the processor, and is characterized in that when the processor executes the computer program, each step in the picture processing method in any one of the embodiments of fig. 2 to 8 is realized.

The present invention further provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the picture processing method in any one of fig. 2 to 8.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no acts or modules are necessarily required of the invention.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In view of the above description of the image processing method, apparatus, terminal and computer readable storage medium provided by the present invention, those skilled in the art will recognize that there may be variations in the embodiments and applications of the method and apparatus according to the teachings of the present invention.

Claims

1. A picture processing method, characterized in that the method comprises:

extracting at least one frame of target video image frame from the multi-frame video image frames according to the target subtitle information and the multi-frame video image frames; the method comprises the following steps: comparing the video time stamp of the multi-frame video image frame with the subtitle time stamp of the target subtitle information, and extracting a frame of target video image frame from the multi-frame video image frame;

alternatively, the first and second electrodes may be,

comparing the caption time stamp of each line of text character string in the target caption information with the video time stamp of the multi-frame video image frame in sequence to determine a set of video image frames corresponding to each line of text character string;

sequentially extracting a frame of target video image frame corresponding to each line of text character string from the set of video image frames corresponding to each line of text character string to obtain a plurality of frames of target video image frames;

2. The method according to claim 1, wherein the obtaining multiple frames of video image frames and target caption information within a specified time range according to the picture processing command and a multimedia file of a target movie comprises:

determining a subtitle mode of the target film;

according to the subtitle mode of the target film, performing data processing on a multimedia file of the target film to obtain original video image data and original subtitle information of the target film;

and obtaining multi-frame video image frames and target subtitle information within a specified time range according to the picture processing instruction, the original video image data and the original subtitle information.

3. The method according to claim 2, wherein the performing data processing on the multimedia file of the target movie according to the subtitle mode of the target movie to obtain original video image data and original subtitle information of the target movie comprises:

if the subtitle mode of the target film is a soft subtitle mode, decapsulating a multimedia file of the target film in a container format to obtain a video compressed code stream and a subtitle stream, or decapsulating the multimedia file of the target film in the container format to obtain the video compressed code stream, and decapsulating the multimedia file of the target film in the container format based on the obtained external subtitle file of the target film to obtain the subtitle stream;

and decoding the video compressed code stream to obtain the original video image data, and decoding the subtitle stream to obtain the original subtitle information.

4. The method according to claim 2, wherein the performing data processing on the multimedia file of the target movie according to the subtitle mode of the target movie to obtain original video image data and original subtitle information of the target movie comprises:

if the subtitle mode of the target film is a hard subtitle mode, decapsulating a multimedia file of the target film in a container format to obtain a video compression code stream;

decoding the video compressed code stream to obtain the original video image data;

and performing subtitle extraction on the original video image data by using an Optical Character Recognition (OCR) technology to obtain the original subtitle information.

5. The method according to claim 2, wherein the performing data processing on the multimedia file of the target movie according to the subtitle mode of the target movie to obtain original video image data and original subtitle information of the target movie comprises:

if the subtitle mode of the target film is a subtitle-free mode, decapsulating a multimedia file of the target film in a container format to obtain a video compression code stream and an audio compression code stream;

decoding the video compressed code stream to obtain the original video image data, and decoding the audio compressed code stream to obtain audio data;

and extracting subtitles from the audio data by utilizing an Automatic Speech Recognition (ASR) technology to obtain original subtitle information.

6. The method of claim 2, wherein obtaining multiple frames of video image frames and target caption information within a specified time range according to the picture processing command, the original video image data, and original caption information comprises:

if the picture processing instruction contains a specified time range, extracting the multiframe video image frames in the specified time range from the original video image data, and extracting the subtitle information in the specified time range from the original subtitle information to serve as the target subtitle information.

7. The method of claim 2, wherein the obtaining the target caption information and the plurality of frames of video image frames within the specified time range according to the picture processing command, the original video image data and the original caption information comprises:

if the image processing instruction contains a subtitle selection mark, displaying a text character string contained in the original subtitle information;

if the text selection operation is detected, determining the selected text character string, and acquiring a multi-frame video image frame in the specified time range from the original video image data based on the specified time range formed by the caption time stamps of the selected text character string, wherein the selected text character string is the target caption information.

8. The method according to any one of claims 1 to 7, wherein the generating a picture containing subtitles according to the target subtitle information and the target video image frame comprises:

and embedding the text character strings contained in the target subtitle information into the extracted target video image frame to generate a picture containing subtitles.

9. The method according to any one of claims 1 to 7, wherein the generating a picture containing subtitles according to the target subtitle information and the target video image frame comprises:

if the subtitle mode of the target film is a hard subtitle mode, splicing the multiple frames of target video image frames according to the sequence of video timestamps to generate the image containing the subtitles; or intercepting the subtitle areas of other target video image frames except the first frame of target video image frame, and splicing the subtitle areas of the first frame of target video image frame and the other target video image frames according to the sequence of the video timestamps to generate the picture containing the subtitles;

if the subtitle mode of the target film is a soft subtitle mode or a subtitle-free mode, embedding each line of text character string in the target subtitle information into a corresponding target video image frame, and splicing the multiple frames of target video image frames according to the sequence of video time stamps to generate a picture containing subtitles; or embedding each line of text character string in the target caption information into the corresponding target video image frame, intercepting caption areas where other target video image frames except the first frame of target video image frame are embedded with captions, splicing the first frame of target video image frame embedded with captions and the caption areas of the other target video images according to the sequence of the video timestamps, and generating the picture containing the captions.

10. A picture processing apparatus, characterized in that the apparatus comprises:

the frame extraction module is used for extracting at least one frame of target video image frame from the multi-frame video image frames according to the target subtitle information and the multi-frame video image frames; the method comprises the following steps: comparing the video time stamp of the multi-frame video image frame with the subtitle time stamp of the target subtitle information, and extracting a frame of target video image frame from the multi-frame video image frame;

alternatively, the first and second electrodes may be,

11. The apparatus of claim 10, wherein the response obtaining module comprises:

the response determination module is used for responding to a picture processing instruction and determining a subtitle mode of the target film;

the data processing module is used for carrying out data processing on the multimedia file of the target film according to the subtitle mode of the target film to obtain original video image data and original subtitle information of the target film;

and the first determining module is used for obtaining the multi-frame video image frame and the target caption information within the appointed time range according to the picture processing instruction, the original video image data and the original caption information.

12. A mobile terminal comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the steps of the picture processing method according to any one of claims 1 to 9 when executing the computer program.

13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the picture processing method according to any one of claims 1 to 9.