CN113365145A

CN113365145A - Video processing method, video playing method, video processing device, video playing device, computer equipment and storage medium

Info

Publication number: CN113365145A
Application number: CN202110619228.9A
Authority: CN
Inventors: 李琨
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Douyin Vision Co Ltd; Beijing Volcano Engine Technology Co Ltd; Douyin Vision Beijing Co Ltd
Priority date: 2021-06-03
Filing date: 2021-06-03
Publication date: 2021-09-07
Anticipated expiration: 2041-06-03
Also published as: CN113365145B

Abstract

The present disclosure provides a video processing method, a video playing method, a video processing device, a video playing device, a computer device and a storage medium, wherein the video processing method comprises: receiving a video acquisition request aiming at a target video sent by a user side; the video acquisition request carries a target display proportion of the user side; under the condition that the target display proportion is detected to be a first display proportion, obtaining preset cutting information of the target video corresponding to the target display proportion, and obtaining subtitle information determined based on the color value change condition of a subtitle display area of the target video; wherein the first display proportion is a display proportion at which the subtitle display area cannot be completely displayed; and processing the target video based on the subtitle information and the cutting information, and sending the processed target video to the user side.

Description

Video processing method, video playing method, video processing device, video playing device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for video processing and video playing, a computer device, and a storage medium.

Background

Due to the rapid development of the hardware technology of the mobile devices, the types of the mobile devices are increasing, and the aspect ratios of the screens of different mobile devices are often different, so that different users have different requirements on the display ratio of the same video, which results in that the display ratio of the original video is often required to be changed when the video is played.

In the related art, in order to meet the display requirements of different users, a requested video is generally cut. When a video is cut, the cutting of key content in the video is often avoided by cutting the subtitles, so that the subtitle loss during video playing is caused, and the video watching experience of a user is influenced.

Disclosure of Invention

The embodiment of the disclosure at least provides a video processing method, a video playing method, a video processing device, a video playing device, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a video processing method, including:

receiving a video acquisition request aiming at a target video sent by a user side; the video acquisition request carries a target display proportion of the user side;

under the condition that the target display proportion is detected to be a first display proportion, obtaining preset cutting information of the target video corresponding to the target display proportion, and obtaining subtitle information determined based on the color value change condition of a subtitle display area of the target video; wherein the first display proportion is a display proportion at which the subtitle display area cannot be completely displayed;

and processing the target video based on the subtitle information and the cutting information, and sending the processed target video to the user side.

In a possible implementation, the method further includes determining a subtitle display area in the target video according to the following method:

sampling the target video, and determining a plurality of sampling video frames of the target video;

identifying text presentation regions in the plurality of sampled video frames;

and determining a subtitle display area in the target video based on the character display areas in the plurality of sampling video frames.

and inputting the target video to a pre-trained first neural network, and outputting a subtitle display area of the target video by the first neural network.

In a possible implementation manner, the cropping information includes a cropping coordinate corresponding to each video frame in the target video;

for any first display scale, the method further comprises determining cropping information of the target video at the any first display scale according to the following method:

and inputting the target video and any one first display proportion into a pre-trained second neural network, wherein the second neural network outputs the cutting information of the target video at any one first display proportion.

In one possible implementation, the subtitle information of the target video is determined according to the following method:

determining a change video frame, in the target video, of which the characters displayed in the subtitle display area are different from the adjacent video frames;

and identifying subtitle information displayed in a subtitle display area in the change video frame.

In a possible implementation manner, the obtaining of the subtitle information determined based on the color value change condition of the subtitle display area of the target video includes:

acquiring continuous pixel points with the same color value in a subtitle display area of the target video;

determining continuous pixel points to be screened based on the color difference value between the continuous pixel points and other pixel points and the change condition of the same pixel position of the continuous pixel points in preset time;

and aggregating the continuous pixel points to be screened, matching the aggregation result with characters stored in a character library, and determining the subtitle information of the target video based on the matching result.

In a possible implementation manner, the step of obtaining the subtitle information determined based on the color value change condition of the subtitle display area of the target video is executed by a third neural network;

the third neural network is obtained by training according to the following steps:

acquiring a sample video frame with subtitle annotation information;

inputting the sample video frame into a third neural network to be trained to obtain predicted caption information corresponding to the sample video;

and training the third neural network to be trained based on the predicted caption information and the caption marking information.

In a possible implementation, the processing the target video based on the subtitle information and the cropping information includes:

intercepting caption images corresponding to the matched continuous pixel points from the target video;

cutting the target video according to the cutting information; and overlaying the subtitle image to the cut target video.

after the target video is cut based on the cutting information, if the cut target video comprises a part of subtitle areas, fuzzy processing is carried out on the character information in the part of subtitle areas in the target video, and the subtitle information is displayed in the target video after the fuzzy processing in an overlapping mode.

In a second aspect, an embodiment of the present disclosure provides a video playing method, including:

responding to the playing operation of a target video, and sending a video acquisition request, wherein the video acquisition request carries a target display proportion of a user side;

and receiving and playing the processed target video, wherein the processed target video is determined according to the cutting information corresponding to the target display proportion and the subtitle information determined based on the color value change condition of the subtitle display area of the target video.

In a third aspect, an embodiment of the present disclosure further provides a video processing apparatus, including:

the receiving module is used for receiving a video acquisition request aiming at a target video sent by a user side; the video acquisition request carries a target display proportion of the user side;

the acquisition module is used for acquiring preset cutting information of the target video corresponding to the target display proportion and acquiring subtitle information determined based on the color value change condition of a subtitle display area of the target video under the condition that the target display proportion is detected to be a first display proportion; wherein the first display proportion is a display proportion at which the subtitle display area cannot be completely displayed;

and the processing module is used for processing the target video based on the subtitle information and the cutting information and sending the processed target video to the user side.

In a possible implementation manner, the processing module is further configured to determine a subtitle display area in the target video according to the following method:

identifying text presentation regions in the plurality of sampled video frames;

for any first display scale, the processing module is further configured to determine cropping information of the target video at any first display scale according to the following method:

In a possible implementation manner, the obtaining module is further configured to determine subtitle information of the target video according to the following method:

In a possible implementation manner, the obtaining module, when obtaining the subtitle information determined based on the color value change condition of the subtitle display area of the target video, is configured to:

the obtaining module is further configured to train the third neural network according to the following steps:

acquiring a sample video frame with subtitle annotation information;

In one possible implementation, when processing the target video based on the subtitle information and the cropping information, the processing module is configured to:

In a fourth aspect, an embodiment of the present disclosure provides a video playing apparatus, including:

the sending module is used for responding to the playing operation of the target video and sending a video acquisition request, wherein the video acquisition request carries the target display proportion of the user side;

and the playing module is used for receiving and playing the processed target video, and the processed target video is determined according to the cutting information corresponding to the target display proportion and the subtitle information determined based on the color value change condition of the subtitle display area of the target video.

In a fifth aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of any one of the possible implementations of the first or second aspect.

In a sixth aspect, the disclosed embodiments also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program is executed by a processor to perform the steps in any one of the possible implementation manners of the first aspect or the second aspect.

According to the video processing and playing method and device, the computer device and the storage medium provided by the embodiment of the disclosure, for each video, the cutting information and the subtitle information corresponding to the video at each first display proportion can be predetermined, after a video acquisition request sent by a user side is received, the target video can be processed based on the predetermined subtitle information and cutting information corresponding to the target display proportion under the condition that the target display proportion of the user side is detected to be the first display proportion, and then the processed target video is sent to the user side, so that complete display of the subtitle information can be ensured under the condition that the display proportions of different user sides are met, and the viewing experience of users is improved.

In addition, because the subtitle information of the target video is determined based on the color value change condition of the subtitle display area of the target video, the determined subtitle information is more accurate.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 shows a flow chart of a video processing method provided by an embodiment of the present disclosure;

fig. 2 is a schematic diagram illustrating adjustment of a display scale of video content in a video processing method provided by an embodiment of the present disclosure;

fig. 3 is a flowchart illustrating a specific method for determining a subtitle display area in the target video in a video processing method provided by an embodiment of the present disclosure;

fig. 4a is a schematic diagram illustrating a text display area in the video frame identified in the video processing method provided by the embodiment of the disclosure;

fig. 4b is a schematic diagram illustrating that, in the video processing method provided by the embodiment of the present disclosure, a subtitle display area in the target video is determined;

fig. 5 is a flowchart illustrating a specific method for determining subtitle information of the target video in a video processing method provided by an embodiment of the present disclosure;

fig. 6 is a flowchart illustrating a specific method for training a third neural network in the video processing method provided by the embodiment of the present disclosure;

fig. 7 shows a flowchart of a video playing method provided by an embodiment of the present disclosure;

fig. 8 is a schematic diagram illustrating an architecture of a video processing apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic diagram illustrating an architecture of a video playing apparatus provided in an embodiment of the present disclosure;

fig. 10 shows a schematic structural diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Research shows that when the video is played, the requested video is generally cut to meet the display requirements of different users. When a video is cut, the cutting of key content in the video is often avoided by cutting the subtitles, so that the subtitle loss during video playing is caused, and the video watching experience of a user is influenced.

Based on the above research, the present disclosure provides a video processing and playing method, an apparatus, a computer device, and a storage medium, for each video, the clipping information and the subtitle information corresponding to the video at each first display ratio may be predetermined, after receiving a video acquisition request sent by a user, the target video may be processed based on the predetermined subtitle information and the clipping information corresponding to the target display ratio under the condition that it is detected that the target display ratio of the user is the first display ratio, and then the processed target video is sent to the user.

To facilitate understanding of the present embodiment, a detailed description is first given of a video processing method disclosed in the embodiments of the present disclosure, and an execution subject of the video processing method provided in the embodiments of the present disclosure is generally a computer device with certain computing capability, and the computer device is generally a server.

Referring to fig. 1, a flowchart of a video processing method provided in the embodiment of the present disclosure is shown, where the method includes steps S101 to S103, where:

s101: receiving a video acquisition request aiming at a target video sent by a user side; the video obtaining request carries a target display proportion of the user side.

S102: under the condition that the target display proportion is detected to be a first display proportion, obtaining preset cutting information of the target video corresponding to the target display proportion, and obtaining subtitle information determined based on the color value change condition of a subtitle display area of the target video; wherein the first display proportion is a display proportion at which the subtitle display region cannot be completely displayed.

S103: and processing the target video based on the subtitle information and the cutting information, and sending the processed target video to the user side.

Each step and the corresponding implementation method in the embodiments of the present disclosure will be described in detail below.

For S101, the target display ratio of the user side may be a ratio that the target video needs to be displayed on the terminal device corresponding to the user side. The ratio to be displayed includes an aspect ratio for displaying a full screen, for example, the aspect ratio of the terminal device corresponding to the user side is 21: 9, the display proportion of the target video which needs to be displayed in a full screen on the terminal device is also 21: 9; or, the ratio to be displayed may also be a ratio to be viewed, which is selected by the user for the target video, for example, before the user triggers the target video to play, a display ratio input box may be set, and the user may input the ratio to be viewed within a preset range, which may be, for example, 16: 9-22: 9, such as 18.5: 9. 21: 9, etc.

In a specific implementation, the video obtaining request of the target video may be a request generated after the user triggers a play button corresponding to the target video, for example, after the user triggers a play button of any target video in a user-side application program, the video obtaining request of the corresponding target video is generated; or, after the target video has been played, the user inputs the ratio of the target video to be watched through the display ratio input box at the preset position, and then generates the video acquisition request of the corresponding target video.

For any video, the video is often uploaded according to a certain display ratio when being uploaded to a server by a video producer, the display ratio is generally related to a device for shooting the video, and the target display ratio of a user terminal needing to play the video may be various, and the display ratio of a common user terminal device is 16: 9. 18: 9. 19: 9. 21: and 9, after receiving the video, adaptively adjusting the video according to different display proportions, wherein the adaptively adjusting is generally to clip part of the content of the video, so that the key information of the video can be displayed under different display proportions.

As shown in fig. 2, to adjust the display scale of the video content, in fig. 2, the initial display scale of the video is 16: 9 (solid line portion), target display ratio 21: 9, the process of said adjustment is also in 16: 9 (solid line part) cuts out a complete display scale of 21: 9 (dashed line).

In the process of adjusting the video, it is necessary to determine the cropping information of the video at each second display ratio, where the second display ratio may include the display ratios of all known devices (i.e., the aspect ratios of the known devices), and determine the subtitle display area of the video, and when it is detected that the subtitle display area cannot be completely displayed at any second display ratio, the video is processed based on the cropping information and the subtitle information corresponding to the second display ratio information.

Here, the second presentation ratio includes the first presentation ratio, and for example, the second presentation ratio may include A, B, C, D, E, F six presentation ratios, where in the cropping information corresponding to A, B, C, D four presentation ratios, the subtitle presentation area of the target video cannot be completely presented, and A, B, C, D four presentation ratios are the first presentation ratio. For the E, F two display proportions, when a video acquisition request carrying an E or F display proportion sent by a user terminal is received, because the subtitle display area can be completely displayed, the target video is processed only according to the clipping information corresponding to the target display proportion carried in the video acquisition request.

When determining the subtitle display area in the target video, the method can be implemented in any one of the following two ways:

in one possible implementation, as shown in fig. 3, the subtitle display area in the target video may be determined by:

s301: and sampling the target video, and determining a plurality of sampling video frames of the target video.

Here, when the target video is sampled, the target video may be sampled at a preset initial sampling frequency, for example, 5 frames per second.

S302: text presentation regions in the plurality of sampled video frames are identified.

Specifically, the text display areas in the plurality of video frames, in which the text is displayed, may be identified through an Optical Character Recognition (OCR) or other identification technologies, where the identifying the text display areas in the plurality of sample video frames may be to identify area coordinate information corresponding to the text display areas in the sample video frames.

Illustratively, as shown in fig. 4a, it is a schematic diagram of the identified text display area in the video frame. The area coordinate information of the text display area, such as the pixel coordinates of its four vertices, is determined in fig. 4 a.

The text display area comprises a subtitle display area for displaying subtitles and a non-subtitle text display area in the video frame.

Specifically, when the target video is played, besides the text content displayed in the subtitle display area, text displayed with special effect often appears in some positions, such as around a human face or around an object, for example, "how |)! "etc., which may include a subtitle presentation area and/or a text presentation area in a video picture.

S303: and determining a subtitle display area in the target video based on the character display areas in the plurality of sampling video frames.

Here, since the display position of the subtitle in the target video is relatively fixed, the subtitle is usually stably located at a certain position in the middle lower part of the video, and the subtitle position is generally displayed horizontally in the middle, the subtitle display area in the text display area may be determined according to the display characteristics of the subtitle.

Specifically, the text display areas in the plurality of sampled video frames may be superimposed to find the superimposed area, and the more times of the superimposed area are superimposed, the more characters appearing in the superimposed area are represented, which is in accordance with the characteristics of the caption display area and is not in accordance with the text "java | displayed with special effect! The display characteristics of the method may determine, as the superimposition position region where the subtitles are displayed, a region where the number of times of superimposition meets a preset condition, for example, a region where the number of times of superimposition is the largest, so that the subtitle display region in the text display region may be determined according to a relative position relationship between the text display region and the superimposition position region.

For example, as shown in fig. 4b, to determine a schematic view of a subtitle display area in the target video, in fig. 4b, four text display areas are superimposed, the width of the longest text display area is represented by a solid line, the widths of three shorter text display areas are represented by a dotted line for distinguishing, an area filled with a shadow in the middle is the superimposition position area, and after the plurality of sampled video frames are superimposed, it may be determined that the text display area including the superimposition position area is the subtitle display area; or, it may also be determined that a text display area in the same line as the superimposition position area is the subtitle display area; or, a text display area with a distance less than a preset distance from the superimposition position area may be determined as the subtitle display area.

In addition, the video content in the preset time period of the target video may be sampled at a high frequency, for example, within one minute from the beginning or any one minute in the middle of the video, 10 frames per second higher than 5 frames per second at the initial sampling frequency is sampled to obtain a key frame of the target video, the key frame is subjected to the superposition processing of the text display area to obtain the distribution of the display area where the subtitles are located, the predicted display range of the subtitles in the target video may be determined according to the distribution, and when the subtitle display area is subsequently identified, it is only necessary to identify the subtitle display area identified in the predicted display range (for example, the text display area in the predicted display range may be directly used as the subtitle display area), so as to save the calculation resources, and because high-frequency sampling and analysis are carried out in the preset time period, the determined prediction display range is more accurate, the recognition speed is accelerated, and the recognition accuracy of the caption display area is also ensured.

Specifically, when the predicted display range of the subtitles in the target video is determined according to the distribution of the display areas where the subtitles are located, the display area of the current video picture may be pre-divided into a plurality of areas to be screened, when the display areas where the subtitles are located are all located in a certain area to be screened, the area to be screened may be used as the predicted display range, for example, the display area of the video picture may be divided into an upper half display area and a lower half display area, and when it is detected that the subtitle display areas are all located in the lower half display area, the lower half display area may be determined as the predicted display range of the subtitles. The division number of the region to be screened can be divided into 2 or more regions according to actual needs, which is not limited in the embodiments of the present disclosure.

Further, in order to avoid the caption display area in the key frame of the target video, only the above-mentioned "java | happens to be displayed within a preset time period! "instead of the caption of the video, thereby causing the wrong recognition of the predicted display range of the caption, and the self-verification can be performed while the caption display area is recognized subsequently.

For example, the frequency of text appearing in the predicted display range of the subtitles may be counted, if no text appears within 1 minute of the predicted display range, it is determined that the predicted display range is mispredicted, and the predicted display range is re-determined according to the above steps within a preset time period before or after the preset time period; or, it may be determined whether the identified subtitle display area overlaps with the superimposition position area, and when none of the identified subtitle display areas overlaps with the superimposition position area within a continuous period of time, it is determined that the prediction display range is incorrect, and the prediction display range is re-determined according to the above steps within a preset time period before or after the preset time period. Taking the 45 th minute of the video with the preset time period of 90 minutes and the preset time length of 1 minute as examples, the preset time length before the preset time period is the 44 th minute, and the preset time length after the preset time period is the 46 th minute.

In another possible implementation manner, when determining the subtitle display area in the target video, the target video may be further input to a first neural network trained in advance, and the first neural network outputs the subtitle display area of the target video.

Here, when the first neural network training is performed, a sample video with subtitles and corresponding label information may be used to train the first neural network to be trained, then a loss value in the current training process is calculated based on an output result of the first neural network and the label information corresponding to the sample video, and when the loss value is smaller than a preset loss value, it may be determined that the first neural network training is completed.

In a possible implementation manner, the cropping information includes a cropping coordinate corresponding to each video frame in the target video, and for any first display scale, the cropping information of the target video at any first display scale may be determined according to the following method:

Here, the second neural network may be a neural network of the same type as the first neural network, that is, the second neural network may be trained by using a training method similar to that of the first neural network, and therefore, the training process of the second neural network is not described herein again.

Specifically, when determining the input cropping information of the target video, the second neural network may identify key information in a sample video frame of the target video, where the key information may include, for example, a face image, and may determine, from a plurality of pieces of cropping information to be screened, which is determined for any one of the first display ratios, that the cropping information is reserved with the most key information as the target cropping information, so as to avoid poor video impression caused by cropping the key information. For example, the initial display ratio of the video is 16: 9, the first display ratio is 21: and 9, for the first display proportion, randomly generating a plurality of clipping coordinates to be screened by performing horizontal clipping on a video picture (for example, clipping from the upper side or/and the lower side of the video picture in fig. 2), and determining clipping information with the most retained key information as the clipping information of the target video.

In one possible implementation, after determining the subtitle display area, as shown in fig. 5, the subtitle information determined based on the color value change condition of the subtitle display area of the target video may be obtained according to the following steps:

s501: and acquiring continuous pixel points with the same color value in the subtitle display area of the target video.

S502: and determining the continuous pixel points to be screened based on the color difference value between the continuous pixel points and other pixel points and the change condition of the same pixel position of the continuous pixel points in preset time.

Here, the color difference value refers to a difference value between color values of respective pixel points, where one color corresponds to one color value, for example, a color value # ffffff expressed in a 16-system format in a common web page format indicates that the color is white; or the intensity values of Red, Green and Blue color channels represented by (0, 255 and 0) are respectively 0, 255 and 0, the color formed by the three color channels is Green, and the color values have multiple representation methods, so that different color value representation methods can be mutually converted, and the color difference value needs to be converted to the same representation method for calculation.

For any word in the caption, there is usually no color difference value between the pixel point corresponding to the word and the adjacent pixel point, for example, in the white caption, the color of the caption is always white, and the background color in the caption display area is various, in addition, even if the background color in the caption display area is also white, because the background color changes, the color of the pixel point corresponding to the caption does not change, therefore, according to the characteristics of the caption, the continuous pixel points with the same color value in the caption display area can be obtained, and according to the color difference values of the continuous pixel points and other pixel points and the change condition in the preset time at the same pixel position, the continuous pixel points to be screened are determined.

And if the color difference value of some continuous pixel points and other pixel points exceeds the preset color difference value and the color value of the pixel position corresponding to the continuous pixel point in the preset time is not changed, taking the continuous pixel point as the continuous pixel point to be screened.

S503: and aggregating the continuous pixel points to be screened, matching the aggregation result with characters stored in a character library, and determining the subtitle information of the target video based on the matching result.

In a possible implementation manner, the color value of the pixel point corresponding to the background in the subtitle display area may also not change within a predetermined time, for example, the background in the subtitle display area is a gray background plate with uniform color, and therefore, it is obviously impossible to directly determine the subtitle information in the subtitle display area based on the color difference value and the change condition of the same pixel position within the predetermined time.

Therefore, after the continuous pixel points to be screened are determined, the continuous pixel points to be screened can be aggregated, the aggregation result is matched with the characters in the character library, the continuous pixel points matched with the characters in the character library are used as the subtitle information of the target video, and the background obviously cannot be matched with the characters in the character library, so that the background in a subtitle display area can be effectively prevented from being determined as the subtitle information.

In a possible implementation manner, the step of obtaining the subtitle information determined based on the color value change condition of the subtitle display area of the target video may be performed by a third neural network, as shown in fig. 6, and the third neural network may be trained by the following steps:

s601: and acquiring a sample video frame with subtitle marking information.

S602: and inputting the sample video frame into a third neural network to be trained to obtain the predicted caption information corresponding to the sample video.

S603: and training the third neural network to be trained based on the predicted caption information and the caption marking information.

Here, the training process of the third neural network may be similar to that of the first neural network and the second neural network, and the loss value of the current training may be calculated based on the predicted caption information and the caption marking information, and when the loss value is smaller than a preset loss value, it may be determined that the training of the third neural network is completed.

In another possible implementation manner, when determining the subtitles displayed in the subtitle display area, it may also be determined that, in the target video, a change video frame in which characters displayed in the subtitle display area are different from adjacent video frames; and identifying subtitle information displayed in a subtitle display area in the change video frame.

Here, the adjacent video frame may refer to a previous video frame, and the changed video frame, that is, a video frame with changed subtitles, may be determined by changing the texts identified by an identification technology such as OCR, specifically, may be determined by identifying the subtitles in each video frame of the target video obtained after sampling by the identification technology, and determining the video frame with changed subtitles as the changed video frame by comparing whether the texts in the subtitles in each video frame are the same; or, directly using the video frame with the changed display position and/or size of the corresponding subtitle display area as the changed video frame.

For example, taking 100 video frames obtained after the target video is sampled as an example, it may be determined that frames 2 to 35, frames 36 to 70, and frames 71 to 100 respectively exhibit the same subtitle/subtitle exhibiting region and the same, and then it may be determined that frames 2, 36, and 71 are the change video frames.

Furthermore, the determined caption display area in the frames 2, 36 and 71 can be identified, and the caption information displayed in the caption display area can be determined. Illustratively, the subtitle information presented by the subtitle presentation area may be determined by recognition techniques such as OCR. Here, the subtitle information includes text content corresponding to a subtitle.

Here, the processed target video may be sent to the user terminal in a streaming data transmission form, and the processing of the target video is not performed by a terminal device corresponding to the user terminal.

In a possible implementation manner, when the target video is processed based on the subtitle information and the cropping information, the subtitle images corresponding to the matched continuous pixel points can be intercepted from the target video; and cutting the target video according to the cutting information, and overlaying the subtitle image to the cut target video.

For example, when the subtitle image is superimposed onto the clipped target video, the subtitle image may be superimposed onto a preset position of the clipped target video, for example, three pixel positions above the clipped bottom, for performing centered display.

In addition, when the target video is processed based on the subtitle information and the cropping information, after the text content corresponding to the subtitle displayed in the subtitle display area is identified, a new subtitle is generated based on the identified text content, and the new subtitle is superimposed to the cropped target video according to a preset display position for display, so that the processed target video is generated; or, after the subtitle display area is identified, the whole subtitle display area is directly cut, that is, the subtitle background and the subtitle are simultaneously cut, so that a small video simultaneously carrying the subtitle and the subtitle background is generated, and the small video is superimposed to the cut target video according to a preset display position for display. The superimposition mode for displaying the subtitle information in a superimposed manner may refer to the above-mentioned superimposition mode, and is not described herein again.

In another possible implementation manner, after the target video is cut based on the cutting information, if the cut target video includes a partial subtitle region, the text information in the partial subtitle region in the target video may be blurred, and the subtitle information is displayed in the blurred target video in an overlapping manner.

Specifically, the blurring processing includes multiple image blurring processing modes such as gaussian (filtering) blurring processing, mean (filtering) blurring processing, median (filtering) blurring processing, bilateral (filtering) blurring processing, and the like, and the superimposing and displaying of the subtitle information in the target video subjected to the blurring processing includes superimposing the subtitle image to the clipped target video. The superimposing mode of superimposing and displaying the subtitle information in the target video after the blurring processing may refer to the above superimposing mode, and is not described herein again.

According to the video processing method provided by the embodiment of the disclosure, for each video, the cutting information and the caption information corresponding to the video at each first display proportion can be predetermined, after a video acquisition request sent by a user side is received, the target video can be processed based on the predetermined caption information and cutting information corresponding to the target display proportion under the condition that the target display proportion of the user side is detected to be the first display proportion, and then the processed target video is sent to the user side, so that the complete display of the caption information can be ensured under the condition that the display proportions of different user sides are met, and the viewing experience of the user is improved.

Referring to fig. 7, which is a flowchart of a video playing method provided in the embodiment of the present disclosure, the method includes steps S701 to S702, where:

s701: responding to the playing operation of the target video, and sending a video acquisition request, wherein the video acquisition request carries a target display proportion of a user side.

S702: and receiving and playing the processed target video, wherein the processed target video is determined according to the cutting information corresponding to the target display proportion and the subtitle information determined based on the color value change condition of the subtitle display area of the target video.

An execution subject of the video playing method provided by the embodiment of the present disclosure is generally a computer device with certain computing capability, and the computer device includes: the intelligent terminal device with the display function can be, for example, a smart phone, a tablet computer, an intelligent wearable device and the like.

For the processing procedure of the target video, reference may be made to related contents in the above video processing method, and details are not repeated here.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, a video processing apparatus corresponding to the video processing method is also provided in the embodiments of the present disclosure, and since the principle of the apparatus in the embodiments of the present disclosure for solving the problem is similar to the video processing method described above in the embodiments of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not described again.

Referring to fig. 8, there is shown a schematic architecture diagram of a video processing apparatus according to an embodiment of the present disclosure, the apparatus includes: a receiving module 801, an obtaining module 802, and a processing module 803; wherein the content of the first and second substances,

a receiving module 801, configured to receive a video acquisition request for a target video sent by a user side; the video acquisition request carries a target display proportion of the user side;

an obtaining module 802, configured to, when it is detected that the target display ratio is the first display ratio, obtain predetermined clipping information of the target video corresponding to the target display ratio, and obtain subtitle information determined based on a color value change condition of a subtitle display area of the target video; wherein the first display proportion is a display proportion at which the subtitle display area cannot be completely displayed;

the processing module 803 is configured to process the target video based on the subtitle information and the cropping information, and send the processed target video to the user side.

In a possible implementation manner, the processing module 803 is further configured to determine a subtitle display area in the target video according to the following method:

identifying text presentation regions in the plurality of sampled video frames;

for any first display scale, the processing module 803 is further configured to determine cropping information of the target video at any first display scale according to the following method:

In a possible implementation manner, the obtaining module 802 is further configured to determine the subtitle information of the target video according to the following method:

In a possible implementation manner, the obtaining module 802, when obtaining the subtitle information determined based on the color value change condition of the subtitle display area of the target video, is configured to:

acquiring a sample video frame with subtitle annotation information;

In a possible implementation manner, the processing module 803, when processing the target video based on the subtitle information and the cropping information, is configured to:

The video processing apparatus provided by the embodiment of the present disclosure, for each video, may predetermine the clipping information and the subtitle information corresponding to the video at each first display ratio, and after receiving a video acquisition request sent by a user, may process the target video based on the predetermined subtitle information and the clipping information corresponding to the target display ratio when detecting that the target display ratio of the user is the first display ratio, and then send the processed target video to the user, so that, under the condition that the display ratios of different users are satisfied, complete display of the subtitle information is ensured, and viewing experience of the user is improved.

Referring to fig. 9, which is a schematic diagram of an architecture of a video playing apparatus provided in an embodiment of the present disclosure, the apparatus includes: a sending module 901 and a playing module 902; wherein the content of the first and second substances,

a sending module 901, configured to send a video acquisition request in response to a play operation on a target video, where the video acquisition request carries a target display ratio of a user terminal;

and the playing module 902 is configured to receive and play the processed target video, where the processed target video is determined according to the clipping information corresponding to the target display scale and the subtitle information determined based on the color value change condition of the subtitle display area of the target video.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

Based on the same technical concept, the embodiment of the disclosure also provides computer equipment. Referring to fig. 10, a schematic structural diagram of a computer device 1000 provided in the embodiment of the present disclosure includes a processor 1001, a memory 1002, and a bus 1003. The memory 1002 is used for storing execution instructions, and includes a memory 10021 and an external memory 10022; the memory 10021 is also referred to as an internal memory, and is used for temporarily storing the operation data in the processor 1001 and the data exchanged with the external memory 10022 such as a hard disk, the processor 1001 exchanges data with the external memory 10022 through the memory 10021, and when the computer device 1000 operates, the processor 1001 and the memory 1002 communicate through the bus 1003, so that the processor 1001 executes the following instructions:

In a possible implementation manner, the instructions of the processor 1001 further include determining a subtitle display area in the target video according to the following method:

identifying text presentation regions in the plurality of sampled video frames;

In a possible implementation manner, in the instructions of the processor 1001, the cropping information includes a cropping coordinate corresponding to each video frame in the target video;

for any first display proportion, determining cropping information of the target video at any first display proportion according to the following method:

In one possible implementation, in the instructions of the processor 1001, the subtitle information of the target video is determined according to the following method:

In one possible implementation, the instructions of the processor 1001, the obtaining subtitle information determined based on a color value change condition of a subtitle display area of the target video, includes:

In a possible implementation manner, in the instructions of the processor 1001, the step of obtaining the subtitle information determined based on the color value change condition of the subtitle display area of the target video is performed by a third neural network;

acquiring a sample video frame with subtitle annotation information;

In a possible implementation, the processing the target video based on the subtitle information and the cropping information in the instructions of the processor 1001 includes:

Alternatively, the processor 1001 is caused to execute the following instructions:

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the video processing method described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the video processing method in the foregoing method embodiments, which may be referred to specifically in the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implementing, and for example, a plurality of units or components may be combined, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A video processing method, comprising:

2. The method of claim 1, further comprising determining a caption presentation area in the target video according to the following method:

identifying text presentation regions in the plurality of sampled video frames;

3. The method of claim 1, further comprising determining a caption presentation area in the target video according to the following method:

4. The method of claim 1, wherein the cropping information comprises cropping coordinates corresponding to each video frame in the target video;

5. The method of claim 1, wherein the subtitle information for the target video is determined according to the following method:

6. The method according to claim 1, wherein the obtaining of the subtitle information determined based on the color value change of the subtitle display area of the target video comprises:

7. The method according to claim 6, wherein the step of obtaining the subtitle information determined based on the color value variation of the subtitle display region of the target video is performed by a third neural network;

acquiring a sample video frame with subtitle annotation information;

8. The method of claim 6, wherein the processing the target video based on the subtitle information and the cropping information comprises:

9. The method according to any one of claims 1 to 8, wherein the processing the target video based on the subtitle information and the cropping information comprises:

10. A video playback method, comprising:

11. A video processing apparatus, comprising:

12. A video playback apparatus, comprising:

13. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when a computer device is running, the machine-readable instructions when executed by the processor performing the steps of the video processing method according to any one of claims 1 to 9 or performing the steps of the video playback method according to claim 10.

14. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, is adapted to carry out the steps of the video processing method according to any one of claims 1 to 9 or the steps of the video playback method according to claim 10.