CN115474088A

CN115474088A - Video processing method, computer equipment and storage medium

Info

Publication number: CN115474088A
Application number: CN202211094827.4A
Authority: CN
Inventors: 张悦
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2022-12-13

Abstract

The application discloses a video processing method, computer equipment and a storage medium, wherein the method comprises the following steps: responding to a visual angle conversion event of the first video data, acquiring audio data of the first video data, processing the audio data, and determining target text information of the audio data; performing text recognition on the first video data to obtain text recognition information; if the similarity between the target text information and the text identification information is smaller than a set threshold, determining target video data based on the target text information and second video data, wherein the second video data is obtained by performing visual conversion on the first video data; and outputting the target video data. By the method, the subtitles can be added to the video without subtitles, and the video content is enriched.

Description

Video processing method, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video processing method, a computer device, and a computer-readable storage medium.

Background

With the rapid development of information technology, various videos are leap forward, and compared with a non-mobile terminal, the mobile terminal is more popular, and people like to listen to songs and watch videos through the mobile terminal, so that people like to watch videos in a vertical format which can be operated by one hand and can be watched through one hand compared with videos in a horizontal format.

For music videos, when switching between horizontal edition and vertical edition, how to enrich the displayed content is one of the current research hotspots.

Disclosure of Invention

The embodiment of the application provides a video processing method, computer equipment and a storage medium, which can enrich video contents in the process of converting the visual angle of video data.

In a first aspect, an embodiment of the present application discloses a video processing method, including:

acquiring audio data of first video data in response to a view angle conversion event of the first video data;

processing the audio data to obtain target text information of the audio data;

performing text recognition on the first video data to obtain text recognition information;

performing condition detection according to the target text information and the text recognition result;

if the similarity between the target text information and the text identification information is smaller than a set threshold value, determining target video data based on the target text information and second video data, wherein the second video data is obtained by performing visual conversion on the first video data;

and outputting the target video data.

In a second aspect, an embodiment of the present application discloses a video processing apparatus, including:

an acquisition unit configured to acquire audio data of first video data in response to a view angle conversion event for the first video data;

the processing unit is used for processing the audio data to obtain target text information of the audio data; performing text recognition on the first video data to obtain text recognition information;

a determining unit, configured to determine target video data based on the target text information and second video data if the similarity between the target text information and the text identification information is smaller than a set threshold, where the second video data is obtained by performing visual conversion on the first video data;

an output unit for outputting the target video data.

In a third aspect, an embodiment of the present application discloses a computer device, which includes a processor adapted to implement one or more computer programs; and a computer storage medium storing one or more computer programs adapted to be loaded by the processor and to execute the video processing method as described above.

In a fourth aspect, the present application discloses a computer readable storage medium storing one or more computer programs adapted to be loaded by a processor and to perform the video processing method described above.

In a fifth aspect, an embodiment of the present application discloses a computer program product, which includes a computer program, and the computer program is stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device executes the video processing method described above.

In the embodiment of the application, the computer equipment responds to the visual angle conversion event of the first video data, acquires the audio data of the first video data, and processes the audio data to obtain the target text information of the audio data; then, performing text recognition on the first video data to obtain text recognition information; if the similarity between the target text information and the text identification information is smaller than a set threshold, determining target video data based on the target text information and second video data, wherein the second video data is obtained by performing visual conversion on the first video data; and finally outputting the target video data. According to the embodiment of the application, whether subtitles need to be added to the video is determined by utilizing the set threshold, and the processing efficiency is improved. Therefore, the video without subtitles can be added with subtitles under the condition that the video is subjected to visual conversion, so that the video content is enriched.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a network architecture diagram of a video processing system according to an embodiment of the present application;

FIG. 2a illustrates a flat video disclosed in an embodiment of the present application;

FIG. 2b illustrates a portrait video disclosed in an embodiment of the present application;

FIG. 2c illustrates a portrait video as disclosed in an embodiment of the present application;

fig. 3 is a schematic flow chart of a video processing method disclosed in the embodiment of the present application;

FIG. 4a is a diagram illustrating a network architecture of a text detection network disclosed in an embodiment of the present application;

FIG. 4b illustrates a network architecture diagram of a text recognition network as disclosed in an embodiment of the present application;

FIG. 5a illustrates a first manner of displaying textual information;

FIG. 5b illustrates a second manner of displaying textual information;

FIG. 5c shows a third way of displaying text information;

FIG. 5d illustrates a fourth manner of displaying textual information;

FIG. 5e illustrates a fifth manner of displaying textual information;

FIG. 5f illustrates a sixth manner of displaying textual information;

FIG. 5g illustrates a seventh manner of displaying textual information;

FIG. 5h illustrates an eighth manner of displaying textual information;

FIG. 6 is a schematic flow chart diagram of another video processing method disclosed in the embodiments of the present application;

FIG. 7 is a schematic structural diagram of a video rendering module according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to make the pictures after the video visual angle conversion richer, the embodiment of the application provides a video processing method, which can add subtitles to music videos without subtitles in the process of converting the video visual angle, wherein the subtitles can be lyrics, song information and the like. The video processing method provided by the embodiment of the application can be realized based on an AI (Artificial Intelligence) technology. AI refers to a theory, method, technique and application system that simulates, extends and expands human intelligence, senses the environment, acquires knowledge and uses knowledge to obtain the best results using a digital computer or a machine controlled by a digital computer. The AI technology is a comprehensive subject, and the related fields are wide; the video processing method provided by the embodiment of the present application mainly relates to a Machine Learning (ML) technique in the AI technique. Machine learning generally includes techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

In a possible embodiment, the video processing method provided in the embodiment of the present application may also be implemented based on cloud technology (cloud technology) and/or block chain technology. In particular, the method can relate to one or more of cloud storage (Cloudstorage), cloud database (cloutdatabase) and big data (Bigdata) in cloud technology. For example, data (e.g., first video data, etc.) required to execute the video processing method is acquired from a cloud database. For another example, the data required to perform the video processing method may be stored in the form of blocks on a block chain; data (such as target text information, text recognition results, target video data and the like) generated by executing the video processing method can be stored on a block chain in the form of blocks; in addition, the data processing apparatus performing the video processing method may be a node apparatus in a blockchain network.

Referring to fig. 1, fig. 1 is a network architecture diagram of a video processing system disclosed in an embodiment of the application, and as shown in fig. 1, the video processing system 100 may include at least a plurality of terminal devices 101 and a computer device 102, where the terminal devices 101 and the computer device 102 may implement a communication connection, and the connection manner may include a wired connection and a wireless connection, which is not limited herein. In a specific implementation process, the terminal device 101 is mainly configured to respond to a user operation and display a result of related data, in this application, for example, target video data is displayed on the terminal device 101 (the target video data may be displayed on the terminal device 101 in a file form, or the target video data may be played on the terminal device 101), or a view angle conversion event of the user on the first video data is responded; the computer device 102 is equivalent to a data processing device, and is mainly configured to acquire first video data, process the first video data to obtain corresponding target video data, and return the target video data to the terminal device 101.

In a possible implementation manner, the above-mentioned terminal device 101 includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, an aircraft, and the like; the computer device 102 may be a server, which may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. Fig. 1 is a network architecture diagram for schematically representing a video processing system, and is not limited thereto. For example, the computer device 102 in fig. 1 may be deployed as a node in a blockchain network, or the computer device 102 may access the blockchain network, so that the computer device 102 may upload target video data to the blockchain network for storage, to prevent internal data from being tampered, thereby ensuring data security.

With reference to the video processing system, the video processing method according to the embodiment of the present application may generally include: the computer equipment responds to a visual angle conversion event of the first video data, acquires audio data of the first video data, processes the audio data and determines target text information of the audio data; then, performing text recognition on the first video data to obtain text recognition information; determining whether subtitles exist in the first video data or not according to the target text information and the text identification information, if the similarity between the target text information and the text identification information is smaller than a set threshold (namely, judging that the subtitles do not exist in the first video data), determining the target video data based on the target text information and the second video data, wherein the second video data is obtained by performing visual conversion on the first video data, namely, the subtitles are added to the second video by using the target text information, and the second video data is obtained by performing visual conversion on the first video data. Therefore, according to the embodiment of the application, whether the video has the subtitles or not is judged, and then the subtitles are added to the video without the subtitles under the condition that the video is subjected to visual conversion, so that the video content is enriched.

In the embodiments of the present application, the first video data, the audio data, the text identification information, and other related data are referred to, and all the data are authorized by the user. When the above embodiments of the present application are applied to specific products or technologies, the data involved in the use needs to be approved or approved by users, and the collection, use and processing of the relevant data need to comply with relevant laws and regulations and standards of relevant countries and regions.

Before the application is explained, a horizontal version video and a vertical version video are introduced, the horizontal version video and the vertical version video are two different video playing modes, the most intuitive mode is that the horizontal version video and the vertical version video have different sizes on a screen of the device, the proportion of the horizontal version video is generally 16, and the proportion of the vertical version video is generally 9. The application scene is the scene that the horizontal version video is converted into the vertical version video, so that a user can have better video impression on the premise of not turning over the equipment. For example, fig. 2a is a horizontal version video, fig. 2b and fig. 2c are vertical version videos, fig. 2b and fig. 2c are two types of horizontal version videos that are commonly used at present and are converted into vertical version videos, and the vertical version video displayed in fig. 2b has large black areas on the top and bottom, so that the impression is poor; fig. 2c, although not blank, clips the content of the original video, making the original video incompletely displayed. Therefore, in order to enrich the vertical version video content on the basis of not reducing the original video content, the application provides a video processing method, and the effect can be achieved. Referring to fig. 3, it is a schematic flow chart of a video processing method disclosed in the embodiment of the present application, which can be executed by a computer device, and the audio data processing method includes, but is not limited to, steps S301 to S306:

s301: in response to a view angle conversion event for the first video data, audio data of the first video data is acquired.

Herein, in the present application, the visual conversion refers to converting the display vision of the video content, such as converting from a first vision to a second vision, and generally, converting from a landscape video to a portrait video. The first video data of the present application may be a horizontal version video. The first video data generally includes audio data and video data, and the audio data may also include various sounds, such as human voice, musical instrument voice, and the like. Specifically, the first video data may include video data obtained by shooting and clipping by the user, video data obtained by downloading the user over the network, and video data selected by the user directly on-line over the network.

In one possible implementation, after the user triggers a view angle conversion event, the computer device acquires audio data of the first video data in response to the view angle conversion event. In processing music video, the audio data may refer to a song in the video, such as background music carried by the first video data itself, and may also include music added by the user for the first video data. Optionally, the audio data may be a complete song, or may be one or more segments of a corresponding complete song, or may be segments of a plurality of songs. In a possible embodiment, the view angle transition event may be triggered by the user by clicking a transition button in the terminal device.

S302: and processing the audio data to obtain target text information of the audio data.

In a possible implementation manner, firstly, audio data is converted into voice frequency spectrum information, then fingerprint information to be identified of the audio data is determined based on a peak point in the voice frequency spectrum information, then the fingerprint information to be identified is matched with a fingerprint database so as to determine target fingerprint information corresponding to the fingerprint information to be identified and target song attribute information corresponding to the target fingerprint information, and finally target text information of the audio data is determined based on the target song attribute information.

Alternatively, the voice spectrum information may be a voice spectrum map. The speech spectrum information includes two dimensions, namely a time dimension and a frequency dimension, that is, the speech spectrum information includes a correspondence between each time point of the audio data and a frequency of the audio data. If the audio data is a song, the peak point in the speech spectrum information represents the most representative frequency value of the song at each moment, and each peak point corresponds to a mark (f, t) consisting of frequency and time, and the mark can be understood as a coordinate. The labels corresponding to each peak point and its n adjacent peak points may constitute a set of adjacent peak points, which may then be encoded into an audio fingerprint using a form of hash encoding. The selecting of the set of adjacent peak points from the peak points may be to determine a coverage range by taking any peak point in the speech spectrogram as a center of a circle and a preset distance threshold as a radius, and combine all peak points of time points in the coverage range, which are greater than the center of the circle, into the set of adjacent peak points. The adjacent peak points only include the peak points which are within a certain range and have the time points larger than the circle center, namely the peak points behind the circle center, so that repeated sub-fingerprints can be avoided. Of course, the set of neighboring peak points may be selected according to other strategies, which are not limited herein.

For example, if a peak point as a center of a circle is represented as (f 0, t 0), and n sets of adjacent peak points are represented as (f 1, t 1), (f 2, t 2), …, (fn, tn), then (f 0, t 0) is combined with each adjacent peak point to obtain each pair of combination information, such as (f 0, f1, t1-t 0), (f 0, f2, t2-t 0), …, (f 0, fn, tn-t 0), and then each pair of combination information is encoded into sub-fingerprints in the form of hash coding, and all the sub-fingerprints are combined as fingerprint information to be identified of the audio data.

In a possible embodiment, the target text information of the audio data is determined based on the target song attribute information, and the implementation process may include: acquiring the time length of the audio data and the first time of the audio data in the target song, wherein the first time is the starting time of the audio data; adding the time length and the first time to obtain a second time; the method comprises the steps of determining lyric information of a target song based on target song attribute information, wherein the target song attribute information can comprise a unique identifier of the target song, obtaining the lyric information corresponding to the target song through the unique identifier, analyzing characters from a first time to a second time in the lyric information, and determining target text information of audio data according to the analyzed characters. It can be understood that the lyric information of the target song corresponds to a dictionary with many characters, after the time length of the audio data and the start time (first time) and the end time (second time) of the audio data in the target song are determined, the characters in the corresponding time period are directly obtained from the dictionary, and the characters and the time information corresponding to each character are used as the target text information of the audio data.

The first time refers to a start time of the audio data in the corresponding target song. For example, the time length of the audio data of the first video data is 2 minutes, and the time length of the target song is 4 minutes; the audio data starts from the 1 st minute in the target song, then the corresponding first time of the audio data in the target song is 1. The time length of the first video data and the length of the audio data may not be the same, for example, some videos are not audio at first, and audio is not audio until a certain time after playing.

S303: and performing text recognition on the first video data to obtain text recognition information.

In a possible implementation manner, the text recognition may be performed by a text detection recognition model, where the text detection recognition model includes a text detection network and a text recognition network, and in a specific implementation process, the text detection network is first invoked to perform detection processing on the first video data to obtain a text detection result; if the text detection result indicates that the first video data has text information, analyzing the text detection result to obtain a text region image; then, calling a text recognition network to perform text recognition on the text area image to obtain text recognition information; and if the text detection result indicates that the first video data does not have text information, in this case, characters do not need to be recognized, and the text detection result is directly used as the text recognition information of the first video data.

Further, the text detection network includes three modules, which are respectively a feature extraction module composed of Compact attachment blocks, a feature enhancement module composed of a plurality of adaptive RNN networks, and a Box generation and edge Refinement post-processing module, and fig. 4a shows a network architecture diagram of the text detection network, based on these three modules, first performing frame extraction processing on first video data to obtain a video frame image sequence, where the video frame image sequence may include each frame of video frame image of the first video data, or may be a part of video frame image obtained by sampling the first video data by sparse sampling, performing feature extraction on each video frame image by using the feature extraction module, performing enhancement processing on the extracted features by using the enhancement module to enhance the detection function of the characters, and then detecting the enhanced features by using the post-processing module to obtain a text detection result. The text detection result here includes one of presence text information and absence text information.

If the text detection result indicates that the first video data has text information, further, the text identification network is required to identify the detected text region image to obtain text identification information. The text Recognition network is added with a transverse asymmetric convolution and feature module integrating multiple scale receptive fields, so that the support of the network on multi-scale fonts is enhanced, and meanwhile, a Fine-grained convolution method in a Fine-grained image Recognition (Fine-grained Recognition) module is added, so that the image feature extraction under the conditions of similar characters, blurring and the like is effectively enhanced. Fig. 4b shows a network architecture diagram of a text recognition network, by which text information in a text region image can be recognized and output in the form of text.

Optionally, when performing text detection and recognition, the content of the text information in the first video data is complex, and includes not only the lyric subtitle but also other content, such as a video introduction, watermark information, and the like. The text message shown in fig. 5a is a text introduction displayed when the first video data starts to be played, or when the first video data reaches the end of the title; as shown in fig. 5b, the text information is a watermark text and lyrics caption; the textual information shown in FIG. 5c is merely a watermark; the text information shown in fig. 5d is a text in the background of the video, and the three words, i.e., "my home", are background texts. In addition, the text information in the first video data may appear at different positions, or may be presented in different forms, for example, the text information shown in fig. 5e is on the left side of the video; FIG. 5f shows the text information to the right of the video; FIG. 5g shows the text information in the middle of the video; the text information shown in fig. 5h is presented in bilingual (multiple lines) form.

S304: and if the similarity between the target text information and the text identification information is smaller than a set threshold value, determining target video data based on the target text information and second video data, wherein the second video data is obtained by performing visual conversion on the first video data.

In order to determine whether the first video data has the subtitle information, if the first video data has no subtitle, the subtitle is rendered, and if the first video data has the subtitle, the video is directly subjected to visual conversion, and the corresponding subtitle information is displayed. Specifically, target text information is compared with text identification information, the text information in the text identification information is determined, similarity calculation is carried out on the target text information and the text information in the text identification information to obtain the similarity between the target text information and the text identification information, and then the similarity between the target text information and the text identification information is compared with a set threshold value. It should be noted that, when determining the text information in the text identification information, there are two cases, one is that there is text information in the text identification information (in this process, the text information existing in the first video data is identified from each video frame image, and therefore, the existing text information is sorted according to the video playing time, which is beneficial to the comparison between the two, and the other is that the text information included in the text identification information is empty (i.e. there is no text information). In the latter case, when the similarity calculation between the target text information and the text information in the text identification information is performed, the obtained similarity is 0. In the former case, similarity calculation is performed on text information in the text identification information and the target text information, and the obtained similarity between the target text information and the text identification information may also be 0, where 0 indicates that the target text information is completely different from the text information in the text identification information. In the case where the text information existing in the first video data is english and the target text information is chinese, the similarity between the obtained target text information and the text identification information is 0. Of course, other situations are also included, for example, the text information existing in the first video data is an advertisement word in the background of a certain frame of video frame in the video frame data, in which case, the similarity between the obtained target text information and the text identification information may also be 0. The case of 0 is not listed here.

And if the similarity between the target text information and the text identification information is smaller than a set threshold value, the first video data is considered to have no subtitles. And if the similarity between the target text information and the text identification information is greater than or equal to a set threshold value, the first video data is considered to have subtitles.

In a possible implementation manner, if the similarity between the target text information and the text identification information is smaller than a set threshold, the target video data is determined based on the target text information and second video data, where the second video data is obtained by performing visual conversion on the first video data.

The target text information comprises one or more words of a target song corresponding to the audio data, and the start time and the duration of each word. Based on the above, each frame of image of the first video data is subjected to visual conversion to obtain each frame of image after conversion, each frame of image after conversion is determined as second video data, then a text image of each frame of image of the second video data is determined based on the target text information, that is, text information of corresponding time is obtained from the target text information according to the time of each frame of image, then the text information forms a text image, each frame of image in the second video data is operated in the same way to obtain the text image of each frame of image, and then the target video data is determined based on the text image of each frame of image and each frame of image of the second video data.

In a possible embodiment, the target video data may be determined using a rendering module. In the present application, the rendering module is mainly composed of three parts: font configuration files, a space-time parser and a shader (shader) rendering module. Wherein the font configuration file is a json text. The font, size, color, inter-word distance, stroking effect (stroking size and color), shading effect (shading radius, offset, and color), and maximum length of a single line of characters used by the target text information may be configured. With the font configuration file, the target text information can be drawn into one or more corresponding text images by lines. The spatiotemporal parser is used to parse the spatiotemporal information of the text image. The method specifically comprises information such as the position of each word in the text image in the image (convenient for animation word by word), the playing progress, the line number of the displayed file and the like. Finally, each frame of video frame image, each text image and the corresponding analyzed space-time information are transmitted to a shader rendering module together, so that final target video data are synthesized, the visual angle corresponding to the final target video data is changed compared with that of the first video data, in the application, the horizontal version video is converted into the vertical version video, and meanwhile, the vertical version video also comprises subtitle information.

In one possible implementation, the format of the lyric information may be: "[ start time, duration ], ith lyric content", where the start time is the starting time position of the lyric in the target song and the duration is the time taken by the sentence when it is played. For example, { [0000, 0450] first sentence lyrics, [0450, 0500] second sentence lyrics, [0950, 0700] third sentence lyrics, [1650, 0500] fourth sentence lyrics }. Where "0000" in "[0000, 0450] first sentence lyric" means that "first sentence lyric" is started from 0 ms of the target song, and "0450" means that "first sentence lyric" lasts 450 ms. "0450" in "[0450, 0500] second sentence lyric" means that "second sentence lyric" starts from 450 milliseconds of the target song, and "0500" means that "second sentence lyric" lasts for 500 milliseconds. And by analogy, the meaning of the following two lyrics is the same as that of the contents in the [0000, 0450] first lyric "and the [0450, 0500] second lyric", and the description is omitted here.

In one possible implementation, the format of the lyric information may also be: "[ start time, duration ] the first word (start time, duration)" in a lyric of a sentence, wherein the start time within "[ ]" represents the start time of a lyric of a sentence in the entire song, the duration within "[ ]" represents the time taken when the lyric of the sentence is played, the start time within "()" represents the start time of the first word in the lyric of the sentence, and the duration within "()" represents the time taken when the word is played. For example, a lyric includes a lyric: "but also remember your smile", which corresponds to the lyrics format: [264,2686] but (264,188) also (453,268) remembers (721,289) to (1009,328) you (1337,207) for (1545,391) laugh (1936,245) and (2181,769). 264 in the square brackets indicates that the starting time of the lyric in the whole song is 264ms, and 2686 indicates that the lyric takes 2686ms when playing. Taking the "smile" word as an example, the beginning time of the word "smile" in the whole song is 1936ms, and the time of the word "smile" in the song is 245ms when the lyric "still remembers your smile" is played is 1936ms and 245ms. Other words are interpreted in this manner and will not be described in further detail herein.

In one possible implementation, the format of the lyric information may be: "(start time, duration) a word". The start time in the small brackets represents the start time of a certain word in the target song, and the duration time in the small brackets represents the time occupied by the word when the word is played. For example, a lyric includes a lyric: "but also remember your smile", which corresponds to the lyrics format: (264,188) but (453,268) also (721,289) remembers (1009,328) to (1337,207) you (1545,391) for (1936,245) laugh (2181,769). The "264" in the first parenthesis indicates that the "but" word starts 12264 ms in the target song, and the "188" in the first parenthesis indicates that the "but" word takes 188 ms when played.

In another possible implementation manner, if the similarity between the target text information and the text identification information is greater than or equal to the set threshold, that is, it indicates that the first video data originally includes caption information, and in the case of a music video, that is, it indicates that the music video has lyric captions, in this case, the first video data is visually converted directly in response to a view angle conversion event for the first video data, so as to obtain second video data, where the first video data is displayed at a first view angle and the second video data is displayed at a second view angle; and taking the second video data as target video data, outputting the target video data, and synchronously displaying the subtitle information in the first video data.

S305: and outputting the target video data.

In one possible implementation, the manner of outputting the target video data includes one or both of the following: one method is to store target video data in a local end or a cloud end of a terminal device, and the other method is to play the target video data in the terminal device.

In a feasible embodiment, the attribute information of the target song may also be analyzed, the song name and the song performer of the target song are determined, a song information image is generated based on the song name and the song performer, the song information image is added to the target video data, the display position may be specifically set, and the display time may also be set, for example, the song information image and the first frame image of the target video may be spliced together, or the song information image may be displayed at a fixed position all the time in the process of playing the target video data. If the song information is too long, the song information can be divided into a plurality of lines to be displayed by utilizing the font configuration file.

In the embodiment of the application, the computer equipment responds to a visual angle conversion event of the first video data, acquires the audio data of the first video data, processes the audio data and determines the target text information of the audio data; then, performing text recognition on the first video data to obtain a text recognition result; performing condition detection according to the target text information and the text recognition result, namely determining whether subtitles exist in the first video data, if the condition detection is passed (namely, subtitles do not exist in the first video data), determining the target video data based on the target text information and the second video data, adding subtitles to the second video by using the target text information, wherein the second video data is obtained by performing visual conversion on the first video data, and the condition detection comprises: the text recognition result indicates that no text information exists in the first video data, or the text information indicating that the first video data exists in the text recognition result meets a preset condition is determined according to the target text information; and finally outputting the target video data. By the embodiment of the application, the subtitle can be added to the video without the subtitle under the condition that the video is subjected to visual conversion, so that the video content is enriched.

Based on the above explanation, the video processing method can be summarized, please refer to fig. 6, which is a schematic flow chart of another video processing method disclosed in the embodiment of the present application, specifically, a horizontal version video is used as first video data, the video processing method includes a horizontal version video 601 with a subtitle and a horizontal version video 602 without a subtitle, audio data 1 in the horizontal version video 601 video is extracted, audio data 2 in the horizontal version video 602 video is extracted, fingerprint information 1 to be identified of the audio data 1 and fingerprint information 2 to be identified of the audio data 2 are determined, fingerprint matching is performed on the fingerprint information 1 to be identified and the fingerprint information 1 to be identified, a song id1 corresponding to the audio data 1 and a song id2 corresponding to the audio data 2 are determined, and lyric information of the audio data 1 and lyric information of the audio data 2 are determined based on the song id; extracting key frames of a horizontal video 601 and a horizontal video 602, detecting whether text information exists in the horizontal video 601 and the horizontal video 602, if so, determining specific text information, comparing the existing text with lyric information of audio data, detecting whether caption information exists, firstly performing vertical rendering on the video to obtain a vertical video, and then performing lyric rendering according to specific caption conditions, as can be seen from fig. 6, the horizontal video 601 has the text information, and the detected text information is the lyric information of the audio data 1, so that lyric rendering is not performed, and the lyric information is directly displayed on the vertical video, as shown in 603; the horizontal version video 602 has no text, so lyric rendering is required, and the lyric information of the audio data 2 is added to the vertical version video converted from the horizontal version video by using the rendering module, as shown in 604.

Optionally, the rendering module may also be illustrated by a diagram, as shown in fig. 7, which is a schematic diagram of a rendering module shown in this embodiment of the present application, and includes a font configuration file 701, a spatio-temporal parser 702, and a rendering renderer 703. When the first video data is a music video and a song corresponding to the audio data in the first video data is determined to have no subtitles, analyzing target text information (lyric information) by using a space-time analyzer 702 to obtain corresponding text images, wherein each text image comprises a sentence of lyrics or space information and time information of a word in the sentence of lyrics; meanwhile, the font configuration file 701 is used to configure font information of the parsed song information image (which is not parsed) and the video frame sequence of the second video data (which has been subjected to the visual conversion); finally, the rendering device 703 is used to render the text image, the lyric information image, and the video frame sequence of the second video data frame by frame, so as to obtain the target video data. It should be noted that the first video data may be a horizontal version video, the second video data may be a vertical version video, and the font configuration file may also directly convert the font information of the video frame sequence of the first video data, and then perform the visual conversion based on the rendering renderer.

In a feasible embodiment, when the visual conversion is carried out, the application uses the blurred background to replace the black background, does not cut off the content of the original video, and can also supplement the corresponding content (such as song information, including the name of the song and the singer of the song) for the video and add real-time lyric information to the content without lyrics. Therefore, when the video visual angle is converted, the video content can be enriched and the audience can be attracted.

Based on the above method embodiment, the embodiment of the present application further provides a schematic structural diagram of a video processing apparatus. Fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure. The audio data processing apparatus 800 shown in fig. 8 may run an acquisition unit 801, a processing unit 802, a determination unit 803, and an output unit 804, specifically:

an acquisition unit 801 configured to acquire audio data of first video data in response to a view angle conversion event for the first video data;

the processing unit 802 is configured to process the audio data to obtain target text information of the audio data, and perform text recognition on the first video data to obtain text recognition information;

a determining unit 803, configured to determine, if the similarity between the target text information and the text identification information is smaller than a set threshold, target video data based on the target text information and second video data, where the second video data is obtained by performing visual conversion on the first video data;

an output unit 804, configured to output the target video data.

In a possible implementation manner, the determining unit 803 is further configured to determine text information in the text identification information;

the processing unit 802 is further configured to perform similarity calculation on the target text information and the text information in the text identification information to obtain similarity between the target text information and the text identification information; and comparing the similarity between the target text information and the text identification information with a set threshold value.

In a possible implementation manner, the processing unit 802 is further configured to, if the similarity between the target text information and the text identification information is greater than or equal to the set threshold, perform visual conversion on the first video data to obtain second video data, where the first video data is displayed at a first viewing angle, and the second video data is displayed at a second viewing angle;

the output unit 804 is further configured to take the second video data as target video data and output the target video data.

In a possible implementation manner, when the processing unit 802 processes the audio data to obtain target text information of the audio data, the processing unit is specifically configured to:

converting the audio data into voice spectrum information;

determining fingerprint information to be identified corresponding to the audio data based on a peak point in the voice frequency spectrum information;

matching the fingerprint information to be identified with a fingerprint database, and determining target fingerprint information corresponding to the fingerprint information to be identified and target song attribute information corresponding to the target fingerprint information;

and determining target text information of the audio data based on the target song attribute information.

In a possible implementation manner, when the determining unit 803 determines the target text information of the audio data based on the target song attribute information, it is specifically configured to:

acquiring the time length of the audio data and the first time of the audio data in the target song, wherein the first time is the starting time of the audio data;

adding the time length and the first time to obtain a second time;

and determining lyric information of the target song based on the attribute information of the target song, analyzing characters from the first time to the second time in the lyric information, and determining target text information of the audio data according to the analyzed characters.

In one possible implementation, the target text information includes one or more words of the target song, the target text information further includes a start time and a duration of each word of the words of the target song, and/or the start time and the duration of each word of the words of the target song; when the determining unit 803 determines the target video data based on the target text information and the second video data, it is specifically configured to:

performing visual conversion on each frame of image of the first video data to obtain each frame of converted image, and determining each frame of converted image as second video data;

determining a text image of each frame image of the second video data based on the target text information;

determining target video data based on the text image of each frame of image and each frame of image of the second video data.

In a possible implementation manner, the determining unit 803 determines the target video data based on the target text information and the second video data, and is specifically configured to:

processing the audio data to obtain target song attribute information corresponding to the audio data;

analyzing the target song attribute information, and determining the song name and the song singer of the target song;

generating a song information image based on the song title and the song singer;

and determining target video data based on the song information image, the target text information and the second video data.

In a possible implementation manner, the text recognition is performed by a text detection recognition model, where the text detection recognition model includes a text detection network and a text recognition network, and when the processing unit 802 performs text recognition on the first video data to obtain text recognition information, the processing unit is specifically configured to:

calling the text detection network to detect the first video data to obtain a text detection result;

if the text detection result indicates that the first video data has text information, analyzing the text detection result to obtain a text region image, and calling the text recognition network to perform text recognition on the text region image to obtain text recognition information;

and if the text detection result indicates that the first video data does not have text information, taking the text detection result as text identification information.

It can be understood that the functions of each functional unit of the video processing apparatus provided in this embodiment of the present application may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description in the foregoing method embodiment, which is not described herein again.

In a possible embodiment, the video processing apparatus provided in the embodiment of the present application may be implemented in a software manner, and the video processing apparatus may be stored in a memory, may be in the form of a program, a plug-in, and the like, and includes a series of units, including an obtaining unit, a processing unit, a determining unit, and an output unit; the acquisition unit, the processing unit, the determination unit and the output unit are used for realizing the video processing method provided by the embodiment of the application.

In other possible embodiments, the video processing apparatus provided in this embodiment may also be implemented by a combination of hardware and software, and by way of example, the video processing apparatus provided in this embodiment may be a processor in the form of a hardware decoding processor, which is programmed to execute the video processing method provided in this embodiment, for example, the processor in the form of a hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable Logic Devices (PLDs), complex Programmable Logic Devices (CPLDs), field Programmable Gate Arrays (FPGAs), or other electronic components.

By adopting the embodiment of the application, the equipment responds to the visual angle conversion event of the first video data, acquires the audio data of the first video data, processes the audio data and determines the target text information of the audio data; then, performing text recognition on the first video data to obtain text recognition information; if the similarity between the target text information and the text identification information is smaller than a set threshold value, determining target video data based on the target text information and second video data, wherein the second video data is obtained by performing visual conversion on the first video data; and finally outputting the target video data. According to the embodiment of the application, whether subtitles need to be added to the video is determined by utilizing the set threshold, and the processing efficiency is improved. Therefore, the video without subtitles can be added with subtitles under the condition that the video is subjected to visual conversion, so that the video content is enriched.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. The computer device described in the embodiments of the present application includes: a processor 901, a communication interface 902, and a memory 903. The processor 901, the communication interface 902, and the memory 903 may be connected by a bus or other means, and the embodiment of the present application takes the bus connection as an example.

The processor 901 (or referred to as a Central Processing Unit (CPU)) is a computing core and a control core of the computer device, and can analyze various instructions in the computer device and process various data of the computer device, for example: the CPU can be used for analyzing a power-on and power-off instruction sent to the computer equipment by a user and controlling the computer equipment to carry out power-on and power-off operation; the following steps are repeated: the CPU may transmit various types of interactive data between the internal structures of the computer device, and so on. The communication interface 902 may optionally include a standard wired interface, a wireless interface (e.g., wi-Fi, mobile communication interface, etc.), controlled by the processor 901 for transceiving data. The Memory 903 (Memory) is a Memory device in the computer device for storing programs and data. It will be appreciated that the memory 903 herein can comprise both internal memory of the computer device and, of course, extended memory supported by the computer device. The memory 903 provides storage space that stores the operating system of the computer device, which may include, but is not limited to: android system, iOS system, windows Phone system, etc., which are not limited in this application.

In the embodiment of the present application, the processor 901 executes the executable program code in the memory 903 to perform the following operations:

processing the audio data to obtain target text information of the audio data;

and outputting the target video data.

In one possible implementation, the processor 901 is further configured to:

determining text information in the text identification information;

similarity calculation is carried out on the target text information and the text information in the text identification information, and the similarity of the target text information and the text identification information is obtained;

in a possible implementation manner, the processor 901 is further configured to compare the similarity between the target text information and the text recognition information with a set threshold, and to:

if the similarity between the target text information and the text identification information is greater than or equal to the set threshold, performing visual conversion on the first video data to obtain second video data, wherein the first video data is displayed at a first visual angle, and the second video data is displayed at a second visual angle;

and taking the second video data as target video data, and outputting the target video data.

In a possible implementation manner, when the processor 901 processes the audio data to obtain target text information of the audio data, the processor is specifically configured to:

converting the audio data into voice spectrum information;

determining fingerprint information to be identified corresponding to the audio data based on a peak point in the voice spectrum information;

In a possible implementation manner, when the processor 901 determines the target text information of the audio data based on the target song attribute information, the processor is specifically configured to:

adding the time length and the first time to obtain a second time;

In one possible implementation, the target text information includes one or more words of the target song, the target text information further includes a start time and a duration of each word of the words of the target song, and/or the start time and the duration of each word of the words of the target song; when the processor 901 determines target video data based on the target text information and the second video data, the processor is specifically configured to:

In a possible implementation manner, when the processor 901 determines target video data based on the target text information and the second video data, the processor is specifically configured to:

In a possible implementation manner, the text recognition is performed by a text detection recognition model, where the text detection recognition model includes a text detection network and a text recognition network, and when the processor 901 performs text recognition on the first video data to obtain text recognition information, the processor is specifically configured to:

According to an aspect of the present application, the embodiment of the present application further provides a computer program product, which includes a computer program, and the computer program is stored in a computer readable storage medium. The processor 901 reads the computer program from the computer-readable storage medium, and the processor 901 executes the computer program, so that the computer apparatus 900 performs the video processing method of fig. 3.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of video processing, the method comprising:

processing the audio data to obtain target text information of the audio data;

and outputting the target video data.

2. The method of claim 1, further comprising:

determining text information in the text identification information;

and comparing the similarity between the target text information and the text identification information with a set threshold value.

3. The method according to any one of claims 1-2, further comprising:

4. The method of claim 1, wherein the processing the audio data to obtain target text information of the audio data comprises:

converting the audio data into voice spectrum information;

5. The method of claim 4, wherein determining target text information for the audio data based on the target song attribute information comprises:

adding the time length and the first time to obtain a second time;

6. The method of claim 5, wherein the target textual information includes one or more words of the target song, the target textual information further includes a start time and duration of each word of the words, and/or a start time and duration of each word of the words; the determining target video data based on the target text information and the second video data comprises:

7. The method of claim 1, wherein determining target video data based on the target text information and second video data comprises:

8. The method of claim 1, wherein the text recognition is performed by a text detection recognition model, the text detection recognition model comprises a text detection network and a text recognition network, and the performing text recognition on the first video data to obtain text recognition information comprises:

9. A computer device, characterized in that the computer device comprises:

a processor adapted to implement one or more computer programs; and the number of the first and second groups,

computer storage medium storing one or more computer programs adapted to be loaded by said processor and to perform the video processing method according to any of claims 1-8.

10. A computer-readable storage medium, characterized in that the computer storage medium stores one or more computer programs adapted to be loaded by a processor and to perform the video processing method according to any of claims 1-8.