CN113301444B - Video processing method and device, electronic equipment and storage medium - Google Patents

Video processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113301444B
CN113301444B CN202110554116.XA CN202110554116A CN113301444B CN 113301444 B CN113301444 B CN 113301444B CN 202110554116 A CN202110554116 A CN 202110554116A CN 113301444 B CN113301444 B CN 113301444B
Authority
CN
China
Prior art keywords
words
video
word
target object
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110554116.XA
Other languages
Chinese (zh)
Other versions
CN113301444A (en
Inventor
何立伟
陈铁军
刘申亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202110554116.XA priority Critical patent/CN113301444B/en
Publication of CN113301444A publication Critical patent/CN113301444A/en
Application granted granted Critical
Publication of CN113301444B publication Critical patent/CN113301444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Studio Circuits (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The disclosure relates to a video processing method, a video processing device, an electronic device and a storage medium, and belongs to the field of image processing. The video processing method comprises the following steps: identifying audio information in the video to obtain text information corresponding to the audio information and the occurrence time of each character in the text information in the video; in response to that a plurality of words in the text information are identical and continuous, determining a target time period according to the occurrence time of a first character and a last character in the plurality of words in the video, wherein the target time period is used for representing the occurrence time period of the plurality of words in the video, and each word in the text information is composed of at least one character; and adding a dynamic effect in which a plurality of words appear in sequence in a target video segment corresponding to a target time period in the video. According to the scheme of the disclosure, the dynamic subtitles can be added according to the audio information of the characters in the video, and the highlighted content of the characters is displayed through the dynamic subtitles, so that the video processing effect is improved.

Description

Video processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of image processing, and in particular, to a video processing method and apparatus, an electronic device, and a storage medium.
Background
With the continuous development of internet technology and electronic devices, video watching becomes a common entertainment form for users in leisure status, and is favored by the majority of users. In order to make the video content more wonderful, subtitles are usually added to the video manually, but the video is processed manually, so that the video processing efficiency is low.
Disclosure of Invention
The present disclosure provides a video processing method, apparatus, electronic device and storage medium, which reduces the labor cost and improves the video processing effect.
According to an aspect of the embodiments of the present disclosure, there is provided a video processing method, including:
identifying audio information in a video to obtain text information corresponding to the audio information and the occurrence time of each character in the text information in the video;
in response to a plurality of words in the text information being identical and continuous, determining a target time period according to the occurrence time of a first character and a last character in the plurality of words in the video, the target time period being used for representing the occurrence time period of the plurality of words in the video, each word in the text information being composed of at least one character;
and adding a dynamic effect in which the words appear in sequence in a target video segment corresponding to the target time segment in the video.
In some embodiments, the adding, in a target video segment corresponding to the target time segment in the video, a dynamic effect in which the words appear sequentially includes:
adding a dynamic effect in which the plurality of words appear in sequence in the target video segment in response to the target object being included in the target video segment.
In some embodiments, the identifying audio information in a video to obtain text information corresponding to the audio information and an occurrence time of each character in the text information in the video includes:
responding to the video including the target object, identifying the audio information in the video, and obtaining text information corresponding to the audio information and the occurrence time of each character in the text information in the video.
In some embodiments, the adding, in a target video segment corresponding to the target time segment in the video, a dynamic effect in which the words appear sequentially includes:
determining words required to be displayed in each video frame in the target video segment from the plurality of words, wherein the number of the words required to be displayed in any video frame is not less than the number of the words required to be displayed in the last video frame of any video frame;
respectively determining the display position corresponding to the word to be displayed in each video frame;
rendering the corresponding words at the determined display positions in each of the video frames.
In some embodiments, the determining, from the plurality of terms, the term that needs to be displayed for each video frame in the target video segment includes:
determining a starting display time for each of the plurality of terms;
determining a target video segment corresponding to the target time segment from the video;
and for each video frame in the target video segment, determining the words with the starting display time earlier than or equal to the playing time corresponding to the video frame as the words required to be displayed by the video frame.
In some embodiments, said determining the display position in each of said video frames, rendering the corresponding word, comprises:
determining the size of each character in the words according to the distance between the words to be displayed in any video frame and the target object, the number of the characters in the words and the size of the target object;
rendering each character in the term at the determined display position in the video frame according to the size of each character in the term.
In some embodiments, the size of each character in the word is calculated according to the following formula:
Figure BDA0003076506930000021
wherein size is the size of each character in the word, k 1 Representing a coefficient associated with a size of the target object and a distance between the word and the target object, where k 1 Has a positive correlation with the size of the target object, and k 1 Is in negative correlation with the distance, L represents the length of the target object, W represents the width of the target object, n represents the number of characters in the word, k represents 2 Represents a coefficient associated with the number of characters in the word, and k 2 And the number of charactersThe amounts are in a positive correlation; wherein k is 1 Is any value greater than 0, k 2 Is any value greater than 0 and less than 1.
In some embodiments, the separately determining a display position corresponding to a word to be displayed in each video frame includes:
determining display positions of the plurality of words on a first circular curve centered on a target object and having a first distance as a radius in response to a number of words of the plurality of words being less than or equal to a first number; alternatively, the first and second electrodes may be,
in response to a number of words of the plurality of words being greater than the first number, determining a display position of a preceding first number of words of the plurality of words on the first circular curve, determining a display position of remaining words of the plurality of words on a second circular curve centered on the target object and having a radius of a second distance, the second distance being greater than the first distance.
In some embodiments, the separately determining a display position corresponding to a word to be displayed in each video frame includes:
determining a plurality of continuous sequence number intervals and distances corresponding to the sequence number intervals, wherein any sequence number interval represents a sequence number of a word which can be displayed on a circular curve by taking a target object as a center and taking the distance corresponding to the sequence number interval as a radius, the sequence number interval is larger than a previous sequence number interval of the sequence number interval, and the distance corresponding to the sequence number interval is larger than the distance corresponding to the previous sequence number interval;
determining the distance between each word and the target object as the distance corresponding to the sequence number interval to which the sequence number of each word belongs according to the sequence number of each word in the plurality of words;
and for each video frame, determining the display position of the word in the video frame according to the display position of the target object in the video frame and the distance between the word needing to be displayed in the video frame and the target object.
In some embodiments, the determining the display position of the word in the video frame according to the display position of the target object in the video frame and the distance between the word to be displayed in the video frame and the target object includes:
for any word in the video frame that needs to be displayed,
determining the relative position of the word and the center of the target object according to the distance between the word and the target object and the number of words to be displayed on a circular curve which takes the target object as the center and the distance as the radius in the target video segment;
acquiring the display position of the center of the target object in the video frame;
and determining the display position of the word in the video frame according to the relative position of the word and the center of the target object and the display position of the center of the target object in the video frame.
In some embodiments of the present invention, the,
the relative position of the word to the center of the target object includes: an included angle between a connecting line of the words and the center of the target object and a reference datum line, wherein the reference datum line is a ray pointing to a reference direction with the center of the target object as a starting point, and the included angle is calculated according to the following formula:
Figure BDA0003076506930000041
wherein n is the serial number of the words, alpha is the included angle corresponding to the nth word, and alpha is max The maximum value n in the included angle range corresponding to a plurality of words to be displayed on a circular curve which takes the target object as the center and the distance as the radius in the target video segment is 0 And the number of words to be displayed on a circular curve which takes the target object as the center and the distance as the radius in the target video segment is determined.
According to still another aspect of the embodiments of the present disclosure, there is provided a video processing apparatus including:
the recognition unit is configured to recognize audio information in a video, and obtain text information corresponding to the audio information and occurrence time of each character in the text information in the video;
a determining unit configured to perform, in response to a plurality of words in the text information being identical and consecutive, determining a target time period according to appearance times of a first character and a last character in the plurality of words in the video, the target time period being used for representing the appearance time period of the plurality of words in the video, each word in the text information being composed of at least one character;
an adding unit configured to add a dynamic effect in which the plurality of words appear in sequence in a target video segment corresponding to the target time period in the video.
In some embodiments, the adding unit is configured to perform adding a dynamic effect in which the plurality of words appear in sequence in the target video segment in response to a target object being included in the target video segment.
In some embodiments, the identification unit is configured to perform identification on audio information in the video in response to a target object included in the video, and obtain text information corresponding to the audio information and an occurrence time of each character in the text information in the video.
In some embodiments, the adding unit includes:
a word determining subunit configured to perform determining, from the plurality of words, words that need to be displayed in each video frame of the target video segment, where the number of words that need to be displayed in any video frame is not less than the number of words that need to be displayed in the last video frame of the any video frame;
the position determining subunit is configured to respectively determine display positions corresponding to the words to be displayed in each video frame;
a rendering subunit configured to perform the determined display position in each of the video frames to render the corresponding word.
In some embodiments, the term determination subunit is configured to perform determining a starting display time of each of the plurality of terms; determining a target video segment corresponding to the target time segment from the video; and for each video frame in the target video segment, determining the words with the starting display time earlier than or equal to the playing time corresponding to the video frame as the words required to be displayed by the video frame.
In some embodiments, the rendering subunit is configured to perform determining a size of each character in the word according to a distance between the word to be displayed in any video frame and the target object, the number of characters in the word, and a size of the target object; rendering each character in the term at the determined display position in the video frame according to the size of each character in the term.
In some embodiments, the size of each character in the word is calculated according to the following formula:
Figure BDA0003076506930000051
wherein size is the size of each character in the word, k 1 A coefficient representing a relationship between a size of the target object and a distance between the word and the target object, where k 1 Has a positive correlation with the size of the target object, and k 1 Is in negative correlation with the distance, L represents the length of the target object, W represents the width of the target object, n represents the number of characters in the word, k represents 2 Represents a coefficient associated with the number of characters in the word, and k 2 The number of the characters is in positive correlation; wherein k is 1 Is any value greater than 0, k 2 Is any value greater than 0 and less than 1.
In some embodiments, the position determination subunit is configured to perform, in response to a number of words of the plurality of words being less than or equal to a first number, determining display positions of the plurality of words on a first circular curve centered on a target object and having a first distance as a radius; alternatively, the first and second electrodes may be,
the position determination subunit is configured to perform, in response to a number of words of the plurality of words being greater than the first number, determining, on the first circular curve, display positions of a preceding first number of words of the plurality of words, and determining, on a second circular curve centered on the target object and having a second distance as a radius, display positions of remaining words of the plurality of words, the second distance being greater than the first distance.
In some embodiments, the position determination subunit is configured to perform determining a plurality of consecutive sequence number intervals and distances corresponding to the plurality of sequence number intervals, where any sequence number interval represents a sequence number of a word displayable on a circular curve centered on a target object and having a radius equal to a distance corresponding to the any sequence number interval, the any sequence number interval is greater than a previous sequence number interval of the any sequence number interval, and the distance corresponding to the any sequence number interval is greater than the distance corresponding to the previous sequence number interval;
the position determining subunit is configured to perform determining, according to the sequence number of each of the plurality of words, that the distance between each of the words and the target object is a distance corresponding to a sequence number interval to which the sequence number of each of the words belongs;
the position determining subunit is configured to perform, for each video frame, determining a display position of a word in the video frame according to a display position of the target object in the video frame and a distance between the word to be displayed in the video frame and the target object.
In some embodiments, the position determination subunit is configured to perform the determination for any word in the video frame that needs to be displayed,
determining the relative position of the word and the center of the target object according to the distance between the word and the target object and the number of words to be displayed on a circular curve which takes the target object as the center and the distance as the radius in the target video segment;
acquiring the display position of the center of the target object in the video frame;
and determining the display position of the word in the video frame according to the relative position of the word and the center of the target object and the display position of the center of the target object in the video frame.
In some embodiments, the relative position of the word to the center of the target object comprises: an included angle between a connecting line of the words and the center of the target object and a reference datum line, wherein the reference datum line is a ray pointing to a reference direction with the center of the target object as a starting point, and the included angle is calculated according to the following formula:
Figure BDA0003076506930000061
wherein n is the serial number of the words, alpha is the included angle corresponding to the nth word, and alpha is max The maximum value n in the included angle range corresponding to a plurality of words to be displayed on a circular curve which takes the target object as the center and the distance as the radius in the target video segment is 0 And the number of words to be displayed on a circular curve which takes the target object as the center and the distance as the radius in the target video segment is determined.
According to still another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
one or more processors;
volatile or non-volatile memory for storing the one or more processor-executable instructions;
wherein the one or more processors are configured to perform the video processing method of the above aspect.
According to yet another aspect of the embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium, wherein instructions of the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the video processing method of the above aspect.
According to yet another aspect of the embodiments of the present disclosure, there is provided a computer program product, wherein instructions of the computer program product, when executed by a processor of an electronic device, enable a server to perform the video processing method of the above aspect.
The video processing method, the video processing device, the electronic equipment and the storage medium provided by the embodiment of the application have at least the following beneficial effects:
the embodiment of the application provides a method for automatically adding dynamic subtitles to a video, which can add the dynamic subtitles according to the emphasized content of a person in the video when the person in the video emphasizes the content to be expressed in a mode of repeatedly speaking a certain word, and highlight the emphasized content through the dynamic subtitles, so that the labor cost is reduced, and the video processing effect is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a schematic diagram illustrating one implementation environment in accordance with an example embodiment.
Fig. 2 is a flow diagram illustrating a video processing method according to an example embodiment.
Fig. 3 is a flow diagram illustrating a video processing method according to an example embodiment.
FIG. 4 is a schematic diagram illustrating a relative positional relationship of a word to a target object, according to an example embodiment.
FIG. 5 is a diagram illustrating a dynamic effect according to an example embodiment.
Fig. 6 is a flow diagram illustrating a video processing method according to an example embodiment.
Fig. 7 is a block diagram illustrating a video processing apparatus according to an example embodiment.
Fig. 8 is a block diagram illustrating another video processing device according to an example embodiment.
Fig. 9 illustrates a block diagram of a terminal in accordance with an exemplary embodiment.
FIG. 10 is a block diagram illustrating a server in accordance with an exemplary embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the description of the above-described figures are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
As used in this disclosure, the terms "at least one," "a plurality," "each," "any," at least one includes one, two, or more than two, and a plurality includes two or more than two, each referring to each of the corresponding plurality, and any referring to any one of the plurality. For example, the plurality of words includes 3 words, each of which refers to each of the 3 words, and any of which refers to any one of the 3 words, which may be the first, the second, or the third.
The video processing method provided by the embodiment of the present disclosure is executed by an electronic device, and in some embodiments, the electronic device is a terminal, and the terminal may be various types of terminals such as a mobile phone, a tablet computer, and a computer. In some embodiments, the electronic device is a server, and the server is a server, or a server cluster composed of several servers, or a cloud computing service center. In some embodiments, the electronic device includes a terminal and a server.
FIG. 1 is a schematic illustration of an implementation environment provided in accordance with an example embodiment, the implementation environment comprising: the terminal 101 and the server 102 are connected through a wireless or wired network.
The terminal 101 has installed thereon a target application served by the server 102, through which the terminal 101 can implement functions such as data transmission, message interaction, and the like. In some embodiments, terminal 101 is a cell phone, tablet, computer, or other terminal. In some embodiments, the target application is a target application in the operating system of the terminal 101 or a target application provided by a third party. For example, the target application is a video processing application having a function of processing a video, but of course, the video processing application can also have other functions, such as a sharing function, a comment function, and the like. In some embodiments, the server 102 is a background server of the target application or a cloud server providing services such as cloud computing and cloud storage.
The terminal 101 is configured to transmit a video to the server 102 based on the target application, and the server 102 is configured to process the video, for example, add subtitles to the video, and then return the processed video to the terminal 101.
The method provided by the embodiment of the disclosure can be applied to any video processing scene.
For example, the method is applied to the scene of adding subtitles to live video:
when subtitles are added to a live video, if the video processing method provided by the embodiment of the disclosure is adopted, the content emphasized by the anchor in the video can be automatically identified, and dynamic subtitles are added to the video according to the content emphasized by the anchor, so that the labor cost is reduced, and the video processing effect is ensured.
It should be noted that the method provided by the embodiment of the present disclosure can be applied to any kind of video processing scene, for example, a scene in which subtitles are added to a video of a drama, a scene in which subtitles are added to a movie, and a scene in which a dynamic effect is added to any video, and the embodiment of the present disclosure does not limit this.
Fig. 2 is a flow chart illustrating a video processing method according to an exemplary embodiment, referring to fig. 2, comprising the steps of:
201. and identifying the audio information in the video to obtain text information corresponding to the audio information and the occurrence time of each character in the text information in the video.
The video is any video, in some embodiments, the video is local to the terminal, and in some embodiments, the video is a video obtained from another terminal. In some embodiments, the video is obtained by shooting, in other embodiments, the video is obtained by other methods, and the embodiment of the present disclosure does not limit the video.
The video comprises a plurality of video frames and audio information, and when the video is played, the video frames and the audio information are played together, so that the audience can not only see the picture but also hear the sound.
After voice recognition is performed on the audio information, text information corresponding to the audio information can be obtained, the text information includes at least one character, and the text information can be regarded as subtitle information of a video because the audio information is audio information of the video.
The video comprises each video frame and playing time corresponding to each video frame, and the audio information comprises a plurality of audio frames and the playing time corresponding to each audio frame, so that when the video frames and the audio frames are played simultaneously, the video can be played according to the playing time of the video frames and the playing time of the audio frames, and the played pictures are matched with the played sound.
Because the audio information comprises the playing time corresponding to each audio frame, the occurrence time of each character in the text information in the video can be obtained when the audio information is subjected to speech recognition. Alternatively, it takes a certain time for a person in the video to speak a character, and thus the appearance time of each character in the video is a period of time. Alternatively, the time required for a person to say a character in a video is short, and therefore, the appearance time of each character in the video is a time point. Alternatively, when the appearance time of each character in the video is a time point, a fixed time length is taken as the elapsed time of each character.
The fixed duration is a duration set by default in the system, a duration set by a user, a duration set by a clipping person, and the like, which is not limited in the embodiment of the present disclosure.
202. In response to a plurality of words in the text information being identical and consecutive, a target time period is determined based on times of occurrence of a first character and a last character in the plurality of words in the video.
Here, the phrase "the button 1, and the button 1" in the text message is the phrase "the button 1, and the button 1" in the text message, and thus, the phrase "the button 1" is the phrase "the button 1, and the phrase" the button 1 "in the text message.
Wherein the target time period is used to represent a time period of occurrence of a plurality of words in the video. Wherein each word in the text information is composed of at least one character. The starting time of the target time period is the appearance time of the first character in the plurality of words in the video, and the ending time of the target time period is the appearance time of the last character in the plurality of words in the video.
203. And adding a dynamic effect in which a plurality of words appear in sequence in a target video segment corresponding to a target time period in the video.
The time period corresponding to the target video segment is the same as the time periods corresponding to the words, so that the playing content of the target video segment corresponds to the words. For example, the plurality of terms are "deduction 1, deduction 1", and the target video segment is the video segment that the anchor is saying "deduction 1, deduction 1".
The person in the target video segment may speak the multiple words in sequence, and if the multiple words are identical and consecutive, the multiple words are words that the person in the video repeatedly speaks. Typically, one would repeat the content that one wants to emphasize, so this word is the content that the person in the video emphasizes. In order to improve the display effect of the subtitles, when the subtitles corresponding to the plurality of words are added to the target video segment, a dynamic effect that the plurality of words appear in sequence is added, so that the plurality of words are emphasized again through the dynamic effect.
The video processing method provided by the embodiment of the disclosure is a method for automatically adding dynamic subtitles to a video, when a person in the video emphasizes contents to be expressed by repeatedly speaking a certain word, the dynamic subtitles can be added according to the contents emphasized by the person in the video, and the emphasized contents are highlighted through the dynamic subtitles, so that the labor cost is reduced, and the video processing effect is improved.
It should be noted that a plurality of groups of words may be included in the text message, each group of words includes a plurality of words, and the plurality of words are the same and consecutive in the text message, and the video processing method provided by the embodiment of the present disclosure can add dynamic subtitles to any group of words, for example, the text message includes "buy the clothing button 1, button 1" and "like the clothing, buy it", and the text message includes two groups of words, the first group of words is "buy it, buy it".
Since the process of adding dynamic subtitles to each group of words is the same, the embodiment of the present disclosure is only exemplary of the process of adding dynamic subtitles to a group of words.
Fig. 3 is a flow chart illustrating a video processing method according to an exemplary embodiment, referring to fig. 3, comprising the steps of:
301. and identifying the audio information of the video to obtain text information corresponding to the audio information and the occurrence time of each character in the text information in the video.
The video is any video, in some embodiments, the video is a local video, and in some embodiments, the video is a video obtained from another terminal. In some embodiments, the video is obtained by shooting, in other embodiments, the video is obtained by other methods, and the embodiments of the present disclosure do not limit the video.
The video comprises a plurality of video frames and audio information, and when the video is played, the plurality of video frames and the audio information are played together, so that the audience can not only see the picture but also hear the sound.
After the audio information is identified, text information corresponding to the audio information can be obtained, the text information includes at least one character, and the text information can be regarded as subtitle information of a video because the audio information is audio information of the video.
In some embodiments, identifying the audio information of the video to obtain text information corresponding to the audio information and an occurrence time of each character in the text information in the video includes: the method comprises the steps of inputting audio information of a video into a voice recognition model, calling the voice recognition model to process the audio information, and obtaining text information corresponding to the audio information and appearance time of each character in the text information in the video, wherein the voice recognition model is trained through sample voice information of a sample video, sample text information corresponding to the sample voice information and appearance time of each character in the sample text information in the sample video, the sample text information corresponding to the sample voice information is real text information of the sample voice information, and for example, the sample text information is a text which is made by manually listening to the sample voice information and comprises audio content.
It should be noted that any speech recognition method may be adopted in the embodiment of the present disclosure to perform speech recognition on the audio information in the video, and the embodiment of the present disclosure does not limit the speech recognition method adopted in step 301.
It should be noted that, in the embodiment of the present application, only by taking the example of identifying the audio information of the video to obtain the text information corresponding to the audio information and the occurrence time of each character in the text information in the video, the example of obtaining the occurrence time of the character in the text information in the video is described.
In another embodiment, the audio information of the video is identified through a speech recognition model to obtain text information corresponding to the audio information, and the occurrence time of characters in the text information in the video is manually marked.
In another embodiment, the text information and the occurrence time of the characters in the text information in the video are obtained by any method, and the text information and the occurrence time of the characters in the text information in the video are directly obtained. The embodiment of the application does not limit the manner of obtaining the text information and the occurrence time of the characters in the text information in the video.
302. In response to the plurality of words in the text information being identical and continuous, determining a target time period according to the occurrence time of the first character and the last character in the plurality of words in the video, wherein the target time period is used for representing the occurrence time period of the plurality of words in the video.
Here, the phrase "the button 1, and the button 1" in the text message is the phrase "the button 1, and the button 1" in the text message, and thus, the phrase "the button 1" is the phrase "the button 1, and the phrase" the button 1 "in the text message.
In some embodiments, before performing step 302, whether a plurality of terms are the same and consecutive is queried in the text information, and when the query is that the plurality of terms are the same and consecutive, the step of responding to the plurality of terms being the same and consecutive in the text information according to the appearance time target time period of the first character and the last character in the plurality of terms in the video is performed.
In some embodiments, querying whether multiple terms are included in the text information that are identical and consecutive includes: performing word segmentation processing on the text information to obtain a plurality of word segmentation results, comparing adjacent word segmentation results, and determining whether the adjacent word segmentation results are the same, if the adjacent word segmentation results are the same, determining that the text information comprises the same and continuous words, and if the connected word segmentation results are different, determining that the text information does not comprise the same and continuous words.
When the query text information includes a plurality of words which are the same and continuous, the plurality of words which are the same and continuous are also determined, in some embodiments, the comparing the adjacent word segmentation results to determine whether the adjacent word segmentation results are the same comprises: and taking the first word segmentation result as a reference word segmentation result, determining whether the second word segmentation result is the same as the reference word segmentation result, if the second word segmentation result is the same as the reference word segmentation result, continuously determining whether the third word segmentation result is the same as the reference word segmentation result, and if the kth word segmentation result is different from the reference word segmentation result, taking the kth word segmentation result as the reference word segmentation result and continuously comparing the subsequent word segmentation results. If at least one word segmentation result is the same as the reference word segmentation result, the at least one word segmentation result and the corresponding reference word segmentation result are a plurality of continuous and same words. Through the comparison method, a plurality of words which are identical and continuous in the text information can be found.
After the audio information in the video is identified, not only the text information corresponding to the audio information is obtained, but also the appearance time of each character in the text information in the video is obtained, so that the time periods corresponding to a plurality of words can be determined according to the appearance time of each character in the text information in the video. In some embodiments, determining the target time period based on the time of occurrence of the first character and the last character in the plurality of words in the video comprises: and determining the appearance time of the first character in the plurality of words in the video as the starting time of the target time period, and determining the appearance time of the last character in the plurality of words in the video as the ending time of the target time period.
In some embodiments, since it takes a certain time for a person in the video to say a character, the occurrence time of each character in the text message in the video is a time period, and the time period includes the starting occurrence time and the ending occurrence time of the character in the video, and in one possible implementation, determining the target time period includes: and taking the starting appearance time of the first character in the plurality of words in the video as the starting time of the target time period, and taking the ending appearance time of the last character in the plurality of words in the video as the ending time of the target time period.
303. And determining a target video segment corresponding to the target time segment from the video.
The time period corresponding to the target video segment is the same as the time periods corresponding to the words, so that the playing content of the target video segment corresponds to the words. For example, the plurality of terms are "deduction 1, deduction 1", and the target video segment is the video segment that the anchor is saying "deduction 1, deduction 1".
It should be noted that the embodiment of the present disclosure is only exemplified by adding a dynamic effect in which a plurality of words appear in sequence in the target video segment when the text information includes a plurality of words that are the same and consecutive, and in some embodiments, the display positions of the plurality of words are determined by the target object, so that the step of adding the dynamic effect in which the plurality of words appear in sequence in the target video segment is performed when the target object is included in the target video segment. For example, adding a dynamic effect in which a plurality of words appear in sequence in the target video segment includes: and adding a dynamic effect of a plurality of words appearing in sequence in the target video segment in response to the target object being included in the target video segment. In the embodiment of the present disclosure, the added dynamic effect is to emphasize content emphasized by a person in a video, and therefore, the added dynamic effect is related to content information included in the target video segment, before the dynamic effect is added, it is determined whether a target object is included in the target video segment, and when the target object is included in the target video segment, the dynamic effect is added, so that the dynamic effect is related to the content in the target video segment, thereby improving the video processing effect.
The target object included in the target video segment may be a first frame of the target video segment including the target object, or each video frame in the target video segment may include the target object. In one possible implementation, in response to the target object being included in the target video segment, adding a dynamic effect in which a plurality of words appear in sequence in the target video segment includes: adding a dynamic effect in which a plurality of words appear in sequence in the target video segment in response to the first video frame in the target video segment comprising the target object; or, in response to each video frame in the target video segment including the target object, adding a dynamic effect in which a plurality of words appear in sequence in the target video segment; alternatively, in response to the first video frame and the last video frame in the target video segment including the target object, a dynamic effect in which a plurality of words appear in sequence is added to the target video segment.
It should be noted that, since the duration of the target video segment is generally short, if the first video frame and the last video frame in the target video segment include the target object, in most cases, each video frame in the target video segment includes the target object, in one possible implementation, it is determined whether the target video segment includes the target object by determining whether the first video frame and the last video frame in the target video segment include the target object.
It should be noted that, if the first video frame in the target video segment includes the target object, and some of the remaining video frames include the target object and some do not include the target object, the display position of each word may be determined according to the position of the target object in the first video frame, so that the corresponding rendering is performed in each video frame according to the determined display position of each word.
It should be noted that, in the embodiment of the present disclosure, the processing process of the video is exemplarily described only by taking an example of first identifying the audio information and then searching whether the target video segment includes the target object. In another embodiment, the video is searched for whether the target object is included in the video, and then whether the audio information is identified is determined.
In one possible implementation manner, identifying audio information in a video to obtain text information corresponding to the audio information and an occurrence time of each character in the text information in the video includes: and responding to the video including the target object, identifying the audio information in the video, and obtaining text information corresponding to the audio information and the occurrence time of each character in the text information in the video.
That is, if the video does not include the target object, the text information is obtained by not identifying the audio information in the video, and only if the video includes the target object, the text information is obtained by identifying the audio information in the video.
The disclosed embodiments allow for the addition of dynamic effects in a target video segment only if the target video segment includes a target object. Therefore, if the target video segment does not include the target object, the audio information is recognized, and the resulting text information is useless. Therefore, according to the embodiment of the disclosure, whether the target object is included in the video is determined first, and then the audio information is identified, so that useless work of the electronic device is reduced, the calculation amount of the electronic device is reduced, and the efficiency of video processing is improved.
The target object included in the video may be that each video frame in the video includes the target object, or some video frames are randomly extracted from the video, and the video frames include the target object.
The target object may be any object, such as a human face, a microphone, a mobile phone, and the like. In some embodiments, the target object is a preset object, and in one possible implementation, in response to the target object being included in the target video segment, adding a dynamic effect in which a plurality of words appear in sequence in the target video segment includes: and responding to the target object corresponding to the locally stored object identification included in the target video segment, and adding a dynamic effect in which a plurality of words appear in sequence in the target video segment.
In some embodiments, the target object may be an object introduced by a person in the video, for example, the textual information includes "this makeup remover is really good, clean on wiping, buy it", then the person in the video is introducing the makeup remover, the emphasized content is also related to the makeup remover, if the makeup remover is included in the target video segment, then dynamic captioning is added to the target video segment to make the audience focus more on the makeup remover, and if the makeup remover is not included in the target video segment, then dynamic captioning is no longer added to the target video segment. In one possible implementation, in response to a target object being included in a target video segment, adding a dynamic effect in which a plurality of words appear in sequence in the target video segment includes: and extracting the object name of at least one statement before the plurality of words, and adding a dynamic effect of the plurality of words appearing in sequence in the target video segment in response to the target object corresponding to the extracted object name included in the target video segment.
Wherein, at least one sentence before the plurality of words can be a first sentence before the plurality of words, a first and a second sentence before the plurality of words, or a first to a third sentence before the plurality of words, etc. to ensure that the at least one sentence is associated with the content of the plurality of words.
In some embodiments, the target object may also be an object held by a person in the video, and in some cases, the object held by the person in the video is an object related to the content to be emphasized by the person in the video, for example, the person in the video is a vendor anchor holding a commodity, and the owner introduces the commodity to the audience, so as to emphasize the advantages of the commodity. In one possible implementation, in response to a target object being included in a target video segment, adding a dynamic effect in which a plurality of words appear in sequence in the target video segment includes: and adding a dynamic effect in which a plurality of words appear in sequence in the target video segment in response to the target video segment including a target object held by a person.
It should be noted that this step 303 is an optional execution step, and in some embodiments, the target video segment is not determined from the video, but the starting display time and the display time length of each word are determined, and a video frame meeting the display condition in the video is rendered, where the video frame meeting the display condition is a video frame whose playing time is equal to or later than the starting display time of the word and is earlier than or equal to the ending display time of the word. The video is processed through the initial display time and the display duration of each word, and the dynamic effect that a plurality of words appear in sequence in the target video segment can be achieved.
In some embodiments, after the target time period is determined, a step of adding a dynamic effect in which a plurality of words appear in sequence in a target video segment corresponding to the target time period in the video is performed. In one possible implementation manner, adding a dynamic effect in which a plurality of words appear in sequence in a target video segment corresponding to a target time segment in a video includes: determining the initial display time and the display duration of each word in the plurality of words, wherein the initial display time of the first word in the plurality of words is the initial time of a target time period, the initial display times of the plurality of words are sequentially increased according to the word arrangement sequence, the ending display time of the plurality of words is the ending time of the target time period, and the ending display time of the words is the sum of the initial display time of the words and the display duration of the words; for each word, the word is rendered in a video frame having a play time equal to or later than the start display time of the word and earlier than or equal to the end display time of the word.
In the embodiment of the present disclosure, the initial display time of the plurality of words is sequentially increased according to the word arrangement order, that is, the later the word arrangement order is, the later the initial display time is; and the ending display time of the plurality of words is consistent, so that the effect that the plurality of words sequentially appear and disappear simultaneously can be realized after the words are rendered in the corresponding video frames according to the starting display time and the ending display time of the words.
It should be noted that, when a dynamic effect that a plurality of words sequentially appear and disappear is added to a video, for a video frame whose playing time is earlier than the starting display time of a first word, the device does not process the video frame, and for a video frame whose playing time is later than the ending display time of a word, the device does not process the video frame, so that the device can determine a video frame corresponding to the starting display time from the video according to the starting display time of the first word, sequentially process the video frame and other video frames following the video frame from the video frame, and stop processing when it is determined that the playing time of a certain video frame is later than the ending display time of a word.
In some embodiments, the determining the initial display time and the display duration for each of the plurality of words comprises: dividing the target time period according to the number of words of the words to obtain a plurality of sub-time periods with the same time length; respectively determining the starting time of each sub-time period as the starting display time of each word according to the arrangement sequence of the sub-time periods and the arrangement sequence of the words; and for each word, determining the display time length of the word according to the time length corresponding to the sub-time period and the number of the words behind the word in the plurality of words.
The time length corresponding to the sub-time period is the display time interval of adjacent words in the words, and the time lengths of the sub-time periods are the same, so that the words can be sequentially displayed according to the same time interval, and the words disappear simultaneously after appearing sequentially.
For example, determining the display duration of a word according to the number of words following the word in the plurality of words and the display time interval between adjacent words includes: and multiplying the number of the words behind the word by the display time interval after adding 1, and taking the obtained numerical value as the display time length of the word.
304. Determining the words required to be displayed in each video frame in the target video segment from a plurality of words, wherein the number of the words required to be displayed in any video frame is not less than the number of the words required to be displayed in the last video frame of any video frame.
In order to improve the display effect of the subtitles, when the subtitles corresponding to the plurality of words are added to the target video segment, a dynamic effect that the plurality of words appear in sequence is added, so that the plurality of words are emphasized again through the dynamic effect.
It should be noted that, in some embodiments, the electronic device performs word rendering on each video frame in the target video segment, and adds a dynamic effect that a plurality of words appear in sequence in the target video segment. And, when the next word appears, the previous word does not disappear, so the multiple words appear in sequence, and the electronic device renders each of the multiple words in the last video frame of the target video segment. But the electronic device does not perform word rendering on the next video frame of the target video segment in the video, that is, any word in the plurality of words is not included in the next video frame of the target video segment.
Since more words are displayed in the video frame that is further back in the target video segment, when the processed target video segment is played, the effect that a plurality of words appear in sequence can be realized in the target video segment. When the next video frame of the target video segment in the video is played, because no words exist in the next video frame, the effect that a plurality of words disappear after appearing in sequence is achieved.
Since the words appear in sequence, the words required to be displayed in each video frame of the target video segment are not identical. For example, the plurality of words is 5 words, the target video segment includes 50 video frames, wherein 1 to 10 video frames need to display 1 word, 11 to 20 video frames need to display 2 words, 21 to 30 video frames need to display 3 words, 31 to 40 video frames need to display 4 words, and 41 to 50 video frames need to display 5 words.
In the disclosed embodiment, a plurality of words appear in sequence, and therefore, each word has its own initial display time, and the initial display time of each word is different. By controlling the starting display time of each word, a plurality of words can be presented in sequence. In one possible implementation, determining a term that needs to be displayed for each video frame in the target video segment from the plurality of terms includes: determining a starting display time for each of the plurality of terms; and for each video frame in the target video segment, determining the words with the starting display time earlier than or equal to the playing time corresponding to the video frame as the words required to be displayed by the video frame.
In some embodiments, each word appears sequentially at the same time interval, and determining the starting display time of each word in the plurality of words comprises: dividing the target time period according to the number of the words to obtain a plurality of sub-time periods with the same duration; and respectively determining the starting time of each sub-time period as the starting display time of each word according to the arrangement sequence of the sub-time periods and the arrangement sequence of the words. Wherein, determining the starting time of each sub-period as the starting display time of each word means: when the arrangement sequence of the words is the same as that of the sub-time periods, the starting time of the sub-time periods is determined as the starting display time of the words, so that each word can appear at the same time interval, and the dynamic disappearance effect is improved.
For example, the number of words of a plurality of words is 5, the period of time corresponding to the plurality of words is 00.
When a person repeatedly speaks a word, under a common condition, each word interval can be very short, and therefore, in the embodiment of the disclosure, a plurality of words are sequentially displayed according to the same time interval, so that not only can the calculation amount be reduced, but also the display of the words can be ensured to be basically corresponding to the words spoken by the person in the video, and the accuracy of dynamic caption display is ensured.
In other embodiments, the display time of each word is the time that a person in the video uttered the word, and in one possible implementation, determining the starting display time of each word in the plurality of words includes: and determining the initial occurrence time corresponding to the first character in each word as the initial display time of each word, so that when the words are displayed, the characters in the video speak the words, and the subtitle content is consistent with the video content.
305. And respectively determining the display position corresponding to the word to be displayed in each video frame.
In some embodiments, the display positions of the same word in different video frames are the same, while in another embodiment, the display positions of the same word in different video frames may be different, and the display positions need to be determined for the words according to the display content in each video frame; in some embodiments, the display position of each word is preset, and in another embodiment, the display position of each word is calculated in real time, and the display position of the word is not limited in the embodiments of the present disclosure.
In a possible implementation manner, a plurality of words are displayed at corresponding positions of a target object, and the embodiment of the present disclosure exemplifies a process of determining a display position by taking determining a display position of a word according to a target object as an example.
In one possible implementation, the plurality of words are displayed outside of the target object and the plurality of words are displayed around the target object. In some embodiments, the number of the plurality of words is small, and the plurality of words is displayed in one circle outside the target object. In other embodiments, the number of the plurality of words is larger, and the plurality of words are respectively displayed in a plurality of circles outside the target object.
In some embodiments, determining the display position corresponding to the word to be displayed in each video frame separately includes: determining display positions of the plurality of words on a first circular curve centered on the target object and having a first distance as a radius in response to the number of words of the plurality of words being less than or equal to a first number; or, in response to the number of words of the plurality of words being greater than the first number, determining display positions of a first number of words of the plurality of words on a first circular curve, and determining display positions of remaining words of the plurality of words on a second circular curve centered on the target object and having a second distance as a radius, the second distance being greater than the first distance.
Wherein the first number is any number, e.g., the first number is 3, 5, etc. The first number is not limited in the embodiments of the present disclosure. Since the second distance is greater than the first distance, the first circular curve is an inner circle of the target object and the second circular curve is an outer circle of the target object, relative to the second circular curve. That is, the electronic device preferentially arranges words in the inner ring, and when the inner ring is not arranged, the rest words are arranged in the outer ring. This disclosed embodiment can be according to the word quantity of a plurality of words, these a plurality of words of rational layout, and the problem of word layout compactness when having avoided the word more has also avoided the word sparse problem of word layout when the word is less for the layout effect of word is better, has improved the video processing effect.
For example, when the number of the plurality of words is less than or equal to 5, the plurality of words is displayed in the inner circle of the target object, and when the number of the plurality of words is greater than 5, 5 words are displayed in the inner circle of the target object, and the remaining words are displayed in the outer circle. The inner circle of the target object is a boundary of a circular area with the target object as a center and the first length as a radius. The outer circle of the target object is a boundary of a circular region centered on the target object and having a second length as a radius. Wherein the first length is less than the second length.
It should be noted that the embodiments of the present disclosure are only exemplified by the case that the second distance is greater than the first distance, and in some embodiments, the second distance is smaller than the first distance. That is, the electronic device preferentially arranges the words in the outer ring, and when the outer ring is not arranged, the rest words are arranged in the inner ring.
Therefore, the distance between the word target objects is influenced by the number of words. Thus, the distance between the word and the target object may be determined according to the number of words. It should be noted that, in the embodiment of the present disclosure, the terms are only exemplified by the outer ring and the inner ring, but in another embodiment, the number of rings is not limited, and the electronic device may lay out the terms on one ring, may lay out on two rings, may lay out on three rings, and the like.
In one possible implementation manner, the determining a display position corresponding to a word to be displayed in each video frame separately includes: determining a plurality of continuous sequence number intervals and distances corresponding to the plurality of sequence number intervals, wherein any sequence number interval represents the sequence number of words and expressions which can be displayed on a circular curve which takes a target object as the center and takes the distance corresponding to any sequence number interval as the radius, any sequence number interval is larger than the previous sequence number interval of any sequence number interval, and the distance corresponding to any sequence number interval is larger than the distance corresponding to the previous sequence number interval; determining the distance between each word and the target object as the distance corresponding to the sequence number interval to which the sequence number of each word belongs according to the sequence number of each word in the plurality of words; and for each video frame, determining the display position of the word in the video frame according to the display position of the target object in the video frame and the distance between the word to be displayed in the video frame and the target object.
Wherein the sequence number of a word indicates the order of display of the word among the plurality of words. Wherein, the sequence number interval before any sequence number interval is greater than any sequence number interval refers to: the minimum value in any sequence number interval is larger than the maximum value in the previous sequence number interval. The method comprises the steps of limiting a circle curve which is centered on a target object and has a certain distance as a radius to display a few words by dividing a plurality of sequence number intervals.
For example, the plurality of sequence number intervals are [1,5] and [6, 12], where the distance corresponding to the sequence number interval [1,5] is 3 centimeters, and the distance corresponding to the sequence number interval [6, 12] is 4 centimeters, then the distances between the 1 st to 5 th words in the plurality of words and the target object are 3 centimeters, and the distances between the 6 th to 12 th words in the plurality of words and the target object are 4 centimeters.
It should be noted that, after determining the distance between the word and the target object, the word may be displayed at any position on a circular curve with the target object as the center and the distance as the radius, for example, a position is randomly determined from the circular curve as the display position of the word. As another example, the plurality of words are assigned display positions according to respective rules.
In some embodiments, the number of words displayed on the circular curve may be different each time the electronic device is performing video processing. In one possible implementation, the electronic device reasonably arranges the words to be displayed according to the number of the words to be displayed on the circular curve.
For example, determining the display position of a word in a video frame according to the display position of a target object in the video frame and the distance between the word to be displayed in the video frame and the target object includes: for any word in a video frame that needs to be displayed:
determining the relative position of the word and the center of the target object according to the distance between the word and the target object and the number of the words needing to be displayed on a circular curve which takes the target object as the center and the distance as the radius in the target video segment; acquiring the display position of the center of the target object in a video frame; and determining the display position of the word in the video frame according to the relative position of the word and the center of the target object and the display position of the center of the target object in the video frame.
Therefore, in the embodiment of the present disclosure, the electronic device can reasonably arrange the words to be displayed according to the number of the words to be displayed on the circular curve, so that the arrangement effect of a plurality of words on the circular curve is improved, and further, the video processing effect is improved.
In one possible implementation, the relative position of the word to the center of the target object includes: an included angle between a connecting line of the words and the center of the target object and a reference datum line, wherein the reference datum line is a ray which points to a reference direction by taking the center of the target object as a starting point, and the included angle is calculated according to the following formula:
Figure BDA0003076506930000201
wherein n is the wordAlpha is the included angle corresponding to the nth word, alpha max The maximum value n in the included angle range corresponding to a plurality of words to be displayed on a circular curve which takes a target object as a center and takes the distance as a radius in the target video segment 0 The number of words to be displayed on a circular curve centered on the target object and having the distance as a radius in the target video segment is set. Wherein n is an integer greater than or equal to 1.
It should be noted that, in the embodiments of the present disclosure, when words are displayed on a circular curve, there is a display range. The display range is an included angle range corresponding to a plurality of words. For example, the display range is [0 °,360 ° ], the display range of the word on the circular curve is the entire circular curve; as another example, the display range is [0 °,180 ° ], the display range of the word on the circular curve is the upper half of the circular curve.
In one possible implementation, the reference line is a polar axis of polar coordinates with the center of the target object as an origin. And an included angle between a connecting line of the words and the center of the target object and the reference datum line is a polar angle of the words in the polar coordinate system. In a polar coordinate system, any position is described by a polar diameter and a polar angle. In the embodiment of the present disclosure, when determining the display position of a word, the relative position of the word and a target object is determined through a polar coordinate system, and then coordinate conversion is performed to obtain the position of the word in a rectangular coordinate system, where the rectangular coordinate system is a rectangular coordinate system of a video frame, and therefore, the position of the word in the rectangular coordinate system is the display position of the word in the video frame. In one possible implementation, the relative position of the word to the center of the target object is represented by the position of the word in polar coordinates with the center of the target object as the origin.
In some embodiments, determining the relative position of a word to the center of the target object based on the distance between the word and the target object and the number of words to be displayed on a circular curve of the target video segment centered on the target object and having the distance as a radius comprises: determining the distance between the words and the target object as the polar diameter of the words in a polar coordinate system taking the center of the target object as an origin, and acquiring polar angle ranges corresponding to a plurality of words at the distance from the target object based on the distance between the words and the target object; and determining the polar angle of the word according to the polar angle range and the sequence number of the word.
The number of words of the words with different distances from the target object is different, for example, the number of words of the words with 3 centimeters from the target object is 5, the polar angle range corresponding to the 5 words is 0-180 degrees, and the polar angle of the 5 words is 5 values in 0-180 degrees. For another example, if the number of words of a plurality of words 5 cm away from the target object is 7, the polar angle range corresponding to the 7 words is 0 to 270 degrees, and then the polar angle of the 7 words is 7 values from 0 to 270 degrees.
In some embodiments, the polar angle of a word is not only related to the polar angle range but also to the sequence number of the word, i.e. also to the display order of the word. For example, the polar angle of a word is positively correlated with the sequence number of the word, or the polar angle of a word is negatively correlated with the sequence number of the word. If the polar angle of the word is in positive correlation with the sequence number of the word, the dynamic effect added by the electronic equipment in the target video segment is as follows: a plurality of words appear in turn according to a counterclockwise sequence; if the polar angle of the word is in a negative correlation with the sequence number of the word, the dynamic effect added by the electronic device in the target video segment is as follows: the words appear in order clockwise.
In one possible implementation, the polar angle differences of adjacent words in the words at the same distance from the target object are the same, that is, the words at the same distance from the target object are displayed at certain intervals. In some embodiments, determining the polar angle of the term from the polar angle range and the sequence number of the term comprises: determining polar angle intervals of adjacent words in the plurality of words at the distance from the target object according to the number of the words at the distance from the target object and the polar angle range; the polar angle of the words is determined according to the polar angle interval, the polar angle range and the sequence numbers of the words, the interval between adjacent words is ensured to be constant, a plurality of words displayed in the video are more orderly, and the dynamic display effect is improved.
For example, as shown in fig. 4, when the plurality of words is 2 words, the polar angle of the first word is 120 degrees, and the polar angle of the second word is 60 degrees; when the plurality of words are 3 words, the polar angle of the first word is 180 degrees, the polar angle of the second word is 90 degrees, and the polar angle of the third word is 0 degree; when the plurality of words are 4 words, the polar angle of the first word is 180 degrees, the polar angle of the second word is 120 degrees, the polar angle of the third word is 60 degrees, and the polar angle of the fourth word is 0 degree. When the plurality of words are 5 words, the polar angle of the first word is 180 degrees, the polar angle of the second word is 135 degrees, the polar angle of the third word is 90 degrees, the polar angle of the fourth word is 45 degrees, and the polar angle of the fifth word is 0 degree.
For example, as shown in FIG. 5, a first word is displayed in a first video frame, a first word and a second word are displayed in a second video frame, and a first word, a second word, and a third word are displayed in a third video frame. Wherein the intervals between the first word, the second word and the third word are the same.
It should be noted that the embodiments of the present disclosure are merely exemplary illustrations based on the determination of the polar angle for the term according to the serial number of the term. In other embodiments, a polar angle may be determined randomly for a word when determining the polar angle of the word. For example, the polar angle of a word is only related to the polar angle range, in which a value is randomly chosen for the word as the polar angle of the word. The disclosed embodiments do not limit the process of determining the polar angle.
In some embodiments, determining the display position of the word in the video frame according to the relative position of the word to the center of the target object and the display position of the center of the target object in the video frame includes: performing coordinate conversion on the polar diameter and the polar angle of the words according to the display position of the target object in the video frame needing to display the words to obtain the coordinates of the words in a rectangular coordinate system of the video frame; the coordinates are determined as the display position of the words in the video frame.
The rectangular coordinate system of the video frame may be a coordinate system with an origin at any position of the video frame.
The coordinates of the words in the rectangular coordinate system of the video frame are calculated according to the following formula:
x=x 0 +r×cosα
y=y 0 +r×sinα
wherein x is the abscissa of the word in the rectangular coordinate system of the video frame, y is the ordinate of the word in the rectangular coordinate system of the video frame, x 0 Is the abscissa, y, of the center of the target object in the rectangular coordinate system of the video frame 0 The vertical coordinate of the center of the target object in the rectangular coordinate system of the video frame is shown, alpha is the polar angle of the words, and r is the polar diameter of the words.
In some embodiments, the target object is an irregular object, and when the center of the target object is determined, the position of the center of the target object can be obtained through the recognition frame by recognizing the video frame. In one possible implementation, the target object is a human face, the recognition frame is a face frame, the center of the face frame is determined as the center of the human face, and the length and the width of the face frame are used as the length and the width of the human face.
306. Rendering the corresponding words at the determined display positions in each video frame.
In some embodiments, the display patterns of the words rendered in the plurality of video frames are consistent, and in other embodiments, the display patterns of the words rendered in the plurality of video frames are not the same.
In one possible implementation, the display style of the word includes the size of each character in the word. In some embodiments, the size of each character in a word is also determined when the corresponding word is displayed at the determined display position in each video frame. Displaying the corresponding words at the determined display positions in each video frame, including: for each video frame, determining the size of a word required to be displayed by the video frame; words are rendered according to their size, at the determined display position in the video frame.
It should be noted that in some embodiments, when rendering a corresponding word in each video frame, the size of each character in the word needs to be re-determined according to the display content in the video, and in other embodiments, the size of each character in the word may be preset in advance, or the size of each character in the word may be determined according to the first video frame and used in subsequent video frames, where the size of the character may be the font size of the character, and the like. The size of the character is not limited in the embodiments of the present disclosure.
In some embodiments, rendering the corresponding word at the determined display position in each video frame includes: determining the size of each character in the words according to the distance between the words to be displayed and the target object in any video frame, the number of the characters in the words and the size of the target object; each character in the word is rendered at the determined display position in the video frame according to the size of each character in the word.
The size of the characters in the words and the number of the characters in the words are in a negative correlation relationship, the size of the characters in the words and the size of the target object are in a positive correlation relationship, and the size of the characters in the words and the size of the target object are in a positive correlation relationship.
When the number of characters of a word is large, if the font size of the characters in the word is large, the words are crowded after rendering the words in the video frame. Therefore, when the number of characters of the word is large, the size of the character can be reduced, and when the number of characters of the word is large, the size of the character can be increased.
In addition, in some embodiments, a plurality of words may be displayed around the target object, if the area occupied by the target object in the video frame is large, correspondingly, the font size of the character should be relatively large for displaying the effect, and if the area occupied by the target object in the video frame is small, correspondingly, the font size of the character should be relatively small for displaying the effect.
It should be noted that, the embodiment of the present disclosure is only to exemplarily determine the size of each character in a word by determining the size of each character in the word according to the distance between the word to be displayed in a video frame and a target object, the number of characters in the word, and the size of the target object. In another embodiment, the size of each character in the words can be determined according to at least one of the distance between the words to be displayed in the video frame and the target object, the number of characters in the words, or the size of the target object, and the size of each character in the words can also be determined according to other contents in the video; or, determining the size of each character in the plurality of words according to the sound volume of the plurality of words in the audio information, wherein the larger the sound volume of the words in the audio information, the larger the size of each character in the words.
Wherein the size of each character in the word is calculated according to the following formula:
Figure BDA0003076506930000241
where size is the size of each character in the word, k 1 Representing a coefficient associated with the size of the target object and the distance between the word and the target object, where k 1 Has a positive correlation with the size of the target object, and k 1 Is in negative correlation with the distance, L represents the length of the target object, W represents the width of the target object, n represents the number of characters in the word, k 2 Denotes a coefficient associated with the number of characters in the word, and k 2 The number of the characters is in positive correlation; wherein k is 1 Is any value greater than 0, k 2 Is any value greater than 0 and less than 1.
The sizes of the characters in the words are calculated more accurately through the formula, the display effect of a plurality of words is improved, and therefore the video processing effect is improved.
The display style of the word may include other styles besides the size of each character in the word, for example, the display color of the word, the font of the word, and the like.
In some embodiments, the local comprises a first thesaurus, and the first thesaurus is obtained from the local; or obtain the first thesaurus from a server or other device. The first word stock comprises a plurality of different words and a display style parameter of each word, wherein the display style parameter of each word is used for indicating the display style of the word, such as the color, the font and the like of the word.
Therefore, when the words are rendered, the display style parameters of the words can be acquired from the first word bank, and the words are rendered according to the display style parameters. In one possible implementation, rendering the corresponding word at the determined display position in each video frame includes: acquiring display style parameters corresponding to the words from a first word bank, wherein the first word bank comprises a plurality of different words and the display style parameters of each word, and the display style parameters are used for indicating the display style of each word; and rendering the corresponding words at the determined display positions in each video frame according to the display style parameters of the words.
The video processing method provided by the embodiment of the disclosure is a method for automatically adding dynamic subtitles to a video, when a person in the video emphasizes contents to be expressed by repeatedly speaking a certain word, the dynamic subtitles can be added according to the contents emphasized by the person in the video, and the emphasized contents are highlighted through the dynamic subtitles, so that the labor cost is reduced, and the video processing effect is improved.
According to the video processing method provided by the embodiment of the disclosure, when the display positions of the words are determined, the number of the words, the polar angle range corresponding to the words, the interval between adjacent words and the like are considered, so that when a plurality of words with different numbers appear in text information and are identical and continuous, the display styles of the words are kept consistent, and the video processing effect is improved.
As shown in fig. 6, in the embodiment of the present disclosure, audio information in a video is identified to obtain text information, repeated words and the number of times of repetition of the words are determined according to a plurality of continuous and same words in the text information, when a target video segment corresponding to the words includes a face, each video frame in the target video segment is identified, and a position of a face frame and a size of the face frame in each video frame are determined. Determining the initial display time of each word according to the repeated times of the words, and determining the display position of each word according to the repeated times of the words, the position of the face frame and the size of the face frame; and determining the size of the characters in each word according to the number of the characters in the word, repeated words of the word and the size of the face frame. After the initial display time, the display position and the size of the characters in the words of each word are determined, the corresponding words are rendered on the corresponding video frames in the original video.
Fig. 7 is a block diagram illustrating a video processing apparatus according to an example embodiment. Referring to fig. 7, the video processing apparatus includes:
the recognition unit 701 is configured to perform recognition on audio information in a video, and obtain text information corresponding to the audio information and occurrence time of each character in the text information in the video;
a determining unit 702 configured to perform, in response to a plurality of words in the text information being identical and consecutive, determining a target time period according to appearance times of a first character and a last character in the plurality of words in the video, the target time period being used to represent the appearance time period of the plurality of words in the video, each word in the text information being composed of at least one character;
the adding unit 703 is configured to add a dynamic effect in which a plurality of words appear in sequence in a target video segment corresponding to a target time period in the video.
As shown in fig. 8, in some embodiments, the adding unit 703 is configured to perform adding a dynamic effect in which a plurality of words appear in sequence in the target video segment in response to the target object being included in the target video segment.
In some embodiments, the identifying unit 701 is configured to perform identification on the audio information in the video in response to the target object included in the video, and obtain text information corresponding to the audio information and an occurrence time of each character in the text information in the video.
In some embodiments, the adding unit 703 includes:
a word determining subunit 7031 configured to perform determining, from a plurality of words, a word that needs to be displayed in each video frame in the target video segment, where the number of words that need to be displayed in any video frame is not less than the number of words that need to be displayed in the last video frame of any video frame;
a position determining subunit 7032, configured to perform respective determination of display positions corresponding to the words to be displayed in each video frame;
a rendering subunit 7033 configured to perform the determined display position in each video frame to render the corresponding word.
In some embodiments, the word determining subunit 7031 is configured to perform determining a starting display time of each of the plurality of words; determining a target video segment corresponding to a target time segment from a video; and determining the words with the starting display time earlier than or equal to the playing time corresponding to the video frame as the words required to be displayed by the video frame for each video frame in the target video segment.
In some embodiments, the rendering subunit 7033 is configured to perform determining a size of each character in the word according to a distance between the word to be displayed in any one of the video frames and the target object, the number of characters in the word, and a size of the target object; each character in the word is rendered according to the determined display position in the video frame according to the size of each character in the word.
In some embodiments, the size of each character in the word is calculated according to the following formula:
Figure BDA0003076506930000261
where size is the size of each character in the word, k 1 Representing a coefficient associated with the size of the target object and the distance between the word and the target object, where k 1 In normal phase with the size of the target objectIn a closed relationship, and k 1 Is in negative correlation with the distance, L represents the length of the target object, W represents the width of the target object, n represents the number of characters in the word, k 2 Denotes a coefficient associated with the number of characters in a word, and k 2 The number of the characters is in positive correlation; wherein k is 1 Is any value greater than 0, k 2 Is any value greater than 0 and less than 1.
In some embodiments, the position determining subunit 7032 is configured to perform determining, in response to the number of words of the plurality of words being less than or equal to a first number, display positions of the plurality of words on a first circular curve centered on the target object and having a radius of a first distance; alternatively, the first and second electrodes may be,
a position determining subunit 7032 configured to perform, in response to the number of words of the plurality of words being greater than the first number, determining display positions of a preceding first number of words of the plurality of words on a first circular curve, and determining display positions of remaining words of the plurality of words on a second circular curve centered on the target object and having a second distance as a radius, the second distance being greater than the first distance.
In some embodiments, the position determining subunit 7032 is configured to perform determining a plurality of consecutive sequence number intervals and distances corresponding to the plurality of sequence number intervals, where any sequence number interval represents a sequence number of a word displayable on a circular curve centered on the target object and having a radius equal to a distance corresponding to any sequence number interval, where any sequence number interval is greater than a previous sequence number interval of any sequence number interval, and a distance corresponding to any sequence number interval is greater than a distance corresponding to the previous sequence number interval;
a position determining subunit 7032 configured to perform determining, according to the sequence number of each of the plurality of words, that the distance between each word and the target object is a distance corresponding to a sequence number interval to which the sequence number of each word belongs;
a position determining subunit 7032, configured to perform, for each video frame, determining, according to the display position of the target object in the video frame and the distance between the word to be displayed in the video frame and the target object, the display position of the word in the video frame.
In some embodiments, the position determining subunit 7032 is configured to perform:
for any word that needs to be displayed in a video frame,
determining the relative positions of the words and the centers of the target objects according to the distances between the words and the target objects and the number of the words to be displayed on a circular curve which takes the target objects as the centers and the distances as the radiuses in the target video segment;
acquiring the display position of the center of a target object in a video frame;
and determining the display position of the word in the video frame according to the relative position of the word and the center of the target object and the display position of the center of the target object in the video frame.
In some embodiments, the relative position of the word to the center of the target object includes: an included angle between a connecting line of the words and the center of the target object and a reference datum line, wherein the reference datum line is a ray pointing to a reference direction by taking the center of the target object as a starting point, and the included angle is calculated according to the following formula:
Figure BDA0003076506930000271
wherein n is the serial number of the words, alpha is the included angle corresponding to the nth word, alpha max Is the maximum value n in the range of included angles corresponding to a plurality of words to be displayed on a circular curve which takes a target object as the center and takes the distance as the radius in the target video segment 0 The number of words to be displayed on a circular curve centered on the target object and having a radius of distance in the target video segment is set.
With regard to the apparatus in the above-described embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.
In an exemplary embodiment, the present disclosure also provides an electronic device, including: one or more processors; volatile or non-volatile memory for storing one or more processor-executable instructions; wherein the one or more processors are configured to perform the steps performed by the electronic device in the video processing method described above.
In some embodiments, the electronic device is provided as a terminal. Fig. 9 is a block diagram illustrating a structure of a terminal 900 according to an example embodiment. The terminal 900 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 900 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.
The terminal 900 includes: a processor 901 and a memory 902.
Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.
Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 902 is used to store at least one program code for execution by the processor 901 to implement the video processing method provided by the method embodiments in the present disclosure.
In some embodiments, terminal 900 can also optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a display screen 905, a camera assembly 906, an audio circuit 907, a positioning assembly 908, and a power supply 909.
The peripheral interface 903 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902, and the peripheral device interface 903 may be implemented on a separate chip or circuit board, which is not limited by the embodiment.
The Radio Frequency circuit 904 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (Near Field Communication) related circuits, which are not limited by this disclosure.
The display screen 905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, disposed on the front panel of the terminal 900; in other embodiments, the number of the display panels 905 may be at least two, and each of the display panels is disposed on a different surface of the terminal 900 or is in a foldable design; in other embodiments, the display 905 may be a flexible display disposed on a curved surface or a folded surface of the terminal 900. Even more, the display screen 905 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display panel 905 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.
The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. The front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
Audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of the terminal 900. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuit 907 may also include a headphone jack.
The positioning component 908 is used to locate the current geographic Location of the terminal 900 for navigation or LBS (Location Based Service). The Positioning component 908 may be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.
The power supply 909 is used to supply power to the various components in the terminal 900. The power source 909 may be alternating current, direct current, disposable or rechargeable. When the power source 909 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, terminal 900 can also include one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 913, fingerprint sensor 914, optical sensor 915, and proximity sensor 916.
The acceleration sensor 911 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the display screen 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 912 may detect a body direction and a rotation angle of the terminal 900, and the gyro sensor 912 may cooperate with the acceleration sensor 911 to acquire a 3D motion of the user on the terminal 900. The processor 901 can implement the following functions according to the data collected by the gyro sensor 912: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
The pressure sensor 913 may be disposed on a side bezel of the terminal 900 and/or underneath the display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the user's holding signal of the terminal 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at a lower layer of the display screen 905, the processor 901 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 905. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The fingerprint sensor 914 is used for collecting a fingerprint of the user, and the processor 901 identifies the user according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 901 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 914 may be disposed on the front, back, or side of the terminal 900. When a physical key or vendor Logo is provided on the terminal 900, the fingerprint sensor 914 may be integrated with the physical key or vendor Logo.
The optical sensor 915 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the display screen 905 based on the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the display screen 905 is increased; when the ambient light intensity is low, the display brightness of the display screen 905 is reduced. In another embodiment, the processor 901 may also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 915.
A proximity sensor 916, also referred to as a distance sensor, is provided on the front panel of the terminal 900. The proximity sensor 916 is used to collect the distance between the user and the front face of the terminal 900. In one embodiment, when the proximity sensor 916 detects that the distance between the user and the front face of the terminal 900 gradually decreases, the processor 901 controls the display 905 to switch from the bright screen state to the dark screen state; when the proximity sensor 916 detects that the distance between the user and the front surface of the terminal 900 gradually becomes larger, the display 905 is controlled by the processor 901 to switch from the message screen state to the bright screen state.
Those skilled in the art will appreciate that the configuration shown in fig. 9 does not constitute a limitation of terminal 900, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.
In some embodiments, the electronic device is provided as a server. Fig. 10 is a schematic structural diagram of a server according to an exemplary embodiment, where the server 1000 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1001 and one or more memories 1002, where the memory 1002 stores at least one program code, and the at least one program code is loaded and executed by the processors 1001 to implement the methods provided by the method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
The server 1000 may be configured to perform the steps performed by the server in the video processing method described above.
In an exemplary embodiment, there is also provided a non-transitory computer readable storage medium, which when program code in the storage medium is executed by a processor of a server, enables the server to perform the steps performed by the server in the above video processing method. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product is also provided, in which instructions, when executed by a processor of a server, enable the server to perform the steps performed by the server in the above-described video processing method.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (24)

1. A video processing method, characterized in that the video processing method comprises:
identifying audio information in a video to obtain text information corresponding to the audio information and the occurrence time of each character in the text information in the video;
in response to a plurality of words in the text information being identical and continuous, determining a target time period according to the occurrence time of a first character and a last character in the plurality of words in the video, the target time period being used for representing the occurrence time period of the plurality of words in the video, each word in the text information being composed of at least one character;
and adding a dynamic effect in which the words appear in sequence in a target video segment corresponding to the target time segment in the video.
2. The video processing method according to claim 1, wherein the adding a dynamic effect in which the words sequentially appear in a target video segment corresponding to the target time segment in the video comprises:
adding a dynamic effect in which the plurality of words appear in sequence in the target video segment in response to the target object being included in the target video segment.
3. The video processing method according to claim 1, wherein the identifying audio information in a video to obtain text information corresponding to the audio information and an occurrence time of each character in the text information in the video comprises:
responding to the video including the target object, identifying the audio information in the video, and obtaining text information corresponding to the audio information and the occurrence time of each character in the text information in the video.
4. The video processing method according to claim 1, wherein the adding a dynamic effect in which the words sequentially appear in a target video segment corresponding to the target time segment in the video comprises:
determining words required to be displayed in each video frame in the target video segment from the plurality of words, wherein the number of the words required to be displayed in any video frame is not less than the number of the words required to be displayed in the last video frame of any video frame;
respectively determining the display positions corresponding to the words to be displayed in each video frame;
rendering the corresponding words at the determined display positions in each of the video frames.
5. The method of claim 4, wherein said determining, from the plurality of words, a word that needs to be displayed for each video frame in the target video segment comprises:
determining a starting display time for each of the plurality of terms;
determining a target video segment corresponding to the target time segment from the video;
and for each video frame in the target video segment, determining the words with the starting display time earlier than or equal to the playing time corresponding to the video frame as the words required to be displayed by the video frame.
6. The method of claim 4, wherein said rendering the corresponding word at the determined display position in each of the video frames comprises:
determining the size of each character in the words according to the distance between the words to be displayed in any video frame and a target object, the number of the characters in the words and the size of the target object;
rendering each character in the term at the determined display position in the video frame according to the size of each character in the term.
7. The video processing method of claim 6, wherein the size of each character in the word is calculated according to the following formula:
Figure FDA0003895670450000021
wherein size is the size of each character in the word, k 1 Representing a coefficient associated with a size of the target object and a distance between the word and the target object, where k 1 Has a positive correlation with the size of the target object, and k 1 Is in negative correlation with the distance, L represents the length of the target object, W represents the width of the target object, n represents the number of characters in the word, k represents 2 Represents a coefficient associated with the number of characters in the word, and k 2 The number of the characters is in positive correlation; wherein k is 1 Is any value greater than 0, k 2 Is any value greater than 0 and less than 1.
8. The method of claim 4, wherein the separately determining the display position corresponding to the word to be displayed in each video frame comprises:
determining display positions of the plurality of words on a first circular curve centered on a target object and having a first distance as a radius in response to a number of words of the plurality of words being less than or equal to a first number; alternatively, the first and second electrodes may be,
in response to a number of words of the plurality of words being greater than the first number, determining display positions of a preceding first number of words of the plurality of words on the first circular curve, determining display positions of remaining words of the plurality of words on a second circular curve centered on the target object and having a second distance as a radius, the second distance being greater than the first distance.
9. The video processing method according to claim 4, wherein said separately determining a display position corresponding to a word to be displayed in each video frame comprises:
determining a plurality of continuous sequence number intervals and distances corresponding to the sequence number intervals, wherein any sequence number interval represents a sequence number of a word which can be displayed on a circular curve by taking a target object as a center and taking the distance corresponding to the sequence number interval as a radius, the sequence number interval is larger than a previous sequence number interval of the sequence number interval, and the distance corresponding to the sequence number interval is larger than the distance corresponding to the previous sequence number interval;
determining the distance between each word and the target object as the distance corresponding to the sequence number interval to which the sequence number of each word belongs according to the sequence number of each word in the plurality of words;
and for each video frame, determining the display position of the word in the video frame according to the display position of the target object in the video frame and the distance between the word needing to be displayed in the video frame and the target object.
10. The method according to claim 9, wherein said determining the display position of the word in the video frame according to the display position of the target object in the video frame and the distance between the word to be displayed in the video frame and the target object comprises:
for any word needing to be displayed in the video frame, determining the relative position of the word and the center of the target object according to the distance between the word and the target object and the number of the words needing to be displayed on a circular curve which takes the target object as the center and the distance between the word and the target object as the radius in the target video segment;
acquiring the display position of the center of the target object in the video frame;
and determining the display position of the word in the video frame according to the relative position of the word and the center of the target object and the display position of the center of the target object in the video frame.
11. The video processing method of claim 10, wherein the relative position of the term to the center of the target object comprises: an included angle between a connecting line of the words and the center of the target object and a reference datum line, wherein the reference datum line is a ray pointing to a reference direction with the center of the target object as a starting point, and the included angle is calculated according to the following formula:
Figure FDA0003895670450000031
wherein n is the serial number of the words, alpha is the included angle corresponding to the nth word, and alpha max A maximum value n in an included angle range corresponding to a plurality of words to be displayed on a circular curve which takes the target object as the center and the distance between the word and the target object as the radius in the target video segment is 0 The number of words to be displayed on a circular curve which is centered on the target object and has the distance between the words and the target object as a radius in the target video segment is determined.
12. A video processing apparatus, characterized in that the video processing apparatus comprises:
the recognition unit is configured to recognize audio information in a video, and obtain text information corresponding to the audio information and occurrence time of each character in the text information in the video;
a determining unit configured to perform, in response to a plurality of words in the text information being identical and consecutive, determining a target time period according to appearance times of a first character and a last character in the plurality of words in the video, the target time period being used for representing the appearance time period of the plurality of words in the video, each word in the text information being composed of at least one character;
an adding unit configured to add a dynamic effect in which the plurality of words appear in sequence in a target video segment corresponding to the target time period in the video.
13. The video processing apparatus according to claim 12, wherein the adding unit is configured to perform adding a dynamic effect in which the plurality of words appear in sequence in the target video segment in response to a target object being included in the target video segment.
14. The apparatus according to claim 12, wherein the identifying unit is configured to perform identifying audio information in the video in response to a target object being included in the video, and obtain text information corresponding to the audio information and a time of occurrence of each character in the text information in the video.
15. The video processing apparatus according to claim 12, wherein said adding unit includes:
a word determining subunit configured to perform determining, from the plurality of words, words that need to be displayed in each video frame of the target video segment, where the number of words that need to be displayed in any video frame is not less than the number of words that need to be displayed in the last video frame of the any video frame;
the position determining subunit is configured to respectively determine display positions corresponding to the words to be displayed in each video frame;
a rendering subunit configured to perform the determined display position in each video frame to render the corresponding word.
16. The video processing apparatus of claim 15, wherein the word determination subunit is configured to perform determining a starting display time of each of the plurality of words; determining a target video segment corresponding to the target time segment from the video; and for each video frame in the target video segment, determining the words with the starting display time earlier than or equal to the playing time corresponding to the video frame as the words required to be displayed by the video frame.
17. The video processing apparatus according to claim 15, wherein the rendering subunit is configured to perform determining a size of each character in the word according to a distance between a word to be displayed in any video frame and a target object, a number of characters in the word, and a size of the target object; rendering each character in the term at the determined display position in the video frame according to the size of each character in the term.
18. The video processing apparatus of claim 17, wherein the size of each character in the word is calculated according to the following formula:
Figure FDA0003895670450000051
wherein size is the size of each character in the word, k 1 Representing a coefficient associated with a size of the target object and a distance between the word and the target object, where k 1 Has a positive correlation with the size of the target object, and k 1 Is in negative correlation with the distance, L represents the length of the target object, W represents the width of the target object, n represents the number of characters in the word, k represents 2 Represents a coefficient associated with the number of characters in the word, and k 2 The number of the characters is in positive correlation; wherein k is 1 Is any value greater than 0, k 2 Is any value greater than 0 and less than 1.
19. The video processing apparatus according to claim 15, wherein the position determining subunit is configured to perform, in response to the number of words of the plurality of words being less than or equal to a first number, determining display positions of the plurality of words on a first circular curve centered on a target object and having a radius of a first distance; alternatively, the first and second electrodes may be,
the position determination subunit is configured to perform, in response to a number of words of the plurality of words being greater than the first number, determining, on the first circular curve, display positions of a preceding first number of words of the plurality of words, and determining, on a second circular curve centered on the target object and having a second distance as a radius, display positions of remaining words of the plurality of words, the second distance being greater than the first distance.
20. The apparatus according to claim 15, wherein the position determination subunit is configured to perform the determination of consecutive sequence number intervals and distances corresponding to the sequence number intervals, any sequence number interval representing a sequence number of a word displayable on a circular curve centered on the target object and having a radius of a distance corresponding to the any sequence number interval, the any sequence number interval being greater than a preceding sequence number interval of the any sequence number interval, and the distance corresponding to the any sequence number interval being greater than the distance corresponding to the preceding sequence number interval;
the position determining subunit is configured to perform determining, according to the sequence number of each of the plurality of words, that the distance between each of the words and the target object is a distance corresponding to a sequence number interval to which the sequence number of each of the words belongs;
the position determining subunit is configured to determine, for each video frame, a display position of a word in the video frame according to a display position of the target object in the video frame and a distance between the word to be displayed in the video frame and the target object.
21. The video processing apparatus of claim 20, wherein the position determining subunit is configured to perform:
for any word needing to be displayed in the video frame, determining the relative position of the word and the center of the target object according to the distance between the word and the target object and the number of words needing to be displayed on a circular curve which takes the target object as the center and the distance between the word and the target object as the radius in the target video segment;
acquiring the display position of the center of the target object in the video frame;
and determining the display position of the word in the video frame according to the relative position of the word and the center of the target object and the display position of the center of the target object in the video frame.
22. The video processing device of claim 21, wherein the relative position of the term to the center of the target object comprises: an included angle between a connecting line of the words and the center of the target object and a reference datum line, wherein the reference datum line is a ray pointing to a reference direction with the center of the target object as a starting point, and the included angle is calculated according to the following formula:
Figure FDA0003895670450000061
wherein n is the serial number of the words, alpha is the included angle corresponding to the nth word, and alpha max A maximum value n in an included angle range corresponding to a plurality of words to be displayed on a circular curve which takes the target object as the center and the distance between the word and the target object as the radius in the target video segment is 0 The number of words to be displayed on a circular curve which is centered on the target object and has the distance between the words and the target object as a radius in the target video segment is determined.
23. An electronic device, characterized in that the electronic device comprises:
one or more processors;
volatile or non-volatile memory for storing the one or more processor-executable instructions;
wherein the one or more processors are configured to perform the video processing method of any one of claim 1 to claim 11.
24. A non-transitory computer-readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the video processing method of any of claims 1 to 11.
CN202110554116.XA 2021-05-20 2021-05-20 Video processing method and device, electronic equipment and storage medium Active CN113301444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110554116.XA CN113301444B (en) 2021-05-20 2021-05-20 Video processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110554116.XA CN113301444B (en) 2021-05-20 2021-05-20 Video processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113301444A CN113301444A (en) 2021-08-24
CN113301444B true CN113301444B (en) 2023-02-17

Family

ID=77323415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110554116.XA Active CN113301444B (en) 2021-05-20 2021-05-20 Video processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113301444B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113824899B (en) * 2021-09-18 2022-11-04 北京百度网讯科技有限公司 Video processing method, video processing device, electronic equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007078985A (en) * 2005-09-13 2007-03-29 Canon Inc Data retrieving device and its control method
CN109788345A (en) * 2019-03-29 2019-05-21 广州虎牙信息科技有限公司 Live-broadcast control method, device, live streaming equipment and readable storage medium storing program for executing
CN110460872A (en) * 2019-09-05 2019-11-15 腾讯科技(深圳)有限公司 Information display method, device, equipment and the storage medium of net cast
CN111526431A (en) * 2020-04-20 2020-08-11 北京甲骨今声科技有限公司 Equipment for adding captions to video and audio programs in real time

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7013273B2 (en) * 2001-03-29 2006-03-14 Matsushita Electric Industrial Co., Ltd. Speech recognition based captioning system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007078985A (en) * 2005-09-13 2007-03-29 Canon Inc Data retrieving device and its control method
CN109788345A (en) * 2019-03-29 2019-05-21 广州虎牙信息科技有限公司 Live-broadcast control method, device, live streaming equipment and readable storage medium storing program for executing
CN110460872A (en) * 2019-09-05 2019-11-15 腾讯科技(深圳)有限公司 Information display method, device, equipment and the storage medium of net cast
CN111526431A (en) * 2020-04-20 2020-08-11 北京甲骨今声科技有限公司 Equipment for adding captions to video and audio programs in real time

Also Published As

Publication number Publication date
CN113301444A (en) 2021-08-24

Similar Documents

Publication Publication Date Title
CN110572716B (en) Multimedia data playing method, device and storage medium
CN109151044B (en) Information pushing method and device, electronic equipment and storage medium
CN111445901B (en) Audio data acquisition method and device, electronic equipment and storage medium
CN109922356B (en) Video recommendation method and device and computer-readable storage medium
CN109275013B (en) Method, device and equipment for displaying virtual article and storage medium
CN110933468A (en) Playing method, playing device, electronic equipment and medium
CN111276122B (en) Audio generation method and device and storage medium
CN112052897A (en) Multimedia data shooting method, device, terminal, server and storage medium
CN111613213B (en) Audio classification method, device, equipment and storage medium
CN110798327B (en) Message processing method, device and storage medium
CN111586444B (en) Video processing method and device, electronic equipment and storage medium
CN109660876B (en) Method and device for displaying list
CN112839107B (en) Push content determination method, device, equipment and computer-readable storage medium
CN113556481A (en) Video special effect generation method and device, electronic equipment and storage medium
CN113301444B (en) Video processing method and device, electronic equipment and storage medium
CN111860064B (en) Video-based target detection method, device, equipment and storage medium
CN112992127A (en) Voice recognition method and device
CN111611414A (en) Vehicle retrieval method, device and storage medium
US11604919B2 (en) Method and apparatus for rendering lyrics
CN108733831B (en) Method and device for processing word stock
CN112487162A (en) Method, device and equipment for determining text semantic information and storage medium
CN111782767A (en) Question answering method, device, equipment and storage medium
CN111145723A (en) Method, device, equipment and storage medium for converting audio
CN110750675A (en) Lyric sharing method and device and storage medium
CN113593521B (en) Speech synthesis method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant