CN111800650B

CN111800650B - Video dubbing method and device, electronic equipment and computer readable medium

Info

Publication number: CN111800650B
Application number: CN202010506355.3A
Authority: CN
Inventors: 刘恩雨; 李松南; 尚焱; 刘杉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2022-03-25
Anticipated expiration: 2040-06-05
Also published as: CN111800650A

Abstract

The embodiment of the disclosure provides a video dubbing method and device, electronic equipment and a computer readable medium, and belongs to the technical field of computers. The method comprises the following steps: acquiring a target video; extracting the content of the target video to obtain a content description text of the target video; determining a target audio of the target video according to the content description text; and synthesizing the target audio and the target video. The technical scheme provided by the embodiment of the disclosure can accurately position the important information of the target video according to the information of multiple dimensions in the content description text, so as to ensure the high adaptation degree of the obtained target audio and the target video.

Description

Video dubbing method and device, electronic equipment and computer readable medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a video dubbing method and apparatus, an electronic device, and a computer-readable medium.

Background

The intelligent matching of the video and the audio is also called cross-modal retrieval of the video and the audio, namely the intelligent searching of the video and the matched music. In the related art, specific feature information (e.g., face information, background information, etc.) is extracted from an image in a video, and a video style is determined according to the feature information, so as to obtain an adapted audio in an audio database according to the video style. However, when the specific feature information is used as a basis for matching music, the specific feature information often fails to fully describe all information of the video and ignores the emphasis of the video, so that the degree of matching the music and the video allocated to the video is reduced, and the user experience is reduced.

Therefore, a new video dubbing method, apparatus, electronic device and computer readable medium are needed.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The embodiment of the disclosure provides a video music matching method and device, electronic equipment and a computer readable medium, so that audio with high adaptation degree is matched for a video at least to a certain extent, and user experience is improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

The embodiment of the disclosure provides a video dubbing method, which includes: acquiring a target video; extracting the content of the target video to obtain a content description text of the target video; determining a target audio of the target video according to the content description text; and synthesizing the target audio and the target video.

The embodiment of the present disclosure provides a video dubbing music device, including: the video acquisition module is configured to acquire a target video; the content extraction module is configured to extract the content of the target video to obtain a content description text of the target video; an audio matching module configured to determine a target audio of the target video according to the content description text; and the audio and video synthesis module is configured to synthesize the target audio and the target video.

In an exemplary embodiment of the present disclosure, the audio matching module includes an emotion information unit and an audio matching unit. And the emotion information unit is configured to determine the emotion information of the target video according to the content description text. The audio matching unit is configured to determine a target audio of the target video according to the emotion information and the content description text.

In one exemplary embodiment of the present disclosure, the emotion information unit includes a first model subunit and an emotion information subunit. The first model subunit is configured to process the content description text through a first deep learning model to obtain an emotion information vector of the target video. The emotion information subunit is configured to determine, as the emotion information of the target video, a tag whose score value in the emotion information vector is greater than a preset score threshold value.

In an exemplary embodiment of the present disclosure, the audio matching unit includes an emotion classification subunit, a first audio set subunit, and a first audio matching subunit. The emotion classification subunit is configured to determine emotion classifications of the emotion information, where the emotion classifications include a first emotion classification and a second emotion classification. The first audio set subunit is configured to obtain a first music set with melody tone labels matched with the emotion information if the emotion type is the first emotion type. The first audio matching subunit is configured to determine a target audio of the target video in the first music collection.

In an exemplary embodiment of the present disclosure, the audio matching unit further includes a body information subunit, a body category subunit, a second audio set subunit, and a second audio matching subunit. And if the emotion type is the second emotion type, the main body information subunit is configured to obtain the main body information of the target video according to the content description text. The subject category subunit is configured to determine a subject category of the subject information, the subject category including a first subject category and a second subject category. The second audio set subunit is configured to obtain a second music set with lyric tags matching the subject information if the subject information is the first subject category. A second audio matching subunit is configured to determine a target audio of the target video in the second music collection.

In an exemplary embodiment of the present disclosure, the audio matching unit further includes a behavior information subunit, a third audio set subunit, and a third audio matching subunit. And the behavior information subunit is configured to obtain the behavior information of the target video according to the content description text if the main body information is of a second main body type. The third audio set subunit is configured to obtain a third music set with rhythm labels matched with the behavior information. A third audio matching subunit is configured to determine a target audio of the target video in the third set of music.

In an exemplary embodiment of the disclosure, the content extraction module is configured to process the target video through a second deep learning model, and obtain a content description text of the target video.

In an exemplary embodiment of the present disclosure, the audio and video synthesis module includes an audio duration unit and an audio and video synthesis unit. And the audio time duration unit is configured to intercept or splice the target audio according to the video time duration of the target video. And the audio and video synthesis unit is configured to synthesize the target video and the intercepted or spliced target audio.

An embodiment of the present disclosure provides an electronic device, including: at least one processor; a storage device for storing at least one program which, when executed by the at least one processor, causes the at least one processor to implement the video soundtrack method as described in the above embodiments.

An embodiment of the present disclosure provides a computer-readable medium, on which a computer program is stored, where the computer program is executed by a processor to implement the video dubbing method as described in the above embodiment.

In the technical solutions provided by some embodiments of the present disclosure, because the generated content description text can comprehensively describe the video content of the target video from multiple dimensions, when the target audio of the target video is determined according to the content description text, the important information of the target video can be accurately positioned according to the information of the multiple dimensions in the content description text, so as to ensure high adaptation degree of the obtained target audio and the target video, and improve user experience.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

fig. 1 illustrates a schematic diagram of an exemplary system architecture 100 to which a video dubbing method or apparatus of an embodiment of the present disclosure may be applied;

FIG. 2 schematically illustrates a flow diagram of a video soundtrack method according to one embodiment of the present disclosure;

FIG. 3 is a flowchart in an exemplary embodiment based on step S230 of FIG. 2;

FIG. 4 is a flowchart in an exemplary embodiment based on step S231 of FIG. 3;

FIG. 5 is a flowchart in an exemplary embodiment based on step S232 of FIG. 3;

FIG. 6 is a flowchart in an exemplary embodiment based on step S232 of FIG. 3;

FIG. 7 is a flowchart in an exemplary embodiment based on step S232 of FIG. 3;

FIG. 8 is a flowchart in an exemplary embodiment based on step S240 of FIG. 2;

FIG. 9 schematically illustrates a flow diagram of a video soundtrack method according to one embodiment of the present disclosure;

FIG. 10 is a flowchart in an exemplary embodiment based on step S940 of FIG. 9;

FIG. 11 schematically illustrates a screenshot of a target video in accordance with the present disclosure;

fig. 12 schematically illustrates a block diagram of a video soundtrack apparatus according to an embodiment of the present disclosure;

FIG. 13 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in at least one hardware module or integrated circuit, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, three-dimensional (three-dimensional 3D) technology, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and further include common biometric technologies such as face Recognition and fingerprint Recognition.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

In the related technology for video dubbing music, the prior art extracts face information and background information, extracts emotion through the face information, extracts places through the background information, and dubs music for videos according to situations and places. However, this method is too single to cover a part of a scene. For example, when there is no obvious face in the video to extract, but strong emotions are also expressed, for example, the picture of the video is "the shadow of a fire fighter who is fighting a fire in the wrong direction", the above method is not feasible. Meanwhile, the scheme only considers the face and the background, and possibly ignores information expressed by other dimensions in the video. For example, when the video picture is "beauty self-timer at the side of the swimming pool", and when the face is blocked and the face information cannot be extracted, the extracted background is the swimming pool, the dubbing music is matched according to the swimming pool, but actually the video should focus on the behavior of "self-timer", and the above method will cause the dubbing music not to match the video.

In the existing related art for matching video music, there is a method for matching video music according to other specific characteristic information. However, when the specific feature information is the video score, the problem that the video score is not matched with the video due to neglecting information of other dimensions in the video necessarily exists.

Fig. 1 shows a schematic diagram of an exemplary system architecture 100 to which a video dubbing method or apparatus of an embodiment of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, portable computers, desktop computers, wearable devices, virtual reality devices, smart homes, and so forth.

The server 105 may be a server that provides various services. For example, terminal device 103 (which may also be terminal device 101 or 102) uploads the target video to server 105. The server 105 may obtain a target video; extracting the content of the target video to obtain a content description text of the target video; determining a target audio frequency of the target video according to the content description text; and synthesizing the target audio and the target video. And feeds the synthesized target video back to the terminal device 103, and the terminal device 103 can display the synthesized target video to the user so that the user can watch and operate the synthesized target video.

Fig. 2 schematically shows a flow diagram of a video soundtrack method according to one embodiment of the present disclosure. The method provided by the embodiment of the present disclosure may be processed by any electronic device with computing processing capability, for example, the server 105 and/or the

terminal devices

102 and 103 in the embodiment of fig. 1 described above, and in the following embodiment, the server 105 is taken as an execution subject for example, but the present disclosure is not limited thereto.

As shown in fig. 2, a video dubbing method provided by the embodiment of the present disclosure may include the following steps.

In step S210, a target video is acquired.

In the embodiment of the present disclosure, the target video may be transmitted by the terminal device 101 (or 102, 103). The target video may be, for example, video material photographed by a user in real time, and may also be, for example, video material processed according to a user operation.

In step S220, content extraction is performed on the target video to obtain a content description text of the target video.

In the embodiment of the disclosure, the content description text is used for describing the video content of the target video, and the video content can be comprehensively described from multiple dimensions. The content description text is a human-readable text description. For example, the content description text may include at least one of: subject, predicate, object, complement. The content description text including at least one of the subject, the predicate, the object, the predicate, the subject, and the complement can describe the video content from multiple dimensions, so that omission of the video content is avoided.

In an exemplary embodiment, the target video may be processed through the second deep learning model, and the content description text of the target video is obtained. Wherein the second deep learning model can be a trained deep learning model. The training samples of the second deep learning model may include video material and content descriptive text annotation information for the video material.

FIG. 11 schematically illustrates a screenshot of a target video according to the present disclosure. As shown in fig. 11, the content description text of the target video obtained after the content extraction of the target video may be: two people walk in the forest. In the content description text of the above example, the subject is "two persons", the predicate is "walking", and the object is "in forest".

In step S230, the target audio of the target video is determined from the content description text.

In the embodiment of the disclosure, since the content description text describes the video content of the target video from multiple dimensions, when the target audio is determined according to the content description text, the important information of the target video can be accurately located due to the fact that the target audio can be determined according to the multiple dimensions of the content description text, the adaptation degree of the target audio and the target video is improved, and the user experience is improved.

In step S240, the target audio and the target video are synthesized.

According to the video music matching method provided by the embodiment of the disclosure, the generated content description text can comprehensively describe the video content of the target video from multiple dimensions, and when the target audio of the target video is determined according to the content description text, the important information of the target video can be accurately positioned according to the information of the multiple dimensions in the content description text, so that the high adaptation degree of the obtained target audio and the target video is ensured, and the user experience is improved.

Fig. 3 is a flowchart based on step S230 of fig. 2 in an exemplary embodiment.

As shown in fig. 3, step S230 in the above-mentioned embodiment of fig. 2 may further include the following steps.

In step S231, emotion information of the target video is determined from the content description text.

In the embodiment of the disclosure, the content description text can be processed through a deep learning algorithm to obtain the emotion information of the target video. Emotional information may be, for example, but not limited to, sadness, worry, tension, happy, calm, excited, and the like. Taking the video screenshot shown in fig. 11 as an example, when the content description text is "two people walk in forest", the determined emotion information may be, for example: and (5) calming.

In an exemplary embodiment, when the target video is an image set, since information that can be expressed by the video in the form of the image set is reduced compared to the video in the form of a video stream, the content description text obtained from the target video may not be rich. In this case, if determining the emotion information of the target video according to the content description text fails, the target audio of the target video may be determined according to the content description text of the target video.

In step S232, the target audio of the target video is determined according to the emotion information.

In the embodiment of the disclosure, the emotion information can be matched with the tags of the audios to determine the target audio of the target video. The label of the audio may for example be a label of a preset melody tone. For example, the lower melody tones may correspond to emotional information of sadness or worry, but this is merely an example, and the type of melody tones may be set according to the actual situation.

According to the video music matching method, the emotion information is determined according to the content description text, the target audio frequency of the target video is determined according to the emotion information, the accurate emotion information can be generated on the basis of the multidimensional information which can be expressed by the content description text, the adaptation degree of the determined target audio frequency and the target video is improved, and the user experience is improved.

Fig. 4 is a flowchart in an exemplary embodiment based on step S231 of fig. 3.

As shown in fig. 4, step S231 in the above-mentioned fig. 3 embodiment may further include the following steps.

In step S2311, the content description text is processed through the first deep learning model to obtain an emotion information vector of the target video.

In an embodiment of the disclosure, the first deep learning model may be a trained deep learning model. The training samples of the first deep learning model can comprise content description texts and emotion information labels of the content description texts. Each dimension in the emotion information vector corresponds to a specific emotion label. And a specific numerical value of a certain dimension in the emotional information vector represents the value of the emotional information corresponding to the dimension to which the current content description text belongs.

In step S2312, the tags whose score values are greater than the preset score threshold value in the emotion information vector are determined as the emotion information of the target video.

In the embodiment of the disclosure, at least one tag with a score value larger than a preset score threshold value in an emotion information vector can be selected to be determined as the emotion information of the target video. Preferably, the label with the score value larger than a preset score threshold value and the score value being the largest in the emotion information vector can be selected to be determined as the emotion information of the target video. For another example, when the scores in the emotion information vector are all less than or equal to the preset score threshold, the label with the largest score in the emotion information vector may be determined as the emotion information of the target video.

According to the video music matching method, the content description text is processed through the first deep learning model, the emotion information of the target video can be accurately obtained based on the information of multiple dimensions in the content description text, and the emotion information with high accuracy can be obtained according to judgment of scores in the emotion information vector.

Fig. 5 is a flowchart in an exemplary embodiment based on step S232 of fig. 3.

As shown in fig. 5, step S232 in the above-mentioned fig. 3 embodiment may further include the following steps.

In step S2321, emotion categories of emotion information are determined, where the emotion categories include a first emotion category and a second emotion category.

In the embodiment of the present disclosure, the set of emotion information of the first emotion category and the set of emotion information of the second emotion category may be complementary sets. The emotion information of a certain target video can only belong to the first emotion category or the second emotion category, but cannot belong to the first emotion category and the second emotion category at the same time. When the emotion information of the target video belongs to the first emotion category, the emotion information of the target video is information which needs to be focused in the target video. And when the emotion information of the target video belongs to the second emotion category, the emotion information of the target video is information which can not focus on in the target video. For example, the affective information in the first affective category can be negative emotions and the affective information in the second affective category can be positive emotions. When the emotion information of the target video is a negative emotion, the emotion information of the target video is information needing important attention. The affective information in the first affective category can include, for example: sadness, worry, tension, urgency. The affective information in the second affective category can include, for example: happy, excited, cheerful and calm.

In step S2322, if the emotion type is the first emotion type, a first music set with melody tone labels matching with emotion information is obtained.

In the embodiment of the present disclosure, a mapping table of melody tone labels and emotion information of audio may be preset. Preferably, a mapping table of melody tone labels of the audio and emotion information of the first emotion category may be preset. The melody tone label for each audio may also be predetermined. For example, the emotional information mapped when the melody tone label is deep may be, for example, "sadness" (or "worry"). The emotional information mapped when the melody tone label is loud may be, for example, "nervous" (or "anxious"). However, this is merely an example, and the specific category of the melody tone label may be determined according to the actual application scenario.

When the first music set with the melody tone labels matching with the emotion information is obtained, for example, the audio with the melody tone labels of "low" can be integrated into the first music set when the emotion information of the target video is "sad" and the melody tone labels of "low" are matched. And obtaining an audio set under a directory corresponding to a preset 'low' label as a first music set.

In step S2323, the target audio of the target video is determined in the first music collection.

In the disclosed embodiment, the target audio of the target video may be determined in the first music collection according to a random algorithm.

According to the video music matching method, when the emotion information is determined to belong to the first emotion category of the information needing important attention in the target video, the emotion information is matched with the melody tone label, and the target audio is determined according to the obtained first music set. The method and the device have the advantages that the appropriate target audio can be matched for the target video based on the information that the target video needs to focus on, so that the adaptation degree of the target audio and the target video is improved, and the user experience is improved.

Fig. 6 is a flowchart in an exemplary embodiment based on step S232 of fig. 3.

As shown in fig. 6, step S232 in the embodiment of fig. 3 may further include the following steps.

In step S2324, if the emotion type is the second emotion type, the main body information of the target video is obtained according to the content description text.

In the embodiment of the disclosure, because the emotion information of the second emotion category is information that does not need to be focused in the target video, the main body information of the target video can be further obtained according to the content description text. In the content description text, the subject information may be, for example, subject information. Taking fig. 11 as an example, when the content description text is "two persons walk in a forest", the subject information may be subject information in the content description text: two persons. In one embodiment, the body information may be determined according to an output format of the content description text output by the second deep learning model. Preferably, the main body information of the content description text can be obtained according to semantic analysis of the content description text.

In step S2325, a subject category of the subject information is determined, where the subject category includes a first subject category and a second subject category.

In the embodiment of the present disclosure, the set of body information of the first body category and the set of body information of the second body category may be complementary sets. The subject information of a certain target video can only belong to the first subject category or the second subject category, but cannot belong to both the first subject category and the second subject category. When the subject information of the target video belongs to the first subject category, the subject information of the target video is information which needs to be focused in the target video. When the subject information of the target video belongs to the second subject category, the subject information of the target video is information that may not be focused on in the target video. For example, the subject information in the first subject category may be a special behavior subject: animal such as cat and dog, and elderly people and baby, the subject information in the second subject category may be general behavior subjects: a human. When the subject information of the target video is a subject of the special behavior, the subject information of the target video is information that needs to be focused.

In step S2326, if the subject information is of the first subject category, a second music set with lyric tags matching the subject information is obtained.

In the embodiment of the present disclosure, a mapping table between the lyric tag of the audio and the main body information may be preset. Preferably, a mapping table of lyric tags of audio and body information of the first body category may be previously set. The lyric tag of each audio may also be predetermined. For example, a lyric tag including audio of "my good baby" in the lyrics may be set to "baby", and the subject information mapped by the lyric tag "baby" may be "sadness" (or "worry"), for example. The body information mapped when the lyric tag is a jerk vector may be, for example, "baby". However, this is merely an example, and the specific category of the lyric tag may be determined according to the actual application scenario.

When the second music set with the lyric tags matched with the main body information is obtained, in connection with the above example, when the main body information of the target video is "baby", and the lyric tags of the "baby" are matched, the audio with the lyric tags of the "baby" can be integrated into the second music set. And obtaining an audio set under a directory corresponding to a preset 'baby' label as a second music set.

In step S2327, the target audio of the target video is determined in the second music collection.

In the disclosed embodiment, the target audio of the target video may be determined in the second music collection according to a random algorithm.

According to the video music matching method, when the main body information is determined to belong to the first main body category of the information needing important attention in the target video, the main body information is matched with the lyric labels, and the target audio is determined according to the obtained second music set. The method and the device have the advantages that the appropriate target audio can be matched for the target video based on the information that the target video needs to focus on, so that the adaptation degree of the target audio and the target video is improved, and the user experience is improved.

Fig. 7 is a flowchart in an exemplary embodiment based on step S232 of fig. 3.

As shown in fig. 7, step S232 in the embodiment of fig. 3 may further include the following steps.

In step S2328, if the subject information is of the second subject type, behavior information of the target video is obtained according to the content description text.

In the embodiment of the present disclosure, since the main body information of the second main body category is information that does not need to be focused on in the target video, the behavior information of the target video can be further obtained according to the content description text. In the content description text, the behavior information may be predicate information, for example. Taking fig. 11 as an example, when the content description text is "two persons walk in a forest", the behavior information may be predicate information in the content description text: and (5) walking. In one embodiment, the behavior information may be determined according to an output format of the content description text output by the second deep learning model. Preferably, the behavior information of the content description text can be obtained according to semantic analysis of the content description text.

In step S2329, a third music set in which the rhythm label matches the behavior information is obtained.

In the embodiment of the present disclosure, a mapping table of a rhythm label and main body information of an audio may be preset. Preferably, a mapping table of the rhythm label and the behavior information of the audio may be preset. The tempo label for each audio may also be predetermined. The tempo label of the audio may be determined from tempo information of the audio, which may include, for example and without limitation: fast tempo and slow tempo. For example, the behavior information mapped by the rhythm label "fast rhythm" may be a more drastic behavior action, such as "basketball," "running cool," and so on. The behavior information mapped when the rhythm label is "slow rhythm" may be slower and more relaxed behavior actions, such as "watch tv", "walk", "embroidery", etc. However, this is merely an example, and the specific category of the rhythm label and the mapped behavior information may be determined according to the actual application scenario.

When the third music set with the rhythm label matched with the behavior information is obtained, taking fig. 11 as an example, when the behavior information of the target video is "walking", and the rhythm label of the "slow rhythm" is matched, the audio with the rhythm label of the "slow rhythm" can be integrated into the third music set. And obtaining an audio set under a directory corresponding to a preset 'slow-rhythm' label as a third music set.

In step S2330, the target audio of the target video is determined in the third music set.

In the embodiment of the present disclosure, the target audio of the target video may be determined in the third music set according to a random algorithm.

According to the video dubbing method, when the fact that the subject information belongs to the second subject category of the information which does not need to be focused in the target video is determined, the behavior information is obtained and matched with the rhythm label, and the target audio is determined according to the obtained third music set. The method and the device have the advantages that the appropriate target audio can be matched for the target video based on the information that the target video needs to focus on, so that the adaptation degree of the target audio and the target video is improved, and the user experience is improved.

Fig. 8 is a flowchart in an exemplary embodiment based on step S240 of fig. 2.

As shown in fig. 8, step S240 in the above embodiment of fig. 2 may further include the following steps.

In step S241, the target audio is intercepted or spliced according to the video duration of the target video.

In step S242, the target video and the intercepted or spliced target audio are synthesized.

Fig. 9 schematically shows a flow diagram of a video soundtrack method according to one embodiment of the present disclosure.

As shown in fig. 9, the video dubbing method provided by the present embodiment includes the following steps.

In step S910, a content description text is obtained from the target video.

In the embodiment of the disclosure, the target video may be processed through the second deep learning model, so as to obtain the content description text of the target video.

In step S920, the subject information and behavior information of the target video are determined from the content description text.

In step S930, emotion information of the target video is determined from the content description text.

In the embodiment of the disclosure, the content description text can be processed through the first deep learning model to obtain the emotion information vector of the target video; and determining the label with the median value of the emotional information vector larger than a preset score threshold value as the emotional information of the target video.

In step S940, the subject information, the behavior information, and the emotion information are logically determined, and a target music tag matching the target video is obtained.

In step S950, a target music set is determined according to the target music tag, and a target audio of the target video is determined in the target music set.

In the embodiment of the present disclosure, a plurality of corresponding audios may be added in advance under each music tag. After the target music tag is determined, the audio set corresponding to the target music tag can be determined as a target audio set, and the target audio is determined in the target audio set based on a random algorithm.

Fig. 10 is a flowchart in an exemplary embodiment based on step S940 of fig. 9.

As shown in fig. 10, step S940 in the embodiment shown in fig. 9 may further include the following steps.

In step S941, emotion categories of emotion information are determined, where the emotion categories include a first emotion category and a second emotion category.

In step S942, if the emotion type is the first emotion type, the melody tone label matching the emotion information is acquired as the target music label.

In step S943, if the emotion classification is the second emotion classification, the subject classification of the subject information is determined, and the subject classification includes the first subject classification and the second subject classification.

In step S944, if the main body information is the first main body category, the lyric tag matching the main body information is obtained as the target music tag.

In step S945, if the subject information is of the second subject category, the rhythm tag matching the behavior information is obtained as the target music tag.

According to the video music matching method, the generated content description text can comprehensively describe the video content of the target video from multiple dimensions, the main body information, the behavior information and the emotion information of the multiple dimensions in the content description text can be logically judged, the important information of the target video is accurately positioned, the high adaptation degree of the obtained target audio and the target video is ensured, and the user experience is improved. When the target audio of the target video is determined according to the content description text, based on multi-layer logic judgment, the information needing to be focused on in the target video can be determined to be emotion information or main information or behavior information, and then the appropriate target audio is matched for the target video according to the information needing to be focused on in the target video, so that the adaptation degree of the target audio and the target video is improved, and the user experience is improved.

The following describes embodiments of the apparatus of the present disclosure, which can be used to execute the video dubbing method of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the video dubbing method described above in the present disclosure.

Fig. 12 schematically illustrates a block diagram of a video soundtrack apparatus according to an embodiment of the present disclosure.

Referring to fig. 12, a video dubbing apparatus 1200 according to an embodiment of the present disclosure may include: a video acquisition module 1210, a content extraction module 1220, an audio matching module 1230, and an audio-video synthesis module 1240.

In the video dubbing apparatus, the video acquisition module 1210 may be configured to acquire a target video.

The content extraction module 1220 may be configured to extract content of the target video to obtain a content description text of the target video.

The audio matching module 1230 may be configured to determine the target audio of the target video from the content description text.

The audio video synthesis module 1240 may be configured to synthesize the target audio with the target video.

In an exemplary embodiment, the audio matching module 1230 may include an emotion information unit and an audio matching unit. The emotion information unit can be configured to determine emotion information of the target video according to the content description text. The audio matching unit may be configured to determine a target audio of the target video from the emotion information and the content description text.

In an exemplary embodiment, the emotion information unit may include a first model subunit and an emotion information subunit. The first model subunit may be configured to process the content description text through the first deep learning model to obtain an emotion information vector of the target video. The emotion information subunit can be configured to determine the label with the score value larger than a preset score threshold value in the emotion information vector as the emotion information of the target video.

In an exemplary embodiment, the audio matching unit may include an emotion classification subunit, a first audio set subunit, and a first audio matching subunit. The emotion classification subunit can be configured to determine emotion classifications of the emotion information, where the emotion classifications include a first emotion classification and a second emotion classification. The first audio set subunit may be configured to obtain a first music set with melody tone labels matching the emotion information if the emotion category is the first emotion category. The first audio matching subunit may be configured to determine a target audio of the target video in the first music collection.

In an exemplary embodiment, the audio matching unit may further include a body information subunit, a body category subunit, a second audio set subunit, and a second audio matching subunit. And if the emotion type is the second emotion type, the main body information subunit can be configured to obtain the main body information of the target video according to the content description text. The subject category subunit may be configured to determine a subject category of the subject information, the subject category including a first subject category and a second subject category. The second audio set subunit may be configured to obtain a second music set with lyric tags matching the subject information if the subject information is of the first subject category. The second audio matching subunit may be configured to determine a target audio of the target video in the second music collection.

In an exemplary embodiment, the audio matching unit may further include a behavior information subunit, a third audio set subunit, and a third audio matching subunit. The behavior information subunit may be configured to, if the subject information is of the second subject type, obtain behavior information of the target video according to the content description text. The third audio set subunit may be configured to obtain a third music set with tempo labels matching the behavior information. The third audio matching subunit may be configured to determine a target audio of the target video in the third set of music.

In an exemplary embodiment, the content extraction module 1220 may be configured to process the target video through the second deep learning model to obtain the content description text of the target video.

In an exemplary embodiment, the audio video synthesis module 1240 may include an audio duration unit and an audio video synthesis unit. The audio duration unit may be configured to intercept or splice the target audio according to the video duration of the target video. The audio and video synthesis unit can be configured to synthesize the target video and the intercepted or spliced target audio.

According to the video music score device provided by the embodiment of the disclosure, the generated content description text can comprehensively describe the video content of the target video from multiple dimensions, and when the target audio of the target video is determined according to the content description text, the important information of the target video can be accurately positioned according to the information of the multiple dimensions in the content description text, so that the high adaptation degree of the obtained target audio and the target video is ensured, and the user experience is improved.

FIG. 13 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present disclosure. It should be noted that the electronic device 1300 shown in fig. 13 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments of the present disclosure.

As shown in fig. 13, the electronic apparatus 1300 includes a Central Processing Unit (CPU)1301 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1302 or a program loaded from a storage portion 1308 into a Random Access Memory (RAM) 1303. In the RAM 1303, various programs and data necessary for system operation are also stored. The CPU 1301, the ROM 1302, and the RAM 1303 are connected to each other via a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.

The following components are connected to the I/O interface 1305: an input portion 1306 including a keyboard, a mouse, and the like; an output section 1307 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1308 including a hard disk and the like; and a communication section 1309 including a network interface card such as a LAN card, a modem, or the like. The communication section 1309 performs communication processing via a network such as the internet. A drive 1310 is also connected to the I/O interface 1305 as needed. A removable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1310 as necessary, so that a computer program read out therefrom is mounted into the storage portion 1308 as necessary.

In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications component 1309 and/or installed from removable media 1311. The computer program executes various functions defined in the system of the present application when executed by a Central Processing Unit (CPU) 1301.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having at least one wire, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises at least one executable instruction for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules and/or units and/or sub-units described in the embodiments of the present disclosure may be implemented by software, or may be implemented by hardware, and the described modules and/or units and/or sub-units may also be disposed in a processor. Wherein the names of such modules and/or units and/or sub-units in some cases do not constitute a limitation on the modules and/or units and/or sub-units themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method as described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 2, 3, 4, 5, 6, 7, 8, 9, or 10.

It should be noted that although in the above detailed description several modules or units or sub-units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units or sub-units described above may be embodied in one module or unit or sub-unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units or sub-units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for video dubbing music, comprising:

acquiring a target video;

extracting content of the target video through a second deep learning model to obtain a content description text of the target video, wherein the content description text comprises main body information and behavior information, and the main body category of the main body information comprises a first main body category and a second main body category;

determining the emotion information of the target video according to the content description text through a first deep learning model, and determining the emotion types of the emotion information, wherein the emotion types comprise a first emotion type and a second emotion type;

if the emotion type is the first emotion type, determining a target audio of the target video from a first music set with a melody tone label matched with the first emotion type;

if the emotion category is the second emotion category and the main body category is the first main body category, determining a target audio of the target video from a second music set with lyric tags matched with the first main body category;

if the emotion type is the second emotion type and the main body type is the second main body type, determining a target audio of the target video from a third music set with rhythm labels matched with the behavior information;

and synthesizing the target audio and the target video.

2. The method of claim 1, wherein determining affective information for the target video from the content description text via a first deep learning model comprises:

processing the content description text through a first deep learning model to obtain an emotion information vector of the target video;

and determining the label with the score value larger than a preset score threshold value in the emotion information vector as the emotion information of the target video.

3. The method of claim 1, wherein performing content extraction on the target video and obtaining a content description text of the target video comprises:

and processing the target video through a second deep learning model to obtain a content description text of the target video.

4. The method of claim 1, wherein synthesizing the target audio with the target video comprises:

intercepting or splicing the target audio according to the video duration of the target video;

and synthesizing the target video and the intercepted or spliced target audio.

5. A video dubbing apparatus comprising:

the video acquisition module is configured to acquire a target video;

the content extraction module is configured to extract content of the target video through a second deep learning model to obtain a content description text of the target video, wherein the content description text comprises main body information and behavior information, and the main body category of the main body information comprises a first main body category and a second main body category;

the audio matching module is configured to determine the emotion information of the target video according to the content description text through a first deep learning model, and determine emotion categories of the emotion information, wherein the emotion categories comprise a first emotion category and a second emotion category;

if the emotion category is the second emotion category and the main body category is the first main body category, determining a target audio of the target video from a second music set with lyric tags matched with the first main body category; and

and the audio and video synthesis module is configured to synthesize the target audio and the target video.

6. An electronic device, comprising:

at least one processor;

storage means for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method of any one of claims 1-4.

7. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4.