CN110392273B - Audio and video processing method and device, electronic equipment and storage medium - Google Patents

Audio and video processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110392273B
CN110392273B CN201910641537.9A CN201910641537A CN110392273B CN 110392273 B CN110392273 B CN 110392273B CN 201910641537 A CN201910641537 A CN 201910641537A CN 110392273 B CN110392273 B CN 110392273B
Authority
CN
China
Prior art keywords
video
audio
unmanned
dubbed
dubbing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910641537.9A
Other languages
Chinese (zh)
Other versions
CN110392273A (en
Inventor
李美卓
范威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN201910641537.9A priority Critical patent/CN110392273B/en
Publication of CN110392273A publication Critical patent/CN110392273A/en
Application granted granted Critical
Publication of CN110392273B publication Critical patent/CN110392273B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/475End-user interface for inputting end-user data, e.g. personal identification number [PIN], preference data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application

Abstract

The embodiment of the disclosure provides an audio and video processing method, an audio and video processing device, electronic equipment and a storage medium, wherein the method is applied to a server and comprises the following steps: acquiring a dubbing instruction sent by a first electronic device in a virtual space, wherein the first electronic device is an electronic device with live broadcast authority in the virtual space; determining a preset dubbing type corresponding to the dubbing instruction; determining an audio/video to be dubbed; when a dubbing starting instruction sent by the first electronic equipment is obtained, playing the unmanned audio video corresponding to the audio video to be dubbed according to the preset dubbing type; and in the process of playing the unmanned sound video, acquiring dubbing audio corresponding to the unmanned sound video, and simultaneously sending the dubbing audio to second electronic equipment, wherein the second electronic equipment is electronic equipment with the authority of watching live broadcast in the virtual space. By adopting the scheme, the user can interact in the virtual space in a dubbing mode, so that the diversity of the interaction mode is increased, and the user experience is improved.

Description

Audio and video processing method and device, electronic equipment and storage medium
Technical Field
The disclosure relates to the field of computer technology, and in particular, to an audio and video processing method, an audio and video processing device, electronic equipment and a storage medium.
Background
In recent years, network live broadcast is rapidly developed and favored by people. In the field of network live broadcast, a terminal provided with a live broadcast application program can be called a user terminal, and the user terminal for watching a live broadcast of a host broadcast in the live broadcast process is a viewer terminal.
When the network live broadcast is carried out, the anchor can carry out live broadcast in various modes, and can also interact with spectators or other anchors. For example, the spectator may chat with the anchor, give away gifts to the anchor, live link, fight link, etc. between the anchors. However, in the current network live broadcast, the interaction mode is single between the anchor and the audience, or between the anchor and the anchor.
Disclosure of Invention
In order to overcome the problems in the related art, embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a storage medium for processing an audio/video. The specific technical scheme is as follows:
according to a first aspect of embodiments of the present disclosure, there is provided a method for processing an audio and video, applied to a server, the method including:
Acquiring a dubbing instruction sent by a first electronic device in a virtual space, wherein the first electronic device is an electronic device with live broadcast authority in the virtual space;
determining a preset dubbing type corresponding to the dubbing instruction;
determining an audio/video to be dubbed;
when a dubbing starting instruction sent by the first electronic equipment is obtained, playing the unmanned audio video corresponding to the audio video to be dubbed according to the preset dubbing type;
and in the process of playing the unmanned sound video, acquiring dubbing audio corresponding to the unmanned sound video, and simultaneously sending the dubbing audio to second electronic equipment, wherein the second electronic equipment is electronic equipment with live broadcast watching authority in the virtual space.
As one embodiment, the preset dubbing type is a anchor performance type;
the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type comprises the following steps:
and controlling the first electronic equipment and the second electronic equipment to play the unmanned sound video corresponding to the audio/video to be dubbed at the same time.
As one embodiment, the preset dubbing type is a multicast fight type;
the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type comprises the following steps:
Determining a fight sequence corresponding to first electronic equipment corresponding to each anchor;
and controlling the first electronic equipment and the corresponding second electronic equipment to sequentially play the unmanned audio-video corresponding to the audio-video to be dubbed according to the fight sequence.
As one embodiment, the preset dubbing type is a multi-person dubbing type;
the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type comprises the following steps:
and controlling each second electronic device corresponding to the user in the instant messaging area in the virtual space, and simultaneously playing the unmanned audio-video corresponding to the audio-video to be dubbed.
As an implementation manner, the step of controlling each second electronic device corresponding to the user in the instant messaging area in the virtual space and simultaneously playing the unmanned audio/video corresponding to the audio/video to be dubbed includes:
and when the broadcast message sent by the first electronic equipment is obtained, sending the audio/video to be dubbed and a start instruction to each second electronic equipment corresponding to a user in the instant messaging area in the virtual space, so that each second electronic equipment plays the unmanned audio/video corresponding to the audio/video to be dubbed simultaneously when receiving the start instruction.
As an embodiment, the step of determining the audio/video to be dubbed includes:
acquiring a video uploaded by the first electronic equipment;
and determining the uploaded video as the video to be dubbed.
As an implementation manner, the method for acquiring the unmanned audio and video includes:
determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
inputting the amplitude spectrum into a pre-trained network model to obtain a human voice mask matrix corresponding to the audio/video to be dubbed, wherein the network model is obtained by training based on a pre-obtained amplitude spectrum sample and a human voice mask matrix corresponding to the amplitude spectrum sample, and the network model comprises a corresponding relation between the amplitude spectrum and the human voice mask matrix;
calculating to obtain an unmanned voice amplitude spectrum by using the voice mask matrix and the amplitude spectrum;
and determining the unmanned sound video corresponding to the audio-video to be dubbed based on the unmanned sound amplitude spectrum.
As an implementation manner, the method for acquiring the unmanned audio and video includes:
determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
inputting the amplitude spectrum into a pre-trained network model to obtain unmanned audio corresponding to the audio-video to be dubbed, wherein the network model is obtained based on a pre-obtained amplitude spectrum sample and the corresponding unmanned audio training, and comprises a corresponding relation between the amplitude spectrum and the unmanned audio;
And determining the unmanned audio-video corresponding to the audio-video to be dubbed based on the unmanned audio-video.
According to a second aspect of an embodiment of the present disclosure, there is provided a processing method of audio and video, applied to a first electronic device, where the first electronic device is an electronic device having a live broadcast right in a virtual space, the method including:
acquiring a dubbing instruction in the virtual space;
determining a preset dubbing type corresponding to the dubbing instruction;
determining an audio/video to be dubbed;
when a dubbing starting instruction is acquired, playing the unmanned audio video corresponding to the audio video to be dubbed according to the preset dubbing type;
and in the process of playing the unmanned sound video, acquiring dubbing audio corresponding to the unmanned sound video, and simultaneously sending the dubbing audio to a server.
As one embodiment, the preset dubbing type is a anchor performance type;
the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type comprises the following steps:
and playing the unmanned audio video corresponding to the audio-video to be dubbed, and controlling a second electronic device to play the unmanned audio video corresponding to the audio-video to be dubbed at the same time, wherein the second electronic device is an electronic device with the authority of watching live broadcast in the virtual space.
As one embodiment, the preset dubbing type is a multicast fight type;
the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type comprises the following steps:
determining a fight sequence corresponding to first electronic equipment corresponding to each anchor;
and controlling the first electronic equipment and the corresponding second electronic equipment to sequentially play the unmanned audio-video corresponding to the audio-video to be dubbed according to the fight sequence.
As one embodiment, the preset dubbing type is a multi-person dubbing type;
the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type comprises the following steps:
and controlling each second electronic device corresponding to the user in the instant messaging area in the virtual space, and simultaneously playing the unmanned audio-video corresponding to the audio-video to be dubbed.
As an implementation manner, the step of controlling each second electronic device corresponding to the user in the instant messaging area in the virtual space and simultaneously playing the unmanned audio/video corresponding to the audio/video to be dubbed includes:
and sending the broadcast message to the server so that the server sends the audio and video to be dubbed and a start instruction to each second electronic device corresponding to a user in the instant messaging area in the virtual space, and playing the unmanned audio and video corresponding to the audio and video to be dubbed simultaneously when each second electronic device receives the start instruction.
As an embodiment, the step of determining the audio/video to be dubbed includes:
acquiring a video uploaded by a user;
and determining the uploaded video as the video to be dubbed.
As an implementation manner, the method for acquiring the unmanned audio and video includes:
determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
inputting the amplitude spectrum into a pre-trained network model to obtain a human voice mask matrix corresponding to the audio/video to be dubbed, wherein the network model is obtained by training based on a pre-obtained amplitude spectrum sample and a human voice mask matrix corresponding to the amplitude spectrum sample, and the network model comprises a corresponding relation between the amplitude spectrum and the human voice mask matrix;
calculating to obtain an unmanned voice amplitude spectrum by using the voice mask matrix and the amplitude spectrum;
and determining the unmanned sound video corresponding to the audio-video to be dubbed based on the unmanned sound amplitude spectrum.
As an implementation manner, the method for acquiring the unmanned audio and video includes:
determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
inputting the amplitude spectrum into a pre-trained network model to obtain unmanned audio corresponding to the audio-video to be dubbed, wherein the network model is obtained based on a pre-obtained amplitude spectrum sample and the corresponding unmanned audio training, and comprises a corresponding relation between the amplitude spectrum and the unmanned audio;
And determining the unmanned audio-video corresponding to the audio-video to be dubbed based on the unmanned audio-video.
According to a third aspect of the embodiments of the present disclosure, there is provided a processing method of audio and video, applied to a second electronic device, where the second electronic device is an electronic device having a right to watch live broadcast in the virtual space, the method including:
when a dubbing starting instruction in a virtual space is acquired, playing an unmanned audio video corresponding to the pre-acquired audio video to be dubbed;
and in the process of playing the unmanned sound video, playing the dubbing audio when the dubbing audio corresponding to the unmanned sound video is acquired.
As an implementation manner, when a dubbing start instruction in a virtual space is acquired, the step of playing an unmanned audio video corresponding to the pre-acquired audio-video to be dubbed includes:
and when receiving the audio/video to be dubbed and a starting instruction in the virtual space sent by the server, playing the received unmanned audio/video corresponding to the audio/video to be dubbed.
As an implementation manner, the method for acquiring the unmanned audio and video includes:
determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
inputting the amplitude spectrum into a pre-trained network model to obtain a human voice mask matrix corresponding to the audio/video to be dubbed, wherein the network model is obtained by training based on a pre-obtained amplitude spectrum sample and a human voice mask matrix corresponding to the amplitude spectrum sample, and the network model comprises a corresponding relation between the amplitude spectrum and the human voice mask matrix;
Calculating to obtain an unmanned voice amplitude spectrum by using the voice mask matrix and the amplitude spectrum;
and determining the unmanned sound video corresponding to the audio-video to be dubbed based on the unmanned sound amplitude spectrum.
As an implementation manner, the method for acquiring the unmanned audio and video includes:
determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
inputting the amplitude spectrum into a pre-trained network model to obtain unmanned audio corresponding to the audio-video to be dubbed, wherein the network model is obtained based on a pre-obtained amplitude spectrum sample and the corresponding unmanned audio training, and comprises a corresponding relation between the amplitude spectrum and the unmanned audio;
and determining the unmanned audio-video corresponding to the audio-video to be dubbed based on the unmanned audio-video.
According to a fourth aspect of embodiments of the present disclosure, there is provided an audio/video processing apparatus, applied to a server, the apparatus including:
the system comprises a dubbing instruction first dubbing instruction acquisition module, a first control module and a second control module, wherein the dubbing instruction acquisition module is configured to execute and acquire a dubbing instruction sent by a first electronic device in a virtual space, and the first electronic device is an electronic device with live broadcast authority in the virtual space;
The first preset dubbing type determining module is configured to execute the determination of the preset dubbing type corresponding to the dubbing instruction;
the first to-be-dubbed video determining module is configured to execute to determine to-be-dubbed videos;
the unmanned audio video first unmanned audio video playing module is configured to play the unmanned audio video corresponding to the audio video to be dubbed according to the preset dubbing type when the dubbing starting instruction sent by the first electronic equipment is acquired;
the first dubbing audio sending module is configured to obtain the dubbing audio corresponding to the unmanned audio-video in the process of playing the unmanned audio-video, and send the dubbing audio to the second electronic device, wherein the second electronic device is an electronic device with live broadcast watching authority in the virtual space.
As one embodiment, the preset dubbing type is a anchor performance type;
the first unmanned audio video playing module comprises:
and the first unmanned sound video playing sub-module is configured to execute control on the first electronic equipment and the second electronic equipment to play the unmanned sound video corresponding to the audio/video to be dubbed simultaneously.
As one embodiment, the preset dubbing type is a multicast fight type;
the first unmanned audio video playing module comprises:
a combat order determination submodule configured to perform a determination of a combat order corresponding to the first electronic device corresponding to each anchor;
and the second unmanned sound video playing sub-module is configured to control the first electronic equipment and the corresponding second electronic equipment to sequentially play the unmanned sound video corresponding to the audio/video to be dubbed according to the fight sequence.
As one embodiment, the preset dubbing type is a multi-person dubbing type;
the first unmanned audio video playing module comprises:
and the third unmanned sound video playing sub-module is configured to execute and control each second electronic device corresponding to the user in the instant messaging area in the virtual space, and simultaneously play the unmanned sound video corresponding to the audio/video to be dubbed.
As an embodiment, the third unmanned audio video playing submodule includes:
the first unmanned aerial vehicle playing unit is configured to send the audio/video to be dubbed and a start instruction to each second electronic device corresponding to a user in the instant messaging area in the virtual space when the broadcast message sent by the first electronic device is acquired, so that each second electronic device plays the unmanned aerial vehicle corresponding to the audio/video to be dubbed simultaneously when receiving the start instruction.
As an implementation manner, the first to-be-dubbed audio/video determining module includes:
a first video acquisition sub-module configured to perform acquisition of video uploaded by the first electronic device;
and the first video to be dubbed determining submodule is configured to determine the uploaded video as the video to be dubbed.
As one implementation manner, the audio/video processing device further comprises a first unmanned audio/video determining module;
the first unmanned aerial vehicle determination module includes:
the first amplitude spectrum determining submodule is configured to determine an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
the first human voice mask matrix determining submodule is configured to perform input of the amplitude spectrum into a pre-trained network model to obtain a human voice mask matrix corresponding to the video to be dubbed, wherein the network model is obtained by training based on a pre-acquired amplitude spectrum sample and a human voice mask matrix corresponding to the amplitude spectrum sample, and the network model comprises a corresponding relation between the amplitude spectrum and the human voice mask matrix;
the first unmanned sound amplitude spectrum determining submodule is configured to execute calculation to obtain an unmanned sound amplitude spectrum by utilizing the unmanned sound mask matrix and the amplitude spectrum;
And the first unmanned sound video determining submodule is configured to execute unmanned sound video corresponding to the audio/video to be dubbed based on the unmanned sound amplitude spectrum.
As one implementation mode, the audio-video processing device further comprises a second unmanned audio-video determining module;
the second unmanned aerial vehicle video determining module comprises:
the second amplitude spectrum determining submodule is configured to determine an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
the first unmanned sound frequency determining submodule is configured to input the amplitude spectrum into a pre-trained network model to obtain unmanned sound frequency corresponding to the audio-video to be dubbed, wherein the network model is obtained based on a pre-acquired amplitude spectrum sample and corresponding unmanned sound frequency training thereof, and the network model comprises a corresponding relation between the amplitude spectrum and the unmanned sound frequency;
and the second unmanned sound video determining submodule is configured to execute unmanned sound video corresponding to the audio/video to be dubbed based on the unmanned sound audio.
According to a fifth aspect of embodiments of the present disclosure, there is provided an audio/video processing apparatus applied to a first electronic device, where the first electronic device is an electronic device having a live broadcast right in a virtual space, the apparatus includes:
A second dubbing instruction acquisition module configured to execute acquisition of dubbing instructions in the virtual space;
the second preset dubbing type determining module is configured to execute the determination of the preset dubbing type corresponding to the dubbing instruction;
the second to-be-dubbed video determining module is configured to execute the determination of the to-be-dubbed video;
the second unmanned audio video playing module is configured to execute playing the unmanned audio video corresponding to the audio video to be dubbed according to the preset dubbing type when the dubbing starting instruction is acquired;
the second dubbing audio sending module is configured to obtain dubbing audio corresponding to the unmanned audio video in the process of playing the unmanned audio video, and send the dubbing audio to a server.
As one embodiment, the preset dubbing type is a anchor performance type;
the second unmanned sound video playing module comprises:
and the fourth unmanned sound video playing sub-module is configured to play the unmanned sound video corresponding to the audio-video to be dubbed, and control a second electronic device to play the unmanned sound video corresponding to the audio-video to be dubbed at the same time, wherein the second electronic device is an electronic device with the authority of watching live broadcast in the virtual space.
As one embodiment, the preset dubbing type is a multicast fight type;
the second unmanned sound video playing module comprises:
a combat order determination submodule configured to perform a determination of a combat order corresponding to the first electronic device corresponding to each anchor;
and the fifth unmanned sound video playing sub-module is configured to control the first electronic equipment and the corresponding second electronic equipment to sequentially play the unmanned sound video corresponding to the audio/video to be dubbed according to the fight sequence.
As one embodiment, the preset dubbing type is a multi-person dubbing type;
the second unmanned sound video playing module comprises:
and the sixth unmanned sound video playing sub-module is configured to execute and control each second electronic device corresponding to the user in the instant messaging area in the virtual space, and simultaneously play the unmanned sound video corresponding to the audio/video to be dubbed.
As an embodiment, the sixth unmanned audio video playing submodule includes:
the second unmanned aerial vehicle audio and video playing unit is configured to execute the sent broadcast message to the server so that the server sends the audio and video to be dubbed and a start instruction to each second electronic device corresponding to a user in an instant messaging area in the virtual space, and when each second electronic device receives the start instruction, the unmanned aerial vehicle audio and video corresponding to the audio and video to be dubbed is played simultaneously.
As an implementation manner, the second audio/video to be dubbed determining module includes:
the second video acquisition sub-module is configured to acquire the video uploaded by the user;
and the second video to be dubbed determining submodule is configured to determine the uploaded video as the video to be dubbed.
As one implementation manner, the audio/video processing device further comprises a third unmanned audio/video determining module;
the third unmanned aerial vehicle video determining module comprises:
a third amplitude spectrum determining sub-module configured to perform determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
the second voice mask matrix determining submodule is configured to perform input of the amplitude spectrum into a pre-trained network model to obtain a voice mask matrix corresponding to the video to be dubbed, wherein the network model is obtained by training based on a pre-obtained amplitude spectrum sample and a corresponding voice mask matrix thereof, and the network model comprises a corresponding relation between the amplitude spectrum and the voice mask matrix;
the second unmanned sound amplitude spectrum determining submodule is configured to execute calculation to obtain an unmanned sound amplitude spectrum by utilizing the unmanned sound mask matrix and the amplitude spectrum;
And the third unmanned sound video determining submodule is configured to determine the unmanned sound video corresponding to the audio video to be dubbed based on the unmanned sound amplitude spectrum.
As one implementation manner, the audio/video processing device further comprises a fourth unmanned audio/video determining module;
the fourth unmanned aerial vehicle video determining module comprises:
a fourth amplitude spectrum determining sub-module configured to perform determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
the second unmanned sound frequency determining submodule is configured to input the amplitude spectrum into a pre-trained network model to obtain unmanned sound frequency corresponding to the audio-video to be dubbed, wherein the network model is obtained based on a pre-acquired amplitude spectrum sample and corresponding unmanned sound frequency training thereof, and the network model comprises a corresponding relation between the amplitude spectrum and the unmanned sound frequency;
and the fourth unmanned sound video determining submodule is configured to execute unmanned sound video corresponding to the audio/video to be dubbed based on the unmanned sound audio.
According to a sixth aspect of the embodiments of the present disclosure, there is provided an audio/video processing apparatus applied to a second electronic device, where the second electronic device is an electronic device having a right to watch live broadcast in the virtual space, the apparatus including:
The third unmanned sound video playing module is configured to play the unmanned sound video corresponding to the pre-acquired audio-video to be dubbed when the dubbing starting instruction in the virtual space is acquired;
and the dubbing audio playing module is configured to play the dubbing audio when the dubbing audio corresponding to the unmanned audio video is acquired in the process of playing the unmanned audio video.
As an implementation manner, the third unmanned aerial vehicle video playing module includes:
and the seventh unmanned audio video playing sub-module is configured to play the unmanned audio video corresponding to the received audio video to be dubbed when executing the audio video to be dubbed and the starting instruction in the virtual space sent by the receiving server.
As one implementation manner, the audio/video processing device further comprises a fifth unmanned audio/video determining module;
the fifth unmanned aerial vehicle video determination module comprises:
a fifth amplitude spectrum determining sub-module configured to perform determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
the third voice mask matrix determining submodule is configured to perform input of the amplitude spectrum into a pre-trained network model to obtain a voice mask matrix corresponding to the video to be dubbed, wherein the network model is obtained by training based on a pre-obtained amplitude spectrum sample and a corresponding voice mask matrix thereof, and the network model comprises a corresponding relation between the amplitude spectrum and the voice mask matrix;
A third unmanned sound amplitude spectrum determining sub-module configured to perform calculation to obtain an unmanned sound amplitude spectrum by using the unmanned sound mask matrix and the amplitude spectrum;
and a fifth unmanned aerial vehicle video determination submodule configured to execute unmanned aerial vehicle video corresponding to the audio/video to be dubbed based on the unmanned aerial vehicle amplitude spectrum.
As one implementation manner, the audio/video processing device further comprises a fifth unmanned audio/video determining module;
the fifth unmanned aerial vehicle video determination module comprises:
a sixth amplitude spectrum determining submodule configured to perform determination of an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
the third unmanned sound frequency determining submodule is configured to input the amplitude spectrum into a pre-trained network model to obtain unmanned sound frequency corresponding to the audio-video to be dubbed, wherein the network model is obtained based on a pre-acquired amplitude spectrum sample and corresponding unmanned sound frequency training thereof, and the network model comprises a corresponding relation between the amplitude spectrum and the unmanned sound frequency;
and a sixth unmanned audio video determination submodule configured to execute unmanned audio video corresponding to the audio-video to be dubbed based on the unmanned audio.
According to a seventh aspect of embodiments of the present disclosure, there is provided a server comprising:
a processor;
a memory for storing the processor-executable instructions;
the processor is configured to execute the instructions to implement the audio/video processing method according to the first aspect.
According to an eighth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the audio/video processing method according to the second aspect or the third aspect.
According to a ninth aspect of embodiments of the present disclosure, there is provided a storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the method for processing an audio/video according to any one of the above aspects.
In the scheme provided by the embodiment of the disclosure, the server can acquire the dubbing instruction sent by the first electronic device in the virtual space, determine the preset dubbing type corresponding to the dubbing instruction, then determine the audio/video to be dubbed, further play the unmanned audio/video corresponding to the audio/video to be dubbed according to the preset dubbing type when acquiring the dubbing starting instruction sent by the first electronic device, and acquire the dubbing audio corresponding to the unmanned audio/video in the process of playing the unmanned audio/video, and simultaneously send the dubbing audio to the second electronic device. The first electronic device is an electronic device with live broadcast authority in the virtual space, and the second electronic device is an electronic device with live broadcast authority for watching in the virtual space. By adopting the scheme, the user can interact in the virtual space in a dubbing mode, so that the diversity of the interaction mode is increased, and the user experience is improved. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.
FIG. 1 is a flowchart illustrating a first audio video processing method according to an exemplary embodiment;
FIG. 2 is a schematic diagram of a dubbing button shown according to an example embodiment;
FIG. 3 is a first flowchart of step S104 in the embodiment of FIG. 1, shown in accordance with an exemplary embodiment;
FIG. 4 is a first flowchart illustrating a manner of acquiring an unmanned audio video according to an exemplary embodiment;
FIG. 5 is a second flowchart illustrating a method of acquiring an unmanned audio video according to an exemplary embodiment;
FIG. 6 is a flowchart illustrating a second audio video processing method according to an exemplary embodiment;
FIG. 7 is a flowchart illustrating a third audio video processing method according to an exemplary embodiment;
fig. 8 is a block diagram showing the structure of a first audio-video processing apparatus according to an exemplary embodiment;
fig. 9 is a block diagram showing a structure of a second audio-video processing apparatus according to an exemplary embodiment;
Fig. 10 is a block diagram showing the structure of a third audio-video processing apparatus according to an exemplary embodiment;
fig. 11 is a block diagram illustrating a structure of an electronic device according to an exemplary embodiment.
Detailed Description
In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
In order to enrich the interaction modes in the virtual space and improve the user experience, the embodiment of the disclosure provides an audio and video processing method, an audio and video processing device, a server, electronic equipment and a computer readable storage medium.
The following first describes a first audio/video processing method provided by an embodiment of the present disclosure. The first audio and video processing method provided by the embodiment of the disclosure can be applied to a server of a live broadcast application program.
As shown in fig. 1, a processing method of audio and video is applied to a server, and the method includes:
in step S101, a dubbing instruction sent by a first electronic device in a virtual space is obtained;
the first electronic device is an electronic device with live broadcast authority in the virtual space.
In step S102, determining a preset dubbing type corresponding to the dubbing instruction;
in step S103, determining an audio/video to be dubbed;
in step S104, when a dubbing start instruction sent by the first electronic device is obtained, playing an unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type;
in step S105, in the process of playing the unmanned audio-video, the dubbing audio corresponding to the unmanned audio-video is obtained, and the dubbing audio is sent to the second electronic device.
The second electronic device is an electronic device with the right to watch live broadcast in the virtual space.
It can be seen that, in the scheme provided by the embodiment of the disclosure, the server may obtain a dubbing instruction sent by the first electronic device in the virtual space, determine a preset dubbing type corresponding to the dubbing instruction, then determine a video to be dubbed, further play an unmanned audio/video corresponding to the video to be dubbed according to the preset dubbing type when obtaining a dubbing start instruction sent by the first electronic device, and obtain a dubbing audio corresponding to the unmanned audio/video in the process of playing the unmanned audio/video, and send the dubbing audio to the second electronic device. The first electronic device is an electronic device with live broadcast authority in the virtual space, and the second electronic device is an electronic device with live broadcast authority for watching in the virtual space. By adopting the scheme, the user can interact in the virtual space in a dubbing mode, so that the diversity of the interaction mode is increased, and the user experience is improved.
The first electronic device is an electronic device with a live broadcast authority in the virtual space, and the host can utilize the first electronic device to conduct live broadcast. In the live broadcasting process of the anchor, interaction with the audience or other anchors can be performed in a dubbing mode, and at the moment, the anchor can send out a dubbing instruction. To facilitate user operation, a user interface may be provided in the live interface of the first electronic device, e.g., as shown in fig. 2, the live interface of the first electronic device may display a "play dubbing" button 201, which button 201 may be clicked by the host to issue a dubbing instruction.
In step S101, the server may obtain a dubbing instruction sent by the first electronic device in the virtual space, where the dubbing instruction indicates that the host needs to interact with the audience or other hosts in a dubbing manner. Since there are multiple dubbing manners in the virtual space, the server may determine the preset dubbing type corresponding to the obtained dubbing instruction at this time, that is, execute step S102.
In one embodiment, different user interfaces may be provided in the live interface of the first electronic device, and different preset dubbing types may be respectively corresponding to the different user interfaces, through which user interface a user issues a dubbing instruction, a preset type corresponding to the dubbing instruction may be determined to be a preset dubbing type corresponding to the user interface.
The preset dubbing type can be set according to user requirements, for example, dubbing can be performed for a person of a host, dubbing fight can be performed for a plurality of hosts, a section of dubbing can be completed for the host and audience in a matching way, and the like, and the method is not particularly limited.
After acquiring the dubbing instruction sent by the first electronic device in the virtual space, the server may execute step S103 described above, that is, determine the video to be dubbed. In order to facilitate the user to select the video suitable for the user to need as the video to be dubbed, the live interface of the first electronic device may display a video selection panel, where the video selection panel may include a video downloaded by a main broadcast, a video popular on a network, a video recommended to the user to be suitable, and the like, which is not limited in detail herein. One of the videos can be selected by the anchor, and the server can determine that the video is to be dubbed.
In order to facilitate users to be familiar with the content of the video to be dubbed, so that the dubbing effect is better, the first electronic equipment can play the video to be dubbed for watching by a host, and meanwhile, the server can control each second electronic equipment to synchronously play the video to be dubbed for watching by each audience. The second electronic device is an electronic device with the right to watch live broadcast in the virtual space, and a spectator can watch live broadcast of the host by using the second electronic device.
Next, when the dubbing start instruction sent by the first electronic device is obtained, it is indicated that the user needs to start dubbing, and then the unmanned audio and video corresponding to the audio and video to be dubbed can be played according to the preset dubbing type. For convenience of user operation, a corresponding user interface may be provided in the live interface of the first electronic device, for example, the live interface of the first electronic device may display a "start dubbing" button, and the host may issue a dubbing start instruction by clicking the button.
When a dubbing start instruction sent by the first electronic device is obtained, the server can control the first electronic device, the second electronic device and other first electronic devices used by the anchor to start playing the unmanned audio and video corresponding to the audio and video to be dubbed according to the preset dubbing type. The unmanned sound video is a video which only retains background music when people sound is removed.
As an implementation manner, the unmanned aerial vehicle may be local to the electronic device pre-stored in the server or used by each user, and when the unmanned aerial vehicle is stored in the server, the server may send the unmanned aerial vehicle to the electronic device used by each user, so that the electronic device used by each user plays the unmanned aerial vehicle. As another implementation manner, after determining the video to be dubbed, the server may process the video to be dubbed to obtain the unmanned audio video corresponding to the video to be dubbed for standby, which is reasonable.
In the process of playing the unmanned audio-video, the server may acquire the dubbing audio corresponding to the unmanned audio-video, and simultaneously send the dubbing audio to the second electronic device, that is, execute the step S105, so that the audience can watch the dubbing performance. In the process of playing the unmanned audio and video, the anchor and/or audience and/or other anchor can send out audio signals to carry out dubbing of roles in the video, and at the moment, the corresponding user side electronic equipment can collect the audio signals sent by the user, namely dubbing audio, and then send the audio signals to the server.
The server can also receive the dubbing audio sent by each user side electronic device, and then send the dubbing audio to each second electronic device. At this time, each second electronic device is playing the unmanned audio-video corresponding to the audio-video to be dubbed, so that the dubbed audio and the unmanned audio-video are played together, and the spectator can watch the dubbing performance.
As an implementation of the disclosed embodiment, the preset dubbing type may be a hosting performance type. That is, during dubbing, only one of the anchor is dubbed and the audience views the anchor's dubbing performance.
For the case that the preset dubbing type is the hosting performance type, the step of playing the unmanned audio video corresponding to the audio/video to be dubbed according to the preset dubbing type may include:
and controlling the first electronic equipment and the second electronic equipment to play the unmanned sound video corresponding to the audio/video to be dubbed at the same time.
Because the host plays dubbing under the condition, and the audience watches the dubbing performance of the host, the server can control the first electronic equipment and the corresponding second electronic equipment to play the unmanned audio-video corresponding to the to-be-dubbed video at the same time, so that when the host plays dubbing, the server sends the dubbing audio to the second electronic equipment, and the second electronic equipment can play the dubbing audio while playing the unmanned audio-video corresponding to the to-be-dubbed video, and the audience can watch the dubbing performance of the host.
Therefore, in the embodiment, the host can perform dubbing performance to interact with the audience, so that interactivity and interestingness of the virtual space can be enhanced, and user experience is improved.
As an implementation of the embodiment of the present disclosure, the preset dubbing type may be a multicast fight type. That is, the multiple anchor can dub the video to be dubbed separately, and the audience can watch the dubbing fight performance among the multiple anchor.
In one embodiment, a "play dubbing" button may be displayed in a secondary menu of a play fight function in a live interface of the first electronic device, and the anchor clicks the "play dubbing" button to determine that the anchor wants to play a multicast fight dubbing.
The server can match each first electronic device corresponding to the anchor who selects the multi-anchor fight dubbing at present as the first electronic device which will perform the fight dubbing. The video to be dubbed may be selected by any one of the anchor, or may be determined according to other rules, for example, the anchor with the least number of viewers in the virtual space may be selected to increase the popularity of the anchor.
For the case that the preset dubbing type is the hosting performance type, as shown in fig. 3, the step of playing the unmanned audio and video corresponding to the audio and video to be dubbed according to the preset dubbing type may include:
S301, determining a fight sequence corresponding to first electronic equipment corresponding to each anchor;
because a plurality of anchor persons need to perform dubbing fight at present, in order to ensure the audience to watch the feeling of the dubbing fight, each anchor person needs to perform dubbing performance one by one, so the server can determine the fight sequence corresponding to the first electronic equipment corresponding to each anchor person.
In one embodiment, the server may randomly determine the fight sequence corresponding to each of the first electronic devices, and inform the fight sequence corresponding to each of the first electronic devices. In another embodiment, one of the anchor may determine the corresponding combat sequence of each of the first electronic devices. In another embodiment, it is reasonable to agree on the order of the combat by the respective anchor by way of a wheat linkage.
S302, controlling the first electronic equipment and the corresponding second electronic equipment to sequentially play the unmanned audio-video corresponding to the audio-video to be dubbed according to the fight sequence.
After the above-mentioned fight sequence is determined, each anchor can start dubbing fight, that is, dubbing is performed for the video to be dubbed from the first anchor according to the fight sequence until the dubbing of the last anchor is completed. In the process, the server can control each first electronic device and the corresponding second electronic device to sequentially play the unmanned audio and video corresponding to the audio and video to be dubbed, the anchor can dubbed, and the audience can watch the dubbing performance of each anchor.
When each anchor dubs, the corresponding first electronic equipment can collect the voice signal sent by the anchor and send the voice signal to the server, the server can send the voice signal to other first electronic equipment and the second electronic equipment corresponding to all the first electronic equipment as dubbing audio, and each anchor and audience can watch dubbing fight performance.
Therefore, in this embodiment, dubbing fight performance can be performed among a plurality of anchor, so as to interact with other anchor and spectators, so that interactivity and interestingness of the virtual space can be further enhanced, and user experience is further improved.
As an implementation manner of the embodiment of the present disclosure, the preset dubbing type may be a multi-person dubbing type. That is, the anchor and the audience can respectively dug for different roles in the video to be dubbed, and the dubbing of the video to be dubbed is completed together. Typically, the audience is a user in an instant messaging area in the virtual space. For example, it may be a user in a chat room in a live room.
In this case, the video to be dubbed may be selected by the anchor according to the number of persons participating in dubbing. It is reasonable to recommend suitable videos by the server according to the number of users in the instant messaging area in the virtual space, and the method is not particularly limited herein. To facilitate dubbing, a host may agree on the assignment of roles in the instant messaging area by users participating in the dubbing.
For the case that the preset dubbing type is a multi-person dubbing type, the step of playing the unmanned audio video corresponding to the audio video to be dubbed according to the preset dubbing type may include:
and controlling each second electronic device corresponding to the user in the instant messaging area in the virtual space, and simultaneously playing the unmanned audio-video corresponding to the audio-video to be dubbed.
In order to ensure that users in the anchor and the instant messaging area can smoothly complete dubbing of the video to be dubbed, the first electronic equipment and each second electronic equipment corresponding to the users in the instant messaging area need to synchronously play the unmanned audio video corresponding to the video to be dubbed, so that the anchor and each user can smoothly complete dubbing interaction.
Therefore, in this embodiment, users in the anchor and the instant messaging area can cooperate with each other to complete dubbing performance, the interaction between the anchor and the audience is stronger, the participation of the audience is enhanced, the interactivity and the interestingness of the virtual space can be further enhanced, and the user experience is further improved.
As an implementation manner of the embodiment of the present disclosure, the step of controlling each second electronic device corresponding to the user in the instant messaging area in the virtual space and simultaneously playing the unmanned audio/video corresponding to the audio/video to be dubbed may include:
And when the broadcast message sent by the first electronic equipment is obtained, sending the audio/video to be dubbed and a start instruction to each second electronic equipment corresponding to a user in the instant messaging area in the virtual space, so that each second electronic equipment plays the unmanned audio/video corresponding to the audio/video to be dubbed simultaneously when receiving the start instruction.
In order to ensure that the first electronic device and the second electronic devices corresponding to the users participating in dubbing can play the unmanned audio and video simultaneously, the first electronic device can send a command for starting dubbing by sending a broadcast message, and when the server acquires the broadcast message sent by the first electronic device, the server sends the command for starting to-be-dubbed audio and video and the command for starting to the second electronic devices corresponding to the users in the instant messaging area in the virtual space.
Therefore, when each second electronic device receives the starting instruction, the unmanned audio-video corresponding to the audio-video to be dubbed starts to be played, and the unmanned audio-video is ensured to be played at the same time by each second electronic device.
In the dubbing process, in order to ensure that the dubbing audios played by the user terminals are synchronous, in one implementation mode, a real-time communication mode can be adopted for voice transmission. For example, the voice signal may be collected at 20 ms time intervals, the encoded data may be transmitted through udp (User Datagram Protocol ) packets, the network packet loss is processed through FEC (Forward Error Correction ), after the receiving end receives the post-packet, the receiving end may sequence the packets through sequence numbers, and the lost packets are recovered through PLC, so that it may be ensured that the data packet at the transmitting end may be transmitted to the receiving end within 400 ms. And the synchronous playing of dubbing audios at all the user terminals can be ensured.
In this embodiment, the server may send the audio/video to be dubbed and the start instruction to each second electronic device corresponding to the user in the instant messaging area in the virtual space when acquiring the broadcast message sent by the first electronic device, so that each second electronic device plays the unmanned audio/video corresponding to the audio/video to be dubbed simultaneously when receiving the start instruction, thereby ensuring that the unmanned audio/video is played synchronously and the dubbing can be performed smoothly.
As an implementation manner of the embodiment of the present disclosure, the step of determining the video to be dubbed may include:
acquiring a video uploaded by the first electronic equipment; and determining the uploaded video as the video to be dubbed.
When determining the video to be dubbed, the anchor can select the favorite video, upload the favorite video to the server through the first electronic equipment, and the server can acquire the video uploaded by the first electronic equipment, so that the server can determine the video uploaded by the first electronic equipment as the video to be dubbed.
The server can also identify the subtitles of the video uploaded by the first electronic equipment to obtain an identification result, and the identification result is added to the uploaded video, so that each user can conveniently check the subtitles during dubbing. The embodiment of the present disclosure is not particularly limited again as long as the subtitle of the video can be recognized. The first electronic device may also store the video selected by the host locally, and may use the uploaded video each time it is live.
In this embodiment, the server may obtain the video uploaded by the first electronic device, and further determine the uploaded video as the video to be dubbed. Therefore, the requirements of the anchor can be met, and the user experience is further improved.
As shown in fig. 4, the method for obtaining the unmanned aerial vehicle video may include:
s401, determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
in order to process the audio-video with dubbing to obtain the corresponding unmanned audio-video, firstly, the amplitude spectrum corresponding to the audio signal of the audio-video to be dubbed needs to be determined. Specifically, the audio signal of the audio/video to be dubbed can be subjected to framing processing to obtain each frame of audio signal, and then each frame of audio signal is transformed into a frequency domain signal to obtain the amplitude spectrum of each frame of audio signal.
For example, the audio signal to be dubbed is a 16KHz, mono, 16-bit quantized audio signal, then the audio signal may be first subjected to frame division operation, the frame length is 512 sampling points, the frame is shifted to 256 sampling points, so as to obtain each frame of audio signal, and then, each frame of audio signal is subjected to short-time fourier transform, so that a phase spectrum and an amplitude spectrum corresponding to each frame of audio signal can be obtained.
S402, inputting the amplitude spectrum into a pre-trained network model to obtain a human voice mask matrix corresponding to the audio/video to be dubbed;
the server may then input the corresponding magnitude spectrum for each frame of audio signal into a pre-trained network model. The network model may be obtained by training based on a pre-obtained amplitude spectrum sample and a corresponding human voice mask matrix, and may include a correspondence between an amplitude spectrum and the human voice mask matrix. Therefore, the network model can determine the human voice mask matrix corresponding to the amplitude spectrum corresponding to each frame of audio signal according to the corresponding relation between the amplitude spectrum and the human voice mask matrix.
The human voice mask matrix is a mask matrix capable of removing human voice, the value of each element in the human voice mask matrix is 0-1, the closer to 0 is the closer to the human voice, the closer to 1 is the less to the human voice, so that all elements lower than the threshold in the human voice mask matrix can be set to 0 through setting the threshold, and the corresponding audio signal part is the human voice.
The network model may be a deep learning network model such as a convolutional neural network or a cyclic neural network, and is not particularly limited again.
S403, calculating to obtain an unmanned voice amplitude spectrum by using the unmanned voice mask matrix and the amplitude spectrum;
then, the server may perform a dot product on the human voice mask matrix and the amplitude spectrum of each frame of audio signal, so as to obtain the amplitude spectrum of the separated signal, which is understood to be the unmanned audio.
S404, determining the unmanned sound video corresponding to the audio-video to be dubbed based on the unmanned sound amplitude spectrum.
After the amplitude spectrum of the separated signal is obtained, the server can combine the amplitude spectrum of the separated signal with the phase spectrum and then convert the combined amplitude spectrum into a time domain signal, so that the time domain signal of the separated signal, namely the unmanned audio frequency, can be obtained.
Furthermore, the unmanned audio is combined with the image part of the video to be dubbed, so that the unmanned audio corresponding to the video to be dubbed can be obtained.
It can be seen that, in this embodiment, the server may obtain the unmanned audio-video corresponding to the audio-video to be dubbed by using the pre-training completion network model, and may quickly and accurately determine the unmanned audio-video corresponding to the audio-video to be dubbed, so as to further improve user experience.
As shown in fig. 5, the method for obtaining the unmanned aerial vehicle video may include:
S501, determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
step S501 is the same as step S401 described above, and the description and illustration of the portion of step S401 will be referred to, and will not be repeated here.
S502, inputting the amplitude spectrum into a pre-trained network model to obtain an unmanned audio corresponding to the audio/video to be dubbed;
the server may input the amplitude spectrum obtained in step S501 into a pre-trained network model, where the network model may be obtained based on a pre-obtained amplitude spectrum sample and its corresponding unmanned audio training, and may include a correspondence between the amplitude spectrum and the unmanned audio. Therefore, the network model can determine the unmanned audio corresponding to the input amplitude spectrum according to the corresponding relation between the amplitude spectrum and the unmanned audio, and then output the unmanned audio.
Specifically, the network model may determine a human voice mask matrix corresponding to an amplitude spectrum corresponding to each frame of audio signal, then dot-multiply the human voice mask matrix and the amplitude spectrum of each frame of audio signal to obtain an amplitude spectrum of a separated signal, then combine the amplitude spectrum of the separated signal with the phase spectrum, and then transform the amplitude spectrum of the separated signal into a time domain signal, so as to obtain a time domain signal of the separated signal, that is, an unmanned audio.
The network model may be a deep learning network model such as a convolutional neural network or a cyclic neural network, and is not particularly limited again.
S503, determining the unmanned audio-video corresponding to the audio-video to be dubbed based on the unmanned audio-video.
Furthermore, the server can combine the unmanned audio output by the upper network model with the image part of the audio/video to be dubbed, so that the unmanned audio/video corresponding to the audio/video to be dubbed can be obtained.
It can be seen that, in this embodiment, the server may obtain the unmanned audio-video corresponding to the audio-video to be dubbed by using the pre-training completion network model, and may quickly and accurately determine the unmanned audio-video corresponding to the audio-video to be dubbed, so as to further improve user experience.
The unmanned sound video can be a video for removing human voice and reserving background music, can also be a video for removing human voice and background music and reserving some rhythm information, and can also be a video without sound completely, which is reasonable, and particularly can be used for setting a human voice mask matrix according to dubbing requirements so as to achieve corresponding effects.
As an implementation manner of the embodiment of the present disclosure, after obtaining the unmanned audio, in a first implementation manner, the method may further include:
Determining an amplitude spectrum corresponding to the unmanned audio; inputting the amplitude spectrum into a pre-trained network model to obtain a musical instrument mask matrix corresponding to the unmanned sound frequency, and calculating to obtain a target musical instrument amplitude spectrum by using the musical instrument mask matrix and the amplitude spectrum; and determining the target instrument audio corresponding to the unmanned sound audio based on the target instrument amplitude spectrum.
The network model is obtained by training based on a pre-acquired amplitude spectrum sample and a corresponding musical instrument mask matrix, and comprises a corresponding relation between an amplitude spectrum and the musical instrument mask matrix. The musical instrument mask matrix is a matrix which can remove other audio signals and reserve certain musical instrument audio signals.
Since the determination manner of the audio of the target musical instrument is substantially the same as that of the first unmanned audio, the description thereof will be omitted.
In a second embodiment, the method may further include:
determining an amplitude spectrum corresponding to the unmanned audio; and inputting the amplitude spectrum into a pre-trained network model to obtain the target musical instrument audio corresponding to the unmanned audio.
The network model is obtained based on a pre-obtained amplitude spectrum sample and a corresponding musical instrument audio training, and comprises a corresponding relation between an amplitude spectrum and the musical instrument audio. Since the determination manner of the audio of the target musical instrument is substantially the same as that of the second unmanned audio, the description thereof will be omitted.
The target musical instrument may be set according to actual needs, and may be, for example, a piano, guitar, drum, or the like.
Therefore, various target instrument audios can be obtained by utilizing the two modes, the server can replace the target instrument audios in the unmanned audio by utilizing other instrument audios, rhythm information and the like of the unmanned audio can be determined according to the instrument audios, convenience is provided for various dubbing modes, the diversity of dubbing interaction is further enhanced, and user experience is improved.
As an implementation manner of the embodiment of the present disclosure, after completing dubbing, after receiving an upload instruction sent by a user, a server may encode the dubbing audio and the unmanned audio video into a dubbing video, and issue the dubbing video to a live broadcast software platform for the user to download and view.
The embodiment of the disclosure also provides a second audio/video processing method, which can be applied to the first electronic device provided with the live broadcast application program.
The first electronic device is an electronic device with a live broadcast authority in the virtual space, and the host can conduct live broadcast through the first electronic device.
As shown in fig. 6, a processing method of audio and video is applied to a first electronic device, where the first electronic device is an electronic device with a live broadcast authority in a virtual space, and the method includes:
In step S601, a dubbing instruction in the virtual space is acquired;
in step S602, a preset dubbing type corresponding to the dubbing instruction is determined;
in step S603, determining an audio/video to be dubbed;
in step S604, when a dubbing start instruction is acquired, playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type;
in step S605, in the process of playing the unmanned audio-video, the dubbing audio corresponding to the unmanned audio-video is obtained, and the dubbing audio is sent to a server.
It can be seen that, in the scheme provided by the embodiment of the present disclosure, the first electronic device may obtain a dubbing instruction in the virtual space, determine a preset dubbing type corresponding to the dubbing instruction, then determine a video to be dubbed, and further play an unmanned audio/video corresponding to the video to be dubbed according to the preset dubbing type when obtaining a dubbing start instruction, and obtain a dubbing audio corresponding to the unmanned audio/video in the process of playing the unmanned audio/video, and send the dubbing audio to the server. By adopting the scheme, the user can interact in the virtual space in a dubbing mode, so that the diversity of the interaction mode is increased, and the user experience is improved.
In the live broadcasting process of the anchor, interaction with the audience or other anchors can be performed in a dubbing mode, and at the moment, the anchor can send out a dubbing instruction through the first electronic equipment. In step S101, the first electronic device may acquire a dubbing instruction sent by the anchor in the virtual space, where the dubbing instruction indicates that the anchor needs to interact with the audience or other anchor in a dubbing manner. Since there may be multiple dubbing manners in the virtual space, the first electronic device may determine the preset dubbing type corresponding to the acquired dubbing instruction at this time, that is, execute step S602.
After acquiring the dubbing instruction issued by the anchor in the virtual space, the first electronic device may execute step S603 described above, that is, determine the video to be dubbed. And then, when a dubbing starting instruction sent by the anchor is acquired, the anchor needs to start dubbing, and then the unmanned audio and video corresponding to the audio and video to be dubbed can be played according to the preset dubbing type.
When the dubbing start instruction sent by the anchor is obtained, the first electronic device can play the audio/video to be dubbed according to the preset dubbing type, and in the process of playing the unmanned audio/video, the dubbing audio corresponding to the unmanned audio/video is obtained, and meanwhile, the dubbing audio is sent to the server. The server may send dubbing audio to the second electronic device, as well as to the first electronic device for use by other anchor. The unmanned sound video is a video which only retains background music when people sound is removed. The second electronic device is an electronic device having the right to watch live in the virtual space.
The method for determining the preset dubbing type corresponding to the dubbing instruction, the method for determining the to-be-dubbed video and the method for acquiring the dubbing audio corresponding to the unmanned audio video by the first electronic device can be the same as the method for determining the preset dubbing type corresponding to the dubbing instruction, the method for determining the to-be-dubbed video and the method for acquiring the dubbing audio corresponding to the unmanned audio video by the server, so that the description is omitted.
As an implementation of the disclosed embodiment, the preset dubbing type may be a hosting performance type.
Correspondingly, the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type may include:
and playing the unmanned audio-video corresponding to the audio-video to be dubbed, and controlling the second electronic equipment to simultaneously play the unmanned audio-video corresponding to the audio-video to be dubbed.
When the first electronic device plays the unmanned audio-video corresponding to the audio-video to be dubbed, a request can be sent to the server, so that the server controls the second electronic device to play the unmanned audio-video corresponding to the audio-video to be dubbed at the same time. Ensures that the spectators of the anchor can watch the dubbing performance of the anchor at the same time.
Therefore, in the embodiment, the host can perform dubbing performance to interact with the audience, so that interactivity and interestingness of the virtual space can be enhanced, and user experience is improved.
As an implementation of the embodiment of the present disclosure, the preset dubbing type may be a multicast fight type.
Correspondingly, the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type may include:
determining a fight sequence corresponding to first electronic equipment corresponding to each anchor; and controlling the first electronic equipment and the corresponding second electronic equipment to sequentially play the unmanned audio-video corresponding to the audio-video to be dubbed according to the fight sequence.
Because a plurality of broadcasters need to perform dubbing fight at present, in order to ensure that audiences watch the feeling of the dubbing fight, each broadcasters need to perform dubbing performance one by one, the first electronic equipment used by the broadcasters can determine the fight sequence corresponding to the first electronic equipment corresponding to each broadcasters, and further control the first electronic equipment and the corresponding second electronic equipment to sequentially play the unmanned audio-video corresponding to the audio-video to be dubbed according to the fight sequence.
In one embodiment, the first electronic device may send a dubbing switching request to the server, and after receiving the dubbing switching request, the server may control each first electronic device and its corresponding second electronic device to sequentially play the unmanned audio and video corresponding to the audio and video to be dubbed, so that the anchor may dubbed, and its audience may watch the dubbing performance of each anchor.
Therefore, in this embodiment, dubbing fight performance can be performed among a plurality of anchor, so as to interact with other anchor and spectators, so that interactivity and interestingness of the virtual space can be further enhanced, and user experience is further improved.
As an implementation manner of the embodiment of the present disclosure, the preset dubbing type may be a multi-person dubbing type.
Correspondingly, the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type may include:
and controlling each second electronic device corresponding to the user in the instant messaging area in the virtual space, and simultaneously playing the unmanned audio-video corresponding to the audio-video to be dubbed.
In order to ensure that users in the anchor and the instant messaging area can smoothly complete dubbing of the video to be dubbed, the first electronic equipment and each second electronic equipment corresponding to the users in the instant messaging area need to synchronously play the unmanned audio video corresponding to the video to be dubbed, so that the anchor and each user can smoothly complete dubbing interaction.
Therefore, in this embodiment, users in the anchor and the instant messaging area can cooperate with each other to complete dubbing performance, the interaction between the anchor and the audience is stronger, the participation of the audience is enhanced, the interactivity and the interestingness of the virtual space can be further enhanced, and the user experience is further improved.
As an implementation manner of the embodiment of the present disclosure, the step of controlling each second electronic device corresponding to the user in the instant messaging area in the virtual space and simultaneously playing the unmanned audio/video corresponding to the audio/video to be dubbed may include:
and sending the broadcast message to the server so that the server sends the audio and video to be dubbed and a start instruction to each second electronic device corresponding to a user in the instant messaging area in the virtual space, and playing the unmanned audio and video corresponding to the audio and video to be dubbed simultaneously when each second electronic device receives the start instruction.
In order to ensure that the first electronic device and the second electronic devices corresponding to the users participating in dubbing can play the unmanned audio and video simultaneously, the first electronic device can send a command for starting dubbing by sending a broadcast message, and when the server acquires the broadcast message sent by the first electronic device, the server sends the command for starting to-be-dubbed audio and video and the command for starting to the second electronic devices corresponding to the users in the instant messaging area in the virtual space.
Therefore, when each second electronic device receives the starting instruction, the unmanned audio-video corresponding to the audio-video to be dubbed starts to be played, and the unmanned audio-video is ensured to be played at the same time by each second electronic device.
In this embodiment, the server may send the audio/video to be dubbed and the start instruction to each second electronic device corresponding to the user in the instant messaging area in the virtual space when acquiring the broadcast message sent by the first electronic device, so that each second electronic device plays the unmanned audio/video corresponding to the audio/video to be dubbed simultaneously when receiving the start instruction, thereby ensuring that the unmanned audio/video is played synchronously and the dubbing can be performed smoothly.
As an implementation manner of the embodiment of the present disclosure, the step of determining the video to be dubbed may include:
acquiring a video uploaded by a user; and determining the uploaded video as the video to be dubbed.
When determining the video to be dubbed, the anchor can select the favorite video to upload, and the first electronic device can acquire the video uploaded by the user, so that the first electronic device can determine the video uploaded by the user as the video to be dubbed.
In this embodiment, the first electronic device may obtain the video uploaded by the user, and further determine the uploaded video as the video to be dubbed. Therefore, the requirements of the anchor can be met, and the user experience is further improved.
As an implementation manner of the embodiment of the present disclosure, the method for acquiring the unmanned aerial vehicle video may include:
Determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed; inputting the amplitude spectrum into a pre-trained network model to obtain a human voice mask matrix corresponding to the audio/video to be dubbed, wherein the network model is obtained by training based on a pre-obtained amplitude spectrum sample and a human voice mask matrix corresponding to the amplitude spectrum sample, and the network model comprises a corresponding relation between the amplitude spectrum and the human voice mask matrix; calculating to obtain an unmanned voice amplitude spectrum by using the voice mask matrix and the amplitude spectrum; and determining the unmanned sound video corresponding to the audio-video to be dubbed based on the unmanned sound amplitude spectrum.
As an implementation manner of the embodiment of the present disclosure, the method for acquiring the unmanned aerial vehicle video may include:
determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed; inputting the amplitude spectrum into a pre-trained network model to obtain unmanned audio corresponding to the audio-video to be dubbed, wherein the network model is obtained based on a pre-obtained amplitude spectrum sample and the corresponding unmanned audio training, and comprises a corresponding relation between the amplitude spectrum and the unmanned audio; and determining the unmanned audio-video corresponding to the audio-video to be dubbed based on the unmanned audio-video.
Because the manner in which the first electronic device obtains the unmanned aerial vehicle video is the same as the manner in which the server obtains the unmanned aerial vehicle video, the description of the manner in which the server obtains the unmanned aerial vehicle video can be referred to, and will not be repeated here.
The embodiment of the disclosure also provides a third audio/video processing method, which can be applied to a second electronic device provided with a live broadcast application program.
The second electronic device is an electronic device with the right to watch live broadcast in the virtual space, and a viewer can watch live broadcast through the second electronic device.
As shown in fig. 7, a processing method of audio and video is applied to a second electronic device, and the method includes:
in step S701, when a dubbing start instruction in a virtual space is acquired, playing an unmanned audio video corresponding to a pre-acquired audio video to be dubbed;
in step S702, when the dubbing audio corresponding to the unmanned audio-video is obtained in the process of playing the unmanned audio-video, the dubbing audio is played.
Therefore, in the scheme provided by the embodiment of the disclosure, the second electronic device may play the unmanned audio video corresponding to the audio-video to be dubbed obtained in advance when the dubbing start instruction in the virtual space is obtained, and play the dubbing audio when the dubbing audio corresponding to the unmanned audio-video is obtained in the process of playing the unmanned audio-video. By adopting the scheme, the user can interact in the virtual space in a dubbing mode, so that the diversity of the interaction mode is increased, and the user experience is improved.
The audience can watch live broadcast of the host through the second electronic equipment, when the second electronic equipment acquires a dubbing starting instruction in the virtual space, the host or other audiences start dubbing performance at the moment, and then the second electronic equipment can play the unmanned audio and video corresponding to the audio and video to be dubbed acquired in advance.
The dubbing start instruction may be generated by the server and sent to the second electronic device, or may be sent by the first electronic device to the server, where the server forwards the dubbing start instruction to the second electronic device, which is all reasonable.
After the server or the first electronic device determines the video to be dubbed, the video to be dubbed can be sent to the second electronic device, the identifier of the video to be dubbed can also be sent to the second electronic device, and the second electronic device can also determine that the video corresponding to the identifier is the video to be dubbed, so that the unmanned audio video corresponding to the video to be dubbed is obtained.
In the step S702, when the second electronic device obtains the dubbing audio corresponding to the unmanned audio-video during the process of playing the unmanned audio-video, the second electronic device may play the dubbing audio, and the spectator may also view the dubbing performance, where the dubbing audio may be the dubbing audio that the server receives the first electronic device or the second electronic device used by other spectators, and forwards the dubbing audio to the second electronic device.
When the host plays dubbing, the first electronic device can acquire the dubbing audio sent by the host and send the dubbing audio to the server. The second electronic device used by the other spectators can acquire the dubbing audio sent by the spectators and send the dubbing audio to the server when the other spectators perform the dubbing performance.
As an implementation manner of the embodiment of the present disclosure, when the dubbing start instruction in the virtual space is obtained, the step of playing the unmanned audio video corresponding to the pre-obtained audio-to-be-dubbed video may include:
and when receiving the audio/video to be dubbed and a starting instruction in the virtual space sent by the server, playing the received unmanned audio/video corresponding to the audio/video to be dubbed.
In order to ensure that the first electronic device and the second electronic devices corresponding to the users participating in dubbing can play the unmanned audio and video simultaneously, the first electronic device can send a command for starting dubbing by sending a broadcast message, and when the server acquires the broadcast message sent by the first electronic device, the server sends the command for starting to-be-dubbed audio and video and the command for starting to the second electronic devices corresponding to the users in the instant messaging area in the virtual space.
Therefore, when each second electronic device receives the starting instruction, the unmanned audio-video corresponding to the audio-video to be dubbed starts to be played, and the unmanned audio-video is ensured to be played at the same time by each second electronic device.
In this embodiment, the server may send the audio/video to be dubbed and the start instruction to each second electronic device corresponding to the user in the instant messaging area in the virtual space when acquiring the broadcast message sent by the first electronic device, so that each second electronic device plays the unmanned audio/video corresponding to the audio/video to be dubbed simultaneously when receiving the start instruction, thereby ensuring that the unmanned audio/video is played synchronously and the dubbing can be performed smoothly.
As an implementation manner of the embodiment of the present disclosure, the method for acquiring the unmanned aerial vehicle video may include:
determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed; inputting the amplitude spectrum into a pre-trained network model to obtain a human voice mask matrix corresponding to the audio/video to be dubbed, wherein the network model is obtained by training based on a pre-obtained amplitude spectrum sample and a human voice mask matrix corresponding to the amplitude spectrum sample, and the network model comprises a corresponding relation between the amplitude spectrum and the human voice mask matrix; calculating to obtain an unmanned voice amplitude spectrum by using the voice mask matrix and the amplitude spectrum; and determining the unmanned sound video corresponding to the audio-video to be dubbed based on the unmanned sound amplitude spectrum.
As an implementation manner of the embodiment of the present disclosure, the method for acquiring the unmanned aerial vehicle video may include:
determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed; inputting the amplitude spectrum into a pre-trained network model to obtain unmanned audio corresponding to the audio-video to be dubbed, wherein the network model is obtained based on a pre-obtained amplitude spectrum sample and the corresponding unmanned audio training, and comprises a corresponding relation between the amplitude spectrum and the unmanned audio; and determining the unmanned audio-video corresponding to the audio-video to be dubbed based on the unmanned audio-video.
Because the manner of acquiring the unmanned aerial vehicle video by the second electronic device is the same as the manner of acquiring the unmanned aerial vehicle video by the server, the description of the manner of acquiring the unmanned aerial vehicle video by the server can be referred to, and the description is omitted here.
Fig. 8 is a block diagram of a first audio-visual processing device according to an exemplary embodiment.
As shown in fig. 8, an audio/video processing device is applied to a server, and the device includes:
a first dubbing instruction obtaining module 810 configured to perform obtaining a dubbing instruction sent by a first electronic device in a virtual space;
The first electronic device is an electronic device with live broadcast authority in the virtual space.
A first preset dubbing type determining module 820 configured to perform determining a preset dubbing type corresponding to the dubbing instruction;
a first to-be-dubbed video determining module 830 configured to perform determining to-be-dubbed videos;
the first unmanned audio video playing module 840 is configured to play the unmanned audio video corresponding to the audio/video to be dubbed according to the preset dubbing type when the dubbing start instruction sent by the first electronic device is acquired;
the first dubbing audio sending module 850 is configured to obtain dubbing audio corresponding to the unmanned audio and video in the process of playing the unmanned audio and video, and send the dubbing audio to the second electronic device.
The second electronic device is an electronic device with the right to watch live broadcast in the virtual space.
It can be seen that, in the scheme provided by the embodiment of the disclosure, the server may obtain a dubbing instruction sent by the first electronic device in the virtual space, determine a preset dubbing type corresponding to the dubbing instruction, then determine a video to be dubbed, further play an unmanned audio/video corresponding to the video to be dubbed according to the preset dubbing type when obtaining a dubbing start instruction sent by the first electronic device, and obtain a dubbing audio corresponding to the unmanned audio/video in the process of playing the unmanned audio/video, and send the dubbing audio to the second electronic device. The first electronic device is an electronic device with live broadcast authority in the virtual space, and the second electronic device is an electronic device with live broadcast authority for watching in the virtual space. By adopting the scheme, the user can interact in the virtual space in a dubbing mode, so that the diversity of the interaction mode is increased, and the user experience is improved.
As an implementation manner of the embodiment of the present disclosure, the preset dubbing type may be a hosting performance type;
the first unmanned audio video playing module 840 may include:
a first unmanned audio video playing sub-module (not shown in fig. 8) is configured to perform control to simultaneously play the unmanned audio video corresponding to the audio/video to be dubbed by the first electronic device and the second electronic device.
As an implementation manner of the embodiment of the present disclosure, the preset dubbing type may be a multicast fight type;
the first unmanned audio video playing module 840 may include:
a combat order determination sub-module (not shown in fig. 8) configured to perform a determination of a combat order corresponding to the first electronic device corresponding to each anchor;
and a second unmanned audio-video playing sub-module (not shown in fig. 8) configured to control the first electronic device and the corresponding second electronic device to sequentially play the unmanned audio-video corresponding to the audio-video to be dubbed according to the fight sequence.
As an implementation manner of the embodiment of the present disclosure, the preset dubbing type may be a multi-person dubbing type;
the first unmanned audio video playing module 840 may include:
And a third unmanned audio-video playing sub-module (not shown in fig. 8) configured to perform control on each second electronic device corresponding to the user in the instant messaging area in the virtual space, and simultaneously play the unmanned audio-video corresponding to the audio-video to be dubbed.
As an implementation manner of the embodiment of the present disclosure, the third unmanned audio video playing sub-module may include:
the first unmanned aerial vehicle playing unit (not shown in fig. 8) is configured to send the audio/video to be dubbed and a start instruction to each second electronic device corresponding to a user in the instant messaging area in the virtual space when the broadcast message sent by the first electronic device is acquired, so that each second electronic device plays the unmanned aerial vehicle corresponding to the audio/video to be dubbed simultaneously when receiving the start instruction.
As an implementation manner of the embodiment of the present disclosure, the first to-be-dubbed audio/video determining module 830 may include:
a first video acquisition sub-module (not shown in fig. 8) configured to perform acquisition of video uploaded by the first electronic device;
a first video to be dubbed determination submodule (not shown in fig. 8) configured to perform a determination of the uploaded video as a video to be dubbed.
As an implementation of the embodiment of the present disclosure, the audio-video processing apparatus may further include a first unmanned audio-video determining module (not shown in fig. 8);
the first unmanned aerial vehicle determining module may include:
a first amplitude spectrum determination submodule (not shown in fig. 8) configured to perform determination of an amplitude spectrum corresponding to the audio signal of the video-to-be-dubbed;
a first human voice mask matrix determining sub-module (not shown in fig. 8) configured to perform inputting the amplitude spectrum into a pre-trained network model to obtain a human voice mask matrix corresponding to the audio/video to be dubbed;
the network model is obtained by training based on a pre-acquired amplitude spectrum sample and a corresponding human voice mask matrix, and comprises a corresponding relation between an amplitude spectrum and the human voice mask matrix.
A first silence amplitude spectrum determination submodule (not shown in fig. 8) configured to perform a calculation of a silence amplitude spectrum using the silence mask matrix and the amplitude spectrum;
a first unmanned audio video determination sub-module (not shown in fig. 8) configured to perform a determination of an unmanned audio video corresponding to the audio video to be dubbed based on the unmanned audio amplitude spectrum.
As an implementation of the disclosed embodiments, the apparatus may further include a second unmanned audio video determination module (not shown in fig. 8);
the second unmanned aerial vehicle video determining module may include:
a second amplitude spectrum determination sub-module (not shown in fig. 8) configured to perform determination of an amplitude spectrum corresponding to the audio signal of the video-to-be-dubbed;
a first unmanned audio determination sub-module (not shown in fig. 8) configured to perform inputting the amplitude spectrum into a pre-trained network model to obtain an unmanned audio corresponding to the audio-to-be-dubbed video;
the network model is obtained based on a pre-acquired amplitude spectrum sample and corresponding unmanned audio training, and comprises a corresponding relation between an amplitude spectrum and unmanned audio.
A second unmanned audio video determination sub-module (not shown in fig. 8) configured to perform a determination of an unmanned audio video corresponding to the audio-video to be dubbed based on the unmanned audio.
Fig. 9 is a block diagram of a processing apparatus for a second audio/video according to an exemplary embodiment.
As shown in fig. 9, an audio/video processing apparatus is applied to a first electronic device, where the first electronic device is an electronic device having a live broadcast authority in a virtual space, and the apparatus includes:
A second dubbing instruction acquisition module 910 configured to perform acquisition of dubbing instructions in the virtual space;
a second preset dubbing type determining module 920 configured to perform determining a preset dubbing type corresponding to the dubbing instruction;
a second to-be-dubbed video determining module 930 configured to perform determining to-be-dubbed videos;
the second unmanned audio video playing module 940 is configured to execute playing the unmanned audio video corresponding to the audio video to be dubbed according to the preset dubbing type when the dubbing start instruction is acquired;
the second dubbing audio sending module 950 is configured to obtain dubbing audio corresponding to the unmanned audio and video in the process of playing the unmanned audio and video, and send the dubbing audio to a server.
It can be seen that, in the scheme provided by the embodiment of the present disclosure, the first electronic device may obtain a dubbing instruction in the virtual space, determine a preset dubbing type corresponding to the dubbing instruction, then determine a video to be dubbed, and further play an unmanned audio/video corresponding to the video to be dubbed according to the preset dubbing type when obtaining a dubbing start instruction, and obtain a dubbing audio corresponding to the unmanned audio/video in the process of playing the unmanned audio/video, and send the dubbing audio to the server. By adopting the scheme, the user can interact in the virtual space in a dubbing mode, so that the diversity of the interaction mode is increased, and the user experience is improved.
As an implementation manner of the embodiment of the present disclosure, the preset dubbing type is a hosting performance type;
the second unmanned audio video playing module 940 may include:
and a fourth unmanned audio video playing sub-module (not shown in fig. 9) configured to perform playing of the unmanned audio video corresponding to the audio-video to be dubbed, and control a second electronic device to simultaneously play the unmanned audio video corresponding to the audio-video to be dubbed, wherein the second electronic device is an electronic device with the authority of watching live broadcast in the virtual space.
As an implementation manner of the embodiment of the present disclosure, the preset dubbing type is a multicast match type;
the second unmanned audio video playing module 940 may include:
a combat order determination sub-module (not shown in fig. 9) configured to perform a determination of a combat order corresponding to the first electronic device corresponding to each anchor;
and a fifth unmanned audio-video playing sub-module (not shown in fig. 9) configured to control the first electronic device and the corresponding second electronic device to sequentially play the unmanned audio-video corresponding to the audio-video to be dubbed according to the fight sequence.
As an implementation manner of the embodiment of the present disclosure, the preset dubbing type is a multi-person dubbing type;
The second unmanned audio video playing module 940 may include:
a sixth unmanned audio/video playing sub-module (not shown in fig. 9) is configured to perform control on each second electronic device corresponding to the user in the instant messaging area in the virtual space, and simultaneously play the unmanned audio/video corresponding to the audio/video to be dubbed.
As an implementation manner of the embodiment of the present disclosure, the sixth unmanned audio video playing sub-module may include:
and a second unmanned aerial vehicle playing unit (not shown in fig. 9) configured to execute the sent broadcast message to the server, so that the server sends the audio/video to be dubbed and a start instruction to each second electronic device corresponding to a user in the instant messaging area in the virtual space, and when receiving the start instruction, each second electronic device plays the unmanned aerial vehicle corresponding to the audio/video to be dubbed at the same time.
As an implementation manner of the embodiment of the present disclosure, the second to-be-dubbed video determining module 930 may include:
a second video acquisition sub-module (not shown in fig. 9) configured to perform acquisition of video uploaded by the user;
a second video to be dubbed determination submodule (not shown in fig. 9) configured to perform determination of the uploaded video as a video to be dubbed.
As an implementation manner of the embodiment of the present disclosure, the foregoing audio/video processing apparatus may further include a third unmanned audio/video determining module (not shown in fig. 9);
the third unmanned aerial vehicle video determination module may include:
a third amplitude spectrum determination sub-module (not shown in fig. 9) configured to perform determination of an amplitude spectrum corresponding to the audio signal of the video-to-be-dubbed;
a second voice mask matrix determining sub-module (not shown in fig. 9) configured to perform inputting the amplitude spectrum into a pre-trained network model to obtain a voice mask matrix corresponding to the audio/video to be dubbed;
the network model is obtained by training based on a pre-acquired amplitude spectrum sample and a corresponding human voice mask matrix, and comprises a corresponding relation between an amplitude spectrum and the human voice mask matrix.
A second unmanned sound amplitude spectrum determination sub-module (not shown in fig. 9) configured to perform a calculation of an unmanned sound amplitude spectrum using the unmanned sound mask matrix and the amplitude spectrum;
a third unmanned audio video determination sub-module (not shown in fig. 9) configured to perform a determination of the unmanned audio video corresponding to the audio video to be dubbed based on the unmanned audio amplitude spectrum.
As an implementation of the embodiment of the present disclosure, the audio-video processing apparatus may further include a fourth unmanned audio-video determining module (not shown in fig. 9);
the fourth unmanned aerial vehicle video determination module may include:
a fourth amplitude spectrum determination sub-module (not shown in fig. 9) configured to perform determination of an amplitude spectrum corresponding to the audio signal of the video-to-be-dubbed;
a second unmanned audio determination sub-module (not shown in fig. 9) configured to perform inputting the amplitude spectrum into a pre-trained network model to obtain an unmanned audio corresponding to the audio-video to be dubbed;
the network model is obtained based on a pre-acquired amplitude spectrum sample and corresponding unmanned audio training, and comprises a corresponding relation between an amplitude spectrum and unmanned audio.
A fourth unmanned audio video determination sub-module (not shown in fig. 9) configured to perform a determination of an unmanned audio video corresponding to the audio-video to be dubbed based on the unmanned audio.
Fig. 10 is a block diagram of a third audio/video processing device according to an exemplary embodiment.
As shown in fig. 10, an audio/video processing apparatus is applied to a second electronic device, where the second electronic device is an electronic device having a right to watch live broadcast in the virtual space, and the apparatus includes:
A third unmanned audio-video playing module 1010 configured to perform playing of an unmanned audio-video corresponding to the audio-video to be dubbed acquired in advance when a dubbing start instruction in the virtual space is acquired;
and the dubbing audio playing module 1020 is configured to play the dubbing audio when the dubbing audio corresponding to the unmanned audio video is acquired in the process of playing the unmanned audio video.
Therefore, in the scheme provided by the embodiment of the disclosure, the second electronic device may play the unmanned audio video corresponding to the audio-video to be dubbed obtained in advance when the dubbing start instruction in the virtual space is obtained, and play the dubbing audio when the dubbing audio corresponding to the unmanned audio-video is obtained in the process of playing the unmanned audio-video. By adopting the scheme, the user can interact in the virtual space in a dubbing mode, so that the diversity of the interaction mode is increased, and the user experience is improved.
As an implementation manner of the embodiment of the present disclosure, the third unmanned audio video playing module 1010 may include:
the seventh unmanned audio video playing sub-module (not shown in fig. 10) is configured to play the unmanned audio video corresponding to the received audio video to be dubbed when executing the audio video to be dubbed and the start instruction in the virtual space sent by the receiving server.
As an implementation manner of the embodiment of the present disclosure, the audio/video processing apparatus may further include a fifth unmanned audio/video determining module;
the fifth unmanned sound video determination module may include:
a fifth amplitude spectrum determination submodule (not shown in fig. 10) configured to perform determination of an amplitude spectrum corresponding to the audio signal of the video-to-be-dubbed;
a third voice mask matrix determining sub-module (not shown in fig. 10) configured to perform inputting the amplitude spectrum into a pre-trained network model to obtain a voice mask matrix corresponding to the audio/video to be dubbed;
the network model is obtained by training based on a pre-acquired amplitude spectrum sample and a corresponding human voice mask matrix, and comprises a corresponding relation between an amplitude spectrum and the human voice mask matrix.
A third unmanned sound amplitude spectrum determination submodule (not shown in fig. 10) configured to perform calculations with the unmanned sound mask matrix and the amplitude spectrum to obtain an unmanned sound amplitude spectrum;
a fifth unmanned audio video determination sub-module (not shown in fig. 10) configured to perform a determination of an unmanned audio video corresponding to the audio video to be dubbed based on the unmanned audio amplitude spectrum.
As an implementation of the embodiment of the present disclosure, the audio-video processing apparatus may further include a fifth unmanned audio-video determining module (not shown in fig. 10);
The fifth unmanned sound video determination module may include:
a sixth amplitude spectrum determination submodule (not shown in fig. 10) configured to perform determination of an amplitude spectrum corresponding to the audio signal of the video-to-be-dubbed;
a third unmanned audio determining sub-module (not shown in fig. 10) configured to perform inputting the amplitude spectrum into a pre-trained network model to obtain an unmanned audio corresponding to the audio-video to be dubbed;
the network model is obtained based on a pre-acquired amplitude spectrum sample and corresponding unmanned audio training, and comprises a corresponding relation between an amplitude spectrum and unmanned audio.
A sixth unmanned audio video determination sub-module (not shown in fig. 10) configured to perform a determination of an unmanned audio video corresponding to the audio-video to be dubbed based on the unmanned audio.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
The disclosed embodiments also provide an electronic device, as shown in fig. 11, which may include a processor 1101, a communication interface 1102, a memory 1103, and a communication bus 1104, wherein the processor 1101, the communication interface 1102, the memory 1103 complete communication with each other through the communication bus 1104,
A memory 1103 for storing a computer program;
the processor 1101 is configured to implement any one of the audio/video processing methods described in the above embodiments when executing the program stored in the memory 1103. Specifically, the electronic device may be a server, and the processor 1101 is configured to implement the first audio/video processing method according to any one of the above embodiments when executing the program stored in the memory 1103. The electronic device may be the first electronic device, and the processor 1101 is configured to implement the second audio/video processing method according to any one of the above embodiments when executing the program stored in the memory 1103. The electronic device may be the second electronic device, and the processor 1101 is configured to implement the third audio/video processing method according to any one of the above embodiments when executing the program stored in the memory 1103.
Therefore, by adopting the scheme, the user can interact in the virtual space in a dubbing mode, so that the diversity of the interaction mode is increased, and the user experience is improved.
The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
The disclosed embodiments also provide a computer readable storage medium, which when executed by a processor of a server, enables the server to perform the audio/video processing method described in any of the above embodiments.
Therefore, by adopting the scheme, the user can interact in the virtual space in a dubbing mode, so that the diversity of the interaction mode is increased, and the user experience is improved.
The embodiment of the disclosure also provides an application program product, which is used for executing the audio/video processing method in any one of the above embodiments at runtime.
Therefore, by adopting the scheme, the user can interact in the virtual space in a dubbing mode, so that the diversity of the interaction mode is increased, and the user experience is improved.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (31)

1. A method for processing audio and video, which is applied to a server, the method comprising:
Acquiring a dubbing instruction sent by a first electronic device in a virtual space, wherein the first electronic device is an electronic device with live broadcast authority in the virtual space;
determining a preset dubbing type corresponding to the dubbing instruction;
determining an audio/video to be dubbed;
when a dubbing starting instruction sent by the first electronic equipment is obtained, playing an unmanned audio-video corresponding to the to-be-dubbed video according to the preset dubbing type, wherein the unmanned audio-video is obtained by processing the to-be-dubbed video;
during the process of playing the unmanned sound video, acquiring dubbing audio corresponding to the unmanned sound video, and simultaneously sending the dubbing audio to second electronic equipment, wherein the second electronic equipment is electronic equipment with live broadcast watching authority in the virtual space;
the method for processing the audio/video to be dubbed comprises the following steps:
determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed; inputting the amplitude spectrum into a pre-trained network model to obtain a human voice mask matrix corresponding to the audio/video to be dubbed, wherein the network model is obtained by training based on a pre-obtained amplitude spectrum sample and a human voice mask matrix corresponding to the amplitude spectrum sample, and the network model comprises a corresponding relation between the amplitude spectrum and the human voice mask matrix; calculating to obtain an unmanned voice amplitude spectrum by using the voice mask matrix and the amplitude spectrum; determining an unmanned sound video corresponding to the audio video to be dubbed based on the unmanned sound amplitude spectrum; or alternatively, the first and second heat exchangers may be,
Determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed; inputting the amplitude spectrum into a pre-trained network model to obtain unmanned audio corresponding to the audio-video to be dubbed, wherein the network model is obtained based on a pre-obtained amplitude spectrum sample and the corresponding unmanned audio training, and comprises a corresponding relation between the amplitude spectrum and the unmanned audio; and determining the unmanned audio-video corresponding to the audio-video to be dubbed based on the unmanned audio-video.
2. The method of claim 1, wherein the preset dubbing type is a anchor performance type;
the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type comprises the following steps:
and controlling the first electronic equipment and the second electronic equipment to play the unmanned sound video corresponding to the audio/video to be dubbed at the same time.
3. The method of claim 1, wherein the preset dubbing type is a multicast fight type;
the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type comprises the following steps:
determining a fight sequence corresponding to first electronic equipment corresponding to each anchor, wherein the fight sequence is used for indicating the dubbing sequence of each anchor, and the fight sequence corresponds to the dubbing sequence of each anchor;
According to the fight sequence, controlling the first electronic equipment and the corresponding second electronic equipment to sequentially play the unmanned audio-video corresponding to the audio-video to be dubbed, and acquiring a voice signal provided by the first electronic equipment corresponding to each anchor when the anchor dubs according to the fight sequence;
and sending the voice signal to other first electronic devices and second electronic devices corresponding to all the first electronic devices as dubbing audio.
4. The method of claim 1, wherein the preset dubbing type is a multi-person dubbing type;
the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type comprises the following steps:
and controlling each second electronic device corresponding to the user in the instant messaging area in the virtual space, and simultaneously playing the unmanned audio-video corresponding to the audio-video to be dubbed.
5. The method of claim 4, wherein the step of controlling each second electronic device corresponding to the user in the instant messaging area in the virtual space to simultaneously play the unmanned audio-video corresponding to the audio-video to be dubbed comprises:
and when the broadcast message sent by the first electronic equipment is obtained, sending the audio/video to be dubbed and a start instruction to each second electronic equipment corresponding to a user in the instant messaging area in the virtual space, so that each second electronic equipment plays the unmanned audio/video corresponding to the audio/video to be dubbed simultaneously when receiving the start instruction.
6. The method of any of claims 1-5, wherein the step of determining the video to be dubbed comprises:
acquiring a video uploaded by the first electronic equipment;
and determining the uploaded video as the video to be dubbed.
7. The audio and video processing method is characterized by being applied to a first electronic device, wherein the first electronic device is an electronic device with live broadcast authority in a virtual space, and the method comprises the following steps:
acquiring a dubbing instruction in the virtual space;
determining a preset dubbing type corresponding to the dubbing instruction;
determining an audio/video to be dubbed;
when a dubbing starting instruction is acquired, playing an unmanned audio video corresponding to the to-be-dubbed video according to the preset dubbing type, wherein the unmanned audio video is obtained by processing the to-be-dubbed video; in the process of playing the unmanned sound video, acquiring dubbing audio corresponding to the unmanned sound video, and simultaneously sending the dubbing audio to a server, wherein the dubbing audio is used for being played in second electronic equipment, and the second electronic equipment is electronic equipment with live broadcast watching authority in the virtual space;
The method for processing the audio/video to be dubbed comprises the following steps:
determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed; inputting the amplitude spectrum into a pre-trained network model to obtain a human voice mask matrix corresponding to the audio/video to be dubbed, wherein the network model is obtained by training based on a pre-obtained amplitude spectrum sample and a human voice mask matrix corresponding to the amplitude spectrum sample, and the network model comprises a corresponding relation between the amplitude spectrum and the human voice mask matrix; calculating to obtain an unmanned voice amplitude spectrum by using the voice mask matrix and the amplitude spectrum; determining an unmanned sound video corresponding to the audio video to be dubbed based on the unmanned sound amplitude spectrum; or alternatively, the first and second heat exchangers may be,
determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed; inputting the amplitude spectrum into a pre-trained network model to obtain unmanned audio corresponding to the audio-video to be dubbed, wherein the network model is obtained based on a pre-obtained amplitude spectrum sample and the corresponding unmanned audio training, and comprises a corresponding relation between the amplitude spectrum and the unmanned audio; and determining the unmanned audio-video corresponding to the audio-video to be dubbed based on the unmanned audio-video.
8. The method of claim 7, wherein the preset dubbing type is a anchor performance type;
the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type comprises the following steps:
and playing the unmanned audio video corresponding to the audio-video to be dubbed, and controlling a second electronic device to play the unmanned audio video corresponding to the audio-video to be dubbed at the same time, wherein the second electronic device is an electronic device with the authority of watching live broadcast in the virtual space.
9. The method of claim 7, wherein the preset dubbing type is a multicast fight type;
the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type comprises the following steps:
determining a fight sequence corresponding to first electronic equipment corresponding to each anchor, wherein the fight sequence is used for indicating the dubbing sequence of each anchor, and the fight sequence corresponds to the dubbing sequence of each anchor;
and controlling the first electronic equipment and the corresponding second electronic equipment to sequentially play the unmanned audio-video corresponding to the audio-video to be dubbed according to the fight sequence, acquiring a voice signal sent by a host player when dubbing is carried out according to the fight sequence, and sending the voice signal to a server, wherein the voice signal is used as dubbing audio played by other first electronic equipment and the corresponding second electronic equipment of all the first electronic equipment.
10. The method of claim 7, wherein the preset dubbing type is a multi-person dubbing type;
the step of playing the unmanned audio-video corresponding to the audio-video to be dubbed according to the preset dubbing type comprises the following steps:
and controlling each second electronic device corresponding to the user in the instant messaging area in the virtual space, and simultaneously playing the unmanned audio-video corresponding to the audio-video to be dubbed.
11. The method of claim 10, wherein the step of controlling each second electronic device corresponding to the user in the instant messaging area in the virtual space to simultaneously play the unmanned audio-video corresponding to the audio-video to be dubbed comprises:
and sending the broadcast message to the server so that the server sends the audio and video to be dubbed and a start instruction to each second electronic device corresponding to a user in the instant messaging area in the virtual space, and playing the unmanned audio and video corresponding to the audio and video to be dubbed simultaneously when each second electronic device receives the start instruction.
12. The method according to any of claims 7-11, wherein the step of determining the video to be dubbed comprises:
acquiring a video uploaded by a user;
And determining the uploaded video as the video to be dubbed.
13. The audio and video processing method is characterized by being applied to a second electronic device, wherein the second electronic device is an electronic device with live broadcast watching authority in a virtual space, and the method comprises the following steps:
when a dubbing starting instruction in a virtual space is acquired, playing an unmanned audio video corresponding to a pre-acquired audio-video to be dubbed, wherein the unmanned audio video is obtained by processing the audio-video to be dubbed;
in the process of playing the unmanned sound video, playing the dubbing audio corresponding to the unmanned sound video when the dubbing audio is acquired, wherein the dubbing audio is a voice signal provided by a first electronic device;
the method for processing the audio/video to be dubbed comprises the following steps:
determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed; inputting the amplitude spectrum into a pre-trained network model to obtain a human voice mask matrix corresponding to the audio/video to be dubbed, wherein the network model is obtained by training based on a pre-obtained amplitude spectrum sample and a human voice mask matrix corresponding to the amplitude spectrum sample, and the network model comprises a corresponding relation between the amplitude spectrum and the human voice mask matrix; calculating to obtain an unmanned voice amplitude spectrum by using the voice mask matrix and the amplitude spectrum; determining an unmanned sound video corresponding to the audio video to be dubbed based on the unmanned sound amplitude spectrum; or alternatively, the first and second heat exchangers may be,
Determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed; inputting the amplitude spectrum into a pre-trained network model to obtain unmanned audio corresponding to the audio-video to be dubbed, wherein the network model is obtained based on a pre-obtained amplitude spectrum sample and the corresponding unmanned audio training, and comprises a corresponding relation between the amplitude spectrum and the unmanned audio; and determining the unmanned audio-video corresponding to the audio-video to be dubbed based on the unmanned audio-video.
14. The method of claim 13, wherein the step of playing the pre-acquired unmanned audio-video corresponding to the audio-video to be dubbed when the dubbing start instruction in the virtual space is acquired, comprises:
and when receiving the audio/video to be dubbed and a starting instruction in the virtual space sent by the server, playing the received unmanned audio/video corresponding to the audio/video to be dubbed.
15. An audio/video processing device, applied to a server, comprising:
the system comprises a first dubbing instruction acquisition module, a second dubbing instruction acquisition module and a first control module, wherein the first dubbing instruction acquisition module is configured to execute and acquire a dubbing instruction sent by a first electronic device in a virtual space, and the first electronic device is an electronic device with live broadcast authority in the virtual space;
The first preset dubbing type determining module is configured to execute the determination of the preset dubbing type corresponding to the dubbing instruction;
the first to-be-dubbed video determining module is configured to execute the determination of the to-be-dubbed video;
the first unmanned audio video playing module is configured to play the unmanned audio video corresponding to the audio video to be dubbed according to the preset dubbing type when the dubbing starting instruction sent by the first electronic equipment is executed, wherein the unmanned audio video is obtained by processing the audio video to be dubbed;
the first dubbing audio sending module is configured to obtain dubbing audio corresponding to the unmanned audio video in the process of playing the unmanned audio video, and send the dubbing audio to second electronic equipment, wherein the second electronic equipment is electronic equipment with live broadcast watching authority in the virtual space;
the audio and video processing device further comprises a first unmanned audio and video determining module and a second unmanned audio and video determining module;
the first unmanned aerial vehicle determination module includes:
the first amplitude spectrum determining submodule is configured to determine an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
The first human voice mask matrix determining submodule is configured to perform input of the amplitude spectrum into a pre-trained network model to obtain a human voice mask matrix corresponding to the video to be dubbed, wherein the network model is obtained by training based on a pre-acquired amplitude spectrum sample and a human voice mask matrix corresponding to the amplitude spectrum sample, and the network model comprises a corresponding relation between the amplitude spectrum and the human voice mask matrix;
the first unmanned sound amplitude spectrum determining submodule is configured to execute calculation to obtain an unmanned sound amplitude spectrum by utilizing the unmanned sound mask matrix and the amplitude spectrum;
a first unmanned aerial vehicle determination submodule configured to perform unmanned aerial vehicle determination corresponding to the audio-video to be dubbed based on the unmanned aerial vehicle amplitude spectrum;
the second unmanned aerial vehicle video determining module comprises:
the second amplitude spectrum determining submodule is configured to determine an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
the first unmanned sound frequency determining submodule is configured to input the amplitude spectrum into a pre-trained network model to obtain unmanned sound frequency corresponding to the audio-video to be dubbed, wherein the network model is obtained based on a pre-acquired amplitude spectrum sample and corresponding unmanned sound frequency training thereof, and the network model comprises a corresponding relation between the amplitude spectrum and the unmanned sound frequency;
And the second unmanned sound video determining submodule is configured to execute unmanned sound video corresponding to the audio/video to be dubbed based on the unmanned sound audio.
16. The apparatus of claim 15, wherein the preset dubbing type is a anchor performance type;
the first unmanned audio video playing module comprises:
and the first unmanned sound video playing sub-module is configured to execute control on the first electronic equipment and the second electronic equipment to play the unmanned sound video corresponding to the audio/video to be dubbed simultaneously.
17. The apparatus of claim 15, wherein the preset dubbing type is a multicast fight type;
the first unmanned audio video playing module comprises:
a fight sequence determination submodule configured to perform a determination of a fight sequence corresponding to a first electronic device corresponding to each anchor, wherein the fight sequence is used for indicating an order in which each anchor dubs sound, and the fight sequence corresponds to the dubbing order of each anchor;
the second unmanned aerial vehicle video playing sub-module is configured to control the first electronic equipment and the corresponding second electronic equipment to sequentially play the unmanned aerial vehicle video corresponding to the audio/video to be dubbed according to the fight sequence, and acquire the voice signals provided by the first electronic equipment corresponding to the anchor when the anchor dubs according to the fight sequence; and sending the voice signal to other first electronic devices and second electronic devices corresponding to all the first electronic devices as dubbing audio.
18. The apparatus of claim 15, wherein the preset dubbing type is a multi-person dubbing type;
the first unmanned audio video playing module comprises:
and the third unmanned sound video playing sub-module is configured to execute and control each second electronic device corresponding to the user in the instant messaging area in the virtual space, and simultaneously play the unmanned sound video corresponding to the audio/video to be dubbed.
19. The apparatus of claim 18, wherein the third unmanned audiovisual playback submodule comprises:
the first unmanned aerial vehicle playing unit is configured to send the audio/video to be dubbed and a start instruction to each second electronic device corresponding to a user in the instant messaging area in the virtual space when the broadcast message sent by the first electronic device is acquired, so that each second electronic device plays the unmanned aerial vehicle corresponding to the audio/video to be dubbed simultaneously when receiving the start instruction.
20. The apparatus of any of claims 15-19, wherein the first to-be-dubbed video determination module comprises:
a first video acquisition sub-module configured to perform acquisition of video uploaded by the first electronic device;
And the first video to be dubbed determining submodule is configured to determine the uploaded video as the video to be dubbed.
21. An audio/video processing apparatus, applied to a first electronic device, where the first electronic device is an electronic device having a live broadcast right in a virtual space, the apparatus comprising:
a second dubbing instruction acquisition module configured to execute acquisition of dubbing instructions in the virtual space;
the second preset dubbing type determining module is configured to execute the determination of the preset dubbing type corresponding to the dubbing instruction;
the second to-be-dubbed video determining module is configured to execute the determination of the to-be-dubbed video;
the second unmanned audio video playing module is configured to play the unmanned audio video corresponding to the audio video to be dubbed according to the preset dubbing type when the dubbing starting instruction is acquired, wherein the unmanned audio video is obtained by processing the audio video to be dubbed;
the second dubbing audio sending module is configured to obtain dubbing audio corresponding to the unmanned audio video in the process of playing the unmanned audio video, and send the dubbing audio to a server at the same time, wherein the dubbing audio is used for being played in a second electronic device, and the second electronic device is an electronic device with live broadcast watching authority in the virtual space;
The audio and video processing device further comprises a third unmanned audio and video determining module and a fourth unmanned audio and video determining module;
the third unmanned aerial vehicle video determining module comprises:
a third amplitude spectrum determining sub-module configured to perform determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
the second voice mask matrix determining submodule is configured to perform input of the amplitude spectrum into a pre-trained network model to obtain a voice mask matrix corresponding to the video to be dubbed, wherein the network model is obtained by training based on a pre-obtained amplitude spectrum sample and a corresponding voice mask matrix thereof, and the network model comprises a corresponding relation between the amplitude spectrum and the voice mask matrix;
the second unmanned sound amplitude spectrum determining submodule is configured to execute calculation to obtain an unmanned sound amplitude spectrum by utilizing the unmanned sound mask matrix and the amplitude spectrum;
a third unmanned aerial vehicle video determination submodule configured to perform unmanned aerial vehicle video corresponding to the audio-video to be dubbed based on the unmanned aerial vehicle amplitude spectrum;
the fourth unmanned aerial vehicle video determining module comprises:
a fourth amplitude spectrum determining sub-module configured to perform determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
The second unmanned sound frequency determining submodule is configured to input the amplitude spectrum into a pre-trained network model to obtain unmanned sound frequency corresponding to the audio-video to be dubbed, wherein the network model is obtained based on a pre-acquired amplitude spectrum sample and corresponding unmanned sound frequency training thereof, and the network model comprises a corresponding relation between the amplitude spectrum and the unmanned sound frequency;
and the fourth unmanned sound video determining submodule is configured to execute unmanned sound video corresponding to the audio/video to be dubbed based on the unmanned sound audio.
22. The apparatus of claim 21, wherein the preset dubbing type is a anchor performance type;
the second unmanned sound video playing module comprises:
and the fourth unmanned sound video playing sub-module is configured to play the unmanned sound video corresponding to the audio-video to be dubbed, and control a second electronic device to play the unmanned sound video corresponding to the audio-video to be dubbed at the same time, wherein the second electronic device is an electronic device with the authority of watching live broadcast in the virtual space.
23. The apparatus of claim 21, wherein the preset dubbing type is a multicast fight type;
The second unmanned sound video playing module comprises:
a fight sequence determination submodule configured to perform a determination of a fight sequence corresponding to a first electronic device corresponding to each anchor, wherein the fight sequence is used for indicating an order in which each anchor dubs sound, and the fight sequence corresponds to the dubbing order of each anchor;
and the fifth unmanned sound video playing sub-module is configured to execute the control of the first electronic equipment and the corresponding second electronic equipment to sequentially play the unmanned sound video corresponding to the audio/video to be dubbed according to the fight sequence, acquire a voice signal sent by a host player when dubbing is carried out according to the fight sequence, and send the voice signal to a server, wherein the voice signal is used as dubbing audio played by other first electronic equipment and the corresponding second electronic equipment of all the first electronic equipment.
24. The apparatus of claim 21, wherein the preset dubbing type is a multi-person dubbing type;
the second unmanned sound video playing module comprises:
and the sixth unmanned sound video playing sub-module is configured to execute and control each second electronic device corresponding to the user in the instant messaging area in the virtual space, and simultaneously play the unmanned sound video corresponding to the audio/video to be dubbed.
25. The apparatus of claim 24, wherein the sixth unmanned audiovisual playback submodule comprises:
the second unmanned aerial vehicle audio and video playing unit is configured to execute the sent broadcast message to the server so that the server sends the audio and video to be dubbed and a start instruction to each second electronic device corresponding to a user in an instant messaging area in the virtual space, and when each second electronic device receives the start instruction, the unmanned aerial vehicle audio and video corresponding to the audio and video to be dubbed is played simultaneously.
26. The apparatus of any of claims 21-25, wherein the second to-be-dubbed video determining module comprises:
the second video acquisition sub-module is configured to acquire the video uploaded by the user;
and the second video to be dubbed determining submodule is configured to determine the uploaded video as the video to be dubbed.
27. An audio/video processing apparatus, applied to a second electronic device, where the second electronic device is an electronic device having a right to watch live broadcast in a virtual space, the apparatus comprising:
the third unmanned aerial vehicle video playing module is configured to play unmanned aerial vehicle videos corresponding to pre-acquired audio videos to be dubbed when a dubbing starting instruction in a virtual space is acquired, wherein the unmanned aerial vehicle videos are obtained by processing the audio videos to be dubbed;
The dubbing audio playing module is configured to play the dubbing audio when the dubbing audio corresponding to the unmanned audio video is acquired in the process of playing the unmanned audio video, wherein the dubbing audio is a voice signal provided by the first electronic equipment;
the audio and video processing device further comprises a fifth unmanned audio and video determining module and a sixth unmanned audio and video determining module;
the fifth unmanned aerial vehicle video determination module comprises:
a fifth amplitude spectrum determining sub-module configured to perform determining an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
the third voice mask matrix determining submodule is configured to perform input of the amplitude spectrum into a pre-trained network model to obtain a voice mask matrix corresponding to the video to be dubbed, wherein the network model is obtained by training based on a pre-obtained amplitude spectrum sample and a corresponding voice mask matrix thereof, and the network model comprises a corresponding relation between the amplitude spectrum and the voice mask matrix;
a third unmanned sound amplitude spectrum determining sub-module configured to perform calculation to obtain an unmanned sound amplitude spectrum by using the unmanned sound mask matrix and the amplitude spectrum;
a fifth unmanned aerial vehicle video determination submodule configured to perform determination of an unmanned aerial vehicle video corresponding to the audio-video to be dubbed based on the unmanned aerial vehicle amplitude spectrum;
The sixth unmanned aerial vehicle video determination module includes:
a sixth amplitude spectrum determining submodule configured to perform determination of an amplitude spectrum corresponding to the audio signal of the video to be dubbed;
the third unmanned sound frequency determining submodule is configured to input the amplitude spectrum into a pre-trained network model to obtain unmanned sound frequency corresponding to the audio-video to be dubbed, wherein the network model is obtained based on a pre-acquired amplitude spectrum sample and corresponding unmanned sound frequency training thereof, and the network model comprises a corresponding relation between the amplitude spectrum and the unmanned sound frequency;
and a sixth unmanned audio video determination submodule configured to execute unmanned audio video corresponding to the audio-video to be dubbed based on the unmanned audio.
28. The apparatus of claim 27, wherein the third unmanned audiovisual playback module comprises:
and the seventh unmanned audio video playing sub-module is configured to play the unmanned audio video corresponding to the received audio video to be dubbed when executing the audio video to be dubbed and the starting instruction in the virtual space sent by the receiving server.
29. A server, comprising:
a processor;
a memory for storing the processor-executable instructions;
Wherein the processor is configured to execute the instructions to implement the audio-visual processing method of any one of claims 1 to 6.
30. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of processing an audio-visual according to any one of claims 7 to 12 or 13 to 14.
31. A storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of processing an audio-visual according to any one of claims 1-6 or 7 to 12 or 13 to 14.
CN201910641537.9A 2019-07-16 2019-07-16 Audio and video processing method and device, electronic equipment and storage medium Active CN110392273B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910641537.9A CN110392273B (en) 2019-07-16 2019-07-16 Audio and video processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910641537.9A CN110392273B (en) 2019-07-16 2019-07-16 Audio and video processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110392273A CN110392273A (en) 2019-10-29
CN110392273B true CN110392273B (en) 2023-08-08

Family

ID=68284991

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910641537.9A Active CN110392273B (en) 2019-07-16 2019-07-16 Audio and video processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110392273B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111640442B (en) * 2020-06-01 2023-05-23 北京猿力未来科技有限公司 Method for processing audio packet loss, method for training neural network and respective devices
CN112261435B (en) * 2020-11-06 2022-04-08 腾讯科技(深圳)有限公司 Social interaction method, device, system, equipment and storage medium
CN112954377B (en) * 2021-02-04 2023-07-28 广州繁星互娱信息科技有限公司 Live-broadcast fight picture display method, live-broadcast fight method and device

Citations (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101261864A (en) * 2008-04-21 2008-09-10 中兴通讯股份有限公司 A method and system for mixing recording voice at a mobile terminal
US8010692B1 (en) * 2009-11-05 2011-08-30 Adobe Systems Incorporated Adapting audio and video content for hardware platform
CN102325173A (en) * 2011-08-30 2012-01-18 重庆抛物线信息技术有限责任公司 Mixed audio and video sharing method and system
CN102752499A (en) * 2011-12-29 2012-10-24 新奥特(北京)视频技术有限公司 System dubbing through dubbing-free workstation
CN104135667A (en) * 2014-06-10 2014-11-05 腾讯科技(深圳)有限公司 Video remote explanation synchronization method, terminal equipment and system
CN105847913A (en) * 2016-05-20 2016-08-10 腾讯科技(深圳)有限公司 Live video broadcast control method, mobile terminal and system
WO2016184295A1 (en) * 2015-05-19 2016-11-24 腾讯科技(深圳)有限公司 Instant messenger method, user equipment and system
CN106534618A (en) * 2016-11-24 2017-03-22 广州爱九游信息技术有限公司 Method, device and system for realizing pseudo field interpretation
WO2017181594A1 (en) * 2016-04-19 2017-10-26 乐视控股(北京)有限公司 Video display method and apparatus
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
CN107484016A (en) * 2017-09-05 2017-12-15 深圳Tcl新技术有限公司 Video dubs switching method, television set and computer-readable recording medium
CN107492383A (en) * 2017-08-07 2017-12-19 上海六界信息技术有限公司 Screening technique, device, equipment and the storage medium of live content
WO2018018482A1 (en) * 2016-07-28 2018-02-01 北京小米移动软件有限公司 Method and device for playing sound effects
WO2018095219A1 (en) * 2016-11-24 2018-05-31 腾讯科技(深圳)有限公司 Media information processing method and device
CN108668151A (en) * 2017-03-31 2018-10-16 腾讯科技(深圳)有限公司 Audio/video interaction method and device
CN109119063A (en) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 Video dubs generation method, device, equipment and storage medium
CN109151565A (en) * 2018-09-04 2019-01-04 北京达佳互联信息技术有限公司 Play method, apparatus, electronic equipment and the storage medium of voice
CN109151592A (en) * 2018-09-21 2019-01-04 广州华多网络科技有限公司 Connect the interactive approach, device and server of wheat across channel
CN109361930A (en) * 2018-11-12 2019-02-19 广州酷狗计算机科技有限公司 Method for processing business, device and computer readable storage medium
CN109361954A (en) * 2018-11-02 2019-02-19 腾讯科技(深圳)有限公司 Method for recording, device, storage medium and the electronic device of video resource
CN109587509A (en) * 2018-11-27 2019-04-05 广州市百果园信息技术有限公司 Live-broadcast control method, device, computer readable storage medium and terminal
CN109710798A (en) * 2018-12-28 2019-05-03 北京金山安全软件有限公司 Music performance evaluation method and device
CN109830245A (en) * 2019-01-02 2019-05-31 北京大学 A kind of more speaker's speech separating methods and system based on beam forming

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9113132B2 (en) * 2009-07-13 2015-08-18 Genesys Telecommunications Laboratories, Inc. System and methods for recording a compressed video and audio stream
US20140098715A1 (en) * 2012-10-09 2014-04-10 Tv Ears, Inc. System for streaming audio to a mobile device using voice over internet protocol
US20140143218A1 (en) * 2012-11-20 2014-05-22 Apple Inc. Method for Crowd Sourced Multimedia Captioning for Video Content
CN105740029B (en) * 2016-03-03 2019-07-05 腾讯科技(深圳)有限公司 A kind of method, user equipment and system that content is presented

Patent Citations (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101261864A (en) * 2008-04-21 2008-09-10 中兴通讯股份有限公司 A method and system for mixing recording voice at a mobile terminal
US8010692B1 (en) * 2009-11-05 2011-08-30 Adobe Systems Incorporated Adapting audio and video content for hardware platform
CN102325173A (en) * 2011-08-30 2012-01-18 重庆抛物线信息技术有限责任公司 Mixed audio and video sharing method and system
CN102752499A (en) * 2011-12-29 2012-10-24 新奥特(北京)视频技术有限公司 System dubbing through dubbing-free workstation
CN104135667A (en) * 2014-06-10 2014-11-05 腾讯科技(深圳)有限公司 Video remote explanation synchronization method, terminal equipment and system
WO2016184295A1 (en) * 2015-05-19 2016-11-24 腾讯科技(深圳)有限公司 Instant messenger method, user equipment and system
WO2017181594A1 (en) * 2016-04-19 2017-10-26 乐视控股(北京)有限公司 Video display method and apparatus
CN105847913A (en) * 2016-05-20 2016-08-10 腾讯科技(深圳)有限公司 Live video broadcast control method, mobile terminal and system
WO2018018482A1 (en) * 2016-07-28 2018-02-01 北京小米移动软件有限公司 Method and device for playing sound effects
CN106534618A (en) * 2016-11-24 2017-03-22 广州爱九游信息技术有限公司 Method, device and system for realizing pseudo field interpretation
WO2018095219A1 (en) * 2016-11-24 2018-05-31 腾讯科技(深圳)有限公司 Media information processing method and device
CN108668151A (en) * 2017-03-31 2018-10-16 腾讯科技(深圳)有限公司 Audio/video interaction method and device
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
CN107492383A (en) * 2017-08-07 2017-12-19 上海六界信息技术有限公司 Screening technique, device, equipment and the storage medium of live content
CN107484016A (en) * 2017-09-05 2017-12-15 深圳Tcl新技术有限公司 Video dubs switching method, television set and computer-readable recording medium
CN109119063A (en) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 Video dubs generation method, device, equipment and storage medium
CN109151565A (en) * 2018-09-04 2019-01-04 北京达佳互联信息技术有限公司 Play method, apparatus, electronic equipment and the storage medium of voice
CN109151592A (en) * 2018-09-21 2019-01-04 广州华多网络科技有限公司 Connect the interactive approach, device and server of wheat across channel
CN109361954A (en) * 2018-11-02 2019-02-19 腾讯科技(深圳)有限公司 Method for recording, device, storage medium and the electronic device of video resource
CN109361930A (en) * 2018-11-12 2019-02-19 广州酷狗计算机科技有限公司 Method for processing business, device and computer readable storage medium
CN109587509A (en) * 2018-11-27 2019-04-05 广州市百果园信息技术有限公司 Live-broadcast control method, device, computer readable storage medium and terminal
CN109710798A (en) * 2018-12-28 2019-05-03 北京金山安全软件有限公司 Music performance evaluation method and device
CN109830245A (en) * 2019-01-02 2019-05-31 北京大学 A kind of more speaker's speech separating methods and system based on beam forming

Also Published As

Publication number Publication date
CN110392273A (en) 2019-10-29

Similar Documents

Publication Publication Date Title
CN110392273B (en) Audio and video processing method and device, electronic equipment and storage medium
CN107027050B (en) Audio and video processing method and device for assisting live broadcast
CN109327741B (en) Game live broadcast method, device and system
CN112714330B (en) Gift presenting method and device based on live broadcast with wheat and electronic equipment
US11227620B2 (en) Information processing apparatus and information processing method
WO2016150317A1 (en) Method, apparatus and system for synthesizing live video
CN111314720A (en) Live broadcast and microphone connection control method and device, electronic equipment and computer readable medium
CN110910860B (en) Online KTV implementation method and device, electronic equipment and storage medium
CN109525851A (en) Live broadcasting method, device and storage medium
CN110390927B (en) Audio processing method and device, electronic equipment and computer readable storage medium
CN110472099B (en) Interactive video generation method and device and storage medium
CN111028818B (en) Chorus method, apparatus, electronic device and storage medium
CN112243133B (en) Game live broadcast processing method and device and electronic device
JP2009515424A (en) Interactive mobile network game system and method
CN109714622B (en) Video data processing method and device and electronic equipment
US20120155671A1 (en) Information processing apparatus, method, and program and information processing system
CN112423013B (en) Online interaction method, client, server, computing device and storage medium
US20180176628A1 (en) Information device and display processing method
CN110324653B (en) Game interactive interaction method and system, electronic equipment and device with storage function
CN116996702A (en) Concert live broadcast processing method and device, storage medium and electronic equipment
CN104918075B (en) A kind of method and device of program continued broadcasting
CN112055227B (en) Cloud game interaction method, system, device, storage medium and electronic equipment
CN109168039A (en) Code stream clarity switching method, device, terminal and the readable medium of android system
CN113271474B (en) Method, device, equipment and storage medium for testing streaming media server
CN103813186B (en) Scene switching and broadcasting system and method applied to multiple media channels

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant