WO2020177483A1 - 音视频处理方法、装置、电子设备及存储介质 - Google Patents

音视频处理方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2020177483A1
WO2020177483A1 PCT/CN2020/070597 CN2020070597W WO2020177483A1 WO 2020177483 A1 WO2020177483 A1 WO 2020177483A1 CN 2020070597 W CN2020070597 W CN 2020070597W WO 2020177483 A1 WO2020177483 A1 WO 2020177483A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
audio
video
terminal
time
Prior art date
Application number
PCT/CN2020/070597
Other languages
English (en)
French (fr)
Inventor
赵鹏
马达
黄新华
Original Assignee
苏州臻迪智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201910155598.4A external-priority patent/CN110022449A/zh
Priority claimed from CN201910850136.4A external-priority patent/CN110691204B/zh
Priority claimed from CN201910850137.9A external-priority patent/CN110691218B/zh
Application filed by 苏州臻迪智能科技有限公司 filed Critical 苏州臻迪智能科技有限公司
Publication of WO2020177483A1 publication Critical patent/WO2020177483A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording

Definitions

  • the present disclosure relates to the technical field of audio and video processing, and in particular to an audio and video processing method, device, electronic equipment, and storage medium.
  • smart devices are used to record videos during events such as a live broadcast of a party.
  • events such as a live broadcast of a party.
  • the sound recording effect of the smart device is poor.
  • smart devices will also produce noise when they are working. If a smart device is used for voice recording, the recorded sound will be mixed with the noise generated by the smart device.
  • the embodiments of the present disclosure provide an audio and video processing method, device, electronic equipment, and storage medium to solve the problem of poor synthesized audio and video effects in the prior art that smart devices are used to record audio information and video information.
  • the embodiments of the present disclosure provide an audio and video processing method applied to a smart device, and the method includes:
  • control message video information is recorded, so that an audio and video file is synthesized according to the video information and the audio information recorded by the first terminal.
  • control message carries time information
  • recording video information includes:
  • the smart device waits for the delay time to record video information.
  • control message carries time information
  • recording video information includes:
  • the smart device records the video information when the time point is reached.
  • the method further includes:
  • the audio information and the video information are combined into an audio and video file.
  • the method further includes:
  • the video information is sent to the first terminal, so that the first terminal synthesizes an audio and video file according to the audio information recorded by itself and the received video information.
  • the method further includes:
  • the method further includes:
  • the text information corresponding to the video information is obtained, and the video information, the audio information, and the text information are synthesized into an audio and video file with subtitles.
  • the video information, the audio information, and the text information all include first time information; the video information, the audio information, and the text information are combined into an audio and video file with subtitles ,include:
  • the video information, the audio information, and the text information are synthesized into an audio and video file with subtitles.
  • the video information includes a person
  • the synthesizing the video information, the audio information, and the text information into an audio and video file with subtitles includes:
  • the video information, audio information, and text information are synthesized into the audio and video file with subtitles according to the text corresponding to the lip change feature.
  • the obtaining text information corresponding to the audio information includes:
  • the method further includes:
  • the synthesizing an audio and video file according to the video information and the audio information recorded by the first terminal includes:
  • the video information that has been buffered for the first time length in the storage device and the corresponding audio information are synthesized into an audio and video file.
  • the audio information includes at least one first time stamp and audio content information corresponding to each first time stamp;
  • the video information includes at least one second time stamp and each second time stamp.
  • the synthesizing the video information and the corresponding audio information in the storage device that have been buffered for the first time length into an audio and video file includes:
  • the method further includes:
  • the smart device If the smart device is disconnected from the first terminal for a second time length and then reconnected, it receives audio information corresponding to the second time length sent by the first terminal, where the first time length is The maximum buffer time length of the smart device, the second time length is less than or equal to the first time length, and the audio information corresponding to the video information that has been buffered for the first time length includes audio information corresponding to the second time length ;
  • the smart device If the smart device is disconnected from the first terminal for a third length of time and then reconnected, it receives audio information corresponding to the first time length of the latest time among the third time lengths sent by the first terminal, wherein the The third time length is greater than the first time length;
  • the audio information includes audio content information and audio redundant data
  • the audio redundant data is obtained by encoding the audio content information
  • the method further includes:
  • the embodiment of the present disclosure provides an audio and video processing method applied to a first terminal, and the method includes:
  • control message audio information is recorded, so that an audio and video file is synthesized according to the audio information and the video information recorded by the smart device.
  • control message carries time information
  • recording audio information includes:
  • the first terminal waits for the delay duration to record audio information after sending the control message.
  • control message carries time information
  • recording audio information includes:
  • the first terminal records the audio information when the time point is reached.
  • the method includes:
  • the audio information and the video information are combined into an audio and video file.
  • the method includes:
  • the audio information is sent to the smart device, so that the smart device synthesizes an audio and video file according to the video information recorded by itself and the received audio information.
  • the method includes:
  • the audio information is sent to the second terminal, so that the second terminal synthesizes an audio and video file according to the video information recorded by the smart device and the audio information.
  • the method includes:
  • the corresponding text information is generated according to the video information, and the video information, the audio information and the text information are synthesized into an audio and video file with subtitles.
  • the video information, the audio information, and the text information all include first time information; the video information, the audio information, and the text information are combined into an audio and video file with subtitles ,include:
  • the video information, the audio information, and the text information are synthesized into an audio and video file with subtitles.
  • the audio information includes audio content information
  • the method includes:
  • the sending audio information including audio content information to the smart device includes:
  • the audio information corresponding to the second time length is sent to the smart device, wherein the second time length is less than or equal to the first time length.
  • the first terminal is disconnected from the smart device for a third length of time and then reconnects, send the latest audio information corresponding to the first time length to the smart device, where the third time length is greater than the first time length A length of time.
  • the sending audio information including audio content information to the smart device includes:
  • the audio information including audio content information and audio redundant data is sent to the smart device.
  • the embodiment of the present disclosure provides an audio and video processing method applied to a second terminal, and the method includes:
  • the video information and the audio information are synthesized into an audio and video file.
  • the video information, audio information, and text information are combined into an audio and video file with subtitles.
  • the video information, the audio information, and the text information all include first time information; the video information, the audio information, and the text information are combined into an audio and video file with subtitles ,include:
  • the video information, the audio information, and the text information are synthesized into an audio and video file with subtitles.
  • obtaining text information corresponding to the audio information includes:
  • the text information is generated by the first terminal according to the audio information.
  • the second terminal generates the corresponding text information according to the audio information.
  • the text information is generated by the smart device according to the video information.
  • the second terminal generates corresponding text information according to the video information.
  • the embodiment of the present disclosure provides an audio and video processing device, which is applied to a smart device, and the device includes:
  • a receiving and sending module for receiving a control message sent by the first terminal
  • the recording module is configured to record video information according to the control message, so that an audio and video file is synthesized according to the video information and the audio information recorded by the first terminal.
  • the embodiment of the present disclosure provides an audio and video processing device applied to a first terminal, and the device includes:
  • the receiving and sending module is used to send to the smart device a control message for controlling the smart device to record video information
  • the recording module is configured to record audio information according to the control message, so that an audio and video file is synthesized according to the audio information and the video information recorded by the smart device.
  • the present disclosure provides an electronic device, including: a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete mutual communication through the communication bus;
  • a computer program is stored in the memory, and when the program is executed by the processor, the processor is caused to execute the steps of the method applied to the smart device.
  • the present disclosure provides an electronic device, including: a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete mutual communication through the communication bus;
  • a computer program is stored in the memory, and when the program is executed by the processor, the processor is caused to execute the steps of the method applied to the first terminal.
  • the present disclosure provides an electronic device, including: a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete mutual communication through the communication bus;
  • a computer program is stored in the memory, and when the program is executed by the processor, the processor is caused to execute the steps of the method applied to the second terminal.
  • the embodiments of the present disclosure provide a computer-readable storage medium that stores a computer program that can be executed by an electronic device.
  • the program runs on the electronic device, the electronic device executes the above-mentioned application to smart devices. Method steps.
  • the embodiments of the present disclosure provide a computer-readable storage medium that stores a computer program executable by an electronic device, and when the program runs on the electronic device, the electronic device executes the Method steps.
  • the embodiment of the present disclosure provides a computer-readable storage medium, which stores a computer program executable by an electronic device, and when the program runs on the electronic device, the electronic device executes the Method steps.
  • the embodiments of the present disclosure provide an audio and video processing method, device, electronic equipment, and storage medium.
  • the method includes: receiving a control message sent by a first terminal; and recording video information according to the control message so that The video information and the audio information recorded by the first terminal are combined into an audio and video file. Since in the embodiment of the present disclosure, after the smart device receives the control message sent by the first terminal, it records the video information according to the control message, and the audio information is recorded by the first terminal, the recorded audio effect will not be poor. In this case, the effect of the synthesized audio and video files is guaranteed.
  • FIG. 1 is a schematic flowchart of an audio and video processing method provided by an embodiment of the disclosure
  • FIG. 2 is a schematic flowchart of an audio and video processing method provided by an embodiment of the disclosure
  • FIG. 3 is a schematic diagram of a process of synthesizing audio and video files by a smart device according to an embodiment of the disclosure
  • FIG. 4 is a schematic structural diagram of an audio and video processing device provided by an embodiment of the disclosure.
  • FIG. 5 is a schematic structural diagram of an audio and video processing device provided by an embodiment of the disclosure.
  • FIG. 6 is an electronic device provided by an embodiment of the disclosure.
  • FIG. 7 is an electronic device provided by an embodiment of the disclosure.
  • FIG. 8 is a schematic diagram of a scene for synthesizing audio and video files with subtitles according to an embodiment of the disclosure
  • FIG. 9 is a schematic flowchart of an audio and video processing method provided by an embodiment of the disclosure.
  • FIG. 10 is a signaling interaction diagram of an audio and video processing method provided by an embodiment of the disclosure.
  • Figure 11 is a signaling interaction diagram of an audio and video processing method provided by an embodiment of the disclosure.
  • FIG. 12 is a signaling interaction diagram of another audio and video processing method provided by an embodiment of the disclosure.
  • FIG. 13 is a schematic diagram of another scene of synthesis of audio and video files with subtitles according to an embodiment of the disclosure.
  • FIG. 14 is a signaling interaction diagram of an audio and video processing method provided by an embodiment of the disclosure.
  • 15 is a schematic diagram of another scene of synthesis of audio and video files with subtitles provided by an embodiment of the disclosure.
  • Figure 16 is a signaling interaction diagram of an audio and video processing method provided by an embodiment of the disclosure.
  • FIG. 17 is a signaling interaction diagram of another audio and video processing method provided by an embodiment of the disclosure.
  • FIG. 18 is a schematic flowchart of a specific implementation manner of an audio and video processing method provided by an embodiment of this application.
  • FIG. 19 is a schematic flowchart of specific steps of S130 in FIG. 18;
  • FIG. 21 is a schematic flowchart of specific steps of S2220 in FIG. 20.
  • a smart device collects video information.
  • the smart device may be any device, appliance, or machine with a video collection function.
  • the smart device in the present disclosure may include devices with self-detection and self-diagnosis.
  • the smart device in the present disclosure may be provided with a communication module, and the communication module may communicate with the terminal or another smart device.
  • the foregoing communication mode may be a communication mode such as WIFI, infrared, Bluetooth, 4G, or 5G, and the embodiments of the present disclosure are not limited thereto.
  • smart devices in the embodiments of the present disclosure include, but are not limited to, drones, unmanned vehicles, unmanned ships, handheld DVs, robots, and so on. In the following embodiments, the smart device is a drone for description.
  • a drone with a camera function can be used to record a program on the performance stage to obtain a better shooting perspective, or a drone can be used to fly in the air to record landscape scenery.
  • a drone As the drone is in flight, its rotating wings and engine work will make sounds. If you use a drone to record video and collect audio at the same time, the sound of the drone itself will also be recorded.
  • the drone may be far away from the sound source, making the audio obtained by recording noise. The sound of the sound source is rather small.
  • the embodiments of the present disclosure provide an audio and video processing method.
  • the method uses a first terminal to collect audio information, collects video information through a smart device such as a drone, and then synthesizes the audio information and video information to obtain audio and video files Since the audio and video file collects sound through the first terminal, it can avoid noise generated by smart devices such as drones, and can record high-quality audio information to ensure the effect of the synthesized audio and video file.
  • FIG. 1 is a schematic diagram of a process of an audio and video processing method provided by an embodiment of the disclosure. The process includes the following steps performed by a smart device:
  • S101 Receive a control message sent by the first terminal.
  • the smart device is an unmanned aerial vehicle as an example for illustration.
  • the unmanned aerial vehicle can interact with the first terminal for information.
  • a communication module with communication function is provided in the communication module, and information interaction with the first terminal is realized through the communication module.
  • the first terminal can be a mobile phone, a tablet, a smart wearable device, etc., in order to be able to send a control message to the drone, an APP (Application) for controlling the drone can be pre-installed in the first terminal, Specifically, the APP is provided with a control button, and when it is detected that the control button is pressed, the APP will send a control message to the drone.
  • the first terminal and the drone can configure the control message in advance. When it is detected that the control button is pressed, the APP on the first terminal sends a message with a preset format to the drone. When setting the format of the message, confirm receipt of the control message sent by the first terminal, and respond to the control message.
  • S102 Record video information according to the control message, so that an audio and video file is synthesized according to the video information and the audio information recorded by the first terminal.
  • the first terminal can record audio information through its own recording device, or through a separate audio recording device; for example, when the first terminal is a mobile phone or tablet, it can be recorded via a mobile phone or tablet.
  • Direct audio recording can also be performed by a separate device such as a radio or earphone connected to a mobile phone or tablet.
  • the earphone can be connected to the mobile phone or tablet via wireless means such as Bluetooth or WiFi or wired means.
  • the drone is pre-configured with image acquisition equipment with video recording function, such as a pan-tilt camera.
  • the first terminal and the drone can configure the control message in advance.
  • the APP on the first terminal sends a message with a preset format to the drone.
  • the format of the message confirm receipt of the control message sent by the first terminal, and record the video information. Specifically, the video information is recorded through the image acquisition device in the drone.
  • the drone After the drone receives the control message sent by the first terminal, it records the video information according to the control message, but the audio information is recorded by the first terminal, the recorded audio effect will not be poor. The situation guarantees the effect of the synthesized audio and video files.
  • a smart device can be a multifunctional product that can perform video capture in a broad sense, and can be a product that can be transformed from one device form to another by deforming or adding or removing accessories; UAVs and airplanes with flying functions, plus wrist straps or brackets, become handheld cameras, and other accessories become land or surface equipment.
  • the smart device may also be designed for waterproofing, heat insulation, frost and snow protection, etc. The present disclosure does not limit this.
  • control message carries time information.
  • recording video information includes:
  • the smart device waits for the delay time to record video information.
  • the smart device in order to realize the synchronous recording of audio information and video information, after the first terminal sends a control message, and after the drone receives the control message, it can be preset to wait for a certain period of time. Recording of video information or audio information. Therefore, the control message sent by the first terminal to the drone may carry time information for recording the video information, and the time information may be information about the delay time required for the drone to wait.
  • the user can input the time information through the APP installed in the first terminal, for example, when the time information is being recorded.
  • the drone waits for the delay time to record the video information.
  • the first terminal sends a control message with a delay of 3 seconds to the drone, and the first terminal waits for 3 seconds before recording audio information, and the drone waits for 3 seconds after receiving the control message Recording of video information, so as to realize synchronous recording of audio information and video information.
  • the first terminal sends a control message with a delay of 10 minutes to the drone. When the drone receives the control message, it can start timing, and wait 10 minutes after receiving the control message. Collect video information.
  • control message carries time information.
  • recording video information includes:
  • the drone will record the video information when the time point is reached.
  • the control message sent by the first terminal to the drone may carry time information for recording video information, and the time information may be the time point when the drone performs video recording.
  • the user can input the time information through the APP installed in the first terminal, for example, when the time information is input , You can select the available time information provided in the APP, or enter it through keyboard, voice, etc. After the drone receives the control message carrying the time point for video recording, the drone will record the video information after reaching the time point.
  • the user enters the time point of video recording in the APP of the first terminal.
  • the time point is 8:00.
  • the first terminal sends the drone to carry the video recording
  • the time point is the control message at 8:00.
  • the first terminal will record audio information when it reaches 8:00.
  • the drone After receiving the control message, the drone will record the video according to the time point carried in the control message. , Record the video information when it reaches 8:00, so as to realize the synchronous recording of audio information and video information.
  • the time of video capture should be later than the time when the control message is sent.
  • control message there may be no time information in the control message. In this case, it can mean that the video collection starts immediately after the drone receives the control message. Similarly, in order to ensure synchronous recording between the first terminal and the drone, when the first terminal sends a control message and the drone starts to record video according to the control message, the first terminal also starts recording audio at the same time.
  • the method further includes the following steps performed by the smart device:
  • the audio information and the video information are combined into an audio and video file.
  • the first terminal sends the recorded audio information to the drone. Specifically, the first terminal transmits the recorded audio information to the drone through the communication module , The drone receives the audio information sent by the first terminal, and synthesizes the audio information with the video information recorded by itself into an audio and video file.
  • the method further includes the following steps performed by the smart device:
  • the video information is sent to the first terminal, so that the first terminal synthesizes an audio and video file according to the audio information recorded by itself and the received video information.
  • audio information and video information can be synthesized in the first terminal.
  • the drone sends the recorded video information to the first terminal, and the first terminal receives the video information sent by the drone, and associates the video information with itself.
  • the recorded audio information is synthesized into audio and video files.
  • the method further includes the following steps performed by the smart device:
  • the second terminal can be a mobile phone, a computer, a server, etc. Both the smart device and the first terminal can perform information transmission with the second terminal. Specifically, the smart device sends the recorded video information to the second terminal, the first terminal sends the recorded audio information to the second terminal, the second terminal receives the video information and the audio information, and sends the The video information and the audio information are combined into an audio and video file.
  • FIG. 2 is a process schematic diagram of an audio and video processing method applied to a first terminal according to an embodiment of the present disclosure. The process includes the following steps:
  • S201 Send a control message for controlling the drone to record video information to the drone.
  • the method provided in the embodiments of the present disclosure is applied to a first terminal, and the first terminal can perform information interaction with the drone.
  • the first terminal may be a user terminal with a radio function such as a mobile phone, a tablet, a Bluetooth headset, etc.
  • the first terminal can be pre-installed
  • the APP is equipped with a control button. When it is detected that the control button is pressed, the APP will send a control message to the drone to control the drone to record video information. .
  • the first terminal and the drone can configure the control message in advance.
  • the APP on the first terminal sends a message with a preset format to the drone. When setting the format of the message, confirm that the control message sent by the first terminal is received.
  • S202 Record audio information according to the control message, so that an audio and video file is synthesized according to the audio information and the video information recorded by the drone.
  • the first terminal and the drone can configure the control message in advance.
  • the APP on the first terminal sends a message with a preset format to the drone.
  • the video information is recorded, and the first terminal records the audio information according to the control message.
  • the first terminal may record audio information through voice recording software.
  • the first terminal can record audio information through its own recording device, or through a separate audio recording device; for example, when the first terminal is a mobile phone or tablet, it can be recorded via a mobile phone or tablet.
  • Direct audio recording can also be performed through separate devices such as a radio or earphone connected to a mobile phone or tablet.
  • the earphone can be connected to the mobile phone or tablet via wireless means such as Bluetooth and WiFi or wired means.
  • the first terminal After the first terminal sends a control message for controlling the drone to record video information to the drone, the first terminal records audio information according to the control message, and the drone records the video information, There will be no poor recorded audio effects, and the effect of the synthesized audio and video files is guaranteed.
  • control message carries time information.
  • recording audio information includes:
  • the first terminal waits for the delay duration to record audio information after sending the control message.
  • the smart device in order to realize the synchronous recording of audio information and video information, after the first terminal sends a control message, and after the drone receives the control message, it can be preset to wait for a certain period of time. Recording of video information or audio information. Therefore, the control message sent by the first terminal to the drone may carry time information for recording the video information, and the time information may be information about the delay time required for the drone to wait.
  • the user can input the time information through the APP installed in the first terminal, for example, when the time information is being recorded.
  • the first terminal sends the control message carrying the information of the delay time
  • the audio information is recorded after waiting for the delay time.
  • the drone receives the control message carrying the information of the delay time, the drone waits for the delay time to record the video information.
  • the first terminal sends a control message with a delay of 3 seconds to the drone, and the first terminal waits for 3 seconds before recording audio information, and the drone waits for 3 seconds after receiving the control message Recording of video information, so as to realize synchronous recording of audio information and video information.
  • control message carries time information.
  • recording audio information includes:
  • the first terminal records the audio information when the time point is reached.
  • the control message sent by the first terminal to the drone may carry time information for recording video information, and the time information may be the time point when the drone performs video recording.
  • the user can input the time information through the APP installed in the first terminal, for example, when the time information is input , You can select the available time information provided in the APP, or enter it through keyboard, voice, etc.
  • the first terminal reaches this point in time, the audio information is recorded.
  • the drone After the drone receives the control message carrying the time point for video recording, the drone will record the video information after the time point is reached.
  • the user selects the time point for video recording in the APP of the first terminal, and the time point is 8:00.
  • the first terminal sends to the drone a video recording
  • the time point is the control message at 8:00.
  • the first terminal will record the audio information when it reaches 8:00.
  • the drone After receiving the control message, the drone will record the video according to the time point carried in the control message.
  • the video information is recorded when it reaches 8:00, so as to realize the synchronous recording of audio information and video information.
  • the method includes the following steps executed by the first terminal step:
  • audio information and video information can be synthesized in the first terminal.
  • the drone sends the recorded video information to the first terminal, and the first terminal receives the video information, and combines the video information with the audio recorded by itself.
  • Information synthesis audio and video files are included in the first terminal.
  • the method further includes the method executed by the first terminal The following steps:
  • the audio information is sent to the drone, so that the drone synthesizes an audio and video file according to the video information recorded by itself and the received audio information.
  • audio information and video information can be synthesized in the drone.
  • the first terminal sends the recorded audio information to the drone, and the drone receives the audio information, and synthesizes the audio information with the video information recorded by itself into an audio and video file.
  • FIG. 3 taking the first terminal as a mobile phone and the smart device as a drone as an example, a schematic diagram of a process of synthesizing audio and video files by the drone is shown.
  • the mobile phone sends a control message to the drone, it records audio information according to the control message, and encodes the audio information to generate advanced audio coding (AAC) format files or pulse code modulation (Pulse Code Modulation, PCM) ) Format file.
  • AAC format file or PCM format file is transmitted to the UAV via USB (Universal Serial Bus).
  • the drone receives the control message sent by the first terminal, it records video information according to the control message, and encodes the video information.
  • the drone synthesizes audio and video files with the video information it has recorded according to the received audio information , Where the audio and video file is in MP4 format, and the drone will store the generated audio and video file for subsequent playback and other operations.
  • the method further includes the method executed by the first terminal The following steps:
  • the second terminal can be a mobile phone, a computer, a server, etc.
  • the drone sends the recorded video information to the second terminal
  • the first terminal sends the recorded audio information to the second terminal
  • the second terminal receives the video information and the audio information
  • the video information and the audio information are combined into an audio and video file.
  • FIG. 4 is a structural diagram of an audio and video processing device applied to smart devices according to an embodiment of the disclosure, and the device includes:
  • the receiving and sending module 401 is configured to receive a control message sent by the first terminal.
  • the recording module 402 is configured to record video information according to the control message, so that an audio and video file is synthesized according to the video information and the audio information recorded by the first terminal.
  • the recording module 402 is specifically configured to carry time information in the control message; if the time information is information about the delay time, the smart device waits for the time information after receiving the control message. Record the video information with the stated delay time.
  • the recording module 402 is specifically configured to carry time information in the control message; if the time information is the time point for video recording, the smart device will perform the video recording when the time point is reached. Information recording.
  • the receiving and sending module 401 is further configured to receive audio information sent by the first terminal.
  • the device also includes:
  • the synthesis module 403 is used for synthesizing the audio information and the video information into an audio and video file.
  • the receiving and sending module 401 is further configured to send the video information to the first terminal, so that the first terminal synthesizes audio and video according to the audio information recorded by itself and the received video information file.
  • the receiving and sending module 401 is further configured to send the video information to a second terminal, so that the second terminal synthesizes an audio and video file according to the audio information recorded by the first terminal and the video information .
  • FIG. 5 is a structural diagram of an audio and video processing device applied to a first terminal according to an embodiment of the disclosure, and the device includes:
  • the receiving and sending module 501 is configured to send a control message for controlling the smart device to record video information to the smart device.
  • the recording module 502 is configured to record audio information according to the control message, so that an audio and video file is synthesized according to the audio information and the video information recorded by the smart device.
  • the recording module 502 is specifically configured to carry time information in the control message; if the time information is information about the delay time, the first terminal waits for the control message after sending the control message. The audio information is recorded with the delay time.
  • the recording module 502 is specifically configured to carry time information in the control message; if the time information is a time point for audio recording, the first terminal performs Recording of audio information.
  • the receiving and sending module 501 is further configured to receive video information sent by the smart device.
  • the device also includes:
  • the synthesis module 503 is used for synthesizing the audio information and the video information into an audio and video file.
  • the receiving and sending module 501 is further configured to send the audio information to the smart device, so that the smart device synthesizes an audio and video file according to the video information recorded by itself and the received audio information.
  • the receiving and sending module 501 is further configured to send the audio information to a second terminal, so that the second terminal synthesizes an audio and video file according to the video information recorded by the smart device and the audio information.
  • an embodiment of the present disclosure also provides an electronic device 600, as shown in FIG. 6, including: a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601 , The communication interface 602 and the memory 603 communicate with each other through the communication bus 604.
  • a computer program is stored in the memory 603.
  • the processor 601 executes the following steps on the smart device side: receiving a control message sent by the first terminal; Message, recording video information, so that an audio and video file is synthesized according to the video information and the audio information recorded by the first terminal.
  • control message carries time information.
  • the recording video information according to the control message includes: if the time information is information of a delay time, the smart device waits for the delay time to record the video information after receiving the control message . If the time information is the time point for video recording, the smart device records the video information when the time point is reached.
  • the method further includes: receiving audio information sent by the first terminal; and synthesizing the audio information and the video information into an audio and video file.
  • the method further includes: sending the video information to the first terminal, so that the first terminal can use the audio information recorded by itself and the received video information Synthesize audio and video files.
  • the method further includes: sending the video information to a second terminal, so that the second terminal synthesizes the audio information and the video information recorded by the first terminal Audio and video files.
  • the communication bus mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the communication bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
  • the communication interface 602 is used for communication between the aforementioned electronic device and other devices.
  • the memory 603 may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk storage.
  • RAM Random Access Memory
  • NVM Non-Volatile Memory
  • the memory 603 may also be at least one storage device located far away from the foregoing processor 601.
  • the aforementioned processor 601 may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; it may also be a digital command processor (Digital Signal Processing, DSP), an application specific integrated circuit, and a field programmable gate display Or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • NP Network Processor
  • DSP Digital Signal Processing
  • an embodiment of the present disclosure also provides an electronic device 700, as shown in FIG. 7, including: a processor 701, a communication interface 702, a memory 703, and a communication bus 704, wherein the processor 701 , The communication interface 702 and the memory 703 communicate with each other through the communication bus 704.
  • a computer program is stored in the memory 703, and when the program is executed by the processor 701, the processor 701 is caused to execute the following steps on the first terminal side: sending to the smart device a command to control the smart device to record video information Control message; according to the control message, audio information is recorded, so that an audio and video file is synthesized according to the audio information and the video information recorded by the smart device.
  • control message carries time information.
  • the recording of audio information according to the control message includes: if the time information is information of a delay duration, the first terminal waits for the delay duration to record audio information after sending the control message. If the time information is the time point for audio recording, the first terminal records the audio information when the time point is reached.
  • the method includes: receiving video information sent by the smart device; and synthesizing the audio information and the video information into an audio and video file.
  • the method further includes: sending the audio information to the smart device, so that the smart device synthesizes audio according to the video information recorded by itself and the received audio information. Video files.
  • the method further includes: sending the audio information to a second terminal, so that the second terminal synthesizes audio according to the video information recorded by the smart device and the audio information. Video files.
  • the communication bus 704 mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the communication bus 704 can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
  • the communication interface 702 is used for communication between the aforementioned electronic device and other devices.
  • the memory 703 may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk storage.
  • RAM Random Access Memory
  • NVM Non-Volatile Memory
  • the memory 703 may also be at least one storage device located far away from the foregoing processor.
  • the above-mentioned processor 701 may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; it may also be a digital instruction processor (Digital Signal Processing, DSP), an application specific integrated circuit, and a field programmable gate display Or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • NP Network Processor
  • DSP Digital Signal Processing
  • the first terminal is used to record audio information
  • the video information is collected by smart devices such as drones
  • the audio information and video information are synthesized to ensure the effect of the synthesized audio and video files.
  • the video file can express its content more clearly.
  • the corresponding text information ie, subtitles
  • the audio information, video information, and text information are synthesized to obtain audio with subtitles.
  • the corresponding text information can also be obtained according to the video information, and then the video information, audio information, and text information can be synthesized to obtain an audio and video file with subtitles.
  • FIG. 8 is a schematic diagram of a scene for synthesizing audio and video files with subtitles provided by an embodiment of the disclosure, as shown in FIG. 8, including drones, mobile phones, and users.
  • the drone has a video recording function and can communicate with mobile phones.
  • the mobile phone is used to collect the user's voice. In order to collect the user's audio information more clearly, the mobile phone can be placed near the user.
  • using a mobile phone to collect audio is a feasible implementation, and it can also be replaced by other electronic devices with audio recording functions, such as tablets, voice recorders, smart wearable devices such as Bluetooth headsets, etc.
  • the device that records audio can be called the first terminal. After the drone obtains the video information, audio information and text information, it synthesizes the video information, audio information and text information to obtain audio and video files with subtitles.
  • Fig. 9 is a schematic flow chart of an audio and video processing method provided by an embodiment of the disclosure. As shown in Fig. 9, the method is applied to a smart device, such as a drone. It should be noted that the drone has a video capture function. The method includes:
  • Step 210 Collect video information, and obtain audio information and text information corresponding to the audio information; wherein the audio information is collected by the first terminal.
  • the video recording parameters can be set on the drone in advance. After the setting is completed, the drone will take off and follow the settings. Set the parameters for video recording.
  • the user may also remotely control the drone.
  • the first terminal may communicate with the drone, and the first terminal may send control messages of video recording parameters to the drone, so as to control the drone.
  • a communication module is provided in the drone, and the drone communicates with the first terminal through the communication module.
  • the first terminal may be placed near the sound source, and the audio information of the sound source may be collected through the first terminal.
  • the first terminal can send the recorded audio information to the drone. It should be noted that the audio information collected by the first terminal can be synchronized with the video information collected by the drone.
  • the text information is generated based on the audio information
  • the step of generating the text information can be performed in the first terminal, the drone, or the second terminal. That is, after receiving the audio information sent by the first terminal, the drone can generate corresponding text information based on the audio information, or the second terminal can generate corresponding text information based on the audio information after receiving the audio information.
  • the drone can also receive audio information and text information sent by the first terminal. In this case, the first terminal generates the corresponding text information after collecting the audio information, and sends the audio information and text information to the drone .
  • the specific method for generating text information from audio information is described in detail in the following embodiments.
  • Step 220 Synthesize the video information, the audio information, and the text information into an audio and video file with subtitles.
  • the drone after the drone obtains the video information, audio information, and text information, it synthesizes them to obtain audio and video files with subtitles.
  • video information includes multiple frames of video images
  • the synthesis of video information and text information refers to adding text information to the video image of the corresponding frame.
  • audio information is collected through the first terminal, video information is collected through the drone, and corresponding text information is generated according to the audio information. Finally, the audio information, video information, and text information are synthesized. On the one hand, the audio can be guaranteed at the same time. And the quality of the video, on the other hand, through text information, users can obtain more accurate audio information and better understand audio and video.
  • S301 Audio preprocessing; acquiring parameter information in the audio information, the parameter information at least including the number of channels, encoding mode, and sampling rate, and converting the parameter information in the audio information into a standard format.
  • the number of channels is mono
  • the sampling rate is 16000 frame rate
  • the encoding method is WAV format.
  • S302 Noise reduction; select the sound of the first 0.5 seconds in the audio information as the noise sample, divide the noise sample into frames through the Hanning window, and obtain the intensity value corresponding to each frame, which is used as the noise gate threshold, and then passed through Hanning
  • the window divides the audio information into frames and finds the intensity value corresponding to each frame to obtain the audio signal intensity value, and then compares the audio signal intensity value with the noise gate threshold value frame by frame, and retains the audio whose audio signal intensity value is greater than the noise gate threshold value Information, and finally get the audio file with noise reduction.
  • S303 Audio information segmentation; dual-threshold voice endpoint detection technology is adopted to segment the audio information that has been noise-reduced, segment the available audio segments, and treat some audio files that do not meet the threshold as mute or noise. No processing.
  • S304 Fragment recognition; according to the default minimum silence length and shortest effective sound parameters, the selected audio samples are further segmented to obtain a series of speech fragments, and then the obtained speech fragments are processed by calling third-party speech recognition software Voice recognition, sort out the text information corresponding to all audio information.
  • each audio segment has a corresponding time stamp. Therefore, after being converted into corresponding text information, the text information also has the same time stamp.
  • the audio information and text information can be aligned in time through the time stamp.
  • a dialect database and a foreign language translation database can be constructed in advance, so that when the audio information is a dialect or a foreign language, the audio information can also be generated into corresponding text information.
  • the audio information is segmented twice and then text recognition is performed, so that more accurate text information can be obtained.
  • both the video information and the audio information include the first One time information.
  • the text information is generated based on audio information, the obtained text information also includes the first time information.
  • the performance video of the actors on the stage is collected through the drone, and the audio information of the actors on the stage is synchronized through the first terminal. Since the first time information represents the absolute time of collection, and because the video information and audio information are collected synchronously, when synthesizing, the video information, audio information, and text information can be placed in time according to the first time information. Align on the dots to realize that the obtained audio and video files with subtitles are synchronized in time.
  • the implementation of the present disclosure synthesizes audio information, video information, and text information according to the first time information, so that the three can be played synchronously, and there will be no time mismatch between the played video, audio, and subtitles.
  • the mouth, voice, and subtitles of the speaker need to be played synchronously
  • the UAV collects the video information, it can obtain multiple frames of video images in the video information, and recognize the multiple frames of video images to obtain the characteristics of the person's mouth shape change in the video information. It should be noted that before the multi-frame video image is recognized, the multi-frame video image can be divided to obtain the multi-frame video image corresponding to each word spoken by the person in the video information.
  • the corresponding text can be obtained according to the mouth shape change characteristics.
  • a character recognition model can be constructed in advance, and the character of the mouth shape change can be analyzed through the character recognition model, and the corresponding characters can be output.
  • the main purpose of obtaining the corresponding text through the lip change feature is to align and synthesize video information, audio information, and text information. Therefore, the video information, audio information, and text information can be synthesized into an audio and video file with subtitles according to the text corresponding to the lip change feature.
  • the embodiment of the present disclosure obtains the words spoken by the person according to the characteristics of the person’s mouth shape changes in the video, and then aligns and synthesizes the video information, audio information, and text information according to the words spoken by the person, so that the synthesized audio and video files can be played during , Video, audio and subtitles are kept synchronized in time.
  • FIG. 11 is a signaling interaction diagram of an audio and video processing method provided by an embodiment of the present disclosure. As shown in FIG. 11, it includes a drone and a first terminal, and the method includes:
  • UAV collects video information
  • UAV with video recording function collects video information
  • S402 The first terminal collects audio information; the first terminal with audio recording function collects audio information. It should be noted that S401 and S402 can be performed simultaneously;
  • S403 The first terminal sends audio information to the drone; the first terminal sends the collected audio information to the drone. It should be noted that the first terminal communicates with the drone;
  • S404 Generate text information according to the audio information; after receiving the audio information sent by the first terminal, the drone generates corresponding text information according to the audio information;
  • S405 Perform synthesis; the drone synthesizes video information, audio information, and text information to obtain audio and video files with subtitles.
  • the collected audio information is sent to the drone through the first terminal, and the drone generates text information corresponding to the audio information, and synthesizes the collected video information, the received audio information, and the generated text information Therefore, on the one hand, it ensures that clear video information and audio information are obtained at the same time.
  • the drone through subtitles, users who watch the audio and video can understand the audio more clearly, preventing the collected audio from being dialects or foreign languages. A problem that causes users to fail to understand its correct meaning.
  • FIG. 12 is a signaling interaction diagram of another audio and video processing method provided by an embodiment of the disclosure. As shown in FIG. 12, the method includes a drone and a first terminal, and the method includes:
  • UAV collects video information
  • UAV with video recording function collects video information
  • S502 The first terminal collects audio information; the first terminal with audio recording function collects audio information. It should be noted that S501 and S502 can be performed simultaneously;
  • S503 Generate text information according to the audio information; after collecting the audio information, the first terminal generates corresponding text information according to the audio information;
  • the first terminal sends audio information and text information to the drone; the first terminal sends the collected audio information and the generated text information to the drone. It should be noted that the first terminal communicates with the drone connection;
  • S505 Perform synthesis; the drone synthesizes video information, audio information, and text information to obtain audio and video files with subtitles.
  • audio information is collected through a first terminal, video information is collected through a drone, and corresponding text information is generated through audio information, and then synthesized.
  • high-quality audio and video files can be obtained, and on the other hand, subtitles Help users better understand audio information.
  • FIG. 13 is a schematic diagram of a scene for synthesizing audio and video files with subtitles according to an embodiment of the present disclosure.
  • FIG. 13 includes drones, mobile phones and users; drones have video recording functions and can communicate with mobile phones ;
  • the mobile phone is used to collect the user's voice. In order to collect the user's audio information more clearly, the mobile phone can be placed near the user.
  • using a mobile phone to collect audio is a feasible implementation, and it can also be replaced by other electronic devices with audio recording functions, such as tablets, voice recorders, smart wearable devices such as Bluetooth headsets, etc.
  • the device that records audio can be called the first terminal. After the mobile phone collects audio information, it generates corresponding text information based on the audio information, and the mobile phone receives the video information sent by the drone, and then synthesizes the video information, audio information, and text information to obtain audio and video files with subtitles.
  • FIG. 14 is a signaling interaction diagram of an audio and video processing method provided by an embodiment of the disclosure. As shown in FIG. 14, the method is applied to a first terminal, and the first terminal may be a mobile phone, a tablet computer, or the like with a recording function. Electronic equipment, the method includes:
  • S701 Collect video information; in order to obtain clearer and broader video information, use drones to collect video information.
  • the recording operation of the first terminal and the drone can be performed simultaneously.
  • the drone can send the entire video information to the first terminal after collecting the video information, or it can send the collected video information to the first terminal in real time. Of course, it can also send the video information to the first terminal in real time. The collected video information is sent to the first terminal.
  • S702 Collect audio information; in order to be able to collect clearer audio information, the first terminal used to collect audio information may be placed near the sound source.
  • S703 Generate text information; after the first terminal collects the audio information, generate corresponding text information from the audio information, where the method of generating corresponding text information according to the audio information by the first terminal is the same as the method in FIG. 10, I won't repeat them here. Moreover, the first terminal may generate text information after all the entire audio information is collected, or may generate corresponding text information in real time from the collected audio information.
  • S704 Receive video information sent by the drone; after the drone collects the video information, it sends the collected video information to the first terminal. It should be noted that the first terminal and the drone can be connected in advance, and the video information can be transmitted via wireless signals.
  • S705 Synthesize audio and video files with subtitles; the first terminal synthesizes audio information, text information, and video information to obtain audio and video files with subtitles.
  • the first time information is included in the video information, audio information and text information.
  • the first time information in the video information is the time when the drone was recording the video information.
  • the first time information in the audio information is the time point when the first terminal is recording the audio information, and the drone and the first terminal simultaneously record.
  • text information is generated based on audio information, and the text information and audio information are synchronized in time.
  • the first terminal aligns the video information, audio information, and text information at a time point according to the first time information, so as to obtain synchronized audio and video files with subtitles.
  • the first terminal collects audio information, and generates corresponding text information from the audio information, collects video information through the drone, and the first terminal collects the received video information and the collected audio information sent by the drone. Synthesize with the generated text information to obtain audio and video files with subtitles, so that while ensuring that high-quality audio and video files are obtained, adding subtitles can enable users to correctly understand audio information.
  • the first terminal before receiving the video information sent by the drone, the first terminal also sends a control message to the drone, so that the drone collects the video information according to the control message .
  • the recording start time and recording parameters can be set on the drone in advance, and then the drone can record the video according to the set parameters.
  • the first terminal can also communicate with the drone to record the video by the drone. Before recording, the first terminal can be pre-installed with an APP that can control the drone. The user can use the APP to The drone sends a control message, and when the drone receives the control message, it can perform corresponding operations according to the control message.
  • control message may be the start of video recording.
  • the control message can also include time information.
  • the time information can be the time point or delay time to start recording.
  • the time information is used to control the drone to perform video recording operations to ensure that the recording function of the first terminal and the drone is in time. Up-synchronization, which facilitates better alignment of audio, video and text during the audio and video synthesis process.
  • the control message may also include other parameters required in video recording, such as the setting of parameters such as focus distance and brightness.
  • FIG. 15 is a schematic diagram of a scene for synthesizing audio and video files with subtitles according to another embodiment of the present disclosure, as shown in FIG. 15, including drones, mobile phones, servers (third parties, such as the second terminal in the above) and user.
  • the drone has a video recording function and can communicate with the mobile phone; the mobile phone is used to collect the user's voice. In order to collect the user's audio information more clearly, the mobile phone can be placed near the user; the server is used to synthesize the subtitled Audio and video files.
  • the use of mobile phones for audio collection is a feasible implementation, and it can also be replaced by other devices with audio recording functions, such as tablets, voice recorders, smart wearable devices such as Bluetooth headsets, etc.
  • the device that records audio is called the first terminal.
  • the mobile phone After the mobile phone collects audio information, it can generate corresponding text information based on the audio information, and send the audio information and text information to a third party such as a server, or it can only send the audio information to the server, and the server generates the text information.
  • FIG. 16 is a signaling interaction diagram of an audio and video processing method provided by an embodiment of the disclosure.
  • the processing method includes a first terminal, a drone, and a server, and the method includes:
  • S901 Collect audio information; the first terminal collects audio information.
  • the first terminal may be placed near the sound source.
  • S902 Collect video information; the drone collects video information, where the drone can communicate with the first terminal in advance, receive control messages sent by the first terminal, and collect video information according to the control messages. It should be noted that S901 and S902 can be synchronized.
  • S903 Send audio information; after collecting the audio information, the first terminal sends the audio information to the server. It should be noted that the first terminal may send the collected audio information to the server in real time, or may send the audio information to the server after all the audio information is collected.
  • S904 Send video information; the drone sends the collected video information to the server. It should be noted that the drone can send the collected video information to the server in real time, or it can send the video information to the server after all the video information is collected. It should be noted that S903 and S904 can be performed simultaneously.
  • S905 Generate text information; after receiving the audio information, the server generates corresponding text information according to the audio information.
  • the server After receiving the audio information, the server generates corresponding text information according to the audio information.
  • S906 Synthesize audio and video files with subtitles; after generating text information, the server synthesizes audio information, text information, and video information to obtain audio and video files with subtitles.
  • audio information is collected by the first terminal, and the server generates corresponding text information based on the audio information.
  • the video information is collected by the drone.
  • the server synthesizes the video information, audio information, and text information to ensure high-quality audio.
  • the video file can also reduce the load of the first terminal and the drone, and the drone does not need to have the function of audio and video file synthesis, which has lower requirements for the drone.
  • FIG. 17 is a signaling interaction diagram of yet another audio and video processing method provided by an embodiment of the disclosure. As shown in FIG. 17, the method includes:
  • S1001 Collect audio information and generate text information; the first terminal collects audio information.
  • the first terminal may be placed near the sound source. After the first terminal collects the audio information, it generates corresponding text information according to the audio information. It should be noted that the method for generating text information can be consistent with the above embodiment, and will not be repeated here.
  • S1002 Collect video information; the drone can communicate with the first terminal in advance, and the drone can receive the control message sent by the first terminal, and then start collecting video information.
  • S1003 Send audio information and text information; the first terminal sends the audio information and text information to the server.
  • S1004 Send video information; the drone sends the collected video information to the server. It should be noted that the drone can also establish a communication connection with the server in advance.
  • S1005 Synthesize audio and video files with subtitles; the server synthesizes the received audio information, video information, and text information to obtain audio and video files with subtitles.
  • the first terminal collects audio information and generates corresponding text information based on the audio information.
  • the drone collects video information, and the server synthesizes the video information, audio information, and text information to ensure that high-quality audio and video files are obtained. At the same time, it can also reduce the load of the first terminal and the drone, and the drone does not need to have the function of audio and video file synthesis, and the requirements for the drone are lower.
  • the mouth, voice, and subtitles of the speaker need to be played synchronously
  • the drone collects the video information, it can obtain multiple frames of video images in the video information, and recognize the multiple frames of video images to obtain the characteristics of the person's mouth shape change in the video information. It should be noted that before the multi-frame video image is recognized, the multi-frame video image can be divided to obtain the multi-frame video image corresponding to each word spoken by the person in the video information.
  • the corresponding text can be obtained according to the mouth shape change characteristics.
  • a character recognition model can be constructed in advance, and the character of the mouth shape change can be analyzed through the character recognition model, and the corresponding characters can be output.
  • the main purpose of obtaining the corresponding text through the lip change feature is to align and synthesize video information, audio information, and text information. Therefore, the video information, audio information, and text information can be synthesized into an audio and video file with subtitles according to the text corresponding to the lip change feature.
  • the present disclosure recognizes the mouth shape of a person's speech in the video, knows what the person says, and synthesizes the video information, audio information, and text information according to the person's words, so as to ensure the synchronization of audio, video and text in time.
  • both the audio collected by the first terminal and the video collected by the drone can be controlled by the server, that is, an APP that can control the audio collected by the first terminal and the video collected by the drone is installed in the server.
  • an APP that can control the audio collected by the first terminal and the video collected by the drone is installed in the server.
  • a terminal and the drone collect at the same time, it can send a control message to the first terminal and the drone at the same time.
  • the first terminal and the drone receive the control message, they start audio and video collection.
  • other smart devices can also be used to control the first terminal and the drone.
  • this embodiment also provides an audio and video processing device capable of executing each step involved in the method embodiment in FIG. 9.
  • the device may be a module, program segment or code on an electronic device.
  • the device includes a receiving and sending module for collecting video information, and obtaining audio information and text information corresponding to the audio information; wherein the audio information is collected by a first terminal; a synthesis module is used for combining the video information , The audio information and the text information are synthesized into an audio and video file with subtitles.
  • this embodiment also provides an audio and video processing device capable of executing each step involved in the method embodiment in FIG. 14.
  • the device may be a module, program segment or code on an electronic device.
  • the device includes a receiving and sending module for collecting audio information, and generating corresponding text information according to the audio information, for receiving video information sent by a drone (smart device); a synthesis module for combining the video information , The audio information and the text information are synthesized into an audio and video file with subtitles.
  • this embodiment also provides an audio and video processing device capable of executing each step involved in the method embodiment in FIG. 16.
  • the device may be a module, program segment or code on an electronic device.
  • the device includes a receiving and sending module for obtaining video information, audio information, and text information corresponding to the audio information; wherein, the video information is collected by a drone (smart device), and the audio information is the first terminal Acquisition and acquisition; a synthesis module for synthesizing the video information, the audio information and the text information into an audio and video file with subtitles.
  • subtitles are added during audio and video synthesis, which is beneficial to the user's understanding of audio.
  • the first terminal and the smart device can communicate with each other to realize information transmission.
  • the first terminal may collect audio information (also referred to as audio data), and transmit the collected audio information to the drone.
  • audio information also referred to as audio data
  • the first terminal and the drone can realize information transmission. This embodiment will now be illustrated by the following examples.
  • the drone can communicate wirelessly with the wireless module of the first terminal through the WiFi module.
  • the wireless module of the first terminal may be a WiFi module or a 4G module.
  • a ground repeater can also be used. That is, the drone can communicate with the high-power WiFi module of the ground repeater through its own high-power WiFi module, and the high-power WiFi module of the ground repeater then communicates with the first terminal.
  • the drone and the first terminal can also communicate through other short-range wireless communication technologies, such as Bluetooth, ZigBee, etc.
  • Bluetooth ZigBee
  • the specific communication method between the drone and the first terminal should not be understood as Limitations on this disclosure.
  • FIG. 18 is a schematic flowchart of a specific implementation manner for information transmission between the first terminal and the drone provided by the embodiments of the disclosure.
  • the method can be executed by a smart device with a camera, such as a drone.
  • the method is specifically Including S110 to S130:
  • S110 Collect video information, and buffer the video information in a storage device.
  • the video information may include at least one second time stamp and video content information corresponding to each second time stamp.
  • the smart device can shoot and collect video content information through its own camera, and the smart device can add a corresponding second time stamp to the video content information according to the time when the video is taken. Subsequently, the smart device can cache the video information including the video content information and the second time stamp in the storage device.
  • the storage device is a memory that stores cached data, such as random access memory (Random Access Memory, RAM for short).
  • S120 Receive audio information corresponding to the video information sent by a first terminal (such as a user terminal), and buffer the audio information in the storage device.
  • the audio information includes at least one first time stamp and audio content information corresponding to each first time stamp.
  • the first terminal may collect audio content information, and the first terminal may add a corresponding first time stamp to the audio content information according to the time of audio collection.
  • the audio information corresponding to the video information can mean that the shooting time corresponding to the second time stamp of the video information is the same as the collection time corresponding to the first time stamp of the audio information, or it can be the shooting time corresponding to the second time stamp and the first time stamp.
  • the corresponding collection time differs by a certain length of time, and a certain length of time can be 1 second or 0.5 second.
  • the shooting time corresponding to the second time stamp is 1 second earlier than the collection time corresponding to the first time stamp, or the shooting time corresponding to the second time stamp is later than the collection time corresponding to the first time stamp by 0.5 seconds.
  • the smart device can continuously shoot video content information, add a corresponding second time stamp to the video content information, and cache the video information including the video content information and the second time stamp In the storage device.
  • the smart device may also continuously receive the audio information sent by the first terminal, and buffer the audio information in the storage device, so as to wait to synthesize an audio and video file with the buffered video information.
  • the smart device is disconnected from the first terminal for a second length of time and then reconnected, receiving audio information corresponding to the second length of time sent by the first terminal.
  • the first time length is the maximum buffer time length of the smart device
  • the second time length is less than or equal to the first time length
  • the audio information corresponding to the video information that has been buffered for the first time length includes the second time length The corresponding audio information.
  • the first time length that is, the maximum video cache duration
  • the second time length of 5 seconds
  • the maximum video buffering time is 10 seconds, which means that the video is calculated from the moment of collection, and the audio and video files will be synthesized with the corresponding audio information after a delay of 10 seconds. For example, assuming that the video collection time is the 0th second, the video collected at the 0th second will be combined with the audio at the 10th second to synthesize an audio and video file.
  • the first terminal will not only start collecting new audio information and send it to the smart device when it reconnects after 5 seconds, the first terminal will also The audio information collected within 5 seconds of disconnection is also sent to the smart device.
  • the 5 seconds of disconnection can be any 5 seconds of the aforementioned 10 second time period from the 0th second to the 10th second, for example, it can be the 0th second to the 5th second, or It can be from time 3 to 8 seconds, or from time 5 to 10 seconds.
  • the video collected at the 0th second starts to synthesize audio and video files with the corresponding audio information.
  • the 5 seconds when the smart device is disconnected from the first terminal is between the 0th and 10th seconds. Therefore, the disconnection between the smart device and the first terminal will not affect the synthesis of audio and video files.
  • the first terminal can still detect the unsuccessful disconnection period
  • the sent audio is sent to the smart device so that the smart device synthesizes the audio and video files, thereby making the synthesis of the audio and video files more stable.
  • the smart device is disconnected from the first terminal for a third time length and then reconnected, receiving audio information corresponding to the first time length of the latest time among the third time lengths sent by the first terminal, Wherein, the third time length is greater than the first time length.
  • the 15 seconds when the smart device is disconnected from the first terminal is from the 0th second to the 15th second.
  • the time reaches the 10th second since the smart device is still disconnected from the first terminal, the 0th second is The captured video has no corresponding audio information, and because the maximum video buffering time is 10 seconds, the video captured at the 0th second will generate a video file without audio without corresponding audio information.
  • the video collected at the first second, the video collected at the second second, the video collected at the third second If there is no corresponding audio information, a video file without audio will be generated.
  • the video collected until the 5th second will synthesize the audio and video files with the corresponding audio information at the 15th second, and the 15th second is the moment when the smart device is disconnected from the first terminal and then reconnected. Therefore, the first terminal
  • the latest audio information corresponding to the time period equal to the length of the first time can be sent to the smart device so that the smart device can respond
  • the video information in the cache performs the synthesis action.
  • the first terminal can send the latest audio information to the smart device,
  • the duration of the latest audio information may be the same as the maximum buffer duration of the video, so that the smart device can synthesize as many audio and video files with audio as possible.
  • S130 Synthesize the video information and corresponding audio information that have been buffered for the first time length in the storage device into an audio and video file.
  • the smart device after the smart device collects the video information, it does not immediately synthesize the video information and the corresponding audio information into an audio and video file, but first caches the video information for a period of time, and then caches the video information after a period of time.
  • the audio and video files are synthesized with the corresponding audio information, and the audio information is sent by the first terminal.
  • the smart device if the smart device is disconnected from the first terminal and reconnected, it will not affect the synthesis of audio information and video information, making the synthesis of audio and video files more stable, and improving the existing technology. The problem of poor sound signal.
  • FIG. 19 shows the specific steps of S130.
  • the specific process of synthesizing audio and video files may include the following S131 to S134:
  • the second time stamp of the video information that has been cached for the first time length is extracted, that is, the collection time corresponding to the video content information included in the video information is obtained.
  • S132 Determine whether there is a first time stamp corresponding to the second time stamp in at least one first time stamp of the audio information, if yes, execute S133; if not, execute S134.
  • the first time stamp corresponding to the second time stamp may mean that the collection time corresponding to the second time stamp is the same as the collection time corresponding to the first time stamp, or the collection time corresponding to the second time stamp is greater than the collection time corresponding to the first time stamp.
  • the collection time is earlier by a fixed length of time, or the collection time corresponding to the second time stamp is later than the collection time corresponding to the first time stamp by a fixed time length.
  • the storage device of the smart device After obtaining the second time stamp of the video information that has been cached for the first length of time, search the storage device of the smart device whether there is a first time stamp corresponding to the second time stamp, and if the second time stamp corresponding to the second time stamp is found
  • the first timestamp indicates that the video information has corresponding audio information that can be synthesized, and execute S133; if the first timestamp corresponding to the second timestamp is not found, it indicates that the video information has no corresponding audio information that can be synthesized. Perform S134.
  • S133 Synthesize the audio content information corresponding to the first time stamp corresponding to the second time stamp and the video information that has been buffered for the first time length into an audio and video file.
  • Each first time stamp in the at least one first time stamp has a corresponding time
  • each second time stamp in the at least one second time stamp also has a corresponding time.
  • the correspondence between the audio content information and the video content information can be realized according to the correspondence between the timestamps, so that even if the video is not synthesized with the audio in real time, the audio and video files can be synthesized.
  • the audio information corresponding to the video information of the first length of time buffered this time is not in the buffer and the audio and video files cannot be synthesized.
  • the prompt information of the lack of voice is added to the message to better distinguish it from the video paragraph with audio, and it is convenient to filter and process the video with the lack of voice later.
  • a disconnect prompt message is issued.
  • the disconnected prompt message is a message to remind the operator of the smart device that the connection is disconnected. It can be a flash signal from the body of the smart device or an acoustic signal.
  • the smart device can send a disconnected prompt message so that the prompt information can be The operator of the smart device is aware of it so that the operator can take remedial measures.
  • the audio redundant data is decoded to obtain the same data as the lost audio content information.
  • the audio information includes audio content information and audio redundant data, and the audio redundant data is obtained by encoding the audio content information.
  • the audio redundant data may be obtained by the first terminal encoding the audio content information, and the specific encoding method may be a preset rule, and the preset rule is a rule known to both the smart device and the first terminal.
  • the audio information transmitted by the first terminal to the smart device may include audio content information and audio redundant data. If the audio content information is lost and the audio redundant data is not lost, the smart device can decode the audio redundant data to obtain the The lost audio content information is the same data, which further improves the reliability of data transmission.
  • FIG. 20 shows a schematic flowchart of another specific implementation manner of an audio and video processing method provided by an embodiment of the present disclosure.
  • the method may be executed by the first terminal, and specifically includes the following S210 to S220:
  • S2210 Collect audio content information.
  • S2220 Send audio information including audio content information to the smart device, so that the smart device synthesizes the video information buffered for the first time length and the corresponding audio information into an audio and video file.
  • the first terminal can collect the audio content information, and then add the corresponding first time stamp to the audio content information, and then send the audio information including the audio content information and the first time stamp to the smart device, so that the smart device caches
  • the video information for a period of time is synthesized with the corresponding audio information.
  • S2220 specifically includes: if the first terminal is disconnected from the smart device for a second length of time and then reconnected, sending audio information corresponding to the second length of time to the smart device, where the second The time length is less than or equal to the first time length.
  • the smart device is disconnected from the first terminal and then reconnected, and the length of the disconnection does not exceed the maximum video buffering time. Therefore, when the smart device reconnects to the first terminal, the first terminal can still detect the unsuccessful disconnection period
  • the sent audio is sent to the smart device so that the smart device synthesizes the audio and video files, thereby making the synthesis of the audio and video files more stable.
  • S2220 specifically further includes: if the first terminal is disconnected from the smart device for a third period of time and then reconnects, sending the latest audio information corresponding to the first period of time to the smart device, where: The third time length is greater than the first time length.
  • the first terminal can send the latest audio information to the smart device,
  • the duration of the latest audio information may be the same as the maximum buffer duration of the video, so that the smart device can synthesize as many audio and video files with audio as possible.
  • S2220 specifically includes the following S221 to S222:
  • S221 Perform encoding processing on audio content information to obtain audio redundant data.
  • the first terminal can encode audio content information according to preset rules. For example, if the audio content information is A, B, C, and D, the first terminal can encode A, B, C, and D respectively to obtain Audio redundant data a, b, c, d, where a corresponds to A, b corresponds to B, c corresponds to C, and d corresponds to D.
  • S222 Send the audio information including audio content information and audio redundant data to the smart device.
  • the first terminal can send audio information including audio content information A, B, C, D and audio redundant data a, b, c, and d to the smart device.
  • this embodiment also provides an audio and video processing device capable of executing each step involved in the method embodiment in FIG. 18.
  • the device may be a module, program segment or code on an electronic device.
  • the device includes a receiving and sending module for collecting video information, buffering the video information in a storage device, receiving audio information corresponding to the video information sent by a first terminal, and buffering the audio information in the In the storage device.
  • the synthesis module is configured to synthesize the video information and the corresponding audio information that have been buffered for the first time length in the storage device into an audio and video file.
  • the device includes a receiving and sending module for collecting audio content information, and sending audio information including the audio content information to the smart device, so that the smart device The device synthesizes the video information buffered for the first time length and the corresponding audio information into an audio and video file.
  • the embodiments of the present disclosure also provide a computer-readable storage medium that stores a computer program executable by an electronic device.
  • the program is stored in the electronic device
  • the steps of the method applied to the smart terminal are realized when the electronic device is executed, for example, the following steps:
  • a computer program is stored in the memory, and when the program is executed by the processor, the processor is caused to perform the following steps: receiving a control message sent by the first terminal; according to the control message, recording video information to The audio and video files are synthesized according to the video information and the audio information recorded by the first terminal.
  • the embodiments of the present disclosure also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program that can be executed by an electronic device.
  • the steps of the method applied to the first terminal are realized when the electronic device is executed, for example, the following steps:
  • a computer program is stored in the memory, and when the program is executed by the processor, the processor is caused to perform the following steps: send a control message for controlling the smart device to record video information to the smart device; according to the control message , Recording audio information, so that audio and video files are synthesized according to the audio information and the video information recorded by the smart device.
  • the aforementioned computer-readable storage medium may be any available medium or data storage device that can be accessed by the processor in the electronic device, including but not limited to magnetic storage such as floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc., and optical storage such as CD , DVD, BD, HVD, etc., and semiconductor memory such as ROM, EPROM, EEPROM, non-volatile memory (NAND FLASH), solid state drive (SSD), etc.
  • magnetic storage such as floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.
  • optical storage such as CD , DVD, BD, HVD, etc.
  • semiconductor memory such as ROM, EPROM, EEPROM, non-volatile memory (NAND FLASH), solid state drive (SSD), etc.
  • the present disclosure may also provide an electronic device, including: a processor, a communication interface, a memory, and a communication bus.
  • the processor, the communication interface , The memories communicate with each other through the communication bus.
  • a computer program is stored in the memory, and when the program is executed by the processor, the processor is caused to execute the steps of the method applied to the second terminal.
  • the embodiments of the present disclosure may also provide a computer-readable storage medium that stores a computer program executable by an electronic device, and when the program runs on the electronic device, the electronic device executes the application to the second terminal Steps of the method.
  • a smart device and a terminal may be directly connected, such as a wireless connection.
  • the smart device and the terminal may be connected through a relay, such as another terminal or another base station.
  • the information to be synthesized may be other than video information and audio information.
  • pictures obtained by the smart device and coordinates obtained by the terminal may be synthesized.
  • the picture obtained by the smart device and the text obtained by the terminal can also be synthesized.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.
  • the audio and video processing solution provided by the present disclosure can record high-quality audio information and ensure the effect of synthesized audio and video files. By obtaining audio and video files with subtitles, the audio and video files can express their content more clearly.
  • the smart device if the smart device is disconnected from the first terminal and reconnected during the video information caching time, it will not affect the synthesis of audio information and video information, making the synthesis of audio and video files more efficient. It is stable and improves the problem of poor sound signal in the prior art.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

本公开提供了一种音视频处理方法、装置、电子设备及存储介质,所述方法包括:接收第一终端发送的控制消息;根据所述控制消息,录制视频信息,以使根据所述视频信息与所述第一终端录制的音频信息合成音视频文件。由于在本公开实施例中,当智能设备接收到第一终端发送的控制消息后,根据该控制消息,录制视频信息,而由第一终端录制音频信息,不会出现录制的音频效果较差的情况,保证了合成的音视频文件的效果。

Description

音视频处理方法、装置、电子设备及存储介质
相关申请的交叉引用
本公开要求于2019年03月01日提交中国专利局的申请号为CN2019101555984,名称为“一种音频和视频合成方法、装置、电子设备及存储介质”、于2019年09月09日提交中国专利局的申请号为CN2019108501364,名称为“一种音视频处理方法、装置、电子设备及存储介质”及于2019年09月09日提交中国专利局的申请号为CN2019108501379,名称为“音频数据传输方法、装置、电子设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及音视频处理技术领域,尤其涉及一种音视频处理方法、装置、电子设备及存储介质。
背景技术
由于无人机、无人车、无人船、手持DV、机器人等智能设备技术逐渐成熟,其适用范围也越来越广泛。例如在进行晚会直播等活动时,会采用智能设备录制视频。而由于智能设备在工作时与人等被拍摄对象之间的距离较远,采用智能设备进行声音的录制,收音效果较差。并且智能设备在工作时也会产生噪音,若采用智能设备进行声音的录制,录制的声音会夹杂智能设备产生的噪音。
发明内容
本公开实施例提供了一种音视频处理方法、装置、电子设备及存储介质,用以解决现有技术中采用智能设备进行音频信息和视频信息录制,合成的音视频效果较差的问题。
本公开实施例提供了一种音视频处理方法,应用于智能设备,所述方法包括:
接收第一终端发送的控制消息;
根据所述控制消息,录制视频信息,以使根据所述视频信息与所述第一终端录制的音频信息合成音视频文件。
可选地,所述控制消息中携带时间信息;
所述根据所述控制消息,录制视频信息包括:
如果所述时间信息为延时时长的信息,则所述智能设备在接收到所述控制消息后,等待所述延时时长进行视频信息的录制。
可选地,所述控制消息中携带时间信息;
所述根据所述控制消息,录制视频信息包括:
如果所述时间信息为进行视频录制的时间点,则所述智能设备在达到所述时间点时,进行视频信息的录制。
可选地,所述录制视频信息之后,所述方法还包括:
接收所述第一终端发送的音频信息;
将所述音频信息与所述视频信息合成音视频文件。
可选地,所述录制视频信息之后,所述方法还包括:
向所述第一终端发送所述视频信息,以使所述第一终端根据自身录制的音频信息及接收到的所述视频信息合成音视频文件。
可选地,所述录制视频信息之后,所述方法还包括:
向第二终端发送所述视频信息,以使所述第二终端根据所述第一终端录制的音频信息及所述视频信息合成音视频文件。
可选地,所述方法还包括:
获得所述音频信息对应的文字信息,将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件;或者,
获得所述视频信息对应的文字信息,将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件。
可选地,所述视频信息、所述音频信息和所述文字信息均包括第一时间信息;所述将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件,包括:
根据所述第一时间信息将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件。
可选地,所述视频信息中包括人,所述将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件,包括:
获取所述视频信息对应的多帧视频图像,并对多帧视频图像进行识别,获得所述视频信息中人的口型变化特征;
根据所述口型变化特征获得对应的文字;
根据所述口型变化特征对应的文字将所述视频信息、音频信息和所述文字信息合成所述带有字幕的音视频文件。
可选地,所述获得所述音频信息对应的文字信息,包括:
对所述音频信息进行预处理,获得处理后音频信息;
对所述处理后音频信息进行端点切分,获得音频样本;
根据预设的最小静音长度和最短有效声音对所述音频样本进行再次切分,获得多个音频片段;
对每个音频片段进行文字识别,获得所述文字信息。
可选地,所述方法还包括:
将所述视频信息缓存在存储设备中;
将所述音频信息缓存在存储设备中;
所述根据所述视频信息与所述第一终端录制的音频信息合成音视频文件,包括:
将所述存储设备中的已经缓存第一时间长度的视频信息与对应的音频信息合成为音视频文件。
可选地,所述音频信息包括至少一个第一时间戳以及与每个第一时间戳对应的音频内容信息;所述视频信息包括至少一个第二时间戳以及与所述每个第二时间戳对应的视频内容信息;
所述将所述存储设备中的已经缓存第一时间长度的视频信息与对应的音频信息合成为音视频文件,包括:
对于已经缓存第一时间长度的视频信息,提取所述视频信息的第二时间戳;
判断所述音频信息的多个第一时间戳中是否存在与所述第二时间戳对应的第一时间戳;
若是,将与所述第二时间戳对应的第一时间戳所对应的音频内容信息与所述已经缓存第一时间长度的视频信息合成音视频文件;
若否,向所述已经缓存第一时间长度的视频信息添加语音缺失的提示信息。
可选地,所述方法还包括:
若所述智能设备与所述第一终端断开第二时间长度后重新连接,接收所述第一终端发送的所述第二时间长度对应的音频信息,其中,所述第一时间长度为所述智能设备的最大缓存时长,所述第二时间长度小于或等于所述第一时间长度,所述已经缓存第一时间长度的视频信息对应的音频信息包括所述第二时间长度对应的音频信息;
若所述智能设备与所述第一终端断开第三时间长度后重新连接,接收所述第一终端发送的第三时间长度中最新时间的第一时间长度对应的音频信息,其中,所述第三时间长度大于所述第一时间长度;
若所述智能设备与所述第一终端在断开后无法重新连接,发出断开连接的提示信息。
可选地,所述音频信息包括音频内容信息和音频冗余数据,所述音频冗余数据由所述音频内容信息编码获得,所述方法还包括:
若音频内容信息丢失且音频冗余数据未丢失,对所述音频冗余数据进行解码,获得与丢失的所述音频内容信息相同的数据。本公开实施例提供了一种音视频处理方法,应用于第一终端,所述方法包括:
向智能设备发送控制智能设备进行视频信息录制的控制消息;
根据所述控制消息,录制音频信息,以使根据所述音频信息与所述智能设备录制的视频信息合成音视频文件。
可选地,所述控制消息中携带时间信息;
所述根据所述控制消息,录制音频信息包括:
如果所述时间信息为延时时长的信息,则所述第一终端发送所述控制消息后,等待所述延时时长进行音频信息的录制。
可选地,所述控制消息中携带时间信息;
所述根据所述控制消息,录制音频信息包括:
如果所述时间信息为进行音频录制的时间点,则所述第一终端在达到所述时间点时,进行音频信息的录制。
可选地,所述录制音频信息之后,所述方法包括:
接收所述智能设备发送的视频信息;
将所述音频信息与所述视频信息合成音视频文件。
可选地,所述录制音频信息之后,所述方法包括:
向所述智能设备发送所述音频信息,以使所述智能设备根据自身录制的视频信息及接收到的所述音频信息合成音视频文件。
可选地,所述录制音频信息之后,所述方法包括:
向第二终端发送所述音频信息,以使所述第二终端根据所述智能设备录制的视频信息及所述音频信息合成音视频文件。
可选地,所述录制音频信息之后,所述方法包括:
根据所述音频信息生成对应的文字信息,将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件;或者,
根据所述视频信息生成对应的文字信息,将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件。
可选地,所述视频信息、所述音频信息和所述文字信息均包括第一时间信息;所述将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件,包括:
根据所述第一时间信息将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件。
可选地,所述音频信息中包括音频内容信息,所述录制音频信息之后,所述方法包括:
向智能设备发送包括音频内容信息的音频信息,以使所述智能设备将缓存第一时间长度的视频信息与对应的音频信息合成为音视频文件。
可选地,所述向智能设备发送包括音频内容信息的音频信息,包括:
若所述第一终端与所述智能设备断开第二时间长度后重新连接,向所述智能设备发送第二时间长度对应的音频信息,其中,所述第二时间长度小于或等于所述第一时间长度;
若所述第一终端与所述智能设备断开第三时间长度后重新连接,向所述智能设备发送最近的第一时间长度对应的音频信息,其中,所述第三时间长度大于所述第一时间长度。
可选地,所述向智能设备发送包括音频内容信息的音频信息,包括:
对音频内容信息进行编码处理,得到音频冗余数据;
向所述智能设备发送包括音频内容信息和音频冗余数据的所述音频信息。本公开实施例提供了一种音视频处理方法,应用于第二终端,所述方法包括:
获得视频信息、音频信息;其中,所述视频信息为智能设备采集获得,所述音频信息为第一终端采集获得;
将所述视频信息、所述音频信息合成音视频文件。
根据所述音频信息生成对应的文字信息,将所述视频信息、音频信息和所述文字信息合成带有字幕的音视频文件;或者,
根据所述视频信息生成对应的文字信息;
将所述视频信息、音频信息和所述文字信息合成带有字幕的音视频文件。
可选地,所述视频信息、所述音频信息和所述文字信息均包括第一时间信息;所述将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件,包括:
根据所述第一时间信息将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件。
可选地,获得所述音频信息对应的文字信息,包括:
所述文字信息为所述第一终端根据所述音频信息生成的;或者,
所述第二终端根据所述音频信息生成对应的所述文字信息。
所述文字信息为所述智能设备根据所述视频信息生成的;或者,
所述第二终端根据所述视频信息生成对应的文字信息。本公开实施例提供了一种音视频处理装置,应用于智能设备,所述装置包括:
接收发送模块,用于接收第一终端发送的控制消息;
录制模块,用于根据所述控制消息,录制视频信息,以使根据所述视频信息与所述第一终端录制的音频信息合成音视频文件。
本公开实施例提供了一种音视频处理装置,应用于第一终端,所述装置包括:
接收发送模块,用于向智能设备发送控制智能设备进行视频信息录制的控制消息;
录制模块,用于根据所述控制消息,录制音频信息,以使根据所述音频信息与所述智能设备录制的视频信息合成音视频文件。
本公开提供了一种电子设备,包括:处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过通信总线完成相互间的通信;
所述存储器中存储有计算机程序,当所述程序被所述处理器执行时,使得所述处理器执行上述应用于智能设备的方法的步骤。
本公开提供了一种电子设备,包括:处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过通信总线完成相互间的通信;
所述存储器中存储有计算机程序,当所述程序被所述处理器执行时,使得所述处理器执行上述应用于第一终端的方法的步骤。
本公开提供了一种电子设备,包括:处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过通信总线完成相互间的通信;
所述存储器中存储有计算机程序,当所述程序被所述处理器执行时,使得所述处理器执行上述应用于第二终端的方法的步骤。
本公开实施例提供了一种计算机可读存储介质,其存储有可由电子设备执行的计算机程序,当所述程序在所述电子设备上运行时,使得所述电子设备执行上述应用于智能设备的方法的步骤。
本公开实施例提供了一种计算机可读存储介质,其存储有可由电子设备执行的计算机程序,当所述程序在所述电子设备上运行时,使得所述电子设备执行应用于第一终端的方法的步骤。
本公开实施例提供了一种计算机可读存储介质,其存储有可由电子设备执行的计算机程序,当所述程序在所述电子设备上运行时,使得所述电子设备执行应用于第二终端的方法的步骤。
本公开实施例提供了一种音视频处理方法、装置、电子设备及存储介质,所述方法包括:接收第一终端发送的控制消息;根据所述控制消息,录制视频信息,以使根据所述视频信息与所述第一终端录制的音频信息合成音视频文件。由于在本公开实施例中,当智能设备接收到第一终端发送的控制消息后,根据该控制消息,录制视频信息,而由第一终端录制音频信息,不会出现录制的音频效果较差的情况,保证了合成的音视频文件的效果。
附图说明
为了更清楚地说明本公开实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例提供的一种音视频处理方法的流程示意图;
图2为本公开实施例提供的一种音视频处理方法的流程示意图;
图3为本公开实施例提供的智能设备合成音视频文件的过程示意图;
图4为本公开实施例提供的一种音视频处理装置的结构示意图;
图5为本公开实施例提供的一种音视频处理装置的结构示意图;
图6为本公开实施例提供的一种电子设备;
图7为本公开实施例提供的一种电子设备;
图8为本公开实施例提供的带有字幕的音视频文件合成的场景示意图;
图9为本公开实施例提供的一种音视频处理方法的流程示意图;
图10为本公开实施例提供的一种音视频处理方法的信令交互图;
图11为本公开实施例提供的一种音视频处理方法的信令交互图;
图12为本公开实施例提供的另一种音视频处理方法的信令交互图;
图13为本公开实施例提供的带有字幕的音视频文件合成的另一场景示意图;
图14为本公开实施例提供的一种音视频处理方法的信令交互图;
图15为本公开实施例提供的带有字幕的音视频文件合成的另一场景示意图;
图16为本公开实施例提供的一种音视频处理方法的信令交互图;
图17为本公开实施例提供的又一种音视频处理方法信令交互图;
图18为本申请实施例提供的音视频处理方法的一种具体实施方式的流程示意图;
图19为图18中S130的具体步骤的流程示意图;
图20为本申请实施例提供的音视频处理方法的另一种具体实施方式的流程示意图;
图21为图20中S2220的具体步骤的流程示意图。
具体实施方式
下面将结合附图对本公开作进一步地详细描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本公开保护的范围。
本公开施例提供的音视频处理方案,由智能设备进行视频信息的采集,应当说明的是,智能设备可以是任何一种具有视频采集功能的设备、器械或者机器。本公开中的智能设备可以包括具有自我检测以及自我诊断的设备。应理解,本公开中的智能设备可以设置有通信模块,通过通信模块可以与终端或者另一智能设备通信。上述通信的方式可以是WIFI、红外、蓝牙、4G或5G等通信形式,本公开实施例并不限于此。另外,本公开实施例中的智能设备包括但不限于无人机、无人车、无人船、手持DV、机器人等。下面的实施例中以智能设备为无人机进行描述。
在本公开之前,由于无人机技术越来越成熟,航拍技术也越来越受到青睐。例如,可以通过带有摄像功能的无人机对演奏舞台上的节目进行录制,以获得更好的拍摄视角,也可以通过无人机飞在空中对山水景色进行录制。由于无人机在飞行过程中,其旋转的机翼以及发动机工作都会发出声音。如果使用无人机一边录制视频,一边采集音频,那么无人机自身发出的声音也会被录制进去,再有无人机可能会离声源较远,使得录制获得的音频中,噪音大,声源的声音反而很小。
因此,本公开实施例提供一种音视频处理方法,该方法通过使用第一终端收录音频信息,通过智能设备如无人机采集视频信息,再将音频信息和视频信息进行合成,获得音视频文件,该音视频文件由于通过第一终端采集声音,因此能够避免智能设备如无人机产生的噪声,且能够录制质量较高的音频信息,确保合成的音视频文件的效果。
图1为本公开实施例提供的一种音视频处理方法的过程示意图,该过程包括由智能设备执行的以下步骤:
S101:接收第一终端发送的控制消息。
本公开实施例提供的该方法应用于智能设备,本实施例以智能设备为无人机为例,进行举例说明,该无人机可以与第一终端进行信息交互,具体的,在无人机中设置有具有通信功能的通信模块,通过该通信模块实现与第一终端之间的信息交互。
该第一终端可以是手机、平板、智能穿戴设备等,为了能够向该无人机发送控制消息,在该第一终端中可以预先安装对无人机进行控制的APP(Application,应用程序),具体的,该APP中设置有控制按键,当检测到该控制按键被按下时,该APP会向该无人机发送控制消息。第一终端和无人机可以预先进行控制消息的配置,当检测到控制按键被按下时,第一终端上的APP向无人机发送预设格式的消息, 当无人机接收到该预设格式的消息时,确认接收到第一终端发送的控制消息,响应该控制消息。
S102:根据所述控制消息,录制视频信息,以使根据所述视频信息与所述第一终端录制的音频信息合成音视频文件。
第一终端录制音频信息可以是通过自身的录制设备进行音频信息的录制,也可以是通过分离的音频录制设备进行音频的录制;比如当第一终端是手机、平板的时候,可以通过手机、平板直接进行音频的录制,也可以是通过与手机、平板连接的收音器、耳机等分离的设备进行音频的录制,所述耳机可以通过蓝牙、WiFi等无线方式或者有线方式与手机、平板连接。
该无人机上预先配置有具有视频录制功能的图像采集设备,例如云台相机。第一终端和无人机可以预先进行控制消息的配置,当检测到控制按键被按下时,第一终端上的APP向无人机发送预设格式的消息,当无人机接收到该预设格式的消息时,确认接收到第一终端发送的控制消息,录制视频信息。具体的,通过无人机中的图像采集设备录制视频信息。
由于在本公开实施例中,当无人机接收到第一终端发送的控制消息后,根据该控制消息,录制视频信息,而由第一终端录制音频信息,不会出现录制的音频效果较差的情况,保证了合成的音视频文件的效果。
可以理解的是,本公开的各实施例中无人机仅为智能设备的一种示例性举例说明,在不同场景中,智能设备可以为其他。例如,智能设备可以为广义上的、能够进行视频采集的多功能产品,可以为能够通过变形或者增减附件从一种设备形态变换到另一种设备形态的产品;如加上机臂变成具有飞行功能的无人机、飞机,加上腕带或者支架变成手持摄像机,加上其它附件变成陆地或者水面设备等。为了确保智能设备能够适应各种天气、应用场景,智能设备还可以进行防水设计、隔热设计、防霜防雪设计等,本公开对此不作限制。
为了保证合成的音视频文件的准确性,在上述实施例的基础上,在本公开实施例中,所述控制消息中携带时间信息。
所述根据所述控制消息,录制视频信息包括:
如果所述时间信息为延时时长的信息,则所述智能设备在接收到所述控制消息后,等待所述延时时长进行视频信息的录制。
以智能设备为无人机为例,为了实现音频信息和视频信息的同步录制,第一终端在发送控制消息之后,且无人机在接收到该控制消息之后,可以预先设置等待一定时间之后进行视频信息或音频信息的录制。因此,第一终端向无人机发送的控制消息中可以携带有进行视频信息录制的时间信息,该时间信息可以为该无人机所需等待的延时时长的信息。
为了方便无人机获知该时间信息,以便无人机知道应该延时多长时间进行视频信息的录制,用户可以通过第一终端中安装的APP,进行时间信息的输入,例如在进行时间信息的输入时,可以选择APP中提供的可供选择的时长信息,或者通过键盘、语音等方式输入。当无人机接收到携带有该延时时长的信息的控制消息之后,该无人机等待该延时时长后进行视频信息的录制。
例如,第一终端向无人机发送携带有延时时长为3秒的控制消息,第一终端在等待3秒后进行音频信息的录制,无人机在接收到该控制消息后等待3秒后进行视频信息的录制,从而实现音频信息和视频信息的同步录制。又例如,第一终端向无人机发送携带有延时时长为10分钟的控制消息,当无人机接收到控制消息后,可以开始计时,等至接收到控制消息开始的10分钟之后,开始进行视频信息的采集。
又一种实现方式中,在上述实施例的基础上,在本公开实施例中,所述控制消息中携带时间信息。
所述根据所述控制消息,录制视频信息包括:
如果所述时间信息为进行视频录制的时间点,则所述无人机在达到所述时间点时,进行视频信息的录制。
为了实现音频信息和视频信息的同步录制,可以预先设置无人机和第一终端均在同一时间点进行视频信息和音频信息的录制。因此,第一终端向无人机发送的控制消息中可以携带有进行视频信息录制的时间信息,该时间信息可以为该无人机进行视频录制的时间点。
为了方便无人机获知该时间信息,以便无人机知道应该在何时进行视频信息的录制,用户可以通过第一终端中安装的APP,进行时间信息的输入,例如在进行时间信息的输入时,可以选择APP中提供的 可供选择的时间信息,或者通过键盘、语音等方式输入。当无人机接收到携带有进行视频录制的时间点的控制消息之后,该无人机到达该时间点后进行视频信息的录制。
例如,用户在第一终端的APP中输入进行视频录制的时间点,例如该时间点为8:00,当检测到控制按键被按下时,第一终端向无人机发送携带有进行视频录制的时间点为8:00的控制消息,第一终端在达到8:00时进行音频信息的录制,无人机在接收到该控制消息后,根据该控制消息中携带的进行视频录制的时间点,在达到8:00时进行视频信息的录制,从而实现音频信息和视频信息的同步录制。应当说明的是,进行视频采集的时间点应当晚于控制消息发送的时间。
还应当说明的是,控制消息中也可以没有时间信息,该种情况下,可以表示在无人机接收到控制消息之后立即开始视频的采集。同样地,为了保证第一终端与无人机同步录制,当第一终端发送了控制消息后,无人机根据控制消息开始录制视频时,第一终端也同时开始录制音频。
为了合成音视频文件,在上述各实施例的基础上,在本公开实施例中,所述录制视频信息之后,所述方法还包括由智能设备执行的以下步骤:
接收所述第一终端发送的音频信息;
将所述音频信息与所述视频信息合成音视频文件。
以智能设备为无人机为例,为了合成音视频文件,可以在无人机中进行音频信息与视频信息的合成。为了保证无人机能够实现音频信息与视频信息的合成,第一终端将录制完成的音频信息发送到无人机中,具体的,第一终端将录制的音频信息通过通信模块传输给无人机,无人机接收第一终端发送的音频信息,并将该音频信息与自身录制的视频信息合成音视频文件。
其中,无人机将音频信息与视频信息合成音视频文件的过程为现有技术,在本公开实施例中,对该过程不作赘述。
为了合成音视频文件,在上述各实施例的基础上,在本公开实施例中,所述录制视频信息之后,所述方法还包括由智能设备执行的以下步骤:
向所述第一终端发送所述视频信息,以使所述第一终端根据自身录制的音频信息及接收到的所述视频信息合成音视频文件。
在具体实施过程中,为了合成音视频文件,可以在第一终端中进行音频信息与视频信息的合成。具体的,以智能设备为无人机为例,无人机将录制完成的视频信息发送到该第一终端中,该第一终端接收无人机发送的视频信息,并将该视频信息与自身录制的音频信息合成音视频文件。
其中,该第一终端将音频信息与视频信息合成音视频文件的过程为现有技术,在本公开实施例中,对该过程不作赘述。
为了合成音视频文件,在上述各实施例的基础上,在本公开实施例中,所述录制视频信息之后,所述方法还包括由智能设备执行的以下步骤:
向第二终端发送所述视频信息,以使所述第二终端根据所述第一终端录制的音频信息及所述视频信息合成音视频文件。
在具体实施过程中,为了减少智能设备和第一终端的计算压力,可以既不在第一终端上合成音视频文件,也不在智能设备上合成音视频文件,而在第二终端上合成音视频文件,该第二终端可以为手机,电脑,服务器等。智能设备和第一终端均可以与该第二终端进行信息传输。具体的,智能设备将录制完成的视频信息发送到该第二终端中,第一终端将录制完成的音频信息发送到该第二终端中,该第二终端接收视频信息和音频信息,并将该视频信息与该音频信息合成音视频文件。
其中,该第二终端将音频信息与视频信息合成音视频文件的过程为现有技术,在本公开实施例中,对该过程不作赘述。
以智能设备为无人机为例,图2为本公开实施例提供的一种应用于第一终端的音视频处理方法的过程示意图,该过程包括以下步骤:
S201:向无人机发送控制无人机进行视频信息录制的控制消息。
本公开实施例提供的该方法应用于第一终端,该第一终端可以与无人机进行信息交互。该第一终端可以是手机、平板、蓝牙耳机等具有收音功能的用户终端,为了能够向该无人机发送控制无人机进行视频信息录制的控制消息,在该第一终端中可以预先安装对无人机进行控制的APP,具体的,该APP中设 置有控制按键,当检测到该控制按键被按下时,该APP会向该无人机发送控制无人机进行视频信息录制的控制消息。第一终端和无人机可以预先进行控制消息的配置,当检测到控制按键被按下时,第一终端上的APP向无人机发送预设格式的消息,当无人机接收到该预设格式的消息时,确认接收到第一终端发送的控制消息。
S202:根据所述控制消息,录制音频信息,以使根据所述音频信息与所述无人机录制的视频信息合成音视频文件。
第一终端和无人机可以预先进行控制消息的配置,当检测到控制按键被按下时,第一终端上的APP向无人机发送预设格式的消息,当无人机接收到该预设格式的消息时,确认接收到第一终端发送的控制消息,录制视频信息,且第一终端根据该控制消息,进行音频信息的录制。具体的,第一终端可以通过声音录制软件录制音频信息。
第一终端录制音频信息可以是通过自身的录制设备进行音频信息的录制,也可以是通过分离的音频录制设备进行音频的录制;比如当第一终端是手机、平板的时候,可以通过手机、平板直接进行音频的录制,也可以是通过与手机、平板连接的收音器、耳机等分离设备进行音频的录制,所述耳机可以通过蓝牙、WiFi等无线方式或者有线方式与手机、平板连接。
由于在本公开实施例中,第一终端向无人机发送控制无人机进行视频信息录制的控制消息后,第一终端根据该控制消息,录制音频信息,而由无人机录制视频信息,不会出现录制的音频效果较差的情况,保证了合成的音视频文件的效果。
为了保证合成的音视频文件的准确性,在上述各实施例的基础上,在本公开实施例中,所述控制消息中携带时间信息。
所述根据所述控制消息,录制音频信息包括:
如果所述时间信息为延时时长的信息,则所述第一终端发送所述控制消息后,等待所述延时时长进行音频信息的录制。
以智能设备为无人机为例,为了实现音频信息和视频信息的同步录制,第一终端在发送控制消息之后,且无人机在接收到该控制消息之后,可以预先设置等待一定时间之后进行视频信息或音频信息的录制。因此,第一终端向无人机发送的控制消息中可以携带有进行视频信息录制的时间信息,该时间信息可以为该无人机所需等待的延时时长的信息。
为了方便无人机获知该时间信息,以便无人机知道应该延时多长时间进行视频信息的录制,用户可以通过第一终端中安装的APP,进行时间信息的输入,例如在进行时间信息的输入时,可以选择APP中提供的可供选择的时长信息,或者通过键盘、语音等方式输入。当该第一终端发送携带有该延时时长的信息的控制消息之后,等待该延时时长后进行音频信息的录制。无人机在接收到携带有该延时时长的信息的控制消息之后,该无人机等待该延时时长后进行视频信息的录制。
例如,第一终端向无人机发送携带有延时时长为3秒的控制消息,第一终端在等待3秒后进行音频信息的录制,无人机在接收到该控制消息后等待3秒后进行视频信息的录制,从而实现音频信息和视频信息的同步录制。
另一实施方式中,在上述实施例的基础上,在本公开实施例中,所述控制消息中携带时间信息。
所述根据所述控制消息,录制音频信息包括:
如果所述时间信息为进行音频录制的时间点,则所述第一终端在达到所述时间点时,进行音频信息的录制。
为了实现音频信息和视频信息的同步录制,可以预先设置无人机和第一终端均在同一时间点进行视频信息和音频信息的录制。因此,第一终端向无人机发送的控制消息中可以携带有进行视频信息录制的时间信息,该时间信息可以为该无人机进行视频录制的时间点。
为了方便无人机获知该时间信息,以便无人机知道应该在何时进行视频信息的录制,用户可以通过第一终端中安装的APP,进行时间信息的输入,例如在进行时间信息的输入时,可以选择APP中提供的可供选择的时间信息,或者通过键盘、语音等方式输入。当第一终端在达到该时间点时进行音频信息的录制。无人机在接收到携带有进行视频录制的时间点的控制消息之后,该无人机在达到该时间点后进行视频信息的录制。
例如,用户在第一终端的APP中选择进行视频录制的时间点,该时间点为8:00,当检测到控制按键被按下时,第一终端向无人机发送携带有进行视频录制的时间点为8:00的控制消息,第一终端在达到8:00时进行音频信息的录制,无人机在接收到该控制消息后,根据该控制消息中携带的进行视频录制的时间点,在达到8:00时进行视频信息的录制,从而实现音频信息和视频信息的同步录制。
以智能设备为无人机为例,为了合成音视频文件,在上述各实施例的基础上,在本公开实施例中,所述录制音频信息之后,所述方法包括由第一终端执行的以下步骤:
接收所述无人机发送的视频信息;将所述音频信息与所述视频信息合成音视频文件。
为了合成音视频文件,可以在第一终端中进行音频信息与视频信息的合成。为了保证第一终端够实现音频信息与视频信息的合成,无人机将录制完成的视频信息发送到该第一终端中,第一终端接收该视频信息,并将该视频信息与自身录制的音频信息合成音视频文件。
其中,第一终端将音频信息与视频信息合成音视频文件的过程为现有技术,在本公开实施例中,对该过程不作赘述。
以智能设备为无人机为例,为了合成音视频文件,在上述各实施例的基础上,在本公开实施例中,所述录制音频信息之后,所述方法还包括由第一终端执行的以下步骤:
向所述无人机发送所述音频信息,以使所述无人机根据自身录制的视频信息及接收到的所述音频信息合成音视频文件。
在具体实施过程中,为了合成音视频文件,可以在无人机中进行音频信息与视频信息的合成。具体的,第一终端将录制完成的音频信息发送到该无人机中,该无人机接收该音频信息,并将该音频信息与自身录制的视频信息合成音视频文件。
其中,该无人机将音频信息与视频信息合成音视频文件的过程为现有技术,在本公开实施例中,对该过程不作赘述。
例如,如图3所示,以第一终端为手机,智能设备为无人机为例,示出了无人机合成音视频文件的一种过程示意图。手机向无人机发送控制消息后,根据该控制消息,录制音频信息,并对该音频信息进行编码,生成高级音频编码(Advanced Audio Coding,AAC)格式文件或脉冲编码调制(Pulse Code Modulation,PCM)格式文件。并将生成的AAC格式文件或PCM格式文件通过USB(Universal Serial Bus,通用串行总线)传输到无人机中。无人机接收到第一终端发送的控制消息后,根据该控制消息,录制视频信息,并对该视频信息进行编码,无人机根据接收的音频信息,与自身录制的视频信息合成音视频文件,其中该音视频文件为MP4格式,无人机将生成的该音视频文件进行存储,用于后续的播放等操作。
以智能设备为无人机为例,为了合成音视频文件,在上述各实施例的基础上,在本公开实施例中,所述录制音频信息之后,所述方法还包括由第一终端执行的以下步骤:
向第二终端发送所述音频信息,以使所述第二终端根据所述无人机录制的视频信息及所述音频信息合成音视频文件。
在具体实施过程中,为了减少无人机和第一终端的计算压力,可以既不在第一终端上合成音视频文件,也不在无人机上合成音视频文件,而在第二终端上合成音视频文件,该第二终端可以为手机,电脑,服务器等。具体的,无人机将录制完成的视频信息发送到该第二终端中,第一终端将录制完成的音频信息发送到该第二终端中,该第二终端接收该视频信息和该音频信息,并将该视频信息与该音频信息合成音视频文件。
其中,该第二终端将音频信息与视频信息合成音视频文件的过程为现有技术,在本公开实施例中,对该过程不作赘述。
图4为本公开实施例提供的一种应用于智能设备的音视频处理装置的结构图,该装置包括:
接收发送模块401,用于接收第一终端发送的控制消息。
录制模块402,用于根据所述控制消息,录制视频信息,以使根据所述视频信息与所述第一终端录制的音频信息合成音视频文件。
可选地,所述录制模块402,具体用于所述控制消息中携带时间信息;如果所述时间信息为延时时长的信息,则所述智能设备在接收到所述控制消息后,等待所述延时时长进行视频信息的录制。
可选地,所述录制模块402,具体用于所述控制消息中携带时间信息;如果所述时间信息为进行视 频录制的时间点,则所述智能设备在达到所述时间点时,进行视频信息的录制。
可选地,所述接收发送模块401,还用于接收所述第一终端发送的音频信息。
所述装置还包括:
合成模块403,用于将所述音频信息与所述视频信息合成音视频文件。
可选地,所述接收发送模块401,还用于向所述第一终端发送所述视频信息,以使所述第一终端根据自身录制的音频信息及接收到的所述视频信息合成音视频文件。
可选地,所述接收发送模块401,还用于向第二终端发送所述视频信息,以使所述第二终端根据所述第一终端录制的音频信息及所述视频信息合成音视频文件。
图5为本公开实施例提供的一种应用于第一终端的音视频处理装置的结构图,该装置包括:
接收发送模块501,用于向智能设备发送控制智能设备进行视频信息录制的控制消息。
录制模块502,用于根据所述控制消息,录制音频信息,以使根据所述音频信息与所述智能设备录制的视频信息合成音视频文件。
可选地,所述录制模块502,具体用于所述控制消息中携带时间信息;如果所述时间信息为延时时长的信息,则所述第一终端发送所述控制消息后,等待所述延时时长进行音频信息的录制。
可选地,所述录制模块502,具体用于所述控制消息中携带时间信息;如果所述时间信息为进行音频录制的时间点,则所述第一终端在达到所述时间点时,进行音频信息的录制。
可选地,所述接收发送模块501,还用于接收所述智能设备发送的视频信息。
所述装置还包括:
合成模块503,用于将所述音频信息与所述视频信息合成音视频文件。
可选地,所述接收发送模块501,还用于向所述智能设备发送所述音频信息,以使所述智能设备根据自身录制的视频信息及接收到的所述音频信息合成音视频文件。
可选地,所述接收发送模块501,还用于向第二终端发送所述音频信息,以使所述第二终端根据所述智能设备录制的视频信息及所述音频信息合成音视频文件。
在上述各实施例的基础上,本公开实施例还提供了一种电子设备600,如图6所示,包括:处理器601、通信接口602、存储器603和通信总线604,其中,处理器601,通信接口602,存储器603通过通信总线604完成相互间的通信。
所述存储器603中存储有计算机程序,当所述程序被所述处理器601执行时,使得所述处理器601执行智能设备侧的如下步骤:接收第一终端发送的控制消息;根据所述控制消息,录制视频信息,以使根据所述视频信息与所述第一终端录制的音频信息合成音视频文件。
可选地,所述控制消息中携带时间信息。所述根据所述控制消息,录制视频信息包括:如果所述时间信息为延时时长的信息,则所述智能设备在接收到所述控制消息后,等待所述延时时长进行视频信息的录制。如果所述时间信息为进行视频录制的时间点,则所述智能设备在达到所述时间点时,进行视频信息的录制。
可选地,所述录制视频信息之后,所述方法还包括:接收所述第一终端发送的音频信息;将所述音频信息与所述视频信息合成音视频文件。
可选地,所述录制视频信息之后,所述方法还包括:向所述第一终端发送所述视频信息,以使所述第一终端根据自身录制的音频信息及接收到的所述视频信息合成音视频文件。
可选地,所述录制视频信息之后,所述方法还包括:向第二终端发送所述视频信息,以使所述第二终端根据所述第一终端录制的音频信息及所述视频信息合成音视频文件。
上述电子设备提到的通信总线可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
通信接口602用于上述电子设备与其他设备之间的通信。
存储器603可以包括随机存取存储器(Random Access Memory,RAM),也可以包括非易失性存储器(Non-Volatile Memory,NVM),例如至少一个磁盘存储器。可选地,存储器603还可以是至少一个 位于远离前述处理器601的存储装置。
上述处理器601可以是通用处理器,包括中央处理器、网络处理器(Network Processor,NP)等;还可以是数字指令处理器(Digital Signal Processing,DSP)、专用集成电路、现场可编程门陈列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。
在上述各实施例的基础上,本公开实施例还提供了一种电子设备700,如图7所示,包括:处理器701、通信接口702、存储器703和通信总线704,其中,处理器701,通信接口702,存储器703通过通信总线704完成相互间的通信。
所述存储器703中存储有计算机程序,当所述程序被所述处理器701执行时,使得所述处理器701执行第一终端侧的如下步骤:向智能设备发送控制智能设备进行视频信息录制的控制消息;根据所述控制消息,录制音频信息,以使根据所述音频信息与所述智能设备录制的视频信息合成音视频文件。
可选地,所述控制消息中携带时间信息。所述根据所述控制消息,录制音频信息包括:如果所述时间信息为延时时长的信息,则所述第一终端发送所述控制消息后,等待所述延时时长进行音频信息的录制。如果所述时间信息为进行音频录制的时间点,则所述第一终端在达到所述时间点时,进行音频信息的录制。
可选地,所述录制音频信息之后,所述方法包括:接收所述智能设备发送的视频信息;将所述音频信息与所述视频信息合成音视频文件。
可选地,所述录制音频信息之后,所述方法还包括:向所述智能设备发送所述音频信息,以使所述智能设备根据自身录制的视频信息及接收到的所述音频信息合成音视频文件。
可选地,所述录制音频信息之后,所述方法还包括:向第二终端发送所述音频信息,以使所述第二终端根据所述智能设备录制的视频信息及所述音频信息合成音视频文件。
上述电子设备提到的通信总线704可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。该通信总线704可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
通信接口702用于上述电子设备与其他设备之间的通信。
存储器703可以包括随机存取存储器(Random Access Memory,RAM),也可以包括非易失性存储器(Non-Volatile Memory,NVM),例如至少一个磁盘存储器。可选地,存储器703还可以是至少一个位于远离前述处理器的存储装置。
上述处理器701可以是通用处理器,包括中央处理器、网络处理器(Network Processor,NP)等;还可以是数字指令处理器(Digital Signal Processing,DSP)、专用集成电路、现场可编程门陈列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。
在上述各实施例通过使用第一终端收录音频信息,通过智能设备如无人机采集视频信息,再将音频信息和视频信息进行合成,确保合成的音视频文件的效果的基础上,为了使得音视频文件能够更加清楚地表达其内容,本实施例中,还可以根据音频信息获得对应的文字信息(即字幕),最后,将音频信息、视频信息和文字信息进行合成,获得带有字幕的音视频文件,该带有字幕的音视频文件由于通过第一终端采集声音,因此能够避免智能设备产生的噪声,且能够录制质量较高的音频信息,再通过文字信息的展示,使得音视频文件能够更加清楚的表达其内容。
可选的,还可以根据视频信息获得对应的文字信息,然后将视频信息、音频信息和文字信息进行合成,获得带有字幕的音视频文件。
图8为本公开实施例提供的带有字幕的音视频文件合成的场景示意图,如图8所示,包括无人机、手机和用户。无人机带有视频录制功能,且能够与手机通信。手机用来采集用户的语音,为了能够更加清楚地采集到用户的音频信息,可以将手机放在用户附近。应当说明的是,使用手机进行音频的采集是一种可行的实施方式,也可以通过其他具备音频录制功能的电子设备代替,例如:可以是平板电脑、录音笔、智能穿戴设备如蓝牙耳机等,可以将录制音频的设备称为第一终端。无人机获得到视频信息、音频信息和文字信息之后,将视频信息、音频信息和文字信息进行合成,获得带字幕的音视频文件。
图9为本公开实施例提供的一种音视频处理方法流程示意图,如图9所示,该方法应用于智能设备, 如无人机,应当说明的是,该无人机具备视频采集功能。该方法包括:
步骤210:采集视频信息,并获得音频信息和所述音频信息对应的文字信息;其中所述音频信息为通过第一终端采集获得。
在具体的实施过程中,以无人机为例,在使用无人机采集视频信息时,可以预先在无人机上进行视频录制参数的设定,设定完成后使无人机起飞并根据设定的参数进行视频的录制。也可以是用户远程操控无人机,例如可以通过第一终端与无人机进行通信连接,由第一终端向无人机发送视频录制参数的控制消息,以实现对无人机的控制。可选地,在无人机中设置有通信模块,无人机通过通信模块与第一终端进行通信。
为了能够获得质量较高的音频信息,可以将第一终端放置在声源附近,通过第一终端采集声源的音频信息。第一终端可以将其录制的音频信息发送给无人机。应当说明的是,第一终端采集的音频信息可以与无人机采集的视频信息同步。
另外,文字信息是根据音频信息生成的,生成文字信息的步骤可以是在第一终端执行,也可以在无人机执行,或者是在第二终端执行。即无人机可以在接收到第一终端发送的音频信息后,根据该音频信息生成对应的文字信息,也可以是第二终端在接收到音频信息后,根据音频信息生成对应的文字信息。无人机也可以接收第一终端发送的音频信息和文字信息,这种情况下,是第一终端在采集到音频信息后,生成对应的文字信息,将音频信息和文字信息发送给无人机。音频信息生成文字信息的具体方式通过下面的实施例详细描述。
步骤220:将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件。
在具体的实施过程中,无人机在获得到视频信息、音频信息和文字信息之后,将其进行合成,从而获得带有字幕的音视频文件。
应当说明的是,视频信息包括多帧视频图像,视频信息与文字信息的合成是指将文字信息添加在对应帧的视频图像中。
本公开实施例通过第一终端采集音频信息,通过无人机采集视频信息,以及根据音频信息生成对应的文字信息,最后,将音频信息、视频信息和文字信息进行合成,一方面能够同时保证音频和视频的质量,另一方面,通过文字信息可以使用户获得更加准确的音频信息,能够更好地理解音视频。
根据音频信息生成对应的文字信息的方法可以有多种,下面介绍一种可行的实施方式,如图10所示,包括:
S301:音频预处理;获取音频信息中的参数信息,参数信息至少包括声道数、编码方式及采样率,将音频信息中的参数信息转换为标准格式。例如:声道数为单声道、采样率为16000帧率、编码方式为WAV格式。
S302:降噪;选取音频信息中前0.5秒的声音作为噪声样本,通过汉宁窗对噪声样本进行分帧并求出每一帧对应的强度值,以此作为噪声门阈值,再通过汉宁窗对音频信息进行分帧并求出每一帧对应的强度值,获得音频信号强度值,随后对音频信号强度值与噪声门阈值进行逐帧比较,保留音频信号强度值大于噪声门阈值的音频信息,最终得到降噪完成的音频文件。
S303:音频信息切分;采用双门限语音端点检测技术,对已完成降噪的音频信息进行端点切分,切分出可用的音频段,将未满足门限的部分音频文件当作静音或噪音、不作处理。
S304:片段识别;按照默认的最小静音长度和最短有效声音两项参数对选择出的音频样本进行进一步切分,得到一系列的语音片段,然后将得到的语音片段通过调用第三方语音识别软件进行语音识别,整理得到全部音频信息对应的文字信息。
应当说明的是,每个音频片段都有对应的时间戳,因此转换成对应的文字信息后,文字信息也具有相同的时间戳,通过时间戳可以将音频信息与文字信息在时间上进行对齐。
应当说明的是,可以预先构建方言库、外文翻译库,从而当音频信息为方言或者是外文时,也可以将音频信息生成对应的文字信息。
本公开实施例通过对音频信息进行两次切分后进行文字识别,能够获得较为准确的文字信息。
在上述实施例的基础上,对于使用第一终端和无人机同步进行采集音视频的情况,在采集音视频信息时,可以记录采集的时间,因此,视频信息和音频信息中均包括有第一时间信息。另外,由于文字信 息是根据音频信息生成的,获得的文字信息中也包括第一时间信息。
例如:对舞台剧表演的录制,通过无人机采集舞台上演员的表演视频,通过第一终端同步采集舞台上演员的音频信息。由于第一时间信息表示的是采集时的绝对时间,又由于视频信息和音频信息是同步采集的,所以在进行合成时,可以根据第一时间信息,将视频信息、音频信息和文字信息在时间点上进行对齐,以实现获得的带有字幕的音视频文件在时间上是同步的。
本公开实施根据第一时间信息对音频信息、视频信息和文字信息进行合成,能够使得三者之间同步播放,不会出现播放的视频、音频和字幕在时间上不匹配的问题。
在另一实施例中,对于视频信息中有人,并且有人在讲话的场景,例如:晚会中的歌唱节目、语言类节目或者新闻联播等,需要将讲话的人的口型、声音和字幕同步播放,无人机采集到视频信息之后,可以获取视频信息中的多帧视频图像,并对多帧视频图像进行识别,获得视频信息中人的口型变化特征。应当说明的是,在对多帧视频图像进行识别之前,可以对多帧视频图像进行划分,获得视频信息中的人说出的每个字对应的多帧视频图像。
在获得了口型变化特征之后,可以根据口型变化特征获得对应的文字。应当说明的是,可以预先构建文字识别模型,通过文字识别模型对口型变化特征进行分析,输出对应的文字。
应当说明的是,通过口型变化特征获得对应的文字的主要目的是为了将视频信息、音频信息和文字信息进行对齐合成。因此,可以根据口型变化特征对应的文字将视频信息、音频信息和文字信息合成带有字幕的音视频文件。
本公开实施例通过视频中人的口型变化特征获得人说的文字,然后根据人说的文字将视频信息、音频信息和文字信息进行对齐合成,从而能够使得合成后的音视频文件在播放时,视频、音频和字幕保持时间上的同步。
在另一实施例中,图11为本公开实施例提供的一种音视频处理方法的信令交互图,如图11所示,包括无人机和第一终端,该方法包括:
S401:无人机采集视频信息;带有视频录制功能的无人机采集视频信息;
S402:第一终端采集音频信息;带有音频录制功能的第一终端采集音频信息,应当说明的是,S401和S402可以同时进行;
S403:第一终端向无人机发送音频信息;第一终端将采集到的音频信息发送给无人机,应当说明的是,第一终端与无人机进行通信连接;
S404:根据音频信息生成文字信息;无人机在接收到第一终端发送的音频信息后,根据音频信息生成对应的文字信息;
S405:进行合成;无人机将视频信息、音频信息和文字信息进行合成,获得带有字幕的音视频文件。
本公开实施例通过第一终端将采集获得的音频信息发送给无人机,无人机生成音频信息对应的文字信息,并将采集到的视频信息、接收的音频信息以及生成的文字信息进行合成,从而,一方面同时保证了获得清晰的视频信息和音频信息,另一方面,通过字幕的方式,使得观看该音视频的用户能够更加清楚的理解音频,防止采集到的音频是方言或者外语,导致用户不能理解其正确含义的问题。
图12为本公开实施例提供的另一种音视频处理方法信令交互图,如图12所示,包括无人机和第一终端,该方法包括:
S501:无人机采集视频信息;带有视频录制功能的无人机采集视频信息;
S502:第一终端采集音频信息;带有音频录制功能的第一终端采集音频信息,应当说明的是,S501和S502可以同时进行;
S503:根据音频信息生成文字信息;第一终端在采集到音频信息后,根据音频信息生成对应的文字信息;
S504:第一终端向无人机发送音频信息和文字信息;第一终端将采集到的音频信息以及生成的文字信息发送给无人机,应当说明的是,第一终端与无人机进行通信连接;
S505:进行合成;无人机将视频信息、音频信息和文字信息进行合成,获得带有字幕的音视频文件。
本公开实施例通过第一终端采集音频信息,通过无人机采集视频信息,通过音频信息生成对应的文字信息,然后进行合成,一方面能够获得高质量的音视频文件,另一方面通过字幕来帮助用户更好的理 解音频信息。
图13为本公开实施例提供的带有字幕的音视频文件合成的场景示意图,如图13所示,包括无人机、手机和用户;无人机带有视频录制功能,且能够与手机通信;手机用来采集用户的语音,为了能够更加清楚地采集到用户的音频信息,可以将手机放在用户附近。应当说明的是,使用手机进行音频的采集是一种可行的实施方式,也可以通过其他具备音频录制功能的电子设备代替,例如:可以是平板电脑、录音笔、智能穿戴设备如蓝牙耳机等,可以将录制音频的设备称为第一终端。手机在采集到音频信息后,根据音频信息生成对应的文字信息,并且手机接收无人机发送的视频信息,然后根据视频信息、音频信息和文字信息进行合成,获得带字幕的音视频文件。
图14为本公开实施例提供的一种音视频处理方法的信令交互图,如图14所示,该方法应用于第一终端,第一终端可以为手机、平板电脑等带有录音功能的电子设备,该方法包括:
S701:采集视频信息;为了能够获得更加清晰广阔的视频信息,采用无人机进行视频信息的采集。其中,第一终端与无人机进行录制操作可以同步进行。应当说明的是,无人机可以将视频信息采集完之后再将整个视频信息发送给第一终端,也可以将采集的视频信息实时发送给第一终端,当然,还可以按照预设时间段将采集到的视频信息发送给第一终端。
S702:采集音频信息;为了能够采集到较为清晰的音频信息,可以将用来采集音频信息的第一终端放置在声源附近。
S703:生成文字信息;在当第一终端采集到音频信息之后,将音频信息生成对应的文字信息,其中,第一终端根据音频信息生成对应的文字信息的方法与上述图10中的方法一致,此处不再赘述。并且,第一终端可以在将整个音频信息全部采集完成后再进行文字信息的生成,也可以将采集到的音频信息实时生成对应的文字信息。
S704:接收无人机发送的视频信息;无人机在采集到视频信息之后,将采集到的视频信息发送给第一终端。应当说明的是,第一终端和无人机可以预先进行通信连接,视频信息可以通过无线信号进行传输。
S705:合成带字幕的音视频文件;第一终端将音频信息、文字信息和视频信息进行合成,获得带字幕的音视频文件。
在合成时,为了保证视频、音频和字幕同步,视频信息、音频信息和文字信息中均包括第一时间信息,其中,视频信息中的第一时间信息为无人机在录制视频信息时的时间点,同时,音频信息中的第一时间信息为第一终端在录制音频信息时的时间点,无人机和第一终端同时进行录制。另外,文字信息根据音频信息生成,其文字信息跟音频信息在时间上是同步的。第一终端在将视频信息、音频信息和文字信息进行合成时,根据第一时间信息将视频信息、音频信息和文字信息在时间点上进行对齐,从而获得同步的带字幕的音视频文件。
本公开实施例通过第一终端采集音频信息,并将音频信息生成对应的文字信息,通过无人机采集视频信息,并且第一终端将接收到的无人机发送的视频信息、采集的音频信息和生成的文字信息进行合成,获得带有字幕的音视频文件,从而在保证获得高质量的音视频文件时,再加上字幕能够使用户正确理解音频信息。
在上述实施例的基础上,在接收无人机发送的视频信息之前,第一终端还向所述无人机发送控制消息,以使所述无人机根据所述控制消息采集所述视频信息。
在具体的实施过程中,在使用无人机录制视频时,可以预先在无人机上进行录制的开始时间、录制参数等的设置,然后使得无人机按照设置的参数进行视频的录制。另外,也可以通过第一终端与无人机的通信来实现无人机对视频的录制,在录制之前,第一终端中可以预先安装有能够控制无人机的APP,用户可以通过该APP向无人机发送控制消息,并且当无人机接收到该控制消息后能够根据该控制消息执行相应的操作。
可选地,控制消息可以为开始录制视频,当无人机接收到该控制消息后,立即开始视频的录制。控制消息中也可以包括时间信息,时间信息可以为开始录制的时间点或延时时长,通过时间信息控制无人机进行视频的录制操作,保证了第一终端和无人机的录制功能在时间上同步,从而便于在音视频合成的过程中,更好地将音频、视频和文字对齐。控制消息中还可以包括视频录制中所需的其他参数,例如焦 距、亮度等参数的设置。
图15为本公开另一实施例提供的带有字幕的音视频文件合成的场景示意图,如图15所示,包括无人机、手机、服务器(第三方,如上述中的第二终端)和用户。无人机带有视频录制功能,且能够与手机通信;手机用来采集用户的语音,为了能够更加清楚地采集到用户的音频信息,可以将手机放在用户附近;服务器用于合成带字幕的音视频文件。应当说明的是,使用手机进行音频的采集是一种可行的实施方式,也可以通过其他具备音频录制功能的设备代替,例如:可以是平板电脑、录音笔、智能穿戴设备如蓝牙耳机等,可以将录制音频的设备称为第一终端。手机在采集到音频信息后,可以根据音频信息生成对应的文字信息,并将音频信息和文字信息发送给第三方如服务器,也可以只将音频信息发送给服务器,由服务器进行文字信息的生成。
图16为本公开实施例提供的一种音视频处理方法的信令交互图,如图16所示,该处理方法中包括第一终端、无人机和服务器,该方法包括:
S901:采集音频信息;第一终端采集音频信息,其中,为了能够获得更加清晰的音频信息,可以将第一终端放置在声源附近。
S902:采集视频信息;无人机进行视频信息的采集,其中,无人机可以预先与第一终端进行通信连接,接收第一终端发送的控制消息,并根据控制消息进行视频信息的采集。应当说明的是,S901和S902之间可以同步进行。
S903:发送音频信息;第一终端在采集到音频信息之后,将音频信息发送给服务器。应当说明的是,第一终端可以实时将采集到的音频信息发送给服务器,也可以将音频信息全部采集完之后再将音频信息发送给服务器。
S904:发送视频信息;无人机将采集到的视频信息发送给服务器。应当说明的是,无人机可以实时将采集到的视频信息发送给服务器,也可以将视频信息全部采集完之后再将视频信息发送给服务器。应当说明的是,S903和S904可以同步进行。
S905:生成文字信息;服务器在接收到音频信息之后,根据音频信息生成对应的文字信息,其中,具体生成文字信息的方法有多种,例如可以与上述实施例中的生成方法一致,此处不再赘述。
S906:合成带字幕的音视频文件;服务器在生成文字信息之后,将音频信息、文字信息和视频信息进行合成,获得带字幕的音视频文件。
本公开实施例通过第一终端采集音频信息并由服务器根据音频信息生成对应的文字信息,通过无人机采集视频信息,服务器将视频信息、音频信息和文字信息进行合成,在保证获得高质量音视频文件的同时,还能够降低第一终端和无人机的负载,并且无人机中无需具备音视频文件合成的功能,对无人机要求较低。
图17为本公开实施例提供的又一种音视频处理方法的信令交互图,如图17所示,该方法包括:
S1001:采集音频信息,并生成文字信息;第一终端采集音频信息,为了采集到清晰的音频信息,可以将第一终端放置在声源附近。当第一终端采集到音频信息之后,根据音频信息生成对应的文字信息。应当说明的是,生成文字信息的方法可以与上述实施例一致,此处不再赘述。
S1002:采集视频信息;无人机可以预先与第一终端进行通信连接,并且无人机可以接收第一终端发送的控制消息,然后开始采集视频信息。
S1003:发送音频信息和文字信息;第一终端将音频信息和文字信息发送给服务器。
S1004:发送视频信息;无人机将采集到的视频信息发送给服务器。应当说明的是,无人机也可以预先与服务器建立通信连接。
S1005:合成带字幕的音视频文件;服务器将接收到的音频信息、视频信息和文字信息进行合成,获得带字幕的音视频文件。
本公开实施例通过第一终端采集音频信息并根据音频信息生成对应的文字信息,无人机采集视频信息,服务器将视频信息、音频信息和文字信息进行合成,在保证获得高质量音视频文件的同时,还能够降低第一终端和无人机的负载,并且无人机中无需具备音视频文件合成的功能,对无人机的要求较低。
在另一实施例中,对于视频信息中有人的场景,并且有人在讲话,例如:晚会中的歌唱节目、语言类节目或者新闻联播等,需要将讲话的人的口型、声音和字幕同步播放,无人机采集到视频信息之后, 可以获取视频信息中的多帧视频图像,并对多帧视频图像进行识别,获得视频信息中人的口型变化特征。应当说明的是,在对多帧视频图像进行识别之前,可以对多帧视频图像进行划分,获得视频信息中的人说出的每个字对应的多帧视频图像。
在获得了口型变化特征之后,可以根据口型变化特征获得对应的文字。应当说明的是,可以预先构建文字识别模型,通过文字识别模型对口型变化特征进行分析,输出对应的文字。
应当说明的是,通过口型变化特征获得对应的文字的主要目的是为了将视频信息、音频信息和文字信息进行对齐合成。因此,可以根据口型变化特征对应的文字将视频信息、音频信息和文字信息合成带有字幕的音视频文件。
本公开通过识别视频中人讲话的口型,获知其说的话,并根据人说的话将视频信息、音频信息和文字信息进行合成,从而能够保证音频、视频和文字在时间上的同步。
在一种实现方式中,第一终端采集音频和无人机采集视频都可以通过服务器进行控制,即,服务器中安装有能够控制第一终端采集音频和无人机采集视频的APP,当需要第一终端和无人机同时采集时,可以同时向第一终端和无人机发送控制消息,当第一终端和无人机接收到控制消息后,开始音视频的采集。当然,还可以采用其他智能设备对第一终端和无人机进行控制。
应理解,与上述图9方法实施例对应,本实施例还提供一种能够执行图9方法实施例涉及的各个步骤的音视频处理装置,该装置可以是电子设备上的模块、程序段或代码,该装置具体的功能可以参见上文中的描述,为避免重复,此处适当省略详细描述。该装置包括接收发送模块,用于采集视频信息,并获得音频信息和所述音频信息对应的文字信息;其中所述音频信息为通过第一终端采集获得;合成模块,用于将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件。
与上述图14方法实施例对应,本实施例还提供一种能够执行图14方法实施例涉及的各个步骤的音视频处理装置,该装置可以是电子设备上的模块、程序段或代码,该装置具体的功能可以参见上文中的描述,为避免重复,此处适当省略详细描述。该装置包括接收发送模块,用于采集音频信息,并根据所述音频信息生成对应的文字信息,用于接收无人机(智能设备)发送的视频信息;合成模块,用于将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件。
与上述图16方法实施例对应,本实施例还提供一种能够执行图16方法实施例涉及的各个步骤的音视频处理装置,该装置可以是电子设备上的模块、程序段或代码,该装置具体的功能可以参见上文中的描述,为避免重复,此处适当省略详细描述。该装置包括接收发送模块,用于获得视频信息、音频信息和所述音频信息对应的文字信息;其中,所述视频信息为无人机(智能设备)采集获得,所述音频信息为第一终端采集获得;合成模块,用于将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件。
本实施例在音视频合成时,加入了字幕,有利于用户对音频的理解。
在本公开实施例提供的音视频处理过程中,第一终端与智能设备(如无人机)能够相互通信,实现信息传输。例如,第一终端可以采集音频信息(亦可称为音频数据),并将采集到的音频信息传输给无人机。第一终端与无人机实现信息传输的方式有多种,本实施例现进行以下举例说明。
无人机可通过WiFi模块与第一终端的无线模块进行无线通信,第一终端的无线模块可以为WiFi模块,也可以为4G模块等。可选地,无人机与第一终端通信时,还可以借助地面中继器。即无人机可通过自身的大功率WiFi模块与地面中继器的大功率WiFi模块通信,地面中继器的大功率WiFi模块再与第一终端进行通信。
可以理解,无人机与第一终端除了通过WiFi通信外,还可以通过其他短距离无线通信技术进行通信,例如蓝牙、ZigBee等,无人机与第一终端的具体通信方式不应该理解为是对本公开的限制。
图18为本公开实施例提供的第一终端与无人机实现信息传输的一种具体实施方式的流程示意图,该方法可以由携带有摄像头的智能设备,如无人机来执行,该方法具体包括S110至S130:
S110,采集视频信息,将所述视频信息缓存在存储设备中。
视频信息可以包括至少一个第二时间戳以及与每个第二时间戳对应的视频内容信息。
智能设备可以通过自身携带的摄像头拍摄并采集视频内容信息,智能设备可以根据视频拍摄的时刻为视频内容信息添加对应的第二时间戳。随后,智能设备可以将包括有视频内容信息以及第二时间戳的 视频信息缓存在存储设备中。存储设备为存储缓存数据的存储器,例如随机存取存储器(Random Access Memory,简称RAM)。
S120,接收第一终端(如用户终端)发送的与所述视频信息对应的音频信息,并将所述音频信息缓存在所述存储设备中。
音频信息包括至少一个第一时间戳以及与每个第一时间戳对应的音频内容信息。第一终端可以采集音频内容信息,并且第一终端可以根据音频采集的时刻为音频内容信息添加对应的第一时间戳。
与视频信息对应的音频信息可以指视频信息的第二时间戳对应的拍摄时刻与音频信息的第一时间戳对应的采集时刻相同,也可以是第二时间戳对应的拍摄时刻与第一时间戳对应的采集时刻相差一个确定的时间长度,一个确定的时间长度可以是1秒,也可以是0.5秒。例如,第二时间戳对应的拍摄时刻早于第一时间戳对应的采集时刻1秒,或第二时间戳对应的拍摄时刻晚于第一时间戳对应的采集时刻0.5秒。
若智能设备与第一终端一直连接良好,则智能设备可以持续地拍摄视频内容信息,并为视频内容信息添加相应的第二时间戳,并将包括视频内容信息以及第二时间戳的视频信息缓存在存储设备中。智能设备还可以持续地接收第一终端发送的音频信息,并将音频信息缓存在存储设备中,以等待与缓存的视频信息合成音视频文件。
可选地,若所述智能设备与所述第一终端断开第二时间长度后重新连接,接收所述第一终端发送的所述第二时间长度对应的音频信息。
其中,第一时间长度为所述智能设备的最大缓存时长,第二时间长度小于或等于所述第一时间长度,已经缓存第一时间长度的视频信息对应的音频信息包括所述第二时间长度对应的音频信息。
为了便于描述,不妨以第一时间长度(即视频的最大缓存时长)为10秒,第二时间长度为5秒为例进行说明:
视频最大缓存时长为10秒,表示视频从采集的时刻开始计算,推迟10秒才会与对应的音频信息合成音视频文件。例如,假设视频的采集时刻是第0秒时刻,则第0秒时刻采集的视频会在第10秒时刻与音频合成音视频文件。
若智能设备与第一终端断开连接5秒,且在5秒后重新连接,则第一终端不仅将从5秒后重新连接时开始采集新的音频信息发送给智能设备,第一终端还会把断开连接的5秒内采集到的音频信息也发送给智能设备。
断开连接的5秒可以是上述第0秒时刻至第10秒时刻这10秒钟时长的时间段中的任一5秒钟时长,例如,可以是第0秒时刻至第5秒时刻,也可以是第3秒时刻至第8秒时刻,还可以是第5秒时刻至第10秒时刻。
当时间到达第10秒时刻时,第0秒时刻采集的视频开始与对应的音频信息合成音视频文件,由于智能设备与第一终端断开连接的5秒在第0秒时刻至第10秒时刻内,因此,此次智能设备与第一终端的断开连接不会影响到音视频文件的合成。
智能设备与第一终端断开后重连,且断开的时长不超过视频的最大缓存时长,则在智能设备与第一终端重连时,第一终端依然可以把断开时间段内未成功发送的音频发给智能设备,以便智能设备合成音视频文件,从而使得音视频文件的合成更加稳定。
可选地,若所述智能设备与所述第一终端断开第三时间长度后重新连接,接收所述第一终端发送的第三时间长度中最新时间的第一时间长度对应的音频信息,其中,所述第三时间长度大于所述第一时间长度。
不妨设第三时间长度为15秒,在上文所举的例子继续进行说明:
不妨设智能设备与第一终端断开的15秒就是第0秒时刻至第15秒时刻,则时间到达第10秒时刻时,由于智能设备与第一终端依然处于断开状态,第0秒时刻采集到的视频无对应的音频信息,又因为视频的最大缓存时长是10秒,因此第0秒时刻采集到的视频在无对应音频信息的情况下会生成无音频的视频文件。同理,第1秒时刻采集到的视频、第2秒时刻采集到的视频、第3秒时刻采集到的视频…在无对应音频信息的情况下会生成无音频的视频文件。直到第5秒时刻采集到的视频会在第15秒时刻与对应的音频信息合成音视频文件,而第15秒时刻是智能设备与第一终端断开后重新连接的时刻,因 此,第一终端可以将最新的与第一时间长度等长的时间段(即第5秒时刻至第15秒时刻这10秒时间长度的时间段)内对应的音频信息发送给智能设备,以便智能设备可以对尚在缓存中的视频信息执行合成动作。
智能设备与第一终端断开后重连,且断开的时长超过视频的最大缓存时长,则在智能设备与第一终端重连时,第一终端可以把最新的音频信息发送给智能设备,最新的音频信息的时长可以与视频的最大缓存时长相同,以便智能设备尽可能多地合成有音频的音视频文件。
S130,将所述存储设备中的已经缓存第一时间长度的视频信息与对应的音频信息合成为音视频文件。
本实施例中,智能设备在采集到视频信息后,并不会马上将视频信息与对应的音频信息合成音视频文件,而是先将视频信息缓存一段时间,然后将缓存一段时间后的视频信息与对应的音频信息合成音视频文件,音频信息由第一终端发送过来。在视频信息缓存的时间内,若智能设备与第一终端断开并重连,则不会对音频信息与视频信息的合成产生影响,使得音视频文件的合成更加稳定,改善了现有技术中的声音信号较差的问题。
请参见图19,图19示出了S130的具体步骤,合成音视频文件的具体过程可以包括如下S131至S134:
S131,对于已经缓存第一时间长度的视频信息,提取所述视频信息的第二时间戳。
提取已经缓存第一时间长度的视频信息的第二时间戳,即获得该视频信息包括的视频内容信息所对应的采集时刻。
S132,判断所述音频信息的至少一个第一时间戳中是否存在与所述第二时间戳对应的第一时间戳,若是,执行S133;若否,执行S134。
与所述第二时间戳对应的第一时间戳可以指第二时间戳对应的采集时刻与第一时间戳对应的采集时刻相同,或者第二时间戳对应的采集时刻比第一时间戳对应的采集时刻早一个固定时长,或者第二时间戳对应的采集时刻比第一时间戳对应的采集时刻晚一个固定时长。
获得已经缓存第一时间长度的视频信息的第二时间戳后,从智能设备的存储设备中查找是否存在与该第二时间戳对应的第一时间戳,若查找到与第二时间戳对应的第一时间戳,则表明该视频信息有对应的音频信息可合成,执行S133;若未查找到与第二时间戳对应的第一时间戳,则表明该视频信息无对应的音频信息可合成,执行S134。
S133,将与所述第二时间戳对应的第一时间戳所对应的音频内容信息与所述已经缓存第一时间长度的视频信息合成音视频文件。
至少一个第一时间戳中的每个第一时间戳均有各自对应的时刻,至少一个第二时间戳中的每个第二时间戳同样也有各自对应的时刻。可以根据时间戳的对应关系来实现音频内容信息与视频内容信息的对应,从而实现了即使视频不是实时地与音频合成音视频文件,也能合成音视频同步的音视频文件。
S134,向所述已经缓存第一时间长度的视频信息添加语音缺失的提示信息。
若音频信息中不存在与第二时间戳对应的第一时间戳,则说明与本次缓存第一时间长度的视频信息对应的音频信息未在缓存中,无法合成音视频文件,则可以在视频信息中添加语音缺失的提示信息,以便较好地与有音频的视频段落区分开,便于后期对语音缺失的视频进行筛选和处理。
在一种具体实施方式中,若所述智能设备与所述第一终端在断开后无法重新连接,发出断开连接的提示信息。
可选地,若智能设备与第一终端断开连接,且断开超过预设时长,即可判定智能设备与第一终端无法重连。断开连接的提示信息为提醒智能设备的操作者连接断开的信息,可以是智能设备机体的闪光信号,也可以是声响信号,智能设备可以发出断开连接的提示信息,令提示信息可以被智能设备的操作者察觉,以便操作者进行补救措施。
在一种具体实施方式中,若音频内容信息丢失且音频冗余数据未丢失,对所述音频冗余数据进行解码,获得与丢失的所述音频内容信息相同的数据。
其中,所述音频信息包括音频内容信息和音频冗余数据,所述音频冗余数据由所述音频内容信息编码获得。
音频冗余数据可以由第一终端对音频内容信息进行编码处理获得,具体的编码方式可以为预先设置的规则,该预先设置的规则为智能设备和第一终端均知晓的规则。
第一终端传给智能设备的音频信息可以包括音频内容信息以及音频冗余数据,若音频内容信息丢失且音频冗余数据未丢失的情况下,智能设备可以对音频冗余数据进行解码,获得与丢失的音频内容信息相同的数据,从而进一步提高了数据传输的可靠性。
请参见图20,图20示出了本公开实施例提供的音视频处理方法的另一种具体实施方式的流程示意图,该方法可以由第一终端执行,具体包括如下S210至S220:
S2210,采集音频内容信息。
S2220,向智能设备发送包括音频内容信息的音频信息,以使所述智能设备将缓存第一时间长度的视频信息与对应的音频信息合成为音视频文件。
第一终端可以采集音频内容信息,然后为音频内容信息添加上对应的第一时间戳,随后将包括有音频内容信息以及第一时间戳的音频信息发送给智能设备,以使智能设备把已经缓存一段时间的视频信息与相应的音频信息进行合成。
可选地,S2220具体包括:若所述第一终端与所述智能设备断开第二时间长度后重新连接,向所述智能设备发送第二时间长度对应的音频信息,其中,所述第二时间长度小于或等于所述第一时间长度。
智能设备与第一终端断开后重连,且断开的时长不超过视频的最大缓存时长,因此在智能设备与第一终端重连时,第一终端依然可以把断开时间段内未成功发送的音频发给智能设备,以便智能设备合成音视频文件,从而使得音视频文件的合成更加稳定。
可选地,S2220具体还包括:若所述第一终端与所述智能设备断开第三时间长度后重新连接,向所述智能设备发送最近的第一时间长度对应的音频信息,其中,所述第三时间长度大于所述第一时间长度。
智能设备与第一终端断开后重连,且断开的时长超过视频的最大缓存时长,则在智能设备与第一终端重连时,第一终端可以把最新的音频信息发送给智能设备,最新的音频信息的时长可以与视频的最大缓存时长相同,以便智能设备尽可能多地合成有音频的音视频文件。
在一种具体实施方式中,请参见图21,S2220具体包括如下S221至S222:
S221,对音频内容信息进行编码处理,得到音频冗余数据。
第一终端可以根据预先设置的规则对音频内容信息进行编码处理,例如,若音频内容信息为A、B、C、D,第一终端可将A、B、C、D分别进行编码处理,得到音频冗余数据a、b、c、d,其中,a与A对应,b与B对应,c与C对应,d与D对应。
S222,向所述智能设备发送包括音频内容信息和音频冗余数据的所述音频信息。
接上文的例子继续说明,第一终端可以把包括音频内容信息A、B、C、D和音频冗余数据a、b、c、d的音频信息发给智能设备。
与上述图18方法实施例对应,本实施例还提供一种能够执行图18方法实施例涉及的各个步骤的音视频处理装置,该装置可以是电子设备上的模块、程序段或代码,该装置具体的功能可以参见上文中的描述,为避免重复,此处适当省略详细描述。该装置包括接收发送模块,用于采集视频信息,将所述视频信息缓存在存储设备中,接收第一终端发送的与所述视频信息对应的音频信息,并将所述音频信息缓存在所述存储设备中。
合成模块,用于将所述存储设备中的已经缓存第一时间长度的视频信息与对应的音频信息合成为音视频文件。
本实施例还示出了音视频处理装置的另一种具体实施方式,该装置包括接收发送模块,用于采集音频内容信息,向智能设备发送包括音频内容信息的音频信息,以使所述智能设备将缓存第一时间长度的视频信息与对应的音频信息合成为音视频文件。
在上述各实施例的基础上,本公开实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质内存储有可由电子设备执行的计算机程序,当所述程序在所述电子设备上运行时,使得所述电子设备执行时实现应用于智能终端的方法的步骤,例如,如下步骤:
所述存储器中存储有计算机程序,当所述程序被所述处理器执行时,使得所述处理器执行如下步骤:接收第一终端发送的控制消息;根据所述控制消息,录制视频信息,以使根据所述视频信息与所述第一 终端录制的音频信息合成音视频文件。
在上述各实施例的基础上,本公开实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质内存储有可由电子设备执行的计算机程序,当所述程序在所述电子设备上运行时,使得所述电子设备执行时实现应用于第一终端的方法的步骤,例如,如下步骤:
所述存储器中存储有计算机程序,当所述程序被所述处理器执行时,使得所述处理器执行如下步骤:向智能设备发送控制智能设备进行视频信息录制的控制消息;根据所述控制消息,录制音频信息,以使根据所述音频信息与所述智能设备录制的视频信息合成音视频文件。
上述计算机可读存储介质可以是电子设备中的处理器能够存取的任何可用介质或数据存储设备,包括但不限于磁性存储器如软盘、硬盘、磁带、磁光盘(MO)等、光学存储器如CD、DVD、BD、HVD等、以及半导体存储器如ROM、EPROM、EEPROM、非易失性存储器(NAND FLASH)、固态硬盘(SSD)等。
可以理解的是,与由第二终端(服务器)所执行的动作对应地,本公开还可以提供一种电子设备,包括:处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过通信总线完成相互间的通信。所述存储器中存储有计算机程序,当所述程序被所述处理器执行时,使得所述处理器执行上述应用于第二终端的方法的步骤。本公开实施例还可以提供一种计算机可读存储介质,其存储有可由电子设备执行的计算机程序,当所述程序在所述电子设备上运行时,使得所述电子设备执行应用于第二终端的方法的步骤。
可以理解的是,上述描述的仅为本公开的示例性实现方式,在其他场景中,本公开还可以有多种变换的实现方式,例如,智能设备和终端之间可以直接连接,如无线连接。又例如,智能设备和终端之间可以通过中继连接,如通过另一个终端、另一基站等连接。进行合成的信息除了视频信息和音频信息,还可以为其他,例如,在其他场景中,可以将智能设备获取的图片和终端获取的坐标进行合成。又例如,还可以将智能设备获取的图片和终端获取的文字进行合成。又例如,也可以将智能设备获取的音频信息、坐标等中的至少一个和终端获取的视频、图片等中的至少一个进行合成,本公开对此不作一一举例说明。
本公开是参照根据本公开实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本公开的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本公开范围的所有变更和修改。
显然,本领域的技术人员可以对本公开进行各种改动和变型而不脱离本公开的精神和范围。这样,倘若本公开的这些修改和变型属于本公开权利要求及其等同技术的范围之内,则本公开也意图包含这些改动和变型在内。
工业实用性
本公开提供的音视频处理方案,能够录制质量较高的音频信息,确保合成的音视频文件的效果。通过获得带有字幕的音视频文件,使得音视频文件能够更加清楚地表达其内容。通过对视频信息和音频信息的缓存,在视频信息缓存的时间内,若智能设备与第一终端断开并重连,不会对音频信息与视频信息的合成产生影响,使得音视频文件的合成更加稳定,改善了现有技术中的声音信号较差的问题。

Claims (28)

  1. 一种音视频处理方法,其特征在于,应用于智能设备,所述方法包括:
    接收第一终端发送的控制消息;
    根据所述控制消息,录制视频信息,以使根据所述视频信息与所述第一终端录制的音频信息合成音视频文件。
  2. 根据权利要求1所述的方法,其特征在于,所述控制消息中携带时间信息;
    所述根据所述控制消息,录制视频信息包括:
    如果所述时间信息为延时时长的信息,则所述智能设备在接收到所述控制消息后,等待所述延时时长进行视频信息的录制;
    如果所述时间信息为进行视频录制的时间点,则所述智能设备在达到所述时间点时,进行视频信息的录制。
  3. 根据权利要求1或2所述的方法,其特征在于,所述录制视频信息之后,所述方法还包括:
    接收所述第一终端发送的音频信息;
    将所述音频信息与所述视频信息合成音视频文件;或者,
    向所述第一终端发送所述视频信息,以使所述第一终端根据自身录制的音频信息及接收到的所述视频信息合成音视频文件;或者,
    向第二终端发送所述视频信息,以使所述第二终端根据所述第一终端录制的音频信息及所述视频信息合成音视频文件。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述方法还包括:
    获得所述音频信息对应的文字信息,将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件;或者,
    获得所述视频信息对应的文字信息,将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件。
  5. 根据权利要求4所述的方法,其特征在于,所述视频信息、所述音频信息和所述文字信息均包括第一时间信息;所述将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件,包括:
    根据所述第一时间信息将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件。
  6. 根据权利要求4或5所述的方法,其特征在于,所述视频信息中包括人,所述将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件,包括:
    获取所述视频信息对应的多帧视频图像,并对多帧视频图像进行识别,获得所述视频信息中人的口型变化特征;
    根据所述口型变化特征获得对应的文字;
    根据所述口型变化特征对应的文字将所述视频信息、音频信息和所述文字信息合成所述带有字幕的音视频文件。
  7. 根据权利要求4至6任一项所述的方法,其特征在于,所述获得所述音频信息对应的文字信息,包括:
    对所述音频信息进行预处理,获得处理后音频信息;
    对所述处理后音频信息进行端点切分,获得音频样本;
    根据预设的最小静音长度和最短有效声音对所述音频样本进行再次切分,获得多个音频片段;
    对每个音频片段进行文字识别,获得所述文字信息。
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述方法还包括:
    将所述视频信息缓存在存储设备中;
    将所述音频信息缓存在存储设备中;
    所述根据所述视频信息与所述第一终端录制的音频信息合成音视频文件,包括:
    将所述存储设备中的已经缓存第一时间长度的视频信息与对应的音频信息合成为音视频文件。
  9. 根据权利要求8所述的方法,其特征在于,所述音频信息包括至少一个第一时间戳以及与每个第一时间戳对应的音频内容信息;所述视频信息包括至少一个第二时间戳以及与所述每个第二时间戳对应的视频内容信息;
    所述将所述存储设备中的已经缓存第一时间长度的视频信息与对应的音频信息合成为音视频文件,包括:
    对于已经缓存第一时间长度的视频信息,提取所述视频信息的第二时间戳;
    判断所述音频信息的多个第一时间戳中是否存在与所述第二时间戳对应的第一时间戳;
    若是,将与所述第二时间戳对应的第一时间戳所对应的音频内容信息与所述已经缓存第一时间长度的视频信息合成音视频文件;
    若否,向所述已经缓存第一时间长度的视频信息添加语音缺失的提示信息。
  10. 根据权利要求9所述的方法,其特征在于,所述方法还包括:
    若所述智能设备与所述第一终端断开第二时间长度后重新连接,接收所述第一终端发送的所述第二时间长度对应的音频信息,其中,所述第一时间长度为所述智能设备的最大缓存时长,所述第二时间长度小于或等于所述第一时间长度,所述已经缓存第一时间长度的视频信息对应的音频信息包括所述第二时间长度对应的音频信息;若所述智能设备与所述第一终端断开第三时间长度后重新连接,则接收所述第一终端发送的第三时间长度中最新时间的第一时间长度对应的音频信息,其中,所述第三时间长度大于所述第一时间长度;
    若所述智能设备与所述第一终端在断开后无法重新连接,发出断开连接的提示信息。
  11. 根据权利要求8至10任一项所述的方法,其特征在于,所述音频信息包括音频内容信息和音频冗余数据,所述音频冗余数据由所述音频内容信息编码获得,所述方法还包括:
    若音频内容信息丢失且音频冗余数据未丢失,对所述音频冗余数据进行解码,获得与丢失的所述音频内容信息相同的数据。
  12. 一种音视频处理方法,其特征在于,应用于第一终端,所述方法包括:
    向智能设备发送控制智能设备进行视频信息录制的控制消息;
    根据所述控制消息,录制音频信息,以使根据所述音频信息与所述智能设备录制的视频信息合成音视频文件。
  13. 根据权利要求12所述的方法,其特征在于,所述控制消息中携带时间信息;
    所述根据所述控制消息,录制音频信息包括:
    如果所述时间信息为延时时长的信息,则所述第一终端发送所述控制消息后,等待所述延时时长进行音频信息的录制;
    如果所述时间信息为进行音频录制的时间点,则所述第一终端在达到所述时间点时,进行音频信息的录制。
  14. 根据权利要求12或13所述的方法,其特征在于,所述录制音频信息之后,所述方法包括:
    接收所述智能设备发送的视频信息;
    将所述音频信息与所述视频信息合成音视频文件;或者,
    向所述智能设备发送所述音频信息,以使所述智能设备根据自身录制的视频信息及接收到的所述音频信息合成音视频文件;或者,
    向第二终端发送所述音频信息,以使所述第二终端根据所述智能设备录制的视频信息及所述音频信息合成音视频文件。
  15. 根据权利要求12至14任一项所述的方法,其特征在于,所述录制音频信息之后,所述方法包括:
    根据所述音频信息生成对应的文字信息,将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件;或者,
    根据所述视频信息生成对应的文字信息,将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件。
  16. 根据权利要求15所述的方法,其特征在于,所述视频信息、所述音频信息和所述文字信息均包括第一时间信息;所述将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件,包括:
    根据所述第一时间信息将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件。
  17. 根据权利要求12至16任一项所述的方法,其特征在于,所述音频信息中包括音频内容信息,所述录制音频信息之后,所述方法包括:
    向智能设备发送包括音频内容信息的音频信息,以使所述智能设备将缓存第一时间长度的视频信息与对应的音频信息合成为音视频文件。
  18. 根据权利要求17所述的方法,其特征在于,所述向智能设备发送包括音频内容信息的音频信息,包括:
    若所述第一终端与所述智能设备断开第二时间长度后重新连接,向所述智能设备发送第二时间长度对应的音频信息,其中,所述第二时间长度小于或等于所述第一时间长度;
    若所述第一终端与所述智能设备断开第三时间长度后重新连接,向所述智能设备发送最近的第一时间长度对应的音频信息,其中,所述第三时间长度大于所述第一时间长度。
  19. 根据权利要求17或18所述的方法,其特征在于,所述向智能设备发送包括音频内容信息的音频信息,包括:
    对音频内容信息进行编码处理,得到音频冗余数据;
    向所述智能设备发送包括音频内容信息和音频冗余数据的所述音频信息。
  20. 一种音视频处理方法,其特征在于,应用于第二终端,所述方法包括:
    获得视频信息、音频信息;其中,所述视频信息为智能设备采集获得,所述音频信息为第一终端采集获得;
    将所述视频信息、所述音频信息合成音视频文件。
  21. 根据权利要求20所述的方法,其特征在于,所述方法还包括:
    根据所述音频信息生成对应的文字信息,将所述视频信息、音频信息和所述文字信息合成带有字幕的音视频文件;或者,
    根据所述视频信息生成对应的文字信息;
    将所述视频信息、音频信息和所述文字信息合成带有字幕的音视频文件。
  22. 根据权利要求21所述的方法,其特征在于,所述视频信息、所述音频信息和所述文字信息均包括第一时间信息;所述将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件,包括:
    根据所述第一时间信息将所述视频信息、所述音频信息和所述文字信息合成带有字幕的音视频文件。
  23. 根据权利要求21所述的方法,其特征在于,获得所述音频信息对应的文字信息,包括:
    所述文字信息为所述第一终端根据所述音频信息生成的;或者,
    所述第二终端根据所述音频信息生成对应的所述文字信息。
  24. 根据权利要求21所述的方法,其特征在于,获得所述视频信息对应的文字信息,包括:
    所述文字信息为所述智能设备根据所述视频信息生成的;或者,
    所述第二终端根据所述视频信息生成对应的文字信息。
  25. 一种音视频处理装置,其特征在于,应用于智能设备,所述装置包括:
    接收发送模块,用于接收第一终端发送的控制消息;
    录制模块,用于根据所述控制消息,录制视频信息,以使根据所述视频信息与所述第一终端录制的音频信息合成音视频文件。
  26. 一种音视频处理装置,其特征在于,应用于第一终端,所述装置包括:
    接收发送模块,用于向智能设备发送控制智能设备进行视频信息录制的控制消息;
    录制模块,用于根据所述控制消息,录制音频信息,以使根据所述音频信息与所述智能设备录制的 视频信息合成音视频文件。
  27. 一种电子设备,其特征在于,包括:处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过通信总线完成相互间的通信;
    所述存储器中存储有计算机程序,当所述程序被所述处理器执行时,使得所述处理器执行权利要求1至22任一项所述方法的步骤。
  28. 一种计算机可读存储介质,其特征在于,其存储有可由电子设备执行的计算机程序,当所述程序在所述电子设备上运行时,使得所述电子设备执行权利要求1至24任一项所述方法的步骤。
PCT/CN2020/070597 2019-03-01 2020-01-07 音视频处理方法、装置、电子设备及存储介质 WO2020177483A1 (zh)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN201910155598.4 2019-03-01
CN201910155598.4A CN110022449A (zh) 2019-03-01 2019-03-01 一种音频和视频合成方法、装置、电子设备及存储介质
CN201910850137.9 2019-09-09
CN201910850136.4A CN110691204B (zh) 2019-09-09 2019-09-09 一种音视频处理方法、装置、电子设备及存储介质
CN201910850136.4 2019-09-09
CN201910850137.9A CN110691218B (zh) 2019-09-09 2019-09-09 音频数据传输方法、装置、电子设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2020177483A1 true WO2020177483A1 (zh) 2020-09-10

Family

ID=72337438

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/070597 WO2020177483A1 (zh) 2019-03-01 2020-01-07 音视频处理方法、装置、电子设备及存储介质

Country Status (1)

Country Link
WO (1) WO2020177483A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106231226A (zh) * 2015-09-21 2016-12-14 零度智控(北京)智能科技有限公司 影音合成方法、装置及系统
WO2017123307A2 (en) * 2015-12-01 2017-07-20 Qualcomm Incorporated Electronic device for generating video data
CN107872605A (zh) * 2016-09-26 2018-04-03 青柠优视科技(北京)有限公司 一种无人机系统和无人机音视频处理方法
CN207200853U (zh) * 2017-05-08 2018-04-06 北京臻迪科技股份有限公司 一种影音融合系统
CN110022449A (zh) * 2019-03-01 2019-07-16 苏州臻迪智能科技有限公司 一种音频和视频合成方法、装置、电子设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106231226A (zh) * 2015-09-21 2016-12-14 零度智控(北京)智能科技有限公司 影音合成方法、装置及系统
WO2017123307A2 (en) * 2015-12-01 2017-07-20 Qualcomm Incorporated Electronic device for generating video data
CN107872605A (zh) * 2016-09-26 2018-04-03 青柠优视科技(北京)有限公司 一种无人机系统和无人机音视频处理方法
CN207200853U (zh) * 2017-05-08 2018-04-06 北京臻迪科技股份有限公司 一种影音融合系统
CN110022449A (zh) * 2019-03-01 2019-07-16 苏州臻迪智能科技有限公司 一种音频和视频合成方法、装置、电子设备及存储介质

Similar Documents

Publication Publication Date Title
US10825480B2 (en) Automatic processing of double-system recording
CN110691204B (zh) 一种音视频处理方法、装置、电子设备及存储介质
CN112400325A (zh) 数据驱动的音频增强
CN111050201B (zh) 数据处理方法、装置、电子设备及存储介质
JP7347597B2 (ja) 動画編集装置、動画編集方法及びプログラム
CN109104616B (zh) 一种直播间的语音连麦方法及客户端
WO2023005412A1 (zh) 录音方法、装置、无线耳机及存储介质
WO2021244056A1 (zh) 一种数据处理方法、装置和可读介质
WO2018045703A1 (zh) 语音处理方法、装置及终端设备
CN109120947A (zh) 一种直播间的语音私聊方法及客户端
US20230283888A1 (en) Processing method and electronic device
CN110933485A (zh) 一种视频字幕生成方法、系统、装置和存储介质
US11580954B2 (en) Systems and methods of handling speech audio stream interruptions
US8615153B2 (en) Multi-media data editing system, method and electronic device using same
US11889128B2 (en) Call audio playback speed adjustment
US20180268819A1 (en) Communication terminal, communication method, and computer program product
JP2019215449A (ja) 会話補助装置、会話補助方法及びプログラム
WO2020177483A1 (zh) 音视频处理方法、装置、电子设备及存储介质
CN109802968B (zh) 一种会议发言系统
CN105450970A (zh) 一种信息处理方法及电子设备
US20200184973A1 (en) Transcription of communications
CN115767158A (zh) 同步播放方法、终端设备及存储介质
CN114531425A (zh) 一种处理方法和处理装置
CN113707151A (zh) 语音转写方法、装置、录音设备、系统与存储介质
CN112562688A (zh) 语音转写方法、装置、录音笔和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20766066

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20766066

Country of ref document: EP

Kind code of ref document: A1