CN115243087A

CN115243087A - Audio and video co-shooting processing method and device, terminal equipment and storage medium

Info

Publication number: CN115243087A
Application number: CN202210786617.5A
Authority: CN
Inventors: 查航; 张远
Original assignee: Beijing Small Sugar Technology Co ltd
Current assignee: Beijing Small Sugar Technology Co ltd
Priority date: 2022-07-04
Filing date: 2022-07-04
Publication date: 2022-10-25

Abstract

The application discloses an audio and video close-shot processing method and device, terminal equipment, a server and a storage medium. The method comprises the following steps: in response to receiving the starting instruction, controlling the terminal equipment to start recording audio and video, and simultaneously controlling the electronic equipment to start playing specified audio data; in response to receiving the ending instruction, controlling the terminal equipment to stop recording the audio and the video to obtain first audio data and first video data; matching the first audio data and the specified audio data to determine a time difference between the first audio data and the specified audio data; performing offset processing on the specified audio data based on the time difference to obtain target audio data; and synthesizing the target audio data and the first video data to obtain an audio and video file with the audio and video photographed in time. By using the embodiment disclosed by the application, the file of the audio and video pictures which are taken in time can be obtained, and the use satisfaction of the user is improved.

Description

Audio and video co-shooting processing method and device, terminal equipment and storage medium

Technical Field

The present application relates to the field of audio and video synthesis technologies, and in particular, to an audio and video capture processing method and apparatus, a terminal device, a computer-readable storage medium, and a computer program product.

Background

With the popularization of terminal devices such as mobile phones and tablet computers, recording and sharing life through photographing and video recording are becoming popular. People can utilize cell-phone camera to record the dance video, through post processing, the dance works are accomplished in the preparation, share in friend's circle or video platform, obtain more enjoyment.

However, when a dance video is produced by using a mobile phone, the rhythm of the audio in the dance work is often inconsistent with the action of a dancer in the video due to the difference of hardware configuration of the mobile phone, so that the user experience is poor, and the use satisfaction of the user is reduced.

Disclosure of Invention

In view of this, embodiments of the present application provide an audio and video capture processing method and apparatus, a terminal device, a server, a computer-readable storage medium, and a computer program product, which are used to solve at least one technical problem.

In a first aspect, an embodiment of the present application provides an audio and video capture processing method, which is applied to a terminal device, and the method includes: in response to receiving a start instruction, controlling the terminal equipment to start recording audio and video, and simultaneously controlling the electronic equipment to start playing out specified audio data; recording the audio comprises recording the sound when the appointed audio data are played outside the electronic equipment, and recording the video comprises recording the appointed target when the appointed audio data are played outside the electronic equipment; in response to receiving the ending instruction, controlling the terminal equipment to stop recording the audio and the video to obtain first audio data and first video data; matching the first audio data and the specified audio data to determine a time difference between the first audio data and the specified audio data; performing offset processing on the specified audio data based on the time difference to obtain target audio data; and synthesizing the target audio data and the first video data to obtain an audio and video file with a co-shot audio and video.

According to the method of the embodiment of the application, the matching the first audio data and the specified audio data to determine the time difference between the first audio data and the specified audio data includes: extracting data segments of a specified time period in the first audio data; extracting data segments of a specified time period in the specified audio data; and matching the data segment of the first audio data and the data segment of the designated audio data to determine the time difference between the first audio data and the designated audio data.

According to the method of the embodiment of the application, the time length of the first audio data and the specified audio data is greater than a first time length threshold, and the matching processing is performed on the first audio data and the specified audio data to determine the time difference between the first audio data and the specified audio data, including: acquiring a head data fragment A1, a middle data fragment B1 and a tail data fragment C1 in the first audio data; acquiring a head data segment A2, a middle data segment B2 and a tail data segment C2 in the designated audio data; respectively matching the head data segment A1, the middle data segment B1 and the tail data segment C1 with the head data segment A2, the middle data segment B2 and the tail data segment C2 to obtain a head data time difference, a middle data time difference and a tail data time difference; and taking an average value of the head data time difference, the middle data time difference and the tail data time difference as the time difference between the first audio data and the specified audio data.

According to the method of the embodiment of the application, the first time length threshold value is 15s.

According to the method of the embodiment of the application, the determining the time difference between the first audio data and the specified audio data by matching the first audio data and the specified audio data, wherein the time length of the first audio data and the specified audio data both fall into a first time length range, comprises: acquiring a head data segment D1 and a tail data segment E1 in the first audio data; acquiring a head data segment D2 and a tail data segment E2 in the specified audio data; respectively matching the head data segment D1 and the tail data segment E1 with the head data segment D2 and the tail data segment E2 to obtain a head data time difference and a tail data time difference; and taking the average value of the head data time difference and the tail data time difference as the time difference between the first audio data and the specified audio data.

According to the method of the embodiment of the application, the first time length range is 10s-15s.

According to the method of the embodiment of the application, the determining the time difference between the first audio data and the specified audio data by matching the first audio data and the specified audio data, wherein the time length of the first audio data and the specified audio data both fall into a second time length range, comprises: acquiring a head data fragment F1 in the first audio data; acquiring a head data fragment F2 in the specified audio data; respectively matching the head data fragment F1 and the head data fragment F2 to obtain a head data time difference; the header data time difference is taken as a time difference between the first audio data and the specified audio data.

According to an embodiment of the method, the second time period ranges from 5s to 10s.

According to the method of the embodiment of the application, the offset processing of the specified audio data based on the time difference comprises: and adding a mute frame into the head of the specified audio data, wherein the duration of the mute frame is determined according to the time difference.

According to the method of the embodiment of the application, the specified audio data comprises audio data of a song; the specified target comprises a user; the electronic device and the terminal device are the same device, or the electronic device and the terminal device are not the same device.

According to the method of the embodiment of the application, the matching the first audio data and the specified audio data to determine the time difference between the first audio data and the specified audio data includes: transforming the first audio data and the specified audio data into first frequency domain data and second frequency domain data, respectively, using a short-time fourier transform; comparing the characteristics of the first frequency domain data and the second frequency domain data by using a sliding time window, and when the difference between the first frequency domain data and the second frequency domain data is minimum, acquiring the offset of the first frequency domain data relative to the second frequency domain data; determining a time difference between the first audio data and the specified audio data based on the offset.

According to the method of the embodiment of the application, the determining the time difference between the first audio data and the specified audio data based on the offset comprises: the time difference is calculated using the following equation:

delay＝window_offset×window_length/sample_rate；

wherein delay is a time difference between the first audio data and the specified audio data; window _ offset is an offset of the first frequency-domain data relative to the second frequency-domain data; sample _ rate is the audio sampling rate within the sliding time window and window _ length is the number of sampling points within the sliding time window.

According to the method of the embodiment of the application, before the control terminal device starts to record the audio and the video and simultaneously controls the electronic device to start playing out the specified audio data, the method further comprises the following steps: at intervals of a predetermined length.

According to the method of the embodiment of the application, before the matching processing is performed on the first audio data and the specified audio data to determine the time difference between the first audio data and the specified audio data, the method further includes: if the audio parameters of the first audio data and the designated audio data are inconsistent, performing audio resampling on the designated audio receipt to unify the audio parameters of the first audio data and the designated audio data; the audio parameters include sampling frequency, channel number and quantization bit number.

In a second aspect, an embodiment of the present application provides an apparatus for processing audio and video beat, including: the first control module is used for responding to the received starting instruction, controlling the terminal equipment to start to collect audio and simultaneously controlling the electronic equipment to start to play specified audio data; recording the audio comprises recording the sound when the appointed audio data are played outside the electronic equipment, and recording the video comprises recording the appointed target when the appointed audio data are played outside the electronic equipment; the second control module is used for responding to the received ending instruction and controlling the terminal equipment to stop recording the audio and the video to obtain first audio data and first video data; the matching processing module is used for matching the first audio data and the specified audio data so as to determine the time difference between the first audio data and the specified audio data; the offset processing module is used for carrying out offset processing on the specified audio data based on the time difference to obtain target audio data; and the synthesis processing module is used for synthesizing the target audio data and the first video data to obtain an audio and video file with audio and video in time.

In a third aspect, an embodiment of the present application provides a terminal device, where the terminal device includes: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, performs the steps of the method as described above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method as described above.

In a fifth aspect, embodiments of the present application provide a computer program product comprising computer program instructions which, when executed by a processor, implement a method as described above.

According to the scheme provided by the embodiment of the application, the first audio data recorded by the terminal equipment is matched with the specified audio data played outside, the time difference between the first audio data and the specified audio data is determined, the specified audio data is subjected to offset processing based on the time difference, the target audio data is obtained, so that the delay time of the terminal equipment when the specified audio data is played outside is eliminated, the target audio data and the video file recorded by the terminal equipment are synthesized, the audio-visual effect of rhythm of audio and video in time and synchronization can be obtained, and the use satisfaction of a user is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings in the embodiments of the present application are briefly described below.

Fig. 1 is a block flow diagram of an audio-video capture processing method according to an embodiment of the present application;

FIG. 2 is a timing diagram of the operation of a terminal device according to an embodiment of the application;

FIG. 3 is a schematic diagram of the operation of a terminal equipment module according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an audio-video capture processing apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 6 is a schematic diagram of a software structure of a terminal device according to an embodiment of the present application.

Detailed Description

The principles and spirit of the present application will be described below with reference to a number of exemplary embodiments. It is to be understood that these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the principles and spirit of the disclosure to those skilled in the art. The exemplary embodiments provided herein are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments herein without inventive effort are within the scope of the present application.

As will be appreciated by one skilled in the art, embodiments of the present application may be embodied as a system, apparatus, device, method, computer-readable storage medium, or computer program product. Accordingly, the present application may be embodied in at least one of the following forms: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software. According to the implementation mode of the application, the application requests to protect an audio and video shooting processing method and device, a terminal device, a server and a computer readable storage medium.

In this document, terms such as first, second, third, and the like are used solely to distinguish one entity (or operation) from another entity (or operation), without necessarily requiring or implying any order or relationship between such entities (or operations).

When the terminal device is used for recording video, the terminal device system collects audio and video through the audio collection module and the video collection module, and plays audio data (such as a piece of music) through the audio playing module. After the recording is finished, the recorded and collected audio may have noise in the environment, so that the recorded audio and video file is not directly used, and the original audio data of the music and the collected video data need to be synthesized, so that the interference of the noise is avoided. However, the audio playing module may have a delay in invoking the system playing interface, and when the user hears the audio playing, the video has been captured for a while. If the head of the collected video data and the head of the played audio data are directly aligned and synthesized, the problem that the audio and video after synthesis are not synchronous is caused. For example, when the dancing scene of the user is recorded by using the terminal equipment, the audio acquisition module acquires the sound in the environment when the user dances, the video acquisition module acquires the dancing picture of the user, and the audio playing module plays the song related to the dancing action. When a user starts to dance, the audio playing module broadcasts audio data through the loudspeaker, a delay time exists in the process, so that after the collected video data and the played audio data are synthesized, the dance action of the user in the collected video is not in time with the rhythm of the played audio data, and the problem of audio-visual asynchronism occurs.

In order to solve the above problems, the present application provides an audio and video capturing processing method. Fig. 1 is a flowchart of an audio/video capture processing method according to an embodiment of the present application, where the method is applied to a terminal device, and includes the following steps:

s101, in response to receiving a start instruction, controlling the terminal equipment to start recording audio and video, and simultaneously controlling the electronic equipment to start playing out specified audio data; recording the audio comprises recording the sound when the appointed audio data are played outside the electronic equipment, and recording the video comprises recording the appointed target when the appointed audio data are played outside the electronic equipment;

s102, in response to receiving the ending instruction, controlling the terminal equipment to stop recording the audio and the video to obtain first audio data and first video data;

s103, matching the first audio data and the specified audio data to determine a time difference between the first audio data and the specified audio data;

s104, carrying out offset processing on the specified audio data based on the time difference to obtain target audio data;

and S105, synthesizing the target audio data and the first video data to obtain an audio and video file with audio and video in time.

According to the method and the device, the first audio data recorded by the terminal equipment are matched with the specified audio data played outside, the time difference between the first audio data and the specified audio data is determined, the specified audio data are subjected to offset processing based on the time difference, and the target audio data are obtained, so that the delay time of the terminal equipment when the specified audio data are played outside is eliminated, the synthesized audio data and the rhythm of the video content recorded by the terminal equipment are synchronous and in time, and the use satisfaction of a user is improved.

In this embodiment of the present application, optionally, the matching process is performed on the first audio data and the specified audio data to determine the time difference between the first audio data and the specified audio data, and the following processes may be implemented:

s11, extracting data segments of a specified time period in the first audio data;

s12, extracting a data segment of a specified time period in the specified audio data;

and S13, matching the data segment of the first audio data and the data segment of the designated audio data to determine the time difference between the first audio data and the designated audio data.

If the matching is performed on the first audio data and all the data of the designated audio data, it takes a longer time, and there is a problem that the matching efficiency is low. Therefore, the embodiment of the application can be set to respectively extract the data segments of the first audio data and the specified time period in the specified audio data, and the time difference between the first audio data and the specified audio data can be determined by matching the data segments, so that the matching time can be shortened, the matching speed can be improved, and the use experience of a user can be optimized.

In this embodiment of the present application, optionally, the durations of the first audio data and the specified audio data are both greater than a first duration threshold, and the matching processing is performed on the first audio data and the specified audio data to determine the time difference between the first audio data and the specified audio data, and may be implemented by:

s21, acquiring a head data segment A1, a middle data segment B1 and a tail data segment C1 in the first audio data;

s22, acquiring a head data segment A2, a middle data segment B2 and a tail data segment C2 in the specified audio data;

s23, respectively matching the head data segment A1, the middle data segment B1 and the tail data segment C1 with the head data segment A2, the middle data segment B2 and the tail data segment C2 to obtain a head data time difference, a middle data time difference and a tail data time difference; and

and S24, taking the average value of the head data time difference, the middle data time difference and the tail data time difference as the time difference between the first audio data and the specified audio data.

In extracting the first audio data and the data pieces of the specified audio data, the number of the extracted data pieces may be considered. The number of data segments is too small, which may affect the accuracy of the matching result. Too many data segments increase the matching time, which can not improve the matching speed. Therefore, when the duration of the first audio data and the specified audio data is greater than the first duration threshold, 3 data segments are obtained, including a head data segment A1, a middle data segment B1, and a tail data segment C1 in the first audio data; similarly, a head data segment A2, a middle data segment B2, and a tail data segment C2 in the designated audio data are obtained. Therefore, the accuracy of the matching result can be ensured, and the matching speed can be increased.

In the embodiment of the present application, the first duration threshold may be 12s, 15s, 17s, 20s, longer or shorter, optionally. For example, when the first time length threshold is 22s, the time lengths of the head data segment, the middle data segment, and the tail data segment may be 5s, 6s, or other time lengths.

According to the embodiment of the application, the data segments of different positions of the audio data are respectively extracted based on the total duration of the audio data, the average value of the time differences of the data segments is used as the final time difference, the influence of noise on the determination of the time difference can be avoided, and the accuracy of the time difference is improved.

In this embodiment of the application, optionally, the durations of the first audio data and the specified audio data both fall within a first duration range, and the matching processing is performed on the first audio data and the specified audio data to determine the time difference between the first audio data and the specified audio data, which may be implemented by:

s31, acquiring a head data segment D1 and a tail data segment E1 in the first audio data;

s32, acquiring a head data segment D2 and a tail data segment E2 in the specified audio data;

s33, respectively matching the head data segment D1 and the tail data segment E1 with the head data segment D2 and the tail data segment E2 to obtain a head data time difference and a tail data time difference;

s34, taking an average value of the head data time difference and the tail data time difference as the time difference between the first audio data and the specified audio data.

In the embodiment of the present application, optionally, the first time length range is 10s to 15s.

When the duration of the first audio data and the duration of the specified audio data both fall within a first duration range, namely 10s-15s, acquiring a head data segment D1 and a tail data segment E1 in the first audio data; the head data segment D2 and the tail data segment E2 in the specified audio data are acquired. Therefore, the accuracy of determining the time difference can be ensured, and the matching speed can be increased. The duration of the head data segment and the tail data segment obtained therein may be 5s, 6s, or other duration.

In this embodiment of the present application, optionally, the durations of the first audio data and the specified audio data both fall within a second duration range, and the matching processing is performed on the first audio data and the specified audio data to determine the time difference between the first audio data and the specified audio data, which may be implemented by:

s41, acquiring a head data fragment F1 in the first audio data;

s42, acquiring a head data fragment F2 in the specified audio data;

s43, respectively matching the head data fragment F1 and the head data fragment F2 to obtain a head data time difference;

s44, regarding the header data time difference as the time difference between the first audio data and the specified audio data.

In the embodiment of the present application, optionally, the second time period ranges from 5s to 10s.

When the time length of the first audio data and the specified audio data both fall into a second time length range, namely 5s-10s, acquiring a head data fragment F1 in the first audio data; the header data fragment F2 in the specified audio data is acquired. Therefore, the accuracy of determining the time difference can be ensured, and the matching speed can be increased. The acquired head data segment may be 5s, 6s or other duration. In one embodiment of the present application, when the time lengths of the first audio data and the specified audio data are both less than 5s, the operation of matching the time difference between the first audio data and the specified audio data cannot be completed. Too short a duration of the first audio data and the specified audio data will further compress the duration of the acquired data segment, resulting in less accurate determination of the time difference. Therefore, when the first audio and the specified audio data are both less than 5s in duration, the user may be prompted to: if the recording duration is too short, please re-record.

In this embodiment of the application, optionally, performing offset processing on the specified audio data based on the time difference may be implemented by: and adding a mute frame into the head of the specified audio data, wherein the duration of the mute frame is determined according to the time difference.

Wherein the mute frame is the same as the audio parameters of the specified audio data. Audio parameters include, but are not limited to, sampling frequency, number of channels, and number of quantization bits. And adding a mute frame with the same duration as the time difference into the head of the specified audio data to obtain target audio data, wherein the target audio data and the rhythm in the first video data are synchronous and in time. The method aligns the target audio and the first video data by adding the mute frame into the designated audio data, and has the advantages of simple processing method, short processing time and the like.

In this embodiment of the present application, optionally, the specified audio data includes audio data of a song; the specified target may be any person in the current environment, such as a user of the terminal device, a target person or group photographed by the terminal device; the electronic device and the terminal device may be the same device, or the electronic device and the terminal device may not be the same device.

The specified audio data may be audio data of a song and the specified target may be an end user. In a scene that a user records a dance video by using terminal equipment, when hearing a song played outside the electronic equipment, the user dances according to the rhythm of the song; meanwhile, the dancing action of the user is recorded by the camera of the terminal equipment, and first video data are obtained.

If the electronic device and the terminal device are the same device, the terminal device (for example, a smart phone) is used for recording audio and video, and meanwhile, the mobile phone is used for playing designated audio data. Through tests, for the condition that the designated audio data is externally played by the same terminal equipment and the audio and video are recorded at the same time, the formed audio delay time is about within the range of 30ms-300 ms.

If the electronic device and the terminal device are not the same device, for example, the electronic device may be a dedicated audio playing device (e.g., a normal speaker, a power amplifier speaker, a music player with a speaker, etc.), the audio playing device is in communication connection with the terminal device (e.g., bluetooth connection or other protocol connection), and the terminal device may send the designated audio data to the audio playing device for playing. Through tests, the specified audio data is played outside by using the audio playing device, and meanwhile, the condition that terminal devices such as smart phones record audio and video files is utilized, the formed audio delay time is about within the range of 200ms-2000ms, namely, if the specified audio data and video pictures recorded by the mobile phones are synthesized without processing, the picture rhythm lags behind the audio rhythm, and the lagged time is about 200ms-2000ms. By using the scheme of the application, the accurate delay time can be determined, and the audio and the video in the synthesized video can be synchronized through offset processing, such as adding equal-length mute frames, so that the use experience of a user is greatly improved.

In this embodiment of the present application, optionally, performing matching processing on the first audio data and the specified audio data to determine a time difference between the first audio data and the specified audio data includes: transforming the first audio data and the specified audio data into first frequency domain data and second frequency domain data, respectively, using a short-time fourier transform; comparing the characteristics of the first frequency domain data and the second frequency domain data by using a sliding time window, and when the difference between the first frequency domain data and the second frequency domain data is minimum, acquiring the offset of the first frequency domain data relative to the second frequency domain data; a time difference between the first audio data and the specified audio data is determined based on the offset.

The first audio data and the specified audio data are transformed into first frequency domain data and second frequency domain data, respectively, using a short-time fourier transform. By comparing the features in the first frequency domain data and the second frequency domain data, the time difference between the first audio data and the specified audio data can be obtained quickly. Compared with the direct comparison of the first audio data and the specified audio data, the type of the audio data is converted from the time domain to the frequency domain by using the short-time Fourier transform, so that the comparison difficulty is reduced, the calculated amount is reduced, and the comparison speed is increased. Further, the time difference between the first audio data and the specified audio data can be accurately determined by the offset amount of the first frequency domain data with respect to the second frequency domain data.

In this embodiment of the application, optionally, determining a time difference between the first audio data and the specified audio data based on the offset includes: the time difference is calculated using the following equation:

delay＝window_offset×window_length/sample_rate；

wherein delay is a time difference between the first audio data and the specified audio data; window _ offset is an offset of the first frequency domain data relative to the second frequency domain data; sample _ rate is the audio sample rate within the sliding time window and window _ length is the number of sample points within the sliding time window.

Based on the offset of the first frequency domain data relative to the second frequency domain data, the audio sampling rate in the sliding time window and the number of sampling points in the sliding time window, the time difference between the first audio data and the specified audio data can be quickly and accurately obtained.

In this embodiment of the application, optionally, before the control terminal device starts to record the audio and the video and simultaneously controls the electronic device to start playing out the specified audio data, a predetermined time interval may be first set, and then recording may be started.

The predetermined time interval may include the time for initializing the audio acquisition module, the video acquisition module and the audio playing module on the terminal device. And after the terminal equipment receives the starting instruction and the initialization of the audio acquisition module, the video acquisition module and the audio playing module is completed, the terminal equipment starts to record audio and video and externally plays specified audio data. The audio acquisition module, the video acquisition module and the audio playing module have different initialization durations, and if the initialization is completed, the audio acquisition module, the video acquisition module and the audio playing module start to work immediately, which causes the problem that the recorded audio, the recorded video and the played specified audio data are not synchronous. Therefore, after the preset time interval, after the initialization of the audio acquisition module, the video acquisition module and the audio playing module is completed, the recording of the audio and video and the playing of the specified audio data are started simultaneously, and the synchronization of the recorded audio and video is ensured. The preset time length is respectively longer than the time length of initialization of the audio acquisition module, the video acquisition module and the audio playing module, and the preset time length can be in the range of 8s-12 s. Preferably, the predetermined time period may be 10s.

In this embodiment of the application, optionally, before performing matching processing on the first audio data and the specified audio data to determine a time difference between the first audio data and the specified audio data, the method further includes: if the audio parameters of the first audio data and the designated audio data are inconsistent, performing audio resampling on the designated audio receipt to unify the audio parameters of the first audio data and the designated audio data; the audio parameters include sampling frequency, channel number and quantization bit number.

The specified audio data may be audio data local to the terminal device or audio data obtained from a server. The audio data may be from different sources, and there may be a problem that the audio parameters of the first audio data and the specified audio data are different. And carrying out audio resampling operation on the appointed audio data to unify audio parameters of the first audio data and the appointed audio data, so that the time difference between the first audio data and the appointed audio data can be ensured to be more accurate.

Fig. 2 is a timing diagram of the operation of a terminal device according to an embodiment of the application. Fig. 3 is a schematic diagram of the operation of a terminal device module according to an embodiment of the present application. The following describes, with reference to fig. 2 and fig. 3, a processing procedure of the audio/video close-time processing method according to the embodiment of the present application, by taking a scene in which a user plays dance music and records actor dance videos by using a terminal device as an example.

Before preparing to dance, the user may select the name of the dance music on the terminal device. Generally, different dance songs correspond to different dance movements, where the beats of dance correspond to the rhythm of the song. The user can also use the functions of whitening, slimming, dance background replacement, filter lens replacement and the like on the terminal equipment.

When the user is ready to start dancing, a start instruction is issued on the terminal device. And the terminal equipment receives the starting instruction, counts down for 10 seconds and controls the audio playing module, the audio acquisition module and the video acquisition module to carry out initialization work. When the countdown is finished, the audio acquisition module starts to acquire audio data, and the acquired audio data comprises the sound of the environment when the user dances; the video acquisition module starts to acquire video data, and the acquired video data comprises action pictures of dancing of a user; the Audio playing module starts to play designated Audio data Audio2, a user starts to dance according to the rhythm after hearing the sound of the Audio2, and the dancing action corresponds to the heard rhythm of the Audio 2. Wherein collecting the sound of the environment when the user dances comprises the Audio playing module playing the sound of Audio 2.

And when the user clicks to stop shooting or the dance music is played, the shooting is automatically stopped, and an ending instruction is sent. The terminal equipment receives the ending instruction and controls the Audio acquisition module to stop working to obtain first Audio data Audio1; and controlling the Video acquisition module to stop working, and obtaining first Video data Video1 after the Video coding module codes the acquired Video data.

Audio1 and Audio2 are sent to the Audio feature extraction module. Firstly, judging whether the duration of Audio1 and Audio2 is more than 15 seconds, if so, respectively intercepting the first 5 seconds data, the middle 5 seconds data and the last 5 seconds data of Audio1 and Audio2, and obtaining 6 sections of data as follows: audioFront1, audioCenter1, audioBack1, audioFront2, audioCenter2, and AudioBack2;

in contrast to Audio1 and Audio2 Audio parameters, the Audio parameters include: sampling frequency, channel number, quantization bit number, and Audio resampling for Audio2 data segments (AudioFront 2, audioCenter2, audioBack 2) if different data segments are needed to unify Audio parameters of Audio1 and Audio 2.

AudioFront1 and AudioFront2, audioCenter1 and AudioCenter2, audioback1 and Audioback2 were divided into three sets of comparative data; and finally, averaging the three groups of Audio time differences to obtain the final average time difference, namely the Audio delay (hereinafter referred to as delay) of Audio1 to Audio 2.

And adding a mute frame with the same duration as the delay time difference into the head of the Audio2 to obtain target Audio data Audio3, and then synthesizing the Audio3 and the Video1 into an Audio and Video synchronous Audio and Video file in a data head alignment mode. According to the method and the device, through comparison between Audio1 and Audio2, delay of the Audio playing module for playing the designated Audio data is obtained, so that the rhythm of the Audio in the synthesized video is consistent with the dance action rhythm of the user in the video, and the use satisfaction of the user is improved.

Those skilled in the art should understand that the scheme of the present application is not limited to a scene when the user records a dance video, and may also be applied to a scene when the user records a singing song according to an accompaniment, and may also be applied to other scenes, which is not limited herein.

Fig. 4 is a schematic structural diagram of an audio-video close-up processing device according to an embodiment of the application. As shown in fig. 4, the processing apparatus 100 includes:

the first control module 110 is configured to, in response to receiving the start instruction, control the terminal device to start acquiring audio, and at the same time, control the electronic device to start playing out specified audio data; recording the audio comprises recording sound when the appointed audio data are played outside the electronic equipment, and recording the video comprises recording a specified target when the appointed audio data are played outside the electronic equipment.

And the second control module 120 is configured to, in response to receiving the end instruction, control the terminal device to terminate recording of the audio and the video, so as to obtain the first audio data and the first video data.

A matching processing module 130, configured to perform matching processing on the first audio data and the specified audio data to determine a time difference between the first audio data and the specified audio data.

And an offset processing module 140, configured to perform offset processing on the specified audio data based on the time difference to obtain target audio data.

And a synthesis processing module 150, configured to synthesize the target audio data and the first video data to obtain an audio/video file with a co-shot of audio and video.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the above-described systems, modules and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Those skilled in the art will appreciate that the embodiments described herein are illustrative of preferred embodiments and that the acts, steps, modules, or elements described herein are not necessarily required to practice the embodiments of the present application. In the foregoing embodiments, the descriptions of the embodiments of the present application have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

Fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application. The terminal device 10 comprises a processor 11, a memory 12 and a communication bus for connecting the processor 11 and the memory 12, wherein a computer program that can be run on the processor 11 is stored in the memory 12, and when the computer program is run by the processor 11, steps in the method of the embodiments of the present application can be executed or implemented. The terminal device 10 may be a server in the embodiment of the present application, and the terminal device 10 may also be a cloud server. The terminal device 10 may also be an AR device in the embodiment of the present application. A terminal device may also be referred to as a computing device, where appropriate. The terminal device 10 further comprises a communication interface for receiving and transmitting data.

In some embodiments, the processor 11 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an Application Processor (AP), a modem processor, an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, a neural Network Processor (NPU), etc.; the processor 11 may also be other general purpose processors, application Specific Integrated Circuits (ASICs), field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. The general purpose processor may be a microprocessor, any conventional processor, etc. The NPU can rapidly process input information by referring to a biological neural network structure and can continuously perform self-learning. Applications such as intelligent recognition, for example, image recognition, face recognition, semantic recognition, voice recognition, text understanding, and the like can be implemented by the NPU terminal device 10.

In some embodiments, the storage 12 may be an internal storage unit of the terminal device 10, such as a hard disk or a memory of the terminal device 10; the memory 12 may also be an external storage device of the terminal device 10, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal device 10. The memory 12 may also include both internal and external memory units of the terminal device 10. The memory 12 may be used for storing an operating system, application programs, a BootLoader (BootLoader), data, and other programs, such as program code of a computer program. The memory 12 includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM). The memory 12 is used for storing program codes executed by the terminal device 10 and transmitted data. The memory 12 may also be used to temporarily store data that has been output or is to be output.

Those skilled in the art will appreciate that fig. 5 is only an example of the terminal device 10 and does not constitute a limitation to the terminal device 10, and that the terminal device 10 may include more or less components than those shown, or combine some components, or include different components, such as input and output devices, network access devices, and the like.

Fig. 6 is a schematic diagram of a software structure of a terminal device according to an embodiment of the present application. Taking the mobile phone operating system as an Android system as an example, in some embodiments, the Android system is divided into four layers, which are: the system comprises an application program layer, an application program Framework (FWK), a system layer and a hardware abstraction layer, wherein the layers communicate with each other through a software interface.

First, the application layer may include a plurality of application packages, which may be various application apps such as call, camera, video, navigation, weather, instant messenger, education, and may also be an application app based on AR technology.

Second, the application framework layer FWK provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer may include some predefined functions, such as functions for receiving events sent by the application framework layer.

The application framework layer may include a window manager, a resource manager, and a notification manager, among others.

Wherein, the window manager is used for managing the window program. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like. The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.

Among other things, the resource manager provides various resources, such as localized strings, icons, pictures, layout files, video files, and the like, to the application.

The notification manager enables the application program to display notification information in the status bar, can be used for conveying notification type messages, can automatically disappear after being stopped for a short time, and does not need user interaction. Such as a notification manager used to inform download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, prompting text information in the status bar, sounding a prompt tone, vibrating the electronic device, flashing an indicator light, etc.

In addition, the application framework layer may also include a view system that includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views, for example, a text display view and a picture display view may be included on the display interface of the sms notification icon.

Third, the system layer may include a plurality of functional modules, such as a sensor service module, a physical state recognition module, a three-dimensional graphics processing library (e.g., openGLES), and the like.

The sensor service module is used for monitoring sensor data uploaded by various sensors in a hardware layer and determining the physical state of the mobile phone; the physical state recognition module is used for analyzing and recognizing user gestures, human faces and the like; the three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

In addition, the system layer may also include a surface manager and a media library. The surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications. The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others.

Finally, the hardware abstraction layer is a layer between hardware and software. The hardware abstraction layer may include a display driver, a camera driver, a sensor driver, and the like, and is used for driving related hardware of the hardware layer, such as a display screen, a camera, a sensor, and the like.

Embodiments of the present application also provide a computer-readable storage medium, which stores a computer program or instructions, and the computer program or instructions implement the steps of the method designed in the above embodiments when executed.

Embodiments of the present application also provide a computer program product, which includes a computer program or instructions, and the computer program or instructions implement the steps of the method designed in the above embodiments when executed. Illustratively, the computer program product may be a software installation package.

With regard to each apparatus/product described in the above embodiments, the modules/units included in the apparatus/product may be software modules/units, or hardware modules/units, or may be part of the software modules/units and part of the hardware modules/units. For example, for an application or a device/product integrated on a chip, each of the modules/units included in the device/product may be implemented by hardware such as a circuit, or at least a part of the modules/units may be implemented by a software program and run on a processor integrated within the chip, and the rest of the modules/units may be implemented by hardware such as a circuit. For another example, for an application or a device/product integrated in a terminal, each module/unit included in the application or the device/product integrated in the terminal may be implemented in a hardware manner such as a circuit, or at least a part of the modules/units may be implemented in a software program and run on a processor integrated in the terminal, and the remaining part of the modules/units may be implemented in a hardware manner such as a circuit.

As described above, only the specific embodiments of the present application are provided, and it can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered within the scope of the present application.

Claims

1. An audio and video capture processing method is applied to terminal equipment, and the method comprises the following steps:

in response to receiving a starting instruction, controlling the terminal equipment to start recording audio and video, and simultaneously controlling the electronic equipment to start playing specified audio data; recording the audio comprises recording the sound when the appointed audio data are played outside the electronic equipment, and recording the video comprises recording the appointed target when the appointed audio data are played outside the electronic equipment;

in response to receiving the ending instruction, controlling the terminal equipment to stop recording the audio and the video to obtain first audio data and first video data;

matching the first audio data and the specified audio data to determine a time difference between the first audio data and the specified audio data;

performing offset processing on the specified audio data based on the time difference to obtain target audio data; and

and synthesizing the target audio data and the first video data to obtain an audio and video file with a co-shot audio and video.

2. The method of claim 1, wherein the matching the first audio data and the designated audio data to determine the time difference between the first audio data and the designated audio data comprises:

extracting data segments of a specified time period in the first audio data;

extracting data segments of a specified time period in the specified audio data;

and matching the data segment of the first audio data and the data segment of the designated audio data to determine the time difference between the first audio data and the designated audio data.

3. The method of claim 1, wherein the first audio data and the specified audio data each have a duration greater than a first duration threshold,

the matching the first audio data and the designated audio data to determine a time difference between the first audio data and the designated audio data includes:

acquiring a head data segment A1, a middle data segment B1 and a tail data segment C1 in the first audio data; acquiring a head data fragment A2, a middle data fragment B2 and a tail data fragment C2 in the designated audio data;

respectively matching the head data segment A1, the middle data segment B1 and the tail data segment C1 with the head data segment A2, the middle data segment B2 and the tail data segment C2 to obtain a head data time difference, a middle data time difference and a tail data time difference; and

and taking the average value of the head data time difference, the middle data time difference and the tail data time difference as the time difference between the first audio data and the specified audio data.

4. The method of claim 3, wherein the first duration threshold is 15s.

5. The method according to claim 1, wherein the first audio data and the specified audio data each have a duration falling within a first duration range,

acquiring a head data segment D1 and a tail data segment E1 in the first audio data; acquiring a head data segment D2 and a tail data segment E2 in the specified audio data;

respectively matching the head data segment D1 and the tail data segment E1 with the head data segment D2 and the tail data segment E2 to obtain a head data time difference and a tail data time difference;

and taking the average value of the head data time difference and the tail data time difference as the time difference between the first audio data and the specified audio data.

6. The method of claim 5, wherein the first time period is in a range of 10s-15s.

7. The method according to claim 1, wherein the time lengths of the first audio data and the specified audio data each fall within a second time length range,

acquiring a head data fragment F1 in the first audio data; acquiring a head data fragment F2 in the specified audio data;

respectively matching the head data fragment F1 and the head data fragment F2 to obtain a head data time difference;

the header data time difference is taken as a time difference between the first audio data and the specified audio data.

8. The method of claim 7, wherein the second duration is in the range of 5s-10s.

9. The method of claim 1, wherein the offsetting the specific audio data based on the time difference comprises: and adding a mute frame into the head of the specified audio data, wherein the duration of the mute frame is determined according to the time difference.

10. The method of claim 1, wherein,

the specified audio data comprises audio data of a song;

the specified target comprises a user;

the electronic device and the terminal device are the same device, or the electronic device and the terminal device are not the same device.

11. The method of claim 1, wherein the matching the first audio data and the specified audio data to determine a time difference between the first audio data and the specified audio data comprises:

transforming the first audio data and the specified audio data into first frequency domain data and second frequency domain data, respectively, using a short-time fourier transform;

performing feature comparison on the first frequency domain data and the second frequency domain data by using a sliding time window, and when the difference between the first frequency domain data and the second frequency domain data is minimum, acquiring the offset of the first frequency domain data relative to the second frequency domain data;

determining a time difference between the first audio data and the specified audio data based on the offset.

12. The method of claim 11, wherein said determining a time difference between the first audio data and the specified audio data based on the offset comprises:

the time difference is calculated using the following equation:

delay＝window_offset×window_length/sample_rate；

wherein delay is a time difference between the first audio data and the specified audio data; window _ offset is an offset of the first frequency-domain data with respect to the second frequency-domain data; sample _ rate is the audio sample rate within the sliding time window and window _ length is the number of sample points within the sliding time window.

13. The method according to claim 1, wherein before the controlling terminal device starts recording audio and video and simultaneously the controlling electronic device starts playing out specified audio data, the method further comprises: at intervals of a predetermined length.

14. The method of claim 1, further comprising, prior to said matching the first audio data and the specified audio data to determine a time difference between the first audio data and the specified audio data:

if the audio parameters of the first audio data and the designated audio data are inconsistent, performing audio resampling on the designated audio receipt to unify the audio parameters of the first audio data and the designated audio data; the audio parameters include sampling frequency, channel number and quantization bit number.

15. An audio-video capture processing apparatus, comprising:

the first control module is used for responding to the received starting instruction, controlling the terminal equipment to start to collect audio and simultaneously controlling the electronic equipment to start to play specified audio data; recording the audio comprises recording the sound when the appointed audio data are played outside the electronic equipment, and recording the video comprises recording the appointed target when the appointed audio data are played outside the electronic equipment;

the second control module is used for responding to the received ending instruction and controlling the terminal equipment to stop recording the audio and the video to obtain first audio data and first video data;

the matching processing module is used for matching the first audio data and the specified audio data so as to determine the time difference between the first audio data and the specified audio data;

the offset processing module is used for carrying out offset processing on the specified audio data based on the time difference to obtain target audio data; and

and the synthesis processing module is used for synthesizing the target audio data and the first video data to obtain an audio and video file with a co-shot audio and video.

16. A terminal device, comprising: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, implements the method of any of claims 1-14.

17. A computer-readable storage medium having computer program instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1-14.

18. A computer program product, characterized in that it comprises computer program instructions which, when executed by a processor, implement the method according to any one of claims 1-14.