CN109859730B

CN109859730B - Audio processing method and device

Info

Publication number: CN109859730B
Application number: CN201910227745.4A
Authority: CN
Inventors: 张晨; 郭亮; 范威
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-03-25
Filing date: 2019-03-25
Publication date: 2021-03-26
Anticipated expiration: 2039-03-25
Also published as: CN109859730A

Abstract

The application relates to an audio processing method and an audio processing device, which belong to the technical field of computers, wherein the method is applied to a first user terminal of a first user, and comprises the following steps: collecting first voice data of a first user in the process of playing locally stored accompaniment music; when second voice data of a second user sent by a second user terminal is received, judging whether the current chorus state is a preset chorus state in which the first user does not sing and the second user sings; if the judgment result is yes, calculating the time difference between the far-end acquisition time of the second voice data acquired by the second user terminal and the near-end playing time of the second voice data played by the first user terminal; and the playing time delay time difference of the locally stored accompaniment music is used for playing the accompaniment music and the second voice data after the time delay, so that the first user can sing according to the accompaniment music and the second voice data after the time delay. By adopting the method and the device, the alignment effect of chorus singing voice can be improved.

Description

Audio processing method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an audio processing method and apparatus.

Background

The application program of the online Karaoke has a chorusing function, and the first user and the second user can chorus the song with each other in a microphone connecting mode. The first user terminal of the first user and the second user terminal of the second user both store audio data of accompaniment music of a song to be chorus.

In the singing process of the first user and the second user, the first user terminal can collect the first voice data of the first user while playing the locally stored accompaniment music, and then sends the collected first voice data to the second user terminal. Meanwhile, the first user terminal can receive second voice data of the second user, which is acquired by the second user terminal, and play the second voice data. Correspondingly, the second user terminal may receive the first voice data sent by the first user terminal, and then, the second user terminal may collect the second voice data of the second user while playing the first voice data and the locally stored accompaniment music, and then, the second user terminal may send the second voice data to the first user terminal.

However, due to the network transmission delay, the data processing delay, and the like, after the second audio data is generated, the first user terminal needs a period of time to receive the second audio data, so that the singing voice played by the first user terminal in the second audio data may have a wrong beat phenomenon compared with the accompanying music, that is, the rhythm of the singing voice heard by the first user may lag behind the rhythm of the accompanying music, resulting in a poor alignment effect of the chorus singing voice.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides an audio processing method and apparatus to achieve an effect of improving the alignment of chorus singing voice. The specific technical scheme is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided an audio processing method, which is applied to a first user terminal of a first user, and includes:

collecting first voice data of the first user in the process of playing locally stored accompaniment music;

when second voice data of a second user sent by a second user terminal is received, judging whether the current chorus state is a preset chorus state, wherein the preset chorus state is that the first user does not sing and the second user sings;

if the current chorus state is the preset chorus state, calculating the time difference between the far-end acquisition time and the near-end playing time of the second voice data, wherein the far-end acquisition time is the acquisition time of the second voice data acquired by the second user terminal, and the near-end playing time is the playing time of the second voice data played by the first user terminal;

and delaying the playing time of the locally stored accompaniment music by the time difference, and playing the delayed accompaniment music and the second voice data so that the first user sings according to the delayed accompaniment music and the second voice data.

Optionally, the calculating a time difference between the far-end acquisition time and the near-end playing time of the second voice data includes:

acquiring a sending time stamp carried by the second voice data and a receiving time stamp of the second voice data;

and calculating the time difference between the far-end acquisition time and the near-end playing time of the second voice data according to the sending time stamp and the receiving time stamp.

and calculating the time difference between the far-end acquisition time and the near-end playing time of the second voice data according to a preset correlation analysis algorithm, the second voice data and the locally stored audio data of the accompaniment music.

acquiring data generation time required for generating the second voice data, data transmission time required for transmitting the second voice data to the first user terminal, and data processing time required for the first user terminal to perform data processing on the second voice data;

and calculating the time difference between the far-end acquisition time and the near-end playing time of the second voice data according to the data generation time, the data transmission time and the data processing time.

Optionally, delaying the playing time of the locally stored accompaniment music by the time difference includes:

and determining the audio data of the accompaniment music after time delay according to a preset variable-speed and non-tonal modification algorithm, the locally stored audio data of the accompaniment music and the time difference.

Optionally, delaying the playing time of the locally stored accompaniment music by the time difference further includes:

and synthesizing the first audio data, the second audio data and the delayed audio data of the accompaniment music to determine target audio data.

According to a second aspect of the embodiments of the present disclosure, there is provided an audio processing apparatus, which is applied to a first user terminal of a first user, including:

the audio acquisition unit is configured to acquire first voice data of the first user in the process of playing locally stored accompaniment music;

the judging unit is configured to judge whether the current chorus state is a preset chorus state when second voice data of a second user, sent by a second user terminal, is received, wherein the preset chorus state is that the first user does not sing and the second user sings;

the calculation unit is configured to calculate a time difference between a far-end acquisition time and a near-end playing time of the second voice data when a current chorus state is a preset chorus state, wherein the far-end acquisition time is an acquisition time of the second voice data acquired by the second user terminal, and the near-end playing time is a playing time of the second voice data played by the first user terminal;

and the playing unit is configured to delay the playing time of the locally stored accompaniment music by the time difference and play the delayed accompaniment music and the second voice data so that the first user sings according to the delayed accompaniment music and the second voice data.

Optionally, the computing unit includes:

a first obtaining subunit, configured to obtain a sending timestamp carried by the second voice data and a receiving timestamp of the second voice data;

a first calculating subunit configured to calculate a time difference between a far-end acquisition time and a near-end play time of the second voice data according to the sending time stamp and the receiving time stamp.

Optionally, the computing unit includes:

and the second calculating subunit is configured to calculate a time difference between the far-end acquisition time and the near-end playing time of the second voice data according to a preset correlation analysis algorithm, the second voice data and the locally stored audio data of the accompaniment music.

Optionally, the computing unit includes:

a second obtaining subunit, configured to obtain data generation time required for generating the second voice data, data transmission time required for transmitting the second voice data to the first user terminal, and data processing time required for the first user terminal to perform data processing on the second voice data;

and the third calculation subunit is configured to calculate a time difference between the far-end acquisition time and the near-end playing time of the second voice data according to the data generation time, the data transmission time and the data processing time.

Optionally, the playing unit includes:

and the determining subunit is configured to determine the audio data of the accompaniment music after the time delay according to a preset variable-speed and invariable-key algorithm, the locally stored audio data of the accompaniment music and the time difference.

Optionally, the apparatus further comprises:

and the synthesizing unit is configured to synthesize the first audio data, the second audio data and the delayed audio data of the accompaniment music and determine target audio data.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to carry out the method steps of any of the first aspects when executing the program stored on the memory.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having instructions which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the method steps of any of the first aspects.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: the method comprises the steps that first voice data of a first user can be collected in the process of playing locally stored accompaniment music; when second voice data of a second user sent by a second user terminal is received, judging whether the current chorus state is a preset chorus state in which the first user does not sing and the second user sings; if the current chorus state is the preset chorus state, calculating the time difference between the far-end acquisition time of the second voice data acquired by the second user terminal and the near-end playing time of the second voice data played by the first user terminal; and the playing time delay time difference of the locally stored accompaniment music is used for playing the accompaniment music and the second voice data after the time delay, so that the first user can sing according to the accompaniment music and the second voice data after the time delay. The time difference between the far-end acquisition time and the near-end playing time of the second audio data is calculated, and the playing time of the accompaniment music is delayed by the time difference, so that the singing voice of the second user is aligned with the rhythm of the delayed accompaniment music, the first user can conveniently sing according to the delayed accompaniment music and the second voice data, and the alignment effect of the chorus singing voice can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram illustrating a method of audio processing according to an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method of audio processing according to an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method of audio processing according to an exemplary embodiment;

FIG. 4 is a flow diagram illustrating a method of audio processing according to an exemplary embodiment;

FIG. 5 is a block diagram illustrating an audio processing device according to an exemplary embodiment;

FIG. 6 is a block diagram illustrating an electronic device for audio processing in accordance with an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The embodiment of the application provides an audio processing method, which is applied to a user terminal, wherein the user terminal can be an electronic device with audio data acquisition and playing functions and data processing functions, such as a mobile phone and a computer.

By using the on-line karaoke application program, the first user and the second user in different geographic positions can chorus songs or chat with each other in a microphone connecting mode. In one possible implementation, the first user may be the originator of the connect request and the second user may be the recipient of the connect request. The first user can execute preset operation to trigger the first user terminal to generate a microphone connecting request, wherein the preset operation can be clicking an icon of a second user in a preset display page of the first user terminal; the preset operation can also be inputting a user name of the second user; the microphone connecting request can carry the user identification of the first user, the user identification of the second user and the song identification of the song to be chorus. The first user terminal may then send a connect to microphone request to a server of the application of Karaoke online.

The server can transmit the microphone connecting request to the second user terminal after receiving the microphone connecting request, and after receiving a response message which is sent by the second user terminal and used for indicating that the microphone connecting is approved, the server can send the audio data of the accompaniment music of the song to be chorus to the first user terminal and the second user terminal so that the first user terminal and the second user terminal can store the audio data locally.

In this embodiment of the application, the number of the second users may be one or multiple, and this embodiment of the application is not particularly limited. In the process of singing a song with the opposite party, both the first user terminal of the first user and the second user terminal of the second user can process the audio data of the locally stored accompaniment music through the audio processing method provided by the embodiment of the application, so that the singing voice of the opposite party is aligned with the locally stored accompaniment music. The embodiment of the present application takes the application of the method to the first user terminal of the first user as an example, and describes a specific processing procedure of an audio processing method. Fig. 1 is a flow diagram illustrating a method of audio processing, as shown in fig. 1, including the following steps, according to an example embodiment.

Step 101, in the process of playing locally stored accompaniment music, collecting first voice data of a first user.

The first user terminal and the second user terminal may be preset with an audio collecting component such as a microphone and a sound pickup, and an audio playing component such as a speaker and a sound box.

In implementation, the first user terminal may collect the first voice data of the first user through the audio collecting part while playing the locally stored accompaniment music through the audio playing part. The first user terminal may then transmit the first voice data to the second user terminal.

Correspondingly, the second user terminal can acquire the second voice data of the second user through the audio acquisition component while playing the locally stored accompaniment music through the audio playing component, and then the second user terminal can send the second voice data to the first user terminal.

And 102, judging whether the current chorus state is a preset chorus state or not when receiving second voice data of a second user, which is sent by a second user terminal.

Wherein the predetermined chorus state is that the first user is not singing and the second user is singing. For example, when a first user in charge of singing male voice is listening to a second user in charge of singing female voice to sing while singing to a male and female part of a chorus song, the current chorus state is the preset chorus state.

In implementation, when receiving second voice data of a second user sent by a second user terminal, the first user terminal may determine whether the current chorus state is a preset chorus state according to the first voice data and the second voice data.

In a possible implementation manner, a VAD (Voice Activity Detection) algorithm may be preset in the first user terminal, the first user terminal may determine whether the first Voice data includes the singing Voice of the first user through the VAD algorithm and the first Voice data, and correspondingly, the first user terminal may determine whether the second Voice data includes the singing Voice of the second user through the VAD algorithm and the second Voice data. When the first voice data does not include the singing voice of the first user and the second voice data includes the singing voice of the second user, the first user terminal may determine that the current chorus state is the preset chorus state.

And 103, if the current chorus state is the preset chorus state, calculating the time difference between the far-end acquisition time and the near-end playing time of the second voice data.

And the far-end acquisition time is the acquisition time for acquiring the second voice data by the second user terminal. The near-end playing time is the playing time of playing the second voice data after the first user terminal receives the second voice data.

In implementation, if the current chorus state is the preset chorus state, the first user terminal may obtain the far-end acquisition time and the near-end playing time of the second voice data, and calculate a time difference between the far-end acquisition time and the near-end playing time of the second voice data.

In one possible implementation, the server may transmit the third audio data of the original singer of the chorus song to the first user terminal and the second user terminal while transmitting the audio data of the accompaniment music. For third voice data and second voice data representing the same lyrics, the first user terminal may use the third voice data as voice data matched with accompaniment music, and then the first user terminal may use the playing time of the third voice data as the far-end collecting time of the second voice data.

For example, the first user and the second user start to sing the song "where we seem to have gone" by means of connecting to wheat at 9 o' clock, 10 min 30 sec. The first user terminal determines that the playing time of the third voice data "as if it is a spring" which the original singer sings is 9: 10: 39 seconds, based on the playing time stamp included in the third audio data and the starting playing time of the accompaniment music being 9: 10: 30 seconds, and then the first user terminal may use 9: 10: 39 seconds as the remote capturing time of the second voice data. Then, the first user terminal may play the second voice data at 9 o 'clock 10 min 40 sec after receiving the second voice data indicating that the lyric "as if that is a spring", and the first user may hear "as if that is a spring" that the second user sings, and the near-end play time of the second voice data is 9 o' clock 10 min 40 sec. Then, the first user terminal may calculate a difference between the far-end acquisition time 9, 10 minutes 39 seconds and the near-end play time 9, 10 minutes 40 seconds, to obtain a time difference of 1 second.

In this embodiment of the application, a specific process of determining the lyrics represented by the second voice data and the third voice data by the first user terminal is the prior art, and details are not described here.

In a possible implementation manner, the first user terminal may calculate a time difference between the far-end acquisition time and the near-end playing time of the second voice data while receiving the second voice data. Therefore, after the current chorus state is determined to be the preset chorus state, the first user terminal can immediately delay the playing time of the accompaniment music, the time required by the first user terminal for adjusting the playing time of the accompaniment music is reduced, and the alignment effect of the chorus singing voice is further improved.

And 104, playing the accompaniment music and the second voice data after the delay by delaying the playing time of the locally stored accompaniment music by a time difference so that the first user can sing according to the accompaniment music and the second voice data after the delay.

In implementation, the first user terminal may delay the playing time of the accompaniment music according to the locally stored audio data and time difference of the accompaniment music to obtain the delayed accompaniment music. Then, the first user terminal may play the delayed accompaniment music and the second voice data at the same time, so that the singing voice of the second user is matched with the accompaniment music, thereby facilitating the first user to sing according to the delayed accompaniment music and the second voice data.

In one possible implementation manner, the first user terminal may obtain a signal expression bgm (t) of audio data of accompaniment music stored locally, where t is a playing time of the accompaniment music. The first user terminal may delay the playing time of the accompaniment music by replacing T in the signal expression bgm (T) with T-T, so as to obtain a signal expression of the delayed audio data of the accompaniment music, which is Bgm' (T) ═ Bgm (T-T), where T is a time difference.

In the embodiment of the application, the first user terminal collects the first voice data of the first user in the process of playing the locally stored accompaniment music, and judges whether the current chorus state is the preset chorus state or not when receiving the second voice data of the second user sent by the second user terminal. And if the current chorus state is the preset chorus state, calculating the time difference between the far-end acquisition time and the near-end playing time of the second voice data, delaying the playing time of locally stored accompaniment music by the time difference, and playing the delayed accompaniment music and the second voice data. Because the time difference between the far-end acquisition time and the near-end playing time of the second singing voice is calculated, and the playing time of the accompaniment music is delayed by the time difference, the singing voice of the second user is aligned with the rhythm of the delayed accompaniment music, and the alignment effect of the chorus singing voice can be improved.

Optionally, after obtaining the delayed accompaniment music, the first user terminal may further generate target audio data of a chorus song of the first user and the second user. Then, the first user terminal may send the target audio data to the server, so that the user terminals of other users can obtain the target audio data, and further, the chorus songs of the first user and the second user are played based on the target audio data.

The process of the first user terminal generating the target audio data may be: and synthesizing the first audio data, the second audio data and the delayed audio data of the accompaniment music to determine target audio data.

A mixing algorithm, such as a normalized mixing algorithm and an audio mixing algorithm, may be preset in the first user terminal.

The first user terminal may determine the target audio data through a mixing algorithm, the first audio data, the second audio data, and the delayed audio data of the accompaniment music.

In a possible implementation manner, the target audio data may be generated by a user terminal (i.e., a first user terminal) of an initiator of the microphone connecting request and sent to the server, and the target audio data may also be generated by a user terminal (i.e., a second user terminal) of a recipient of the microphone connecting request and sent to the server, which is not particularly limited in the embodiment of the present application.

In the embodiment of the application, the first user terminal synthesizes the first audio data, the second audio data and the delayed audio data of the accompaniment music to determine the target audio data, so that the first user terminal can send the target audio data to the server to enable the user terminals of other users to acquire the target audio data, and further play the chorus songs of the first user and the second user based on the target audio data. In this way, other users can be facilitated to listen to the chorus songs of the first user and the second user.

Optionally, the first user terminal may calculate the time difference between the far-end acquisition time and the near-end playing time of the second voice data in various ways, and in a feasible implementation manner, the first user terminal may calculate the time difference according to the timestamp. As shown in fig. 2, the specific processing procedure includes:

step 201, obtaining a sending time stamp carried by the second voice data and a receiving time stamp of the second voice data.

In an implementation, the second user terminal may record the current collection time when collecting the second voice data through the audio collection component, and obtain the transmission timestamp. The second user terminal may then generate second voice data including the transmission time stamp, and transmit the second voice data to the first user terminal.

The first user terminal may obtain the sending timestamp carried by the second voice data after receiving the second voice data, and record the time of receiving the second voice data to obtain the receiving timestamp.

Step 202, calculating a time difference between the far-end acquisition time and the near-end playing time of the second voice data according to the sending time stamp and the receiving time stamp.

In an implementation, the first user terminal may use the sending timestamp as a far-end collecting time of the second voice data and the receiving timestamp as a near-end playing time of the second voice data, and then, the first user terminal may use a time difference between the sending timestamp and the receiving timestamp as a time difference between the far-end collecting time and the near-end playing time of the second voice data.

In the embodiment of the application, the first user terminal obtains the sending timestamp carried by the second voice data and the receiving timestamp of the second voice data. And then, calculating the time difference between the far-end acquisition time and the near-end playing time of the second voice data according to the sending time stamp and the receiving time stamp. Because based on the transmission time stamp and the receipt time stamp of second voice data, calculate the time difference of far-end acquisition time and near-end broadcast time, consequently, the time difference of calculating is more accurate, can improve the accompaniment music of this time difference of broadcast time delay, and the alignment effect with second user's singing, the first user of being convenient for sings according to accompaniment music and second voice data after the delay, improves the alignment effect of chorus singing.

Optionally, in another possible implementation manner, the first user terminal may calculate the time difference according to the time required for performing the steps of generating, transmitting, and processing the second voice data, and the specific processing procedure includes:

step one, acquiring data generation time required for generating second voice data, data transmission time required for transmitting the second voice data to a first user terminal, and data processing time required for the first user terminal to perform data processing on the second voice data.

The first user terminal may pre-store preset data generation time required for generating voice data, preset data transmission time required for transmitting the second voice data from the second user terminal to the first user terminal, and preset data processing time required for performing data processing on the voice data by the first user terminal. The data processing comprises decoding and noise reduction, and the preset data processing time can also comprise buffer time required by the data processing. The preset data generation time, the preset data transmission time, and the preset data processing time may be empirical values determined by a skilled person.

In implementation, the first user terminal may use a preset data generation time as a data generation time required for generating the second voice data, use a preset data transmission time as a data transmission time required for transmitting the second voice data to the first user terminal, and use a preset data processing time as a data processing time required for performing data processing on the second voice data.

In a possible implementation manner, the first user terminal may determine a data transmission time required for transmitting the second voice data to the first user terminal according to the sending time stamp and the receiving time stamp of the second voice data.

And step two, calculating the time difference between the far-end acquisition time and the near-end playing time of the second voice data according to the data generation time, the data transmission time and the data processing time.

In implementation, the first user terminal may use a sum of the preset data generation time, the preset data transmission time, and the preset data processing time as a time difference between the far-end acquisition time and the near-end playing time of the second voice data.

In the embodiment of the application, the first user terminal calculates the time difference between the far-end acquisition time and the near-end playing time of the second voice data according to the data generation time required for generating the second voice data, the data transmission time required for transmitting the second voice data to the first user terminal, and the data processing time required for the first user terminal to perform data processing on the second voice data. Because the time difference between the far-end acquisition time and the near-end playing time is calculated based on the time required by the steps of generating, transmitting, processing the data and the like of the second voice data, the calculated time difference is more accurate, the accompaniment music of which the playing time is delayed by the time difference can be improved, the aligning effect with the singing voice of the second user is improved, the first user can conveniently sing according to the delayed accompaniment music and the second voice data, and the aligning effect of the chorus singing voice is improved.

Optionally, in another feasible implementation manner, a correlation analysis algorithm may be preset in the first user terminal, and the first user terminal may calculate the time difference based on the correlation analysis algorithm, where the specific processing procedure includes: and calculating the time difference between the far-end acquisition time and the near-end playing time of the second voice data through a preset correlation analysis algorithm, the second voice data and the audio data of the accompaniment music. Correlation analysis algorithms such as the Generalized Cross Correlation function (GCC).

In the embodiment of the application, the first user terminal calculates the time difference between the far-end acquisition time and the near-end playing time of the second voice data through the correlation analysis algorithm, the second voice data and the audio data of the accompaniment music. Because the playing time of the audio data of the accompaniment music is used as the far-end collecting time, therefore, based on the audio data of the second voice data and the accompaniment music, the calculated time difference between the far-end collecting time and the near-end playing time is more accurate, the accompaniment music with the time difference delayed by the playing time can be improved, the aligning effect with the singing voice of a second user is realized, the first user can conveniently sing according to the delayed accompaniment music and the second voice data, and the aligning effect of the chorus singing voice is improved.

Optionally, a variable-speed invariant Based overlay-Add (WSLOA) may be preset in the first user terminal, and the first user terminal may delay the playing time of the locally stored accompaniment music through the variable-speed invariant algorithm, where the specific processing procedure includes: and determining the audio data of the accompaniment music after time delay through a preset variable-speed non-tonal modification algorithm, the locally stored audio data of the accompaniment music and the time difference.

In implementation, the first user terminal may obtain a time difference between a far-end acquisition time and a near-end playing time of the second voice data, a signal expression of locally stored audio data of accompaniment music, and a preset speed change multiple. Then, the first user terminal may determine the signal expression of the delayed audio data of the accompaniment music through a non-shift algorithm, the signal expression, a shift multiple, and the time difference, to obtain the delayed audio data of the accompaniment music.

For example, the signal expression bgm (T) of the audio data of the accompaniment music stored locally, and the time difference between the far-end capturing time and the near-end playing time of the second voice data are T. When the playing time of the locally stored accompaniment music needs to be delayed by the time difference T, the speed change multiple acquired by the first user terminal may be 2. The first user terminal can perform 2-time speed reduction processing on the audio data of the locally stored accompaniment music through a variable speed non-variable modulation algorithm, and the first user terminal recovers the normal playing speed after lasting for 2T time, so that the playing time of the locally stored accompaniment music can be delayed by T. The signal expression of the audio data of the accompaniment music after the delay is Bgm' (T) ═ WSOLA (bgm (T), 0.5, 2T), where 0.5 represents 2-fold deceleration and 2T represents the duration of the delay process.

When the delay of the play time of the accompaniment music after the delay needs to be restored from T to 0, the first user terminal may acquire 0.5 as the shift multiple. Then, the first user terminal can perform 2-time speed-up processing on the delayed audio data of the accompaniment music through a variable-speed non-variable-pitch algorithm, and restore the normal playing speed after lasting for 2T time, so that the delayed time delay of the playing time of the accompaniment music can be restored to zero. The signal expression of the audio data of the accompaniment music after the restoration is Bgm' (T) ═ WSOLA (bgm (T), 2, 2T), where 2 denotes 2-fold acceleration and 2T denotes the duration of the delay processing.

In the embodiment of the application, the first user terminal determines the audio data of the accompaniment music after the delay through a variable speed non-tone-changing algorithm, the audio data of the accompaniment music stored locally and the time difference, the in-process that the accompaniment music is delayed according to the time difference changing along with the time can be avoided, the problem that the noise is possibly generated in the accompaniment music can be solved, the playing effect of the accompaniment music after the delay can be ensured, the first user can conveniently sing according to the accompaniment music after the delay and the second voice data, and the alignment effect of the chorus singing voice is improved.

As shown in fig. 3, which is a flowchart of an audio processing method provided in an embodiment of the present application, a server sends audio data of accompaniment music to a first user terminal and a second user terminal, the first user terminal sends collected first audio data of a first user to the second user terminal, and correspondingly, the second user terminal sends collected second audio data of a second user to the first user terminal. The first user terminal may generate target audio data of a chorus song and transmit the target audio data to the server. The user terminals of other users can acquire the target audio data of the chorus songs from the server and play the chorus songs of the first user and the second user based on the target audio data.

Corresponding to the above processing flow, an embodiment of the present application further provides an audio processing method, as shown in fig. 4, the specific processing procedure includes:

step 401, in the process of playing locally stored accompaniment music, collecting first voice data of a first user.

In the implementation, the specific processing procedure of this step is the same as that of step 101, and is not described herein again.

Step 402, when second voice data of a second user sent by a second user terminal is received, judging whether the current chorus state is a preset chorus state, wherein the preset chorus state is that the first user does not sing and the second user sings.

In an implementation, the specific processing procedure of this step may refer to step 102, and if the current chorus state is the preset chorus state, the first user terminal may perform step 403. If the current chorus state is not the preset chorus state, the first user terminal may perform step 406.

Step 403, if the current chorus state is the preset chorus state, calculating a time difference between the far-end acquisition time and the near-end playing time of the second voice data.

The far-end acquisition time is the acquisition time of the second voice data acquired by the second user terminal, and the near-end playing time is the playing time of the second voice data played by the first user terminal.

In the implementation, the specific processing procedure of this step is the same as that of step 103, and is not described herein again.

And 404, delaying the playing time of the locally stored accompaniment music by a time difference, and playing the delayed accompaniment music and the second voice data so that the first user sings according to the delayed accompaniment music and the second voice data.

In the implementation, the specific processing procedure of this step is the same as that of step 104, and is not described here again.

Step 405, synthesizing the first audio data, the second audio data, and the delayed audio data of the accompaniment music to determine target audio data.

In the implementation, the specific processing procedure of this step is the prior art, and is not described herein again.

Step 406, synthesizing the first audio data, the second audio data, and the locally stored audio data of the accompaniment music, and determining the target audio data.

In the implementation, the specific processing procedure of this step is similar to that of step 405, and is not described here again.

Step 407, sending the target audio data to the server.

The audio processing method provided by the embodiment of the application is applied to a first user terminal of a first user, and can acquire first voice data of the first user in the process of playing locally stored accompaniment music; when second voice data of a second user sent by a second user terminal is received, judging whether the current chorus state is a preset chorus state in which the first user does not sing and the second user sings; if the current chorus state is the preset chorus state, calculating the time difference between the far-end acquisition time of the second voice data acquired by the second user terminal and the near-end playing time of the second voice data played by the first user terminal; and the playing time delay time difference of the locally stored accompaniment music is used for playing the accompaniment music and the second voice data after the time delay, so that the first user can sing according to the accompaniment music and the second voice data after the time delay. The time difference between the far-end acquisition time and the near-end playing time of the second audio data is calculated, and the playing time of the accompaniment music is delayed by the time difference, so that the singing voice of the second user is aligned with the rhythm of the delayed accompaniment music, the first user can conveniently sing according to the delayed accompaniment music and the second voice data, and the alignment effect of the chorus singing voice can be improved.

Fig. 5 is a block diagram illustrating an audio processing device according to an example embodiment. The apparatus is configured as a first user terminal of a first user, and referring to fig. 5, the apparatus includes an audio collecting unit 510, a judging unit 520, a calculating unit 530, and a playing unit 540.

An audio collecting unit 510 configured to collect first voice data of the first user during playing of locally stored accompaniment music;

a determining unit 520, configured to determine whether a current chorus state is a preset chorus state when second voice data of a second user sent by a second user terminal is received, where the preset chorus state is that the first user does not sing and the second user sings;

a calculating unit 530 configured to calculate a time difference between a far-end collecting time and a near-end playing time of the second voice data when the current chorus state is a preset chorus state, where the far-end collecting time is a collecting time of the second voice data collected by the second user terminal, and the near-end playing time is a playing time of the second voice data played by the first user terminal;

a playing unit 540 configured to delay a playing time of the locally stored accompaniment music by the time difference, and play the delayed accompaniment music and the second voice data, so that the first user sings according to the delayed accompaniment music and the second voice data.

Optionally, the computing unit includes:

Optionally, the playing unit includes:

Optionally, the apparatus further comprises:

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

The audio processing device provided by the embodiment of the application is applied to a first user terminal of a first user, and can acquire first voice data of the first user in the process of playing locally stored accompaniment music; when second voice data of a second user sent by a second user terminal is received, judging whether the current chorus state is a preset chorus state in which the first user does not sing and the second user sings; if the current chorus state is the preset chorus state, calculating the time difference between the far-end acquisition time of the second voice data acquired by the second user terminal and the near-end playing time of the second voice data played by the first user terminal; and the playing time delay time difference of the locally stored accompaniment music is used for playing the accompaniment music and the second voice data after the time delay, so that the first user can sing according to the accompaniment music and the second voice data after the time delay. The time difference between the far-end acquisition time and the near-end playing time of the second audio data is calculated, and the playing time of the accompaniment music is delayed by the time difference, so that the singing voice of the second user is aligned with the rhythm of the delayed accompaniment music, the first user can conveniently sing according to the delayed accompaniment music and the second voice data, and the alignment effect of the chorus singing voice can be improved.

Fig. 6 is a block diagram illustrating an electronic device 600 for audio processing in accordance with an example embodiment. For example, the electronic device 600 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 6, electronic device 600 may include one or more of the following components: a processing component 602, a memory 604, a power component 606, a multimedia component 608, an audio component 610, an interface to input/output (I/O) 612, a sensor component 614, and a communication component 616.

The processing component 602 generally controls overall operation of the electronic device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 602 can include one or more units that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 can include a multimedia unit to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operation at the device 600. Examples of such data include instructions for any application or method operating on the electronic device 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 604 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power supply component 606 provides power to the various components of electronic device 600. The power components 606 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 600.

The multimedia component 608 includes a screen that provides an output interface between the electronic device 600 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 600 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

The I/O interface 612 provides an interface between the processing component 602 and peripheral interface units, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 614 includes one or more sensors for providing status assessment of various aspects of the electronic device 600. For example, the sensor component 614 may detect an open/closed state of the device 600, the relative positioning of components, such as a display and keypad of the electronic device 600, the sensor component 614 may also detect a change in the position of the electronic device 600 or a component of the electronic device 600, the presence or absence of user contact with the electronic device 600, orientation or acceleration/deceleration of the electronic device 600, and a change in the temperature of the electronic device 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate communications between the electronic device 600 and other devices in a wired or wireless manner. The electronic device 600 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 616 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) unit to facilitate short-range communications. For example, the NFC unit may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 604 comprising instructions, executable by the processor 620 of the electronic device 600 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In yet another embodiment provided by the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the audio processing methods of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. An audio processing method applied to a first user terminal of a first user, comprising:

2. The method of claim 1, wherein calculating the time difference between the far-end acquisition time and the near-end playing time of the second speech data comprises:

3. The method of claim 1, wherein calculating the time difference between the far-end acquisition time and the near-end playing time of the second speech data comprises:

4. The method of claim 1, wherein calculating the time difference between the far-end acquisition time and the near-end playing time of the second speech data comprises:

5. The method of claim 1, wherein delaying the playing of the locally stored accompaniment music by the time difference comprises:

6. The method of claim 1, wherein delaying the playing of the locally stored accompaniment music by the time difference further comprises:

and synthesizing the first voice data, the second voice data and the delayed audio data of the accompaniment music to determine target audio data.

7. An audio processing apparatus, applied to a first user terminal of a first user, comprising:

8. The apparatus of claim 7, wherein the computing unit comprises:

9. The apparatus of claim 7, wherein the computing unit comprises:

10. The apparatus of claim 7, wherein the computing unit comprises:

11. The apparatus of claim 7, wherein the playback unit comprises:

12. The apparatus of claim 7, further comprising:

a synthesizing unit configured to synthesize the first voice data, the second voice data, and the delayed audio data of the accompaniment music, and determine target audio data.

13. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to implement the method steps of any of claims 1-6 when executing the program stored on the memory.

14. A non-transitory computer readable storage medium having instructions which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the method steps of any of claims 1-6.