CN110491358B

CN110491358B - Method, device, equipment, system and storage medium for audio recording

Info

Publication number: CN110491358B
Application number: CN201910755717.XA
Authority: CN
Inventors: 刘梓谦; 刘东平
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2023-06-27
Anticipated expiration: 2039-08-15
Also published as: CN110491358A

Abstract

The application discloses a method, a device, a system, equipment and a storage medium for audio recording, which belong to the field of Internet. The method comprises the following steps: receiving a collusion audio recording instruction of a corresponding target song; when a first preset time length is received after a collusion audio recording instruction is received, recording the voice audio of a first segment of the target song; when receiving a collusion audio recording instruction and then a second preset time length, starting to play accompaniment audio of the target song from a first preset play time point; and when the accompaniment audio is played to a third preset time length after the second preset playing time point, ending recording, and generating the collarband audio based on the recorded voice audio and the accompaniment audio.

Description

Method, device, equipment, system and storage medium for audio recording

Technical Field

The present invention relates to the internet field, and in particular, to a method, apparatus, device, system and storage medium for performing audio recording.

Background

With the development of mobile networks, people can choose more and more entertainment modes, and singing applications and friends are used for playing songs, so that the entertainment mode is a common entertainment mode for people.

The current technology for singing songs in a singing application program is as follows: the user selects the song to be sung, triggers the recording instruction of the first terminal and starts to sung the first segment of the song. When receiving the recording instruction, the first terminal plays the accompaniment audio from the starting time point of the first segment, simultaneously starts recording the voice audio of the first user, and stops playing the accompaniment audio when the accompaniment audio is played to the ending time point of the first segment, and simultaneously stops recording the voice audio of the first user. After the recording is finished, the first terminal synthesizes the voice audio and the accompaniment audio into singing audio collusion audio and distributes the singing audio and the accompaniment audio to the Internet. The second user can find out the collusion audio issued by the first user on the internet, trigger the recording instruction of the second terminal, and then start singing the second segment of the song by the collusion audio issued by the first user. And when receiving the recording instruction, the second terminal plays the accompaniment audio from the starting time point of the second segment, namely the ending time point of the first segment, simultaneously starts recording the voice audio of the second user, and stops playing the accompaniment audio and simultaneously stops recording the voice audio of the second user when the accompaniment audio is played to the ending time point of the second segment. And the second terminal records the voice audio of the second user and splices the voice audio with the singing audio of the first user so as to generate the chorus audio.

In carrying out the present application, the inventors have found that the prior art has at least the following problems: in the above scheme, the starting time point of the clip is the time point when the user needs to start singing, and the ending time point of the clip is the time point when the user needs to end singing, so that in the singing process, the time point when the user starts to finish singing and the time point when the terminal starts to finish recording are relatively close, the phenomenon that the recording of the first word or the last word of the voice is incomplete in the recorded voice audio can occur, and further, the user's singing receiving experience by using a singing application program is reduced.

Disclosure of Invention

The embodiment of the application provides an audio recording method, which can solve the problem that the first word or the last word in recorded chorus audio is not recorded completely. The technical scheme is as follows:

in one aspect, a method of audio recording is provided, the method comprising:

receiving a collusion audio recording instruction of a corresponding target song;

when a first preset time length is received after a collusion audio recording instruction is received, recording the voice audio of a first segment of the target song;

when receiving a collusion audio recording instruction and then a second preset time length, starting to play accompaniment audio of the target song from a first preset play time point;

Ending the recording when the accompaniment audio is played to a third preset time length after a second preset playing time point, and generating a collarband audio based on the recorded voice audio and the accompaniment audio, wherein the first preset playing time point and the second preset playing time point are respectively the starting time point and the ending time point of the first segment of the target song.

Optionally, the generating the collusion audio based on the recorded voice audio and the accompaniment audio includes:

and aligning the first preset playing time point of the accompaniment audio with the starting time point of the voice audio, and synthesizing the voice audio and the accompaniment audio to generate a collarband audio.

Optionally, the method further comprises:

and displaying countdown information when receiving a collusion audio recording instruction corresponding to the target song, wherein the countdown time length corresponding to the countdown information is the second preset time length.

Optionally, the generating the collarband audio based on the recorded voice audio and the accompaniment audio further includes:

cutting off a part before a voice starting time point in the voice audio of the recorded first segment if the voice starting time point is detected in a target time length range of the voice audio starting position of the recorded first segment, and cutting off a part of the target time length in the voice audio of the recorded first segment if the voice starting time point is not detected, wherein the target time length is the difference between the second preset time length and the first preset time length;

Cutting out a part after the voice ending time point in the voice audio of the recorded first segment if the voice ending time point is detected within the third preset time length range of the voice audio ending position of the recorded first segment, and cutting out a part of the third preset time length in the voice audio of the recorded first segment if the voice ending time point is not detected;

and generating collarband audio based on the cut-out voice audio and the accompaniment audio.

In yet another aspect, a method of audio recording is provided, the method comprising:

receiving a receiving audio recording instruction of a collusion audio corresponding to a target song, wherein the collusion audio comprises an accompaniment audio of the target song and a voice audio of a first fragment of the target song;

when a first preset time length is received after a singing audio recording instruction is received, recording the voice audio of a second segment of the target song;

playing the collusion audio of the target song from a second preset playing time point when a second preset time length is received after the receiving audio recording instruction;

ending the recording when the collarband audio is played to a third preset time after a third preset play time point, and generating the chorus audio of the target song based on the recorded voice audio of the second segment, the voice audio of the first segment and the accompaniment audio, wherein the second preset play time point is the ending time point of the first segment and the starting time point of the second segment, and the third preset play time point is the ending time point of the second segment.

Optionally, the generating the chorus audio of the target song based on the recorded voice audio of the second segment, the voice audio of the first segment and the accompaniment audio includes:

determining a duration between a first preset playing time point of the collarband audio and the second preset playing time point, wherein the first preset playing time point is a starting time point of the first segment;

in the human voice audio of the first segment, determining a target time point spaced from the starting time point of the first segment by the duration;

aligning a target time point in the voice audio of the first segment with a starting time point of the voice audio of the second segment, and synthesizing the voice audio of the first segment and the voice audio of the second segment to generate synthesized voice audio;

and aligning the first preset playing time point of the accompaniment audio with the starting time point of the synthesized voice audio, synthesizing the synthesized voice audio and the accompaniment audio, and generating the singing receiving audio of the target song.

Optionally, the aligning the target time point in the voice audio of the first segment with the start time point of the voice audio of the second segment, and synthesizing the voice audio of the first segment with the voice audio of the second segment, to generate a synthesized voice audio, includes:

Processing the part of the first segment of the voice audio after the target time point to gradually reduce the volume to obtain adjusted voice audio;

and aligning a target time point in the adjusted voice audio with a starting time point of the voice audio of the second segment, and synthesizing the adjusted voice audio and the voice audio of the second segment to generate synthesized voice audio.

Optionally, the method further comprises:

and displaying countdown information when receiving a singing audio recording instruction of the collarband audio corresponding to the target song, wherein the countdown time length corresponding to the countdown information is the second preset time length.

cutting off a part before the voice starting time point in the voice audio of the recorded second segment if the voice starting time point is detected in the target time length range of the voice audio starting position of the recorded second segment, and cutting off the part of the target time length in the voice audio of the recorded second segment if the voice starting time point is not detected, wherein the target time length is the difference between the second preset time length and the first preset time length;

Cutting out a part after the voice ending time point in the voice audio of the recorded second segment if the voice ending time point is detected within the third preset time length range of the voice audio ending position of the recorded second segment, and cutting out a part of the third preset time length in the voice audio of the recorded second segment if the voice ending time point is not detected;

and generating the chorus audio of the target song based on the cut-out voice audio of the second segment, the voice audio of the first segment and the accompaniment audio.

In another aspect, there is provided an apparatus for performing audio recording, the apparatus comprising:

the receiving module is configured to receive a collusion audio recording instruction corresponding to the target song;

the recording module is configured to start recording the voice audio of the first segment of the target song when a first preset duration is received after a collusion audio recording instruction is received;

the playing module is configured to play the accompaniment audio of the target song from the first preset playing time point when a second preset duration is received after the collusion audio recording instruction is received;

and the synthesis module is configured to finish recording when the accompaniment audio is played to a third preset time length after a second preset playing time point, and generate collarband audio based on the recorded voice audio and the accompaniment audio, wherein the first preset playing time point and the second preset playing time point are respectively the starting time point and the ending time point of the first segment of the target song.

Optionally, the synthesis module is configured to:

Optionally, the apparatus further comprises a display module configured to:

Optionally, the apparatus further comprises a processing module configured to:

Generating collusion audio based on the human voice audio of the first segment after the cutting processing and the accompaniment audio.

In yet another aspect, an apparatus for performing audio recording is provided, the apparatus comprising:

a receiving module configured to receive a chorus audio recording instruction of a collusion audio corresponding to a target song, wherein the collusion audio comprises an accompaniment audio of the target song and a vocal audio of a first segment of the target song;

the recording module is configured to start recording the voice audio of the second segment of the target song when a first preset duration is received after a voice audio recording instruction is received;

the playing module is configured to play the collarband audio of the target song from a second preset playing time point when a second preset duration is received after the receiving audio recording instruction;

the synthesizing module is configured to finish recording when the collusion audio is played to a third preset playing time point, and generates the singing receiving audio of the target song based on the recorded voice audio of the second segment, the voice audio of the first segment and the accompaniment audio, wherein the second preset playing time point is the ending time point of the first segment and the starting time point of the second segment, and the third preset playing time point is the ending time point of the second segment.

Optionally, the synthesis module is configured to:

in the human voice audio of the first segment, determining a target time point separated from a starting time point by the duration;

Optionally, the synthesis module is further configured to:

Optionally, the apparatus further comprises a display module configured to:

Optionally, the apparatus further comprises a processing module configured to:

cutting off a part before the voice starting time point in the voice audio of the recorded second segment if the voice starting time point is detected in the target time range of the voice audio starting position of the recorded second segment, and cutting off the part of the target time in the voice audio of the recorded second segment if the voice starting time point is not detected, wherein the target time is the difference between the second preset time and the first preset time;

In yet another aspect, a system for performing audio recording is provided, the system including a first terminal, a second terminal, and a server, wherein:

the first terminal is used for sending a collusion request to the server; when a first preset time length is received after a collusion audio recording instruction is received, recording the voice audio of a first segment of the target song; when receiving a collusion audio recording instruction and then a second preset time length, starting to play accompaniment audio of the target song from a first preset play time point; ending recording when the accompaniment audio is played for a third preset time length after a second preset playing time point, and generating collarband audio based on the recorded voice audio and the accompaniment audio, wherein the first preset playing time point and the second preset playing time point are respectively the starting time point and the ending time point of the first segment of the target song; sending the collage audio to the server;

the second terminal is used for sending a singing request to the server; receiving a receiving audio recording instruction of a collusion audio corresponding to a target song, wherein the collusion audio comprises an accompaniment audio of the target song and a voice audio of a first fragment of the target song; when a first preset time length is received after a singing audio recording instruction is received, recording the voice audio of a second segment of the target song; playing the collusion audio of the target song from a second preset playing time point when a second preset time length is received after the receiving audio recording instruction; ending recording when the collarband audio is played to a third preset playing time point and then a third preset duration, and generating a chorus audio of the target song based on the recorded voice audio of the second segment, the voice audio of the first segment and the accompaniment audio, wherein the second preset playing time point is the ending time point of the first segment and the starting time point of the second segment, and the third preset playing time point is the ending time point of the second segment; sending the chorus audio to a server;

The server is used for sending accompaniment audio of the target song to the first terminal according to the collusion request; and sending the collarband audio and the accompaniment audio to the second terminal according to the chorus request.

In yet another aspect, a computer device is provided that includes a processor and a memory having at least one instruction stored therein that is loaded and executed by the processor to perform the operations performed by the method of audio recording as described above.

In yet another aspect, a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the operations performed by the method of audio recording as described above is provided.

The beneficial effects that technical scheme that this application embodiment provided brought are:

according to the embodiment of the application, when the user uses the singing application program to perform chorus songs, the terminal starts the recording function in advance and closes the recording function in a delayed mode, so that the audio of the singing songs of the user is completely recorded. Therefore, the embodiment of the application solves the problem that the first word or the last word is not recorded completely in the recorded chorus audio, and improves the chorus experience of the user.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of an implementation environment provided by embodiments of the present application;

fig. 2 is a flowchart of a method for performing audio recording according to an embodiment of the present application;

fig. 3 is a flowchart of a method for performing audio recording according to an embodiment of the present application;

fig. 4 is a flowchart of a method for performing audio recording according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an apparatus for performing audio recording according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an apparatus for performing audio recording according to an embodiment of the present application;

fig. 7 is a schematic diagram of a terminal structure provided in an embodiment of the present application;

fig. 8 is a schematic diagram of a server structure according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of an implementation environment provided by embodiments of the present application. Referring to fig. 1, the audio recording method provided in the present application may be implemented by a terminal and a server together. The terminal can be operated with an application program with singing function, such as a singing application program, the terminal can be provided with a microphone, an earphone, a loudspeaker and other parts, the terminal can be provided with a recording function and a communication function, the terminal can be connected to the Internet, and the terminal can be a mobile phone, a tablet computer, an intelligent wearable device, a desktop computer, a notebook computer and the like. The server may be a background server of the application program, and the server may establish communication with the terminal. The server may be a single server or a server group, if the server is a single server, the server may be responsible for all the processes in the following schemes, if the server is a server group, different servers in the server group may be respectively responsible for different processes in the following schemes, and specific process allocation conditions may be set by technicians according to actual requirements at will, which will not be described herein.

According to the audio recording method, the recorded audio can be detected by starting the recording function in advance and closing the recording function in a delayed mode, so that the purpose of recording the audio completely is achieved. In the embodiment of the present application, the singing application is taken as an example to perform detailed description of the scheme, and other cases are similar to the detailed description of the scheme and are not repeated. The terminal is provided with a singing application program. The singing application program can play the accompaniment audio of the song and display the lyrics of the song, and meanwhile, the voice audio of the user singing the song can be recorded through the microphone, and then the recorded voice audio of the user singing the song and the accompaniment audio of the song are synthesized to generate singing audio. The singing application program can also have a singing function, wherein a first user firstly plays a first segment of a song segment, the leading voice audio is recorded, then other users can continue singing after the audio played by the first user, a second segment of the same song segment is singed, and the receiving voice audio is recorded. And splicing the starting position of the chorus voice audio at the ending position of the collarband voice audio to obtain spliced audio, and then synthesizing the spliced audio with accompaniment to generate the chorus voice audio.

When a user uses the singing application program, one account can be registered in the singing application program, the created account can pay attention to other accounts or can be paid attention to by other accounts, each account is provided with an own account dynamic page, singing audio of the account and the like can be displayed in the dynamic page of the account, and different accounts can mutually access and browse the account dynamic page. The singing application program is also provided with a singing label, a user can click the singing label to enter a singing page, in the singing page, the user can select a song segment which the user wants to sing, and then click a recording starting option in the singing page, so that the user can sing the selected song segment. The user can access the dynamic page of other accounts, browse the collarband audio released by other accounts, and the collarband audio is provided with a corresponding chorus tag, and the user can click on the chorus tag to enter the chorus page and then click on the beginning recording option in the chorus page to chorus the collarband audio. The singing application program is also provided with a singing function page, the singing function page is provided with collarband audios released by other accounts, a singing tag corresponding to each collarband audio is arranged below each collarband audio, and the user can click the singing tag to sing the collarband audio. However, in the process of playing the lead and receiving the lead, the time for the terminal to start recording and end recording may be the same as the time for the user to start playing and end playing, which may occur that the first word or the last word of the lead or receiving the lead is recorded incompletely in the recorded lead voice or receiving the lead voice, and the user may play in advance, resulting in dislocation of the playing voice and accompaniment voice. According to the embodiment of the application, the recording function is started in advance, the recording function is closed in a delayed mode, recorded voice audio is processed, the phenomenon that voice recording is incomplete, singing audio and accompaniment audio are misplaced can be avoided, and therefore user experience of singing songs by using a singing application program is improved.

Fig. 2 is a flowchart of audio recording according to an embodiment of the present application, where the method may be applied to a terminal for a collaring user. Referring to fig. 2, this embodiment includes:

step 201, receiving a collusion audio recording instruction corresponding to a target song.

In an implementation, a user may operate a terminal to launch a singing application to log in to his own account. The singing application program is provided with a collusion tag, and a user can click on the collusion tag to enter a collusion page. After entering the collusion page, the server can automatically recommend a song segment for the user to collude, the user can select to collude the song segment recommended by the server, and the user can search the song segment which the user wants to collude through a search box arranged in the collusion page. After the user selects the target song clip of collarband, the server sends the accompaniment audio and lyric information of the collarband song clip to the terminal, the accompaniment audio also has a corresponding mark file, and the mark file can record three preset playing time points of the accompaniment audio, namely a first preset playing time point, a second preset playing time point and a third preset playing time point. The terminal can play or stop playing the accompaniment audio according to the markup file. In the collusion process, the accompaniment audio can be played at a first preset playing time point, and the accompaniment audio can be played at a second preset playing time point. In the process of singing, the accompaniment audio can be played at the second preset playing time point, and the accompaniment audio can be played at the third preset playing time point. In the collarband page, a recording starting option is further set, and the user can click on the recording starting option, namely a recording instruction of the collarband audio is triggered.

Step 202, when a first preset time period is received after a collusion audio recording instruction is received, recording of the voice audio of the first segment of the target song is started.

In implementation, a first preset duration is set in the singing application program, and when the elapsed duration reaches the first preset duration after the user clicks the record starting option in the collarband page, the terminal starts the recording function to start recording the vocal audio (i.e., the collarband vocal audio) of the first segment of the target song selected by the user.

Step 203, when receiving the collusion audio recording instruction for a second preset time period, playing the accompaniment audio of the target song from the first preset playing time point.

In implementation, a countdown function is set in the singing application program, and the countdown duration is equal to the second preset duration and is greater than the first preset duration. After the user clicks the recording starting option, starting a countdown function, and after the countdown is finished, starting to play the accompaniment audio from a corresponding first preset playing time point in the accompaniment audio by the terminal, namely starting to play the accompaniment audio of the first segment of the target song. In addition, the terminal also can display the countdown information in the collarband page, and the countdown information can be in a mode of displaying a countdown time progress bar or displaying a countdown seconds number so as to remind the user to start to prepare for collarband.

For example, the user selects a piece of a song to be collared on the collage page and clicks on the start recording option. At this time, a countdown time progress bar is displayed on the collarband page, for example, the countdown time period is 3 seconds, that is, the second preset time period is 3 seconds. After the recording start option is clicked for a first preset duration, for example, 2.5 seconds, the terminal starts the recording function. After 0.5 second, the countdown of 3 seconds is completed, the terminal starts playing the accompaniment audio, if the first preset playing time point set in the accompaniment audio is 1 minute and 5 seconds, the terminal starts playing the accompaniment audio from 1 minute and 5 seconds of the accompaniment audio, and meanwhile, the user can collude the selected song segments according to the lyrics prompt.

Step 204, finishing recording when the accompaniment audio is played to a third preset time length after the second preset playing time point, and generating the collarband audio based on the recorded voice audio and the accompaniment audio, wherein the first preset playing time point and the second preset playing time point are respectively the starting time point and the ending time point of the first segment of the target song.

In implementation, the terminal is further provided with a third preset duration, the third preset duration may be equal to or may not be equal to the duration of the first preset duration, the terminal starts to play from the first preset playing time corresponding to the accompaniment audio, after the accompaniment audio is played to the corresponding second preset playing time point, the terminal stops playing the accompaniment audio, namely, the terminal ends playing the accompaniment audio of the first segment of the target song, after the accompaniment audio is stopped playing, the terminal ends the recording function when the elapsed duration reaches the third preset duration, and recorded audio is obtained. Then, the recorded audio is subjected to voice detection, and the starting position and the ending position of the voice are detected. When the starting position of the voice is detected, the detection range is a fixed time length after the recorded voice starting position, wherein the fixed time length can be the difference between the second preset time length and the first preset time length, when the sound volume decibel of the voice is detected to exceed the decibel threshold value, the user is considered to start singing, the time point when the sound volume decibel in the voice exceeds the decibel threshold value is taken as the starting position of the voice, and the voice before the starting position of the voice is cut off. When the ending position of the voice is detected, the detection range is within a third preset time period before the ending position of the recorded voice, when the sound volume decibel of the detected voice is lower than the decibel threshold value, the user is considered to finish singing, the time point when the sound volume decibel of the voice is lower than the decibel threshold value is taken as the ending position of the voice, and the voice after the ending position of the voice is cut off. And finally, taking the audio after the shearing treatment as recorded voice audio. I.e. the user has completed singing the first segment of the target song, the point in time when the first segment of the target song starts is the same as the first preset playing point in time, and the point in time when the first segment of the target song ends is the same as the first preset playing point in time. Finally, aligning the starting time point of the voice audio with the first preset playing time point of the accompaniment audio, and then synthesizing the voice audio and the accompaniment audio to generate the collarband audio.

For example, the third preset time period is set to 0.5 seconds, and the second preset play time point is set to 1 minute and 25 seconds. When the terminal plays to 1 minute and 25 seconds of the accompaniment audio, the terminal stops playing the accompaniment, and then stops the recording function after 0.5 seconds. According to step 203, the first preset playing time point is 1 minute and 5 seconds, the terminal records in advance by 0.5 seconds, the accompaniment audio is played for 20 seconds, and the terminal records 21 seconds of audio. Then, the starting position and the ending position of the voice are detected in the recorded 21 seconds of audio. For detecting the starting position of the voice within the first 0.5 seconds, if the decibel threshold is set to 60 decibels, and the sound decibel of the audio detected at the 0.2 second position in the recorded 21 seconds of audio exceeds 60 decibels, the position of 0.2 seconds is considered as the starting position of the voice, and the audio before 0.2 seconds is clipped. And for detecting the ending position of the voice within 0.5 seconds of the ending, if the decibel threshold is set to be 60 decibels, and if the sound decibel of the voice detected at the 20.7 second position in the recorded 21 second voice is lower than 60 decibels, the 20.7 second position is considered to be the ending position of the voice, the voice after 20.7 seconds is cut off, and finally a section of voice of 20.5 seconds is obtained. And aligning the starting time point of the obtained voice audio with the 1 minute 5 seconds position of the accompaniment audio, and then performing audio synthesis to generate the collarband audio.

Fig. 3 is a flowchart of audio recording according to an embodiment of the present application, where the method may be applied to a terminal for a user to record. Referring to fig. 3, this embodiment includes:

step 301, receiving a chorus audio recording instruction of a collusion audio corresponding to a target song, wherein the collusion audio comprises accompaniment audio of the target song and human voice audio of a first segment of the target song.

In an implementation, a user may operate a terminal to launch a singing application to log in to his own account. The user may click on the chorus tab in the singing application, and then the server may send the collarband audio, accompaniment audio, and lyric information corresponding to the target song to the terminal. Meanwhile, a chorus page is entered, the collarband audio can be played in the chorus page, a record starting option is further arranged in the chorus page, and a user can click the record starting option to trigger a chorus audio recording instruction of the collarband audio of the corresponding target song. After detecting the triggering of the audio recording instruction, the terminal starts to play the accompaniment audio and prepares to record the voice audio of the singing.

Step 302, when a first preset duration is received after the audio recording instruction is received, recording the voice audio of the second segment of the target song is started.

In an implementation, when the elapsed time reaches the first preset time after the user clicks the record starting option in the chorus page, the terminal starts the recording function to start recording the voice audio of the second segment of the target song (i.e., the chorus voice audio).

Step 303, playing the collarband audio of the target song from the second preset playing time point when a second preset time length is received after the receiving the audio recording instruction.

In implementation, a countdown function is set in the singing application program, and the countdown duration is equal to the second preset duration and is greater than the first preset duration. After the user clicks the recording starting option, starting a countdown function, and after the countdown is finished, starting playing the collarband audio from a second set playing time point of the collarband audio by the terminal. In addition, the terminal displays the countdown information in the chorus page, and the countdown information can be displayed in a manner of displaying a countdown time progress bar or a countdown description so as to remind the user to start to prepare chorus.

For example, the user selects a section of collarband audio in the chorus function page, clicks a chorus tag corresponding to the collarband audio, enters the chorus page, plays the collarband audio in the chorus page first, and can directly click a recording start option. After the recording start option is clicked for a first preset duration, for example, 2.5 seconds, the terminal starts the recording function. After 0.5 seconds, the countdown of 3 seconds is completed, and according to the second preset playing time point in step 204, playing is performed from the position corresponding to the accompaniment audio 1 minute and 25 seconds in the collarband audio.

And step 304, finishing recording when the collusion audio is played to a third preset playing time point and then a third preset duration, and generating the chorus audio of the target song based on the recorded voice audio of the second segment, the voice audio of the first segment and the accompaniment audio, wherein the second preset playing time point is the ending time point of the first segment and the starting time point of the second segment, and the third preset playing time point is the ending time point of the second segment.

In the implementation, the terminal is further provided with a third preset time length, and when the elapsed time length reaches the third preset time length, the terminal terminates recording to obtain recorded audio. Then, voice detection is performed on the recorded audio, and the specific detection method is the same as that of step 204, and reference may be made to step 204, which is not repeated here. And then obtaining the voice audio of the second segment recorded.

In implementation, it is necessary to determine a duration between a first preset playing time point of the collarband audio and the second preset playing time point, and further determine a target time point of the duration between the first preset playing time points of the vocal audio of the first segment, then perform a process of gradually decreasing the volume of the audio after the target time point in the vocal audio of the first segment, obtain the vocal audio of the first segment after the process, then align the target time point in the vocal audio of the first segment after the process with the start time point of the vocal audio of the second segment, perform synthesis, generate a synthesized vocal audio, align the first preset playing time point of the accompaniment audio of the target song with the start time point of the synthesized vocal audio, perform synthesis on the synthesized vocal audio and the accompaniment audio, find a position of the first preset playing time point in the synthesized audio, intercept the audio with a long duration after the corresponding position, and generate a receiving audio of the target song.

For example, the third preset time period is set to 0.5 seconds, and the third preset play time point is set to 1 minute 45 seconds. When the terminal plays to 1 minute and 45 seconds corresponding to the accompaniment audio, the playing of the collarband audio is stopped, and then the recording function of the terminal is stopped after 0.5 second. According to step 204, the second preset playing time point is 1 minute and 25 seconds, the terminal records in advance by 0.5 seconds, the accompaniment audio is played for 20 seconds, and the terminal records 21 seconds of audio. Then, the starting position and the ending position of the voice are detected in the recorded 21 seconds of audio. A second piece of human voice audio is obtained and assumed to be 20.3 seconds.

According to the

steps

203 and 204, it may be determined that the duration between the first preset playing time point and the second preset playing time point is 20 seconds, the target time point spaced from the first segment by 20 seconds is 20 seconds, the volume of the voice audio of 0.5 seconds after the first segment is gradually reduced, and then the adjusted 20 second position of the voice audio of the first segment is aligned with the starting time point of the voice audio of the second segment, and synthesized, so as to generate a synthesized voice audio of 40.3 seconds. Then, 1 minute 5 seconds of the accompaniment audio is aligned with the starting time point of the synthesized voice audio and synthesized, the synthesized voice audio is intercepted, and the audio 40.3 seconds after 1 minute 5 seconds of the corresponding accompaniment audio in the synthesized voice audio is intercepted to serve as the chorus audio of the target song.

Fig. 4 is a flowchart of audio recording according to an embodiment of the present application, where the method may be applied to a server, a first terminal, and a second terminal. Referring to fig. 4, this embodiment includes:

in step 401, the first terminal sends a collusion request of a target song to the server.

Step 402, the server sends accompaniment audio and lyric information of the target song to the first terminal according to the collusion request.

In implementation, a collusion tag is set in the singing application program, a user can click on the collusion tag to enter a collusion page, and the server can randomly recommend a song to the user according to the ranking list of the current song or according to personal preference of the user and send accompaniment audio and lyric information of the song to the terminal. The singing application program is also provided with a search column, a user can search a song segment which the user wants to collude in the search column, the name of the song segment is clicked in the search result list, and the server can send accompaniment audio and lyric information of the clicked song segment to the terminal.

In step 403, the first terminal receives a collusion audio recording instruction corresponding to the target song.

In step 404, the first terminal starts recording the vocal audio of the first segment of the target song when a first preset duration is received after receiving the collusion audio recording instruction.

In step 405, when the first terminal receives the collusion audio recording instruction and then the second preset duration, the first terminal starts playing the accompaniment audio of the target song from the first preset playing time point.

In step 406, the first terminal ends recording when the accompaniment audio is played to the second preset playing time point and then the third preset time period, and generates the collarband audio of the target song based on the recorded vocal audio and the accompaniment audio.

In step 407, the first terminal transmits the collusion audio of the target song to the server.

In step 408, the second terminal sends a request for receiving a collusion of the target song to the server.

Step 409, the server transmits the collarband audio, accompaniment audio and lyric information of the target song to the second terminal.

In the implementation, the singing application program is also provided with a singing function page, the singing function page is provided with collarband audios released by other accounts, a singing tag corresponding to each collarband audio is arranged below each collarband audio, and the user can click the singing tag to sing the collarband audio. After clicking the chorus tag, the server sends the collarband audio, accompaniment audio and lyric information of the corresponding collarband song to the second terminal.

In step 410, the second terminal receives a chorus audio recording instruction of the collarband audio corresponding to the target song.

Wherein the collusion audio comprises accompaniment audio of the target song and vocal audio of the first segment of the target song.

In step 411, the second terminal starts recording the vocal audio of the second segment of the target song when the first preset duration is received after receiving the vocal recording instruction.

In step 412, when the second terminal receives the audio recording instruction for the vocal lead, the second terminal starts playing the vocal lead of the target song from the second preset playing time point.

In step 413, the second terminal ends recording when the collusion audio is played to the third preset playing time point and then the third preset time period, and generates the chorus audio of the target song based on the recorded voice audio of the second segment, the voice audio of the first segment and the accompaniment audio.

In step 414, the second terminal sends the chorus audio to the server.

Fig. 5 is a schematic diagram of an apparatus for performing audio recording according to an embodiment of the present application, where the apparatus may be a terminal for performing a collusion user in the foregoing embodiment, and as shown in fig. 5, the apparatus includes:

a receiving module 510 configured to receive a lead audio recording instruction corresponding to a target song;

a recording module 520 configured to start recording the vocal audio of the first segment of the target song when a first preset duration is received after the collusion audio recording instruction;

a playing module 530 configured to play the accompaniment audio of the target song from the first preset playing time point when a second preset duration is received after the collusion audio recording instruction;

and a synthesizing module 540 configured to end recording when the accompaniment audio is played to a third preset time period after a second preset playing time point, and generate collarband audio based on the recorded vocal audio and the accompaniment audio, wherein the first preset playing time point and the second preset playing time point are respectively a start time point and an end time point of the first segment of the target song.

Optionally, the synthesizing module 540 is configured to:

Optionally, the apparatus further comprises a display module configured to:

Optionally, the apparatus further comprises a processing module configured to:

Fig. 6 is a schematic diagram of an apparatus for performing audio recording according to an embodiment of the present application, where the apparatus may be a terminal for performing a singing user in the foregoing embodiment, as shown in fig. 6, and the apparatus includes:

a receiving module 610 configured to receive a chorus audio recording instruction of a collusion audio corresponding to a target song, wherein the collusion audio includes an accompaniment audio of the target song and a vocal audio of a first segment of the target song;

a recording module 620 configured to start recording the vocal audio of the second segment of the target song when a first preset duration is received after receiving the vocal recording instruction;

the playing module 630 is configured to play the collarband audio of the target song from a second preset playing time point when a second preset duration is received after the receiving of the audio recording instruction;

and a synthesizing module 640, configured to end recording when the collusion audio is played to a third preset playing time point, and generate a singing receiving audio of the target song based on the recorded voice audio of the second segment, the voice audio of the first segment and the accompaniment audio, where the second preset playing time point is the ending time point of the first segment and the starting time point of the second segment, and the third preset playing time point is the ending time point of the second segment.

Optionally, the synthesizing module 640 is configured to:

Optionally, the synthesizing module 640 is further configured to:

Optionally, the apparatus further comprises a display module configured to:

Optionally, the apparatus further comprises a processing module configured to:

The embodiment of the application also provides a system for recording audio, which comprises a first terminal, a second terminal and a server, wherein:

the first terminal is used for sending a collusion request to the server; when a first preset time length is received after a collusion audio recording instruction is received, recording the voice audio of a first segment of a target song; when receiving a collusion audio recording instruction and then a second preset time length, starting to play accompaniment audio of the target song from a first preset play time point; ending the recording when the accompaniment audio is played to a third preset time length after a second preset playing time point, and generating collarband audio based on the recorded voice audio and the accompaniment audio, wherein the first preset playing time point and the second preset playing time point are respectively the starting time point and the ending time point of a first segment of the target song; sending the lead singing audio to a server;

the second terminal is used for sending a singing request to the server; receiving a chorus audio recording instruction of a collusion audio corresponding to a target song, wherein the collusion audio comprises accompaniment audio of the target song and human voice audio of a first fragment of the target song; when a first preset time length is received after a singing audio recording instruction is received, recording the voice audio of a second segment of the target song; playing the collarband audio of the target song from a second preset playing time point when a second preset time length is received after the receiving audio recording instruction; ending the recording when the collarband audio is played to a third preset playing time point and then ending the recording, and generating a target song receiving audio based on the recorded voice audio of the second segment, the voice audio of the first segment and the accompaniment audio, wherein the second preset playing time point is the ending time point of the first segment and the starting time point of the second segment, and the third preset playing time point is the ending time point of the second segment; sending the singing voice frequency to a server;

The server is used for sending accompaniment audio of the target song to the first terminal according to the collusion request; and sending the collarband audio and the accompaniment audio to the second terminal according to the singing request.

It should be noted that: in the apparatus for performing audio recording provided in the foregoing embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the apparatus for performing audio recording provided in the above embodiment and the method embodiment for performing audio recording belong to the same concept, and the specific implementation process is detailed in the method embodiment, which is not repeated here.

Fig. 7 shows a block diagram of a terminal 700 according to an exemplary embodiment of the present application, which may be the terminal for collaring or the terminal for chorus in the above-described embodiments. The terminal 700 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 700 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.

In general, the terminal 700 includes: a processor 701 and a memory 702.

Processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 701 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 701 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 701 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 701 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 702 may include one or more computer-readable storage media, which may be non-transitory. The memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 702 is used to store at least one instruction for execution by processor 701 to implement the method for audio recording provided by the method embodiments herein.

In some embodiments, the terminal 700 may further optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 703 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 704, touch display 705, camera 706, audio circuitry 707, positioning component 708, and power supply 709.

A peripheral interface 703 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 701 and memory 702. In some embodiments, the processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 701, the memory 702, and the peripheral interface 703 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 704 is configured to receive and transmit RF (Radio Frequency) signals, also referred to as electromagnetic signals. The radio frequency circuitry 704 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 704 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 704 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 704 may also include NFC (Near Field Communication ) related circuitry, which is not limited in this application.

The display screen 705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 705 is a touch display, the display 705 also has the ability to collect touch signals at or above the surface of the display 705. The touch signal may be input to the processor 701 as a control signal for processing. At this time, the display 705 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 705 may be one, providing a front panel of the terminal 700; in other embodiments, the display 705 may be at least two, respectively disposed on different surfaces of the terminal 700 or in a folded design; in still other embodiments, the display 705 may be a flexible display disposed on a curved surface or a folded surface of the terminal 700. Even more, the display 705 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The display 705 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 706 is used to capture images or video. Optionally, the camera assembly 706 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 706 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 707 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 701 for processing, or inputting the electric signals to the radio frequency circuit 704 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones may be respectively disposed at different portions of the terminal 700. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 701 or the radio frequency circuit 704 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 707 may also include a headphone jack.

The location component 708 is operative to locate the current geographic location of the terminal 700 for navigation or LBS (Location Based Service, location-based services). The positioning component 708 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, the Granati system of Russia, or the Galileo system of the European Union.

A power supply 709 is used to power the various components in the terminal 700. The power supply 709 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 709 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 700 further includes one or more sensors 710. The one or more sensors 710 include, but are not limited to: acceleration sensor 711, gyroscope sensor 712, pressure sensor 713, fingerprint sensor 714, optical sensor 715, and proximity sensor 716.

The acceleration sensor 711 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 700. For example, the acceleration sensor 711 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 701 may control the touch display screen 705 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 711. The acceleration sensor 711 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 712 may detect a body direction and a rotation angle of the terminal 700, and the gyro sensor 712 may collect a 3D motion of the user to the terminal 700 in cooperation with the acceleration sensor 711. The processor 701 may implement the following functions based on the data collected by the gyro sensor 712: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 713 may be disposed at a side frame of the terminal 700 and/or at a lower layer of the touch display screen 705. When the pressure sensor 713 is disposed at a side frame of the terminal 700, a grip signal of the user to the terminal 700 may be detected, and the processor 701 performs left-right hand recognition or quick operation according to the grip signal collected by the pressure sensor 713. When the pressure sensor 713 is disposed at the lower layer of the touch display screen 705, the processor 701 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 705. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 714 is used to collect a fingerprint of the user, and the processor 701 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 714, or the fingerprint sensor 714 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 701 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 714 may be provided on the front, back or side of the terminal 700. When a physical key or vendor Logo is provided on the terminal 700, the fingerprint sensor 714 may be integrated with the physical key or vendor Logo.

The optical sensor 715 is used to collect the ambient light intensity. In one embodiment, the processor 701 may control the display brightness of the touch display 705 based on the ambient light intensity collected by the optical sensor 715. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 705 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 705 is turned down. In another embodiment, the processor 701 may also dynamically adjust the shooting parameters of the camera assembly 706 based on the ambient light intensity collected by the optical sensor 715.

A proximity sensor 716, also referred to as a distance sensor, is typically provided on the front panel of the terminal 700. The proximity sensor 716 is used to collect the distance between the user and the front of the terminal 700. In one embodiment, when the proximity sensor 716 detects that the distance between the user and the front face of the terminal 700 gradually decreases, the processor 701 controls the touch display 705 to switch from the bright screen state to the off screen state; when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal 700 gradually increases, the processor 701 controls the touch display screen 705 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 7 is not limiting of the terminal 700 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 8 is a schematic structural diagram of a server provided in an embodiment of the present application, where the server 800 may have a relatively large difference due to configuration or performance, and may include one or more processors (central processing units, CPU) 801 and one or more memories 802, where the memories 802 store at least one instruction, and the at least one instruction is loaded and executed by the processors 801 to implement the methods provided in the foregoing method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform the method of audio recording in the above embodiment is also provided. The computer readable storage medium may be non-transitory. For example, the computer readable storage medium may be a ROM (Read-Only Memory), a RAM (Random Access Memory ), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the invention to the particular embodiments of the present application, but to limit the scope of the invention to the particular embodiments of the present application.

Claims

1. A method of audio recording, the method comprising:

displaying countdown information when receiving a collusion audio recording instruction corresponding to a target song;

when receiving a collusion audio recording instruction and then a second preset time length, starting to play the accompaniment audio of the target song from a first preset play time point, wherein the countdown time length corresponding to the countdown information is the second preset time length, and the second preset time length is longer than the first preset time length;

2. The method of claim 1, wherein the generating the collarband audio based on the recorded vocal audio and the accompaniment audio comprises:

3. The method of claim 1, wherein the generating the collarband audio based on the recorded vocal audio and the accompaniment audio comprises:

4. A method of audio recording, the method comprising:

displaying countdown information when receiving a singing audio recording instruction of the collusion audio corresponding to the target song;

playing the collusion audio of the target song from a second preset playing time point when a second preset time length is received after receiving the audio receiving recording instruction, wherein the countdown time length corresponding to the countdown information is the second preset time length, and the second preset time length is longer than the first preset time length;

5. The method of claim 4, wherein generating the chorus audio of the target song based on the recorded vocal audio of the second segment, the vocal audio of the first segment, and the accompaniment audio comprises:

6. The method of claim 5, wherein aligning the target point in time in the first piece of human voice audio with the start point in time of the second piece of human voice audio, synthesizing the first piece of human voice audio with the second piece of human voice audio, generating a synthesized human voice audio, comprises:

7. The method of claim 4, wherein generating the chorus audio of the target song based on the recorded vocal audio of the second segment, the vocal audio of the first segment, and the accompaniment audio comprises:

8. An apparatus for recording audio, comprising:

The display module is configured to display countdown information when receiving a collusion audio recording instruction corresponding to a target song;

the playing module is configured to play the accompaniment audio of the target song from a first preset playing time point when a second preset time length is received after the collusion audio recording instruction is received, wherein the countdown time length corresponding to the countdown information is the second preset time length, and the second preset time length is longer than the first preset time length;

9. An apparatus for recording audio, comprising:

The display module is configured to display countdown information when receiving a singing audio recording instruction of the collarband audio corresponding to the target song;

the playing module is configured to play the collarband audio of the target song from a second preset playing time point when a second preset time length is received after receiving the audio recording instruction, wherein the countdown time length corresponding to the countdown information is the second preset time length, and the second preset time length is longer than the first preset time length;

10. A system for audio recording, the system for a first terminal, a second terminal and a server, comprising:

the first terminal is used for sending a collusion request to the server; displaying countdown information when receiving a collusion audio recording instruction corresponding to a target song; when a first preset time length is received after a collusion audio recording instruction is received, recording the voice audio of a first segment of a target song; when receiving a collusion audio recording instruction and then a second preset time length, starting to play accompaniment audio of the target song from a first preset play time point; ending recording when the accompaniment audio is played for a third preset time length after a second preset playing time point, and generating collarband audio based on the recorded voice audio and the accompaniment audio, wherein the first preset playing time point and the second preset playing time point are respectively the starting time point and the ending time point of the first segment of the target song; sending the collage audio to the server;

the second terminal is used for sending a singing request to the server; receiving a receiving audio recording instruction of a collusion audio corresponding to a target song, wherein the collusion audio comprises an accompaniment audio of the target song and a voice audio of a first fragment of the target song; displaying countdown information when receiving a singing audio recording instruction of the collusion audio corresponding to the target song; when a first preset time length is received after a singing audio recording instruction is received, recording the voice audio of a second segment of the target song; playing the collusion audio of the target song from a second preset playing time point when a second preset time length is received after the receiving audio recording instruction; ending recording when the collarband audio is played to a third preset playing time point and then a third preset duration, and generating a chorus audio of the target song based on the recorded voice audio of the second segment, the voice audio of the first segment and the accompaniment audio, wherein the second preset playing time point is the ending time point of the first segment and the starting time point of the second segment, and the third preset playing time point is the ending time point of the second segment; sending the chorus audio to a server;

The server is used for sending accompaniment audio of the target song to the first terminal according to the collusion request; sending the collarband audio and the accompaniment audio to the second terminal according to the collarband request;

the countdown time length corresponding to the countdown information is the second preset time length, and the second preset time length is longer than the first preset time length.

11. A computer device comprising a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to perform the operations performed by the method of recording audio of any of claims 1 to 7.

12. A computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to perform the operations performed by the method of recording audio of any of claims 1 to 7.