CN110767204A

CN110767204A - Sound processing method, device and storage medium

Info

Publication number: CN110767204A
Application number: CN201810848306.0A
Authority: CN
Inventors: 赵朋
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-07-27
Filing date: 2018-07-27
Publication date: 2020-02-07
Anticipated expiration: 2038-07-27
Also published as: CN110767204B

Abstract

The application provides a sound processing method, device and storage medium, when judging that the sound signal that the user sent is asynchronous with the accompaniment signal in the audio and video data of singing originally, according to the time difference between sound signal and the accompaniment signal, adjust the broadcast time of sound signal or accompaniment signal, thereby make the sound signal synchronization that accompaniment signal and user sent, make the sound signal that the user sent reach the optimum condition, when overcoming sound and accompaniment asynchronous, the problem of user sound quality poor.

Description

Sound processing method, device and storage medium

Technical Field

The present application relates to the field of audio and video technologies, and in particular, to a sound processing method and apparatus, and a storage medium.

Background

Currently, users will perform activities such as simulating speech, singing, playing, and dubbing in life. To achieve the best results, the user is required to practice, and the user is familiar with background music during the practice. Then, in the formal activities, the speech, singing and the like are carried out according to the background music, and the optimal effect is displayed.

In the exercise process, or in the process of performance, gather the sound that the user sent through pronunciation collection devices such as earphone, microphone, then play the sound of gathering through real-time such as intelligent stereo set. In the process, the time from the sound emission of the user to the sound playing and being heard by the user is about 30-50 milliseconds (ms).

In the playing process, the earphone or the microphone collects and plays the sound of the user in real time. When the user's voice and background accompaniment are not synchronized, the user's voice quality is poor and the best effect cannot be achieved.

Disclosure of Invention

The application provides a sound processing method, a sound processing device and a storage medium, which are used for overcoming the problem of poor sound quality of a user when sound and accompaniment are not asynchronous.

In a first aspect, an embodiment of the present application provides a sound processing method, which may be applied to an intelligent device and may also be applied to a chip in the intelligent device. The method is described below by taking the application to the intelligent device as an example, and the method comprises the following steps:

judging whether a sound signal sent by a user is synchronous with an accompaniment signal, wherein the accompaniment signal is extracted from original audio and video data;

if the sound signal and the accompaniment signal are not synchronous, the sound signal or the accompaniment signal is processed according to a time difference so that the accompaniment signal is synchronous with the sound signal, and the time difference is the time difference between the sound signal and the accompaniment signal.

Through the sound processing method provided by the first aspect, when the sound signal sent by the user is judged to be asynchronous with the accompaniment signal in the original audio and video data, the playing time of the sound signal or the accompaniment signal is adjusted according to the time difference between the sound signal and the accompaniment signal, so that the accompaniment signal is synchronous with the sound signal sent by the user, the sound signal sent by the user reaches the optimal state, and the problem of poor sound quality of the user when the sound and the asynchronous accompaniment are overcome.

In one possible implementation, the processing the sound signal or the accompaniment signal according to a time difference if the sound signal is not synchronized with the accompaniment signal includes:

if the sound signal is faster than the accompaniment signal, caching the sound signal from a first time point, and playing the cached sound signal at a second time point;

or,

if the sound signal is slower than the accompaniment signal, caching the accompaniment signal from the first time point, and playing the cached accompaniment signal at a second time point;

the first time point is determined according to the lyric file in the original audio and video data, and the second time point is the time point which starts from the first time point and is obtained after the time difference.

With the sound processing method provided by the possible implementation manner, if the sound signal is faster than the accompaniment signal, the sound signal is cached from the first time point, and the cached sound signal is played at the second time point; or, if the sound signal is slower than the accompaniment signal, the accompaniment signal is buffered from the first time point, and the buffered accompaniment signal is played at the second time point. In the process, the aim of aligning the sound signal and the accompaniment signal is achieved by processing the sound signal or the accompaniment signal.

In one possible implementation, if the sound signal is not synchronized with the accompaniment signal, the processing the sound signal or the accompaniment signal according to the time difference includes:

and if the sound signal is slower than the accompaniment signal, deleting the sound signal before the first time point, wherein the first time point is determined according to the lyric file in the original singing audio and video data.

By the sound processing method provided by the possible implementation mode, if the sound signal is slower than the accompaniment signal, the sound signal before the first time point is discarded, and the aim of aligning the sound signal and the accompaniment signal is fulfilled.

In a possible implementation manner, before determining whether the sound signal and the accompaniment signal emitted by the user are synchronized, the method further includes:

extracting characters from the original singing audio and video data according to the first time point, wherein the first time point is determined according to a lyric file in the original singing audio and video data;

judging whether the sound production similarity of the characters corresponding to the sound signals and the extracted characters is larger than a first threshold value or not;

and if the sound production similarity between the characters corresponding to the sound signals and the extracted characters is greater than the first threshold value, determining that the sound signals emitted by the user are the signals corresponding to the extracted characters in the original audio and video data.

According to the sound processing method provided by the possible implementation mode, whether the sound signal of the user is the lyrics of the time point is judged in a mode of comparing the character similarity at the specific time point, and a precondition guarantee is provided for the follow-up judgment of whether the sound signal and the accompaniment signal are synchronous.

determining a first acoustic model corresponding to the characters;

determining a second acoustic model corresponding to the sound signal according to the sound signal;

judging whether the similarity of the first acoustic model and the second acoustic model is larger than a second threshold value or not;

and if the similarity between the first acoustic model and the second acoustic model is greater than the second threshold, determining that the acoustic signal sent by the user is a signal corresponding to the extracted characters in the original audio and video data.

According to the sound processing method provided by the possible implementation mode, whether the sound signal of the user is the lyrics of the time point is judged in a mode of comparing the model similarity at the specific time point, and a precondition guarantee is provided for the follow-up judgment of whether the sound signal and the accompaniment signal are synchronous. In addition, in the process, the lyric file of a song is very small, and the model called by the intelligent equipment is a model for identifying a sentence or a section of lyrics in the library, so that the data quantity to be compared is very small, the comparison speed is high, and the accuracy is high.

In one possible implementation manner, if the sound signal is not synchronized with the accompaniment signal, the method further includes, after processing the sound signal or the accompaniment signal according to a time difference:

extracting an original singing signal from the original singing audio and video data;

and adjusting the sound signal according to the original singing signal so that the similarity between the waveform of the original singing signal and the waveform of the sound signal exceeds a third threshold value.

According to the sound processing method provided by the possible implementation mode, the sound signal with a larger difference with the original singing signal is filled and subjected to smoothing processing, so that the adjusted sound signal played by the intelligent equipment is close to the original singing signal, but is not distorted, and the purpose of beautifying the sound signal sent by a user in real time is achieved.

extracting a first characteristic from the original audio and video data, wherein the first characteristic is used for describing the opening and closing degree and the opening and closing frequency of a target object mouth in the original audio and video data;

acquiring a target sound signal according to the first characteristic, wherein the target sound signal is a sound signal emitted by the target object;

fusing the target sound signal and the sound signal.

According to the sound processing method provided by the possible implementation mode, the target sound signal of the target object is fused with the current sound signal of the user, so that the intelligent device plays the fused sound signal, the fused sound signal is closer to the target sound signal, but is not distorted, and the purpose of beautifying the sound signal of the user device is achieved.

acquiring second characteristics of the user, wherein the second characteristics are used for describing the opening and closing degree and the opening and closing frequency of the mouth of the user;

and according to the second characteristic, adjusting the sound signal so as to enable the adjusted sound signal to be matched with the second characteristic.

According to the sound processing method provided by the possible implementation mode, the intelligent device plays the adjusted sound signal by adjusting the sound signal according to the second characteristic of the user, and the adjusted sound signal is matched with the second characteristic of the user, so that the purpose of beautifying the sound signal of the user device is achieved.

In a second aspect, an embodiment of the present application provides a sound processing apparatus, including:

the device comprises a judging unit, a processing unit and a processing unit, wherein the judging unit is used for judging whether a sound signal sent by a user is synchronous with an accompaniment signal, and the accompaniment signal is extracted from original audio and video data;

and the processing unit is used for processing the sound signal or the accompaniment signal according to the time difference when the judgment unit judges that the sound signal is not synchronous with the accompaniment signal, so that the accompaniment signal is synchronous with the sound signal, and the time difference is the time difference between the sound signal and the accompaniment signal.

In a feasible implementation manner, if the determining unit determines that the sound signal is faster than the accompaniment signal, the processing unit is configured to buffer the sound signal from a first time point and play the buffered sound signal at a second time point;

or,

if the judgment unit judges that the sound signal is slower than the accompaniment signal, the processing unit is used for caching the accompaniment signal from the first time point and playing the cached accompaniment signal at a second time point;

In a feasible implementation manner, if the determining unit determines that the sound signal is slower than the accompaniment signal, the processing unit is configured to delete the sound signal before the first time point, and the first time point is determined according to a lyric file in the original singing audio/video data.

In a feasible implementation manner, before the judgment module judges whether the sound signal and the accompaniment signal sent by the user are synchronous, the processing unit is further configured to extract characters from the original singing audio/video data according to the first time point, and the first time point is determined according to a lyric file in the original singing audio/video data;

the judging unit is further configured to judge whether the sound production similarity between the text corresponding to the sound signal and the extracted text is greater than a first threshold;

if the judging unit judges that the sound production similarity between the characters corresponding to the sound signals and the extracted characters is larger than the first threshold value, the processing unit is further used for determining that the sound signals emitted by the user are the signals corresponding to the extracted characters in the original audio and video data.

In a feasible implementation manner, before the judgment module judges whether the sound signal and the accompaniment signal sent by the user are synchronous, the processing unit is further configured to extract characters from the original singing audio/video data according to the first time point, and determine a first acoustic model corresponding to the characters; determining a second acoustic model corresponding to the sound signal according to the sound signal, wherein the first time point is determined according to a lyric file in the original singing audio and video data;

the judging unit is further configured to judge whether the similarity between the first acoustic model and the second acoustic model is greater than a second threshold;

if the judging unit judges that the similarity between the first acoustic model and the second acoustic model is greater than the second threshold value, the processing unit is further configured to determine that the acoustic signal emitted by the user is a signal corresponding to the extracted characters in the original audio/video data.

In a feasible implementation manner, the processing unit is further configured to extract an original singing signal from the original singing audio/video data after the determining unit determines that the sound signal and the accompaniment signal are not synchronous, and processes the sound signal or the accompaniment signal according to a time difference; and adjusting the sound signal according to the original singing signal so that the similarity between the waveform of the original singing signal and the waveform of the sound signal exceeds a third threshold value.

In a feasible implementation manner, the processing unit is further configured to extract a first feature from the original audio/video data after the determining unit determines that the sound signal and the accompaniment signal are not synchronized, and the sound signal or the accompaniment signal is processed according to a time difference, where the first feature is used to describe an opening and closing degree and an opening and closing frequency of a mouth of a target object in the original audio/video data; acquiring a target sound signal according to the first characteristic, wherein the target sound signal is a sound signal emitted by the target object; fusing the target sound signal and the sound signal.

In a possible implementation manner, after the determining unit determines that the sound signal and the accompaniment signal are not synchronous, and processes the sound signal or the accompaniment signal according to a time difference, the processing unit is further configured to acquire a second feature of the user, where the second feature is used to describe an opening and closing degree and an opening and closing frequency of the mouth of the user;

The beneficial effects of the sound processing apparatus provided in the second aspect and each possible implementation manner of the second aspect may refer to the beneficial effects brought by each possible implementation manner of the second aspect and the second aspect, which are not repeated herein.

In a third aspect, embodiments of the present application provide a computer program product containing instructions, which when executed on a computer, cause the computer to perform the method of the first aspect or the various possible implementations of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to perform the method of the first aspect or the various possible implementations of the first aspect.

The sound processing method, the sound processing device and the storage medium provided by the embodiment of the application adjust the playing time of the sound signal or the accompaniment signal according to the time difference between the sound signal and the accompaniment signal when judging that the sound signal sent by the user is asynchronous with the accompaniment signal in the original audio and video data, thereby synchronizing the accompaniment signal with the sound signal sent by the user, enabling the sound signal sent by the user to reach the optimal state, and overcoming the problem of poor sound quality of the user when the sound and the accompaniment are asynchronous.

Drawings

Fig. 1 is a schematic diagram of an architecture to which a sound processing method according to an embodiment of the present application is applied;

fig. 2 is a flowchart of a sound processing method according to an embodiment of the present application;

FIG. 3 is a flow chart of a sound processing method according to another embodiment of the present application;

FIG. 4 is a flowchart of a sound processing method according to another embodiment of the present application;

fig. 5A is a schematic diagram illustrating an unprocessed waveform of an audio signal in an audio processing method according to an embodiment of the present application;

fig. 5B is a schematic diagram illustrating waveforms of processed sound signals in the sound processing method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a sound processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a communication device according to another embodiment of the present application.

Detailed Description

Fig. 1 is a schematic diagram of an architecture to which a sound processing method according to an embodiment of the present application is applied. As shown in fig. 1, the device related to the present application includes a voice collecting apparatus and an intelligent device. The voice acquisition device acquires a sound signal sent by a user, sends the sound signal to the intelligent equipment, synchronizes the sound signal and the accompaniment signal by the intelligent equipment, and plays the synchronized sound signal and the accompaniment signal. Wherein, pronunciation collection system includes microphone, earphone etc. and smart machine includes intelligent stereo set etc.. The embodiment of the application is suitable for scenes such as user simulation speech, singing, playing or dubbing. For example, when a user sings a song, the voice acquisition device acquires a voice signal of the user, sends the voice signal to the intelligent equipment, and synchronizes the voice signal with an accompaniment signal by the intelligent equipment; for another example, in the analog lecture, the voice collecting device collects the voice signal of the user, and sends the voice signal to the intelligent device, and the intelligent device synchronizes the voice signal and the accompaniment signal in the original audio/video data.

In the following, based on the architecture shown in fig. 1, a singing scene is taken as an example, and the method of the embodiment of the present application is described in detail through some embodiments. The following several embodiments may be combined with each other and may not be described in detail in some embodiments for the same or similar concepts or processes.

Fig. 2 is a flowchart of a sound processing method according to an embodiment of the present application. The present embodiment is to explain a sound processing method according to the present embodiment from the perspective of an intelligent device. As shown in fig. 2, the method includes:

101. judging whether a sound signal sent by a user is synchronous with an accompaniment signal or not, and if not, executing 102; if so, 103 is performed.

Wherein the accompaniment signal is extracted from the original audio and video data.

Typically, the original audio/video data includes an original signal, an accompaniment signal and a lyric file. The clocks of the original singing signal and the accompaniment signal are synchronous, namely, the original singing signal and the accompaniment signal can not be subjected to the phenomenon of quick shooting or slow shooting after a period of time when the original singing audio and video data is played; meanwhile, through rolling or page turning, the lyric file displayed currently is synchronous with the original singing signal or the accompaniment signal.

In a singing scene of a user, the phenomenon that the user shoots quickly or slowly often occurs. During the shooting process, the sound signal sent by the user is faster than the accompaniment signal; when the shooting is slow, the sound signal emitted by the user is slower than the accompaniment signal. At this time, the sound signal is not synchronized with the accompaniment signal. In the embodiment of the present application, the sound signal is not synchronized with the accompaniment signal, which may also be referred to as misalignment between the sound signal and the accompaniment signal.

In general, the lyrics of a song have the following characteristics: the method comprises the following steps that firstly, characters contained in lyrics are fixed; and secondly, when the original audio and video data is played, the lyrics of a certain time point are fixed. In the step, the intelligent device judges whether the sound signal sent by the user is synchronous with the accompaniment signal in the original audio and video data according to the two characteristics. For example, after detecting a sound signal at a time point when a user is likely to sing a song, extracting characters corresponding to the sound signal sent by the user, simultaneously, extracting lyrics at the time point, comparing the characters corresponding to the sound signal sent by the user with the characters in the lyrics, determining whether the sound signal sent by the user is the signals corresponding to the extracted characters in the original audio and video data according to the comparison result, namely whether the user sings the song, and when the user sings the song, continuously judging whether the sound signal and the accompaniment signal are synchronous; for another example, after detecting a sound signal at a time point when a song is likely to be singed, determining an acoustic module of characters contained in the sound signal, and simultaneously determining an acoustic model of the lyric corresponding to the time point, determining whether the sound signal sent by the user is the signal corresponding to the characters extracted from the original audio and video data by comparing the acoustic models, that is, whether the user sings the song, and when the user sings the song, continuously judging whether the sound signal and the accompaniment signal are synchronous.

102. And if the sound signal is not synchronous with the accompaniment signal, processing the sound signal or the accompaniment signal according to a time difference so as to synchronize the accompaniment signal and the sound signal.

Wherein the time difference is a time difference between the sound signal and the accompaniment signal.

When the sound signal is asynchronous with the accompaniment signal, the intelligent device adjusts the playing time of the sound signal or the accompaniment signal according to the time difference between the sound signal and the accompaniment signal, so that the accompaniment signal is synchronous with the sound signal sent by the user.

In a specific implementation, whether to process the sound signal or the accompaniment signal can be determined according to the time difference, so that the accompaniment signal and the sound signal are synchronous. For example, when the time difference is greater than 10ms, the sound signal or the accompaniment signal is processed so that the sound signal and the accompaniment signal are synchronized; when the time difference is less than 10ms, the time difference is small and can be basically ignored, and at the moment, the sound signal or the accompaniment signal is not processed, but the sound signal and the accompaniment signal are synchronized by default.

103. Playing the sound signal and the accompaniment signal.

According to the sound processing method provided by the embodiment of the invention, when the sound signal sent by the user is judged to be asynchronous with the accompaniment signal in the original audio and video data, the playing time of the sound signal or the accompaniment signal is adjusted according to the time difference between the sound signal and the accompaniment signal, so that the accompaniment signal is synchronous with the sound signal sent by the user, the sound signal sent by the user is enabled to reach the optimal state, and the problem of poor sound quality of the user when the sound and the accompaniment are asynchronous is solved.

In the following, a detailed description is given of how the smart device processes the sound signal or the accompaniment signal according to the time difference between the sound signal and the accompaniment signal when the sound signal and the accompaniment signal emitted by the user are not synchronized. Specifically, referring to fig. 3, fig. 3 is a flowchart of a sound processing method according to another embodiment of the present application. As shown in fig. 3, the present embodiment includes:

201. and determining a first time point according to the lyric file in the original audio and video data.

In the embodiment of the application, the original singing audio and video data are stored in the intelligent device in advance. The original audio and video data comprises an original singing signal, an accompaniment signal, a lyric file and the like, the content of the lyric file is lyrics, the lyrics are usually the combination of some short sentences, and meanwhile, the lyrics have a segmented structure. Take the lyrics of song "Chengdu" as an example:

[00:28:56]

[00:29:62] more than your gentle

[00:33:78] how long the rest of the way should be

[00:37:64] you grasp my head

[00:41:63] making I feel awkward

[00:45:47] is struggling free

[00:49:20]

[00:51:32] always in September

(00: 55: 38) Recall that is a worry of thinking

Weeping willow of [00:59:46] tender green in deep autumn

[01:03:32] kiss my forehead

[01:07:31] in a small city in that rainy day

[01:11:44] I never forget you

[01:15:56] Only you take away

[01:21:76]

The lyrics can be known as follows: the lyrics comprise short sentences, and are divided into two sections; moreover, the lyrics corresponding to a certain time point are fixed.

In addition, the intelligent device also stores a recognition library, and the recognition library is a model of the lyrics in the lyric file. Since the amount of lyrics per song is small, each word or each word in the lyrics may be modeled in advance. The model is used to identify whether the sound signal emitted by the user is the lyrics of the song. For example, the model may be a specific text model, and the intelligent device determines whether the voice signal is the lyrics of the song by comparing the similarity between the text corresponding to the voice signal of the user and the lyrics text; for another example, the model may be an acoustic model of a word or word in the lyrics, and the smart device identifies whether the voice signal is lyrics to the song by comparing a text waveform corresponding to the voice signal of the user with the waveform of the lyrics text.

It should be noted that, when modeling each word or each word in the lyrics in advance, the modeling method may be based on a Dynamic Time Warping (DTW) method of pattern matching, a Hidden Markov Model (HMM), or other acoustic modeling methods, and the modeling method is not limited in the embodiment of the present invention.

It should be noted that, although the above describes the method in detail by taking an example in which the intelligent device stores the original audio/video data and the lyric model locally. However, the embodiment of the present application is not limited thereto, and in other feasible implementation manners, the original singing audio and video data and the lyrics model may also be stored in the server, and when the user sings a song, the intelligent device obtains the original singing audio and video data and the lyrics model from the server according to the name of the song, and the like.

In this step, the smart device determines time points, such as 29 seconds 62 milliseconds, 33 seconds 78 milliseconds, 37 seconds 64 milliseconds and the like, at which a song may be singed according to the lyric file, and takes the time points as first time points. In this case, the first time point may be understood as: ideally, the time point at which the user is expected to emit a sound signal; alternatively, the smart device determines a time point at which a song may be played, such as 29 seconds 62 milliseconds, 33 seconds 78 milliseconds, 37 seconds 64 milliseconds, etc., based on the lyric file, and then sets a time point which is a preset time length away from the time point at which the song may be played as the first time point. For example, the time point at which singing may start is 29 seconds and 62 milliseconds, the preset time duration is 5 milliseconds, and the first time point is 29 seconds and 57 milliseconds; as another example, if the possible time point of singing is 37 seconds and 64 milliseconds, the preset time period is 5 milliseconds, and the first time point is 36 seconds and 59 milliseconds. In this case, the first time point may be understood as: ideally, the time point satisfying a preset time duration from the time point at which the user is expected to emit the sound signal.

202. Judging whether a sound signal is detected or not from the first time point, if the sound signal is detected from the first time point, executing 203; if no sound signal is detected from the first point in time, the process returns to 201.

The intelligent equipment starts at a first time point, and whether sound signals sent by voice acquisition devices such as a microphone are received or not is judged. Continuing again with the example of song "Chengdu" above, for example, when the first time point is a time point at which a song may be singed, whether a sound signal is received or not is detected at time points of 29 seconds 62 milliseconds, 33 seconds 78 milliseconds, 37 seconds 64 milliseconds, and the like; for another example, if the first time point is a time point before a time point at which a song may be played, for example, a time point 20ms before the first time point, the detection of whether the sound signal is received is started at time points of 29 seconds 42 ms, 33 seconds 58 ms, 37 seconds 44 ms, and the like.

In the process of detecting the sound signal, the detection time duration may be predefined. The detection duration refers to a period of time for which the sound signal is continuously detected, starting from the first time point. In general, for two words adjacent to each other before and after, the detection time of the previous word cannot be continued until the time point when the next word starts, i.e. for the detection of the previous word, the detection must be finished before the next word starts. For example, the lyric of the previous sentence starts 57 milliseconds from 29 seconds, the lyric of the next sentence starts 59 milliseconds from 37 seconds, and the time length from the lyric of the previous sentence to the lyric of the next sentence is 2 milliseconds from 8 seconds. Assuming that the first time point is 29 seconds and 57 milliseconds, the detection time period is 8 seconds and 2 milliseconds at maximum. In specific implementation, the detection duration can be selected according to the characteristics of specific songs.

There are a plurality of possible singing time points, i.e. more than one first time point. However, the plurality of first time points are sequential, so that when the smart device does not detect the sound signal at a specific first time point, the execution returns to 201, determines the next first time point, and continuously determines whether the sound signal is detected.

203. And extracting characters from the original singing audio and video data according to the first time point, and extracting a lyric model from an identification library.

In the step, the intelligent device extracts characters from the lyric file of the original audio and video data. For example, when the first time point is a time point at which a song is likely to be played, the lyric text "more than your gentle" is extracted starting at 62 milliseconds in 29 seconds; as another example, when the first time point is a time point before the time of possible singing, it starts to detect whether there is a lyric text at 57 ms in 29 s, and then starts to extract the lyric text "more than your gentle" at 62 ms in 29 s.

In the step, the lyric characters are extracted at the time points where the singing is possible, and meanwhile, the lyric model corresponding to the time points is also extracted from the identification library.

204. Judging whether the sound production similarity of the characters corresponding to the sound signals and the characters extracted according to the first time point is larger than a first threshold, if so, executing 205; if the similarity between the text corresponding to the sound signal and the text extracted according to the first time point is not greater than the first threshold, then 209 is executed.

In a feasible implementation manner, the intelligent device judges whether the voice signal sent by the user has lyrics by comparing the character similarity, that is, the user device is singing the lyrics. At this time, the intelligent device extracts characters from the original audio and video data according to the first time point, and then judges whether the sound production similarity between the characters corresponding to the sound signal and the extracted characters is greater than a first threshold value; if the sound production similarity between the characters corresponding to the sound signals and the extracted characters is larger than the first threshold, determining that the sound signals emitted by the user are the signals corresponding to the extracted characters in the original audio and video data, namely that the user sings the song; and if the sound production similarity between the characters corresponding to the sound signal and the extracted characters is not larger than a first threshold value, determining that the sound signal of the user has no relation with the lyrics of the song, namely that the user does not sing the song at the moment. Taking the first time point as 29 seconds and 62 milliseconds as an example, at this time, the text extracted from the original audio/video data by the intelligent device is each word or each word in "more than your gentle", and the intelligent device extracts the text from the sound signal, for example, the first word extracted from the sound signal is "not", "step" or "cloth", the pronunciation of the first word is the same as or similar to the pronunciation of the "not" word in "more than you gentle", and the similarity is greater than a first threshold, so that the first word of the sound signal of the user can be considered as the lyric of the song at the time point. The first threshold may be set according to requirements, for example, 0.7.

In the process, whether the voice signal of the user is the lyric of the time point is judged in a mode of comparing the character similarity at the specific time point, and a precondition guarantee is provided for subsequently judging whether the voice signal and the accompaniment signal are synchronous.

In a feasible implementation manner, the intelligent device judges whether the voice signal sent by the user has lyrics through a model comparison manner. At the moment, the intelligent device extracts characters from the original audio and video data according to the first time point, and then determines a first acoustic model corresponding to the characters; determining a second acoustic model corresponding to the sound signal according to the sound signal; judging whether the similarity of the first acoustic model and the second acoustic model is larger than a second threshold value or not; if the similarity between the first acoustic model and the second acoustic model is greater than the second threshold, determining that the acoustic signal sent by the user is a signal corresponding to the extracted characters in the original audio and video data, namely that the user is singing the song; and if the similarity between the first acoustic model and the second acoustic model is not larger than the second threshold, determining that the sound signal of the user has no relation with the lyrics of the song, namely that the user does not sing the song at the moment.

Specifically, according to the above step 201: the intelligent device is stored with a recognition library, which is a model of the lyrics in the lyric file, such as an acoustic model. The method comprises the steps that a sound signal sent by a user is detected at a first time point, whether the sound signal sent by the user is the lyrics of the song is judged, the intelligent device determines a first acoustic model of characters corresponding to the first time point from an identification library according to the first time point, the number of the first acoustic models can be multiple, meanwhile, the intelligent device processes the sound signal sent by the user to obtain a second acoustic model, similarity judgment is conducted on the two acoustic models, whether the sound signal sent by the user is the corresponding signal extracted from original singing audio and video data is determined according to the judgment result, and whether the user sings the song is determined. Taking the first time point as 29 seconds and 62 milliseconds as an example, at this time, the text extracted from the original audio/video data by the intelligent device is each character or each word in "more than your gentle", the acoustic model of each character or word extracted from the original audio/video data by the intelligent device is extracted from the recognition library (for convenience of distinguishing, the acoustic model of the character extracted from the original audio/video data is referred to as a first acoustic model, and the acoustic model obtained according to the sound signal emitted by the user is referred to as a second acoustic model), for example, the first acoustic model of the "not" character extracted from the original audio/video data may be multiple, and the multiple first acoustic models include a first acoustic model obtained according to male voice, an acoustic model obtained according to female voice, a first acoustic model obtained according to the child voice, a first acoustic model obtained according to the accent of southern person, and a first acoustic model obtained according to the mandarin, A first acoustic model derived from cantonese, etc. Meanwhile, the intelligent device processes the sound channel to obtain a second sound model, the second sound model is compared with the plurality of first sound models, and as long as the similarity between the second sound model and one of the plurality of first sound models is larger than a second threshold value, the sound signal sent by the user is a corresponding signal extracted from original singing audio and video data to obtain characters, namely the user sings the song.

In the process, whether the voice signal of the user is the lyric of the time point is judged in a mode of comparing the similarity of the models at the specific time point, and a precondition guarantee is provided for the follow-up judgment of whether the voice signal and the accompaniment signal are synchronous. In addition, in the process, the lyric file of a song is very small, and the model called by the intelligent equipment is a model for identifying a sentence or a section of lyrics in the library, so that the data quantity to be compared is very small, the comparison speed is high, and the accuracy is high.

205. Judging whether the sound signal sent by the user is synchronous with the accompaniment signal, if the sound signal is faster than the accompaniment signal, executing 206; if the sound signal is slower than the accompaniment signal, execute 207; if the sound signal is synchronized with the accompaniment signal, 208 is executed.

The specific determination process may be performed in step 101, which is not described herein again.

206. And buffering the sound signal from the first time point, and playing the buffered sound signal at the second time point.

In this step, the smart device stores the sound signal in a buffer memory starting at a first time point. At this time, the sound signals emitted by the user are delayed for a period of time from the first time point, and the length of the period of time is the time difference between the sound signals and the accompaniment signals. And when the time passes through the time difference length and reaches a second time point, playing the sound signal in the buffer memory. Taking the first time point as 29 seconds and 50 milliseconds as an example, the smart device starts to detect the sound signal at 29 seconds and 50 milliseconds, and should detect the sound signal at 29 seconds and 62 milliseconds, however, due to the shooting robbing, the user actually sends the sound signal at 29 seconds and 50 milliseconds, and the sent sound signal is the sound signal of the word "not" in "more than your gentle". However, the sound signal is 12 milliseconds faster than the accompaniment signal as required for the accompaniment signal to be synchronized with the sound signal emitted by the user. At this time, the smart device stores the sound signal in the buffer from 29 seconds 50 milliseconds, and then plays the stored sound signal in the buffer at a second time point, i.e., 29 seconds 62 milliseconds.

It should be noted that, due to the segmented structure of the lyrics, after each lyric is sung, the lyrics will pause for a period of time, and only the accompaniment signal is available. Therefore, after each lyric is finished, the sound signal in the default buffer memory is played. That is, not all sound signals are delayed for a period of time.

207. And starting to buffer the accompaniment signals at the first time point, and playing the buffered accompaniment signals at the second time point.

In this step, the smart device starts at a first time point and stores the accompaniment signals in a buffer memory. In this case, the accompaniment signals are delayed from the first time point by a time difference, which is the time difference between the sound signal and the accompaniment signals. When the time passes by the time difference length, the accompaniment signals in the buffer memory are played when the second time point is reached. Taking the first time point of 29 seconds and 62 milliseconds as an example, at this time, the first word of the sound signal emitted by the user should be "not" in "more than your gentle", however, due to the slow beat, the user actually emits the sound signal of "not" word in "more than your gentle" in 29 seconds and 75 milliseconds, and the sound signal is 13 milliseconds slower than the accompaniment signal. At this time, the smart device stores the accompaniment signals in the buffer from the beginning of 62 ms at 29 s, and then plays the accompaniment signals in the stored buffer at the second time point, i.e., 75 ms at 29 s.

It should be noted that the lyrics have a segmented structure, and after each lyric is sung, the lyrics will pause for a period of time, during which only the accompaniment signal is available. Therefore, the accompaniment signal is a continuous signal and no interruption occurs. That is, the accompaniment signal will always exist in the buffer until the whole song is played.

In addition, in order to avoid the situation that the whole accompaniment signal is delayed, in this step, the accompaniment signal may not be buffered, but the sound signal delayed by the user may be directly discarded, that is, if the sound signal is slower than the accompaniment signal, the sound signal before the first time point is deleted. Taking the first time point of 29 seconds and 62 milliseconds as an example, at this time, the first word of the sound signal emitted by the user should be "not" in "more than your gentle", however, due to the slow beat, the user actually emits the sound signal of "not" word in "more than your gentle" in 29 seconds and 75 milliseconds, and the sound signal is 13 milliseconds slower than the accompaniment signal. At this time, the smart device discards the sound signals from 62 milliseconds in 29 seconds to 75 milliseconds in 29 seconds at 75 milliseconds in 29 seconds, namely discards the first 13 milliseconds of the current sound signals in time sequence, and then plays the rest sound signals. In the process, the sound signals shot slowly are directly deleted, so that the accompaniment signals are synchronous with the sound signals sent by the user, the sound signals sent by the user reach the best state, and the problem of poor sound quality of the user when the sound and the accompaniment are asynchronous is solved.

In addition, when the sound signal is slower than the accompaniment signal, the sound signal and the accompaniment signal may be directly played without processing the sound signal or the accompaniment signal.

208. Playing the sound signal and the accompaniment signal.

The sound signal and the accompaniment signal are synchronous, which indicates that the sound signal and the accompaniment signal are aligned, therefore, the sound signal or the accompaniment signal is not processed, but is directly played.

209. And playing the sound signal.

The similarity between the characters corresponding to the voice signal and the extracted characters is not greater than a first threshold value, which indicates that the voice signal sent by the user is not the lyrics. For example, when singing a song, the user makes a sound signal independent of the lyrics, with the small activities being carried out in an active atmosphere, the original singing being turned off. At this time, the sound signal is directly played.

In the above embodiment, if the sound signal is faster than the accompaniment signal, the sound signal is buffered from the first time point, and the buffered sound signal is played at the second time point; or, if the sound signal is slower than the accompaniment signal, the accompaniment signal is buffered from the first time point, and the buffered accompaniment signal is played at the second time point. In the process, the aim of aligning the sound signal and the accompaniment signal is achieved by processing the sound signal or the accompaniment signal.

In the following, how to further process the sound signal after the smart device synchronizes the sound signal and the accompaniment signal in the above embodiment will be described in detail.

In a feasible implementation manner, if the sound signal is not synchronous with the accompaniment signal, the smart device further extracts the original singing signal from the original singing audio/video data after processing the sound signal or the accompaniment signal according to the time difference, and adjusts the sound signal according to the original singing signal, so that the similarity between the waveform of the original singing signal and the waveform of the sound signal exceeds a third threshold. Specifically, referring to fig. 4, fig. 4 is a flowchart of a sound processing method according to another embodiment of the present application. As shown in fig. 4, the present embodiment includes:

301. and determining a first time point according to the lyric file in the original audio and video data.

In particular, refer to step 201 of the embodiment of fig. 3.

302. Judging whether a sound signal is detected or not from the first time point, if the sound signal is detected from the first time point, executing 303; if no sound signal is detected from the first point in time, return is made to 301.

In particular, see step 202 of the embodiment of fig. 3.

303. And extracting characters from the original singing audio and video data according to the first time point.

In particular, see step 203 of the embodiment of fig. 3.

304. Judging whether the characters corresponding to the sound signal are extracted according to a first time point, if so, executing 305; if the text corresponding to the audio signal is not the text extracted at the first time point, the process proceeds to step 307.

In particular, reference may be made to step 204 of the embodiment of fig. 3 described above.

305. Synchronizing the sound signal and the accompaniment signal.

In particular, see steps 205, 206, 207 above.

306. And adjusting the sound signal according to the original singing signal so that the similarity between the waveform of the original singing signal and the waveform of the sound signal exceeds a third threshold value.

307. And playing the sound signal.

At this time, the accompaniment signal may be played or not played according to the demand.

The embodiment of the application is suitable for scenes needing to beautify the sound signals in real time. For example, the northern person sings a cantonese song, and beautifies the current sound signal by using the original singing signal and the historical sound signal, so that the current sound signal of the user is closer to the original singing but is not distorted. For another example, Chinese sings English and Japanese songs. As another example, a scene simulating the cry of an animal.

In a feasible implementation manner, if the sound signal and the accompaniment signal are not synchronous, the intelligent device further extracts an original singing signal from the original singing audio/video data after processing the sound signal or the accompaniment signal according to a time difference; and adjusting the sound signal according to the original singing signal so that the similarity between the waveform of the original singing signal and the waveform of the sound signal exceeds a third threshold value. For example, in the adjustment process, the voice signal and the original signal of the user are respectively sampled at the same time point to obtain a sampling value of the voice signal and a sampling value of the original signal, the two sampling values are compared, if the difference between the two sampling values is larger, the average value of the two sampling values is taken, and the average value is used for replacing the sampling value of the voice signal, so that the smooth processing of the voice signal is realized. In specific implementation, different processing methods can be used according to actual requirements, and the embodiment of the present application is not limited.

Fig. 5A is a schematic diagram of an unprocessed waveform of an audio signal in a sound processing method according to an embodiment of the present application, and fig. 5B is a schematic diagram of a processed waveform of an audio signal in a sound processing method according to an embodiment of the present application. As shown in fig. 5A, the sound signal and the accompaniment signal are aligned, but the sound signal is greatly different from the original signal, and it is necessary to adjust the waveform of the sound signal according to the waveform of the original signal. Specifically, the following can be adjusted: firstly, processing treble (wave crest) and bass (wave trough) by using energy values; secondly, smoothing the sound; thirdly, when the sound signal is insufficient, the current sound signal is supplemented by using the historical data and the original singing signal. In the adjusting process, the original singing signal is extracted from the original singing audio and video data, the original singing signal is regarded as a bottom edition (equivalent to a bottom edition of a photo), and the sound signal is beautified according to the bottom edition. Specifically, the waveform of the original singing signal is compared with the waveform of the sound signal, and filling and smoothing processing is performed on the waveform of the sound signal according to the waveform of the original singing signal. Wherein, the historical data is a sound signal for recording that the user sings the same song for a plurality of times. Each time the user sings the song, the current sound signal is scored. After the user sings for multiple times, multiple scores are obtained, and the voice signals with the highest scores are stored. Then, when singing the same song again, the current sound signal is calibrated in real time by utilizing the track of the stored sound signal, such as the waveform and the like, so that the beauty of the current sound signal is realized. Therefore, people who sings badly can also sing better effect, and the entertainment is improved. In addition, the song may be shared through social tools after each sing.

The waveform of the adjusted sound signal is shown in fig. 5B. As can be seen from fig. 5A and 5B: the waveform of the adjusted voice signal is between the waveform of the original voice signal and the waveform of the unadjusted voice signal. Therefore, although the waveform of the sound signal actually emitted by the user is as shown in fig. 5A, the waveform of the sound signal played by the smart device is as shown in fig. 5B. In the process, the sound signal with a larger difference with the original singing signal is filled and smoothed, so that the adjusted sound signal played by the intelligent equipment is close to the original singing signal, but is not distorted, and the purpose of beautifying the sound signal sent by a user in real time is achieved.

In a feasible implementation manner, if the sound signal and the accompaniment signal are not synchronous, the intelligent device further extracts a first feature from the original audio and video data after processing the sound signal or the accompaniment signal according to the time difference, wherein the first feature is used for describing the opening and closing degree and the opening and closing frequency of a target object mouth in the original audio and video data; according to the first feature, a target sound signal is acquired, the target sound signal being a sound signal emitted by a target object; the target sound signal and the sound signal are fused.

Specifically, when the original audio and video is played, a picture is displayed on the display screen, and the picture contains a target object, such as an animal or a human being. After the intelligent equipment is aligned with the accompaniment signals and the sound signals, first characteristics used for describing the opening degree and the opening and closing frequency of the mouth of the target object are extracted from the screen picture. The mouth opening and closing degree represents the opening amplitude of the mouth of the target object, and the opening and closing frequency represents the opening and closing speed of the mouth of the target object. Then, the intelligent device obtains the target sound signal from the local database or the remote database according to the first characteristic, wherein the target sound signal is the sound signal emitted by the target object. And finally, fusing the target sound signal and the current sound signal of the user to enable the sound signal of the user and the target sound signal to be closer.

For example, when an original audio/video is played, a tiger is included in a picture displayed on a display screen, and the tiger roars. The child is mimicking the ravel of the tiger, i.e., the current sound signal of the user is that the child mimics the ravel of the tiger. The smart device extracts a first feature representing the opening degree and the opening frequency of the tiger mouth from the picture, and acquires a target sound signal, namely the sound of the real tiger roaring stored in the server, based on the first feature. Then, the target sound signal and the current sound signal of the user are fused, the simulated sound signal is closer to the roar sound of the real tiger, and simultaneously, the child sound of the user is kept, namely the synthesized sound signal is not distorted.

For another example, when the original singing audio and video is played, the picture displayed on the display screen contains the original singing singer. The user is mimicking the original singer. The intelligent device extracts first characteristics used for expressing the opening degree and the opening frequency of the original singer mouth from the picture, and obtains a target sound signal, namely the original sound signal based on the first characteristics. Then, the target voice signal and the current voice signal of the user are fused, the fused voice signal is closer to the original singing signal, and meanwhile, the original voice of the user is kept, namely, the synthesized voice signal is not distorted.

In the above embodiment, the target sound signal of the target object is fused with the current sound signal of the user, so that the smart device plays the fused sound signal, and the fused sound signal is closer to the target sound signal but is not distorted, thereby achieving the purpose of beautifying the sound signal of the user device.

In a feasible implementation manner, if the sound signal and the accompaniment signal are not synchronous, the smart device further acquires a second feature of the user after processing the sound signal or the accompaniment signal according to the time difference, wherein the second feature is used for describing the opening and closing degree and the opening and closing frequency of the mouth of the user; and according to the second characteristic, adjusting the sound signal so as to enable the adjusted sound signal to be matched with the second characteristic.

Specifically, the intelligent device is integrated with a shooting device, when a user sings a song, the intelligent device shoots the user through the shooting device, and second characteristics used for expressing the opening degree and the opening frequency of the mouth of the user are extracted from the shot picture or video. Then, the sound signal is adjusted according to the second feature so that the adjusted sound signal matches the second feature. For example, if the user's mouth is opened to the maximum amplitude and the intensity of the user's voice signal is not very high, the high tone of the voice signal is increased.

In the above embodiment, the intelligent device plays the adjusted sound signal by adjusting the sound signal according to the second characteristic of the user, and the adjusted sound signal is matched with the second characteristic of the user, so as to achieve the purpose of beautifying the sound signal of the user equipment.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 6 is a schematic structural diagram of a sound processing apparatus according to an embodiment of the present application. The sound processing apparatus according to the present embodiment may be the above-described smart device, or may be a chip applied to a smart device, for example, a smart audio. The sound processing device can be used for executing the actions of the intelligent equipment in the method embodiment. As shown in fig. 6, the smart device 10 may include: a judging unit 11 and a processing unit 12. Wherein,

a judging unit 11, configured to judge whether a sound signal generated by a user is synchronous with an accompaniment signal, where the accompaniment signal is extracted from original audio/video data;

a processing unit 12, configured to process the sound signal or the accompaniment signal according to a time difference when the determining unit 11 determines that the sound signal is not synchronized with the accompaniment signal, so that the accompaniment signal is synchronized with the sound signal, where the time difference is a time difference between the sound signal and the accompaniment signal.

In a possible implementation manner, if the determining unit 11 determines that the sound signal is faster than the accompaniment signal, the processing unit 12 is configured to buffer the sound signal from a first time point and play the buffered sound signal at a second time point;

or,

if the determining unit 11 determines that the sound signal is slower than the accompaniment signal, the processing unit 12 is configured to buffer the accompaniment signal from the first time point and play the buffered accompaniment signal at a second time point;

In a feasible implementation manner, if the determining unit 11 determines that the sound signal is slower than the accompaniment signal, the processing unit 12 is configured to delete the sound signal before the first time point, where the first time point is determined according to a lyric file in the original audio/video data.

In a feasible implementation manner, the processing unit 12 is further configured to extract characters from the original audio/video data according to the first time point before the determining module determines whether the sound signal and the accompaniment signal sent by the user are synchronous, where the first time point is determined according to a lyric file in the original audio/video data;

the judging unit 11 is further configured to judge whether the utterance similarity between the text corresponding to the sound signal and the extracted text is greater than a first threshold;

if the judging unit 11 judges that the sound production similarity between the characters corresponding to the sound signal and the extracted characters is greater than the first threshold, the processing unit 12 is further configured to determine that the sound signal emitted by the user is a signal corresponding to the extracted characters in the original audio/video data.

In a feasible implementation manner, before the determining module determines whether the sound signal and the accompaniment signal sent by the user are synchronous, the processing unit 12 is further configured to extract characters from the original audio/video data according to the first time point, and determine a first acoustic model corresponding to the characters; determining a second acoustic model corresponding to the sound signal according to the sound signal, wherein the first time point is determined according to a lyric file in the original singing audio and video data;

the determining unit 11 is further configured to determine whether a similarity between the first acoustic model and the second acoustic model is greater than a second threshold;

if the judging unit 11 judges that the similarity between the first acoustic model and the second acoustic model is greater than the second threshold, the processing unit 12 is further configured to determine that the acoustic signal sent by the user is a signal corresponding to the extracted text in the original audio/video data.

In a feasible implementation manner, the processing unit 12 is further configured to extract an original singing signal from the original singing audio/video data after the determining unit 11 determines that the sound signal and the accompaniment signal are not synchronous, and processes the sound signal or the accompaniment signal according to a time difference; and adjusting the sound signal according to the original singing signal so that the similarity between the waveform of the original singing signal and the waveform of the sound signal exceeds a third threshold value.

In a feasible implementation manner, the processing unit 12 is further configured to extract a first feature from the original audio/video data after the determining unit 11 determines that the sound signal and the accompaniment signal are not synchronized, and the sound signal or the accompaniment signal is processed according to a time difference, where the first feature is used to describe an opening and closing degree and an opening and closing frequency of a mouth of a target object in the original audio/video data; acquiring a target sound signal according to the first characteristic, wherein the target sound signal is a sound signal emitted by the target object; fusing the target sound signal and the sound signal.

In a possible implementation manner, the processing unit 12 is further configured to acquire a second feature of the user after the determining unit 11 determines that the sound signal and the accompaniment signal are not synchronized, and processes the sound signal or the accompaniment signal according to a time difference, where the second feature is used for describing an opening and closing degree and an opening and closing frequency of the mouth of the user;

The sound processing apparatus provided in the embodiment of the present application may perform the actions of the intelligent device in the foregoing method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

It should be noted that the processing unit and the determining unit may be implemented in a form called by software through the processing element; or may be implemented in hardware. For example, the processing unit may be a processing element separately set up, or may be implemented by being integrated into a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a function of the processing unit may be called and executed by a processing element of the apparatus. In addition, all or part of the units can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, the steps of the method or the units above may be implemented by hardware integrated logic circuits in a processor element or instructions in software.

For example, the above units may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, when some of the above units are implemented in the form of a processing element scheduler code, the processing element may be a general purpose processor, such as a Central Processing Unit (CPU) or other processor that can call code. As another example, these units may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

Fig. 7 is a schematic structural diagram of a communication device according to another embodiment of the present application. As shown in fig. 7, the communication device 20 may include: a processor 21 (e.g., CPU), a memory 22; the memory 22 may include a random-access memory (RAM) and may further include a non-volatile memory (NVM), such as at least one disk memory, and the memory 22 may store various instructions for performing various processing functions and implementing the method steps of the present application. Optionally, the communication apparatus related to the present application may further include: a power supply 23, a communication bus 24, and a communication port 25. A communication bus 24 is used to enable communication connections between the elements. The communication port 25 is used for connection and communication between the sound processing apparatus and other peripheral devices.

In the embodiment of the present application, the memory 22 is used for storing computer executable program codes, and the program codes comprise instructions; when the processor 21 executes the instruction, the instruction causes the processor 21 of the sound processing apparatus to execute the processing action of the intelligent device in the above method embodiment, which has similar implementation principle and technical effect, and is not described herein again.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The term "plurality" herein means two or more. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship; in the formula, the character "/" indicates that the preceding and following related objects are in a relationship of "division".

It is to be understood that the various numerical references referred to in the embodiments of the present application are merely for descriptive convenience and are not intended to limit the scope of the embodiments of the present application.

It should be understood that, in the embodiment of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

Claims

1. A sound processing method, comprising:

2. The method of claim 1,

if the sound signal is not synchronous with the accompaniment signal, the sound signal or the accompaniment signal is processed according to time difference, and the method comprises the following steps:

or,

3. The method of claim 1, wherein processing the sound signal or the accompaniment signal according to the time difference if the sound signal is not synchronized with the accompaniment signal comprises:

4. The method according to any one of claims 1 to 3, wherein before determining whether the sound signal and the accompaniment signal emitted by the user are synchronized, the method further comprises:

5. The method according to any one of claims 1 to 3, wherein before determining whether the sound signal and the accompaniment signal emitted by the user are synchronized, the method further comprises:

determining a first acoustic model corresponding to the characters;

6. The method according to any one of claims 1-5, wherein the processing the sound signal or the accompaniment signal according to a time difference if the sound signal is not synchronized with the accompaniment signal, further comprises:

7. The method according to any one of claims 1-5, wherein the processing the sound signal or the accompaniment signal according to a time difference if the sound signal is not synchronized with the accompaniment signal, further comprises:

fusing the target sound signal and the sound signal.

8. The method according to any one of claims 1-5, wherein the processing the sound signal or the accompaniment signal according to a time difference if the sound signal is not synchronized with the accompaniment signal, further comprises:

9. A sound processing apparatus, comprising:

10. The apparatus of claim 9,

if the judging unit judges that the sound signal is faster than the accompaniment signal, the processing unit is used for caching the sound signal from a first time point and playing the cached sound signal at a second time point;

or,

11. The apparatus of claim 9,

if the judging unit judges that the sound signal is slower than the accompaniment signal, the processing unit is used for deleting the sound signal before the first time point, and the first time point is determined according to the lyric file in the original singing audio and video data.

12. The apparatus according to any one of claims 9 to 11,

the processing unit is used for extracting characters from the original singing audio and video data according to the first time point before the judging module judges whether the sound signal and the accompaniment signal sent by the user are synchronous or not, and the first time point is determined according to a lyric file in the original singing audio and video data;

13. The apparatus according to any one of claims 9 to 11,

the processing unit is used for extracting characters from the original singing audio and video data according to the first time point and determining a first acoustic model corresponding to the characters before the judging module judges whether the sound signal and the accompaniment signal sent by the user are synchronous; determining a second acoustic model corresponding to the sound signal according to the sound signal, wherein the first time point is determined according to a lyric file in the original singing audio and video data;

14. The apparatus according to any one of claims 9 to 13,

the processing unit is used for extracting an original singing signal from the original singing audio and video data after the judging unit judges that the sound signal and the accompaniment signal are not synchronous and processes the sound signal or the accompaniment signal according to time difference; and adjusting the sound signal according to the original singing signal so that the similarity between the waveform of the original singing signal and the waveform of the sound signal exceeds a third threshold value.

15. The device according to any one of claims 9 to 13, wherein the processing unit is further configured to extract a first feature from the original audio/video data after the determining unit determines that the sound signal and the accompaniment signal are not synchronous, and the sound signal or the accompaniment signal is processed according to a time difference, where the first feature is used to describe an opening and closing degree and an opening and closing frequency of a mouth of a target object in the original audio/video data; acquiring a target sound signal according to the first characteristic, wherein the target sound signal is a sound signal emitted by the target object; fusing the target sound signal and the sound signal.

16. The apparatus according to any one of claims 9 to 13, wherein the processing unit is further configured to collect a second feature of the user after the determining unit determines that the sound signal and the accompaniment signal are not synchronized, and the sound signal or the accompaniment signal is processed according to a time difference, wherein the second feature is used for describing an opening and closing degree and an opening and closing frequency of the mouth of the user;