CN113470699A - Audio processing method and device, electronic equipment and readable storage medium - Google Patents

Audio processing method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN113470699A
CN113470699A CN202111032567.3A CN202111032567A CN113470699A CN 113470699 A CN113470699 A CN 113470699A CN 202111032567 A CN202111032567 A CN 202111032567A CN 113470699 A CN113470699 A CN 113470699A
Authority
CN
China
Prior art keywords
pitch
target
audio frame
audio
frame set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111032567.3A
Other languages
Chinese (zh)
Other versions
CN113470699B (en
Inventor
周勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN202111032567.3A priority Critical patent/CN113470699B/en
Publication of CN113470699A publication Critical patent/CN113470699A/en
Application granted granted Critical
Publication of CN113470699B publication Critical patent/CN113470699B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/0335Pitch control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • G10L2025/906Pitch tracking

Abstract

The application provides an audio processing method, an audio processing device, electronic equipment and a readable storage medium, and belongs to the technical field of data processing. The method comprises the steps of acquiring a first audio frame set of a target audio and a second audio frame set of a reference audio; performing alignment processing on the semantic features in the first audio frame set in a time domain dimension according to the semantic features in the second audio frame set to obtain a target audio frame set corresponding to the first audio frame set; determining a first pitch set corresponding to the target audio frame set, and determining a second pitch set corresponding to the second audio frame set; determining an adjustment strategy based on the first pitch set and the second pitch set; and adjusting the pitch of the target audio by utilizing the adjusting strategy. To avoid the situation of distortion due to not considering the pitch of the user himself.

Description

Audio processing method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of data processing, and in particular, to an audio processing method and apparatus, an electronic device, and a readable storage medium.
Background
With the continuous development of the mobile internet karaoke application, the user has higher and higher requirements on karaoke experience. However, different people have different perceptibility to music and melody, and the running or rhythm cannot follow up, which seriously affects the user's experience of singing K. At present, most of the voice modification functions in the K song application are realized through a template matching technology, namely, the pitch and the rhythm of the song sung by the user are directly adjusted to be consistent with those in the template. However, when the sound is modified by the template matching, the pitch of the user itself is not considered, and the modified audio is likely to be distorted, unlike the user's own voice.
Disclosure of Invention
In order to solve the technical problem that when the tone modification is performed through template matching, the tone of the modified audio is easy to distort due to the fact that the pitch condition of the user is not considered, the application provides an audio processing method and device, an electronic device and a readable storage medium.
In a first aspect, an audio processing method is provided, the method including:
acquiring a first audio frame set of a target audio and a second audio frame set of a reference audio;
performing alignment processing on the semantic features in the first audio frame set in a time domain dimension according to the semantic features in the second audio frame set to obtain a target audio frame set corresponding to the first audio frame set;
determining a first pitch set corresponding to the target audio frame set, and determining a second pitch set corresponding to the second audio frame set;
determining an adjustment strategy based on the first pitch set and the second pitch set;
and adjusting the pitch of the target audio by utilizing the adjusting strategy.
Optionally, the determining an adjustment policy based on the first pitch set and the second pitch set comprises:
determining a first mean value corresponding to the first pitch set and a second mean value corresponding to the second pitch set;
taking the absolute value of the difference value of the first average value and the second average value as a pitch difference value;
determining the adjustment policy based on the pitch-difference value and the second pitch set.
Optionally, the determining the adjustment policy based on the pitch-difference value and the second pitch set comprises:
judging whether the pitch difference value is larger than a preset pitch threshold value or not;
if the pitch difference value is larger than the preset pitch threshold value, acquiring a target parameter, and determining a first target pitch set based on the second pitch set and the target parameter, wherein the first target pitch set is used for adjusting the pitch of the target audio;
and if the pitch difference value is smaller than or equal to the preset pitch threshold value, taking the second pitch set as the first target pitch set.
Optionally, said determining a first target pitch set based on said second pitch set and said target parameters comprises:
taking the sum of the second pitch and the target parameter as a first target pitch for each second pitch in the second pitch set if the first mean is greater than the second mean, resulting in the first target pitch set;
in a case where the first mean is smaller than the second mean, for each second pitch in the second pitch set, taking a difference obtained by subtracting the target parameter from the second pitch as the first target pitch, resulting in the first target pitch set.
Optionally, the method further comprises:
taking a product result of an adjustment value and a preset parameter as the target parameter under the condition of receiving the adjustment value input by the object;
and taking the preset parameter as the target parameter under the condition that an object input adjustment value is not received.
Optionally, the determining the adjustment policy based on the pitch-difference value and the second pitch set comprises:
if the first average value is greater than the second average value, taking the sum of the second pitch and the pitch difference value as a second target pitch for each second pitch in the second pitch set, resulting in a second target pitch set, which is used for adjusting the pitch of the target audio;
in a case where the first mean is smaller than the second mean, for each second pitch in the second pitch set, taking a difference obtained by subtracting the pitch difference value from the second pitch as the second target pitch, resulting in the second target pitch set.
Optionally, the performing, according to the semantic features in the second audio frame set, alignment processing in a time domain dimension on the semantic features in the first audio frame set to obtain a target audio frame set corresponding to the first audio frame set includes:
extracting a first semantic feature from the first audio frame set and extracting a second semantic feature from the second audio frame set;
inputting the first semantic features and the second semantic features into a sequence matching model so that the sequence matching model outputs an alignment result;
inputting the alignment result and the first audio frame set to a time domain adjustment model, so that the time domain adjustment model outputs the target audio frame set.
In a second aspect, there is provided an audio processing apparatus, the apparatus comprising:
the acquisition module is used for acquiring a first audio frame set of target audio and a second audio frame set of reference audio;
the alignment module is used for performing alignment processing on a time domain dimension on the semantic features in the first audio frame set according to the semantic features in the second audio frame set to obtain a target audio frame set corresponding to the first audio frame set;
a first determining module, configured to determine a first pitch set corresponding to the target audio frame set, and determine a second pitch set corresponding to the second audio frame set;
a second determination module to determine an adjustment strategy based on the first pitch set and the second pitch set;
and the adjusting module is used for adjusting the pitch of the target audio by utilizing the adjusting strategy.
Optionally, the second determining module is specifically configured to:
determining a first mean value corresponding to the first pitch set and a second mean value corresponding to the second pitch set;
taking the absolute value of the difference value of the first average value and the second average value as a pitch difference value;
determining the adjustment policy based on the pitch-difference value and the second pitch set.
Optionally, the second determining module is further configured to:
judging whether the pitch difference value is larger than a preset pitch threshold value or not;
if the pitch difference value is larger than the preset pitch threshold value, acquiring a target parameter, and determining a first target pitch set based on the second pitch set and the target parameter, wherein the first target pitch set is used for adjusting the pitch of the target audio;
and if the pitch difference value is smaller than or equal to the preset pitch threshold value, taking the second pitch set as the first target pitch set.
Optionally, the second determining module is further configured to:
taking the sum of the second pitch and the target parameter as a first target pitch for each second pitch in the second pitch set if the first mean is greater than the second mean, resulting in the first target pitch set;
in a case where the first mean is smaller than the second mean, for each second pitch in the second pitch set, taking a difference obtained by subtracting the target parameter from the second pitch as the first target pitch, resulting in the first target pitch set.
Optionally, the apparatus further comprises an input module, configured to:
taking a product result of an adjustment value and a preset parameter as the target parameter under the condition of receiving the adjustment value input by the object;
and taking the preset parameter as the target parameter under the condition that an object input adjustment value is not received.
Optionally, the second determining module is further configured to:
if the first average value is greater than the second average value, taking the sum of the second pitch and the pitch difference value as a second target pitch for each second pitch in the second pitch set, resulting in a second target pitch set, which is used for adjusting the pitch of the target audio;
in a case where the first mean is smaller than the second mean, for each second pitch in the second pitch set, taking a difference obtained by subtracting the pitch difference value from the second pitch as the second target pitch, resulting in the second target pitch set.
Optionally, the alignment module is specifically configured to:
extracting a first semantic feature from the first audio frame set and extracting a second semantic feature from the second audio frame set;
inputting the first semantic features and the second semantic features into a sequence matching model so that the sequence matching model outputs an alignment result;
inputting the alignment result and the first audio frame set to a time domain adjustment model, so that the time domain adjustment model outputs the target audio frame set.
In a third aspect, an electronic device is provided, which includes a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of the first aspect when executing a program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, wherein a computer program is stored in the computer-readable storage medium, and when executed by a processor, the computer program implements the method steps of any of the first aspects.
In a fifth aspect, there is provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the audio processing methods described above.
The embodiment of the application has the following beneficial effects:
the embodiment of the application provides an audio processing method, an audio processing device, electronic equipment and a readable storage medium, wherein a first audio frame set of a target audio and a second audio frame set of a reference audio are obtained; then, according to the semantic features in the second audio frame set, performing alignment processing on the semantic features in the first audio frame set in a time domain dimension to obtain a target audio frame set corresponding to the first audio frame set; determining a first pitch set corresponding to the target audio frame set, and determining a second pitch set corresponding to the second audio frame set; finally, determining an adjustment strategy based on the first pitch set and the second pitch set; and adjusting the pitch of the target audio by using the adjusting strategy. The adjusting strategy for adjusting the pitch in the scheme not only considers the pitch of the reference audio, but also considers the pitch of the target audio, so that the situation that the pitch of a user is not considered to cause distortion can be avoided when the pitch is adjusted by using the adjusting strategy.
Of course, not all advantages described above need to be achieved at the same time in the practice of any one product or method of the present application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a flowchart of an audio processing method according to an embodiment of the present application;
fig. 2 is a flowchart of an audio processing method according to another embodiment of the present application;
fig. 3 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The tone modifying function in the current K song application is mostly realized through a template matching technology, namely, the pitch and the rhythm of the song sung by the user are directly adjusted to be consistent with the pitch and the rhythm in the template. However, when the sound is modified by the template matching, the pitch of the user itself is not considered, and the modified audio is likely to be distorted, unlike the user's own voice. Therefore, the embodiment of the application provides an audio processing method.
An audio processing method provided in the embodiments of the present application will be described in detail below with reference to specific embodiments, as shown in fig. 1, the specific steps are as follows:
s101, a first audio frame set of the target audio and a second audio frame set of the reference audio are obtained.
In the embodiment of the present application, the target audio refers to audio that needs to be modified (for example, a song sung by the user on the karaoke application or an uploaded song), and the reference audio refers to audio that is used as a reference for modifying (for example, a song sung by an original song corresponding to the target audio in a song library).
Further, the reference audio may be obtained in a variety of ways, including but not limited to automatic matching in a song library based on lyrics and/or melody or uploading by the user.
Further, the audio is composed of a plurality of audio frames, and thus a first audio frame set corresponding to the target audio and a second audio frame set corresponding to the reference audio can be extracted.
S102, performing alignment processing on the semantic features in the first audio frame set in a time domain dimension according to the semantic features in the second audio frame set to obtain a target audio frame set corresponding to the first audio frame set.
In the embodiment of the present application, in order to adjust the rhythm of the target audio to be consistent with the rhythm of the reference audio, alignment processing in the time domain dimension is performed on the first audio frame set according to the second audio frame set, and the obtained rhythm of the target audio frame set is consistent with the rhythm of the reference audio.
In an implementation manner of the embodiment of the present application, S102 may include the following steps:
s201, extracting a first semantic feature from the first audio frame set, and extracting a second semantic feature from the second audio frame set.
In this embodiment, the first semantic feature is used to characterize the semantic of the first audio frame set, and the second semantic feature is used to characterize the semantic of the second audio frame set, and the first semantic feature and the second semantic feature are preferably PPG (photoplethysmogram), for example, a distribution of probability density functions corresponding to the first audio frame set is extracted as the first semantic feature by using an Automatic Speech Recognition (ASR) model, and a distribution of probability density functions corresponding to the second audio frame set is extracted as the second semantic feature by using the model. Because the PPG is of the sub-phoneme level and has more categories, the subsequent alignment processing is carried out by utilizing the PPG, and the alignment result is more accurate.
S202, inputting the first semantic features and the second semantic features into a sequence matching model so that the sequence matching model outputs an alignment result.
In this embodiment, the sequence matching model is used to measure the similarity between two time sequences with different lengths, such as a DTW (dynamic time warping) model.
In this embodiment, the first audio frame set and the second audio frame set correspond to the same audio content (e.g., the same song), and because different people often have inconsistent tempos when singing the same song, that is, different people often have inconsistent time of dragging the same sound, for example, the first sound of "a" in the first audio frame set is dragged shorter and the corresponding audio frame is 0-100, and the first sound of "a" in the second audio frame set is dragged longer and the corresponding audio frame is 0-200. The same tones with different pronunciation durations can be mapped through a sequence matching model, namely, the corresponding relation between the audio frames of 0-100 in the first audio frame set and the audio frames of 0-200 in the second audio frame set is established.
Therefore, after the first semantic features and the second semantic features are input into the sequence matching model, the sequence matching model may correspond the audio frames belonging to the same tone in the first audio frame set and the second audio frame set to obtain the alignment result of the first audio frame set and the second audio frame set.
In the embodiment, only the semantic features are utilized for alignment, and features related to voice do not need to be used, so that the influence of voice features on the alignment accuracy during serious callbacks can be avoided.
S203, inputting the alignment result and the first audio frame set to a time domain adjustment model, so that the time domain adjustment model outputs the target audio frame set.
In this embodiment, the time domain adjustment model is used to change the "speech rate" without changing the "intonation," such as a WSOLA (Overlap-Add) model, a Waveform Similarity model, or an Overlap-Add model.
For example, in the alignment result, the audio frames 0 to 100 in the first audio frame set correspond to the audio frames 0 to 200 in the second audio frame set, that is, the audio frames correspond to the same sound but the sounding durations are not consistent, and the audio frames 0 to 100 in the first audio frame set can be adjusted to be the audio frames 0 to 200 by the time domain adjustment model, so that the sounding durations of the audio frames are consistent.
Therefore, after the alignment result and the first audio frame set are input to the time domain adjustment model, the sounding duration of each sound in the target audio frame set output by the time domain adjustment model is consistent with the sounding duration of the corresponding sound in the second audio frame set, and therefore, the rhythms of the target audio frame set and the second audio frame set are consistent.
In the embodiment, the rhythm of the target audio can be adjusted to be consistent with that of the reference audio, and on the basis, the subsequent pitch adjustment is performed with higher accuracy and better tone modification effect.
S103, determining a first pitch set corresponding to the target audio frame set, and determining a second pitch set corresponding to the second audio frame set.
S104, determining an adjustment strategy based on the first pitch set and the second pitch set.
And S105, adjusting the pitch of the target audio by utilizing the adjusting strategy.
In the embodiment of the application, the first pitch set includes the pitch of each audio frame in the target audio frame set, so the first pitch set can be used for characterizing the pitch feature of the target audio, and the second pitch set includes the pitch of each audio frame in the second audio frame set, so the second audio frame set can be used for characterizing the pitch feature of the reference audio, so the pitch of the target audio is adjusted based on the adjustment strategy determined by the first pitch set and the second pitch set, and the pitch situation of the target audio and the pitch situation of the reference audio are considered at the same time.
In the embodiment of the application, firstly, a first audio frame set of a target audio and a second audio frame set of a reference audio are obtained; then, according to the semantic features in the second audio frame set, performing alignment processing on the semantic features in the first audio frame set in a time domain dimension to obtain a target audio frame set corresponding to the first audio frame set; determining a first pitch set corresponding to the target audio frame set, and determining a second pitch set corresponding to the second audio frame set; finally, determining an adjustment strategy based on the first pitch set and the second pitch set; and adjusting the pitch of the target audio by using the adjusting strategy. The adjusting strategy for adjusting the pitch in the scheme not only considers the pitch of the reference audio, but also considers the pitch of the target audio, so that the situation that the pitch of a user is not considered to cause distortion can be avoided when the pitch is adjusted by using the adjusting strategy.
In another embodiment of the present application, the S104 may include the following steps:
step one, determining a first mean value corresponding to the first pitch set and a second mean value corresponding to the second pitch set;
step two, taking the absolute value of the difference value of the first average value and the second average value as a pitch difference value;
step three, determining the adjustment strategy based on the pitch difference value and the second pitch set.
In an embodiment of the present application, the first pitch set includes a pitch of each audio frame in the target set of audio frames, and the second pitch set includes a pitch of each audio frame in the second set of audio frames. The first average value is an average value of all pitches in the first pitch set, that is, a pitch average value corresponding to the target audio frame set, and the second average value is an average value of all pitches in the second pitch set, that is, a pitch average value corresponding to the second audio frame set, so that an absolute value of a difference between the first average value and the second average value can be used for representing a pitch difference situation between the target audio and the reference audio, that is, the larger the absolute value of the difference is, the larger the difference is.
In the embodiment of the application, the pitch difference condition of the target audio and the reference audio is considered when the adjustment strategy is determined, so that the situation that the pitch of the user is not considered to cause distortion can be avoided when the pitch is adjusted by using the adjustment strategy.
In another embodiment of the present application, the S104 may further include the following steps:
step one, judging whether the pitch difference value is larger than a preset pitch threshold value or not;
step two, if the pitch difference value is larger than the preset pitch threshold value, acquiring a target parameter, and determining a first target pitch set based on the second pitch set and the target parameter, wherein the first target pitch set is used for adjusting the pitch of the target audio;
and step three, if the pitch difference value is smaller than or equal to the preset pitch threshold value, taking the second pitch set as the first target pitch set.
In the embodiment of the present application, the target parameter is a positive number, and is used to adjust a modification amplitude of a pitch, the first target pitch set includes a plurality of first target pitches, and when the pitch of the target audio is adjusted, the pitch of each audio frame in the target audio is adjusted to the corresponding first target pitch. The difference condition of the target audio and the reference audio on the pitch is judged through a preset pitch threshold, when the pitch difference value is larger than the preset pitch threshold, the difference of the target audio and the reference audio on the pitch is larger, the first target pitch set is determined through the second pitch set and the target parameters, namely, the modification amplitude is adjusted on the basis of the second pitch set to obtain the first target pitch set, and therefore distortion caused by the fact that the adjustment amplitude is too large is avoided. When the pitch difference value is smaller than or equal to the preset pitch threshold value, the difference between the target audio and the reference audio in pitch is small, and at the moment, the second pitch set can be directly used as the first target pitch set without recalculation, so that the calculation resources are saved.
In one implementation of the embodiments of the present application, the first target pitch set may be determined by:
step one, under the condition that the first mean value is larger than the second mean value, regarding the sum of the second pitch and the target parameter as a first target pitch for each second pitch in the second pitch set, and obtaining the first target pitch set;
step two, under the condition that the first mean value is smaller than the second mean value, regarding each second pitch in the second pitch set, taking a difference value obtained by subtracting the target parameter from the second pitch as the first target pitch, and obtaining the first target pitch set.
In this embodiment, in the case where the first mean is greater than the second mean, the overall pitch representing the target audio is greater than the overall pitch of the reference audio, e.g., singing the same song, typically a girl's pitch is greater than a boy's pitch, at which time the sum of the second pitch and the target parameter is taken as the first target pitch for each second pitch in the second pitch set, so that, in the case where the first mean is greater than the second mean, the first target pitch is higher than the second pitch.
In a case where the first average value is smaller than the second average value, it indicates that the overall pitch of the target audio is smaller than the overall pitch of the reference audio, and at this time, for each second pitch in the second pitch set, a difference obtained by subtracting the target parameter from the second pitch is taken as the first target pitch, and thus, in a case where the first average value is smaller than the second average value, the first target pitch is lower than the second pitch.
In this embodiment, compared with directly adjusting the first pitch to the corresponding second pitch, the first pitch is adjusted to the first target pitch, so that the adjustment range is reduced, and distortion caused by too large adjustment range is avoided.
In yet another embodiment of the present application, the method further comprises the steps of:
step one, under the condition that an object input adjustment value is received, taking a product result of the adjustment value and a preset parameter as the target parameter;
and step two, taking the preset parameter as the target parameter under the condition that the object input adjustment value is not received.
In the embodiment of the application, the object refers to a user for adjusting pitch, the adjustment value refers to an adjustment coefficient, and the preset parameter is a positive number. Taking a product result of an adjustment value and a preset parameter as a target parameter under the condition of receiving the adjustment value input by the object; and under the condition that the object input adjustment value is not received, directly taking the preset parameter as the target parameter. Through the scheme, the target parameters can be determined based on the selection of the user, namely, the adjustment amplitude is determined according to the user instruction, so that the adjusted pitch is more in line with the user requirements, and the user experience is improved.
In another embodiment of the present application, the S104 may further include the following steps:
step one, when the first average value is larger than the second average value, regarding the sum of the second pitch and the pitch difference value as a second target pitch for each second pitch in the second pitch set, to obtain a second target pitch set, where the second target pitch set is used to adjust the pitch of the target audio;
step two, under the condition that the first mean value is smaller than the second mean value, regarding each second pitch in the second pitch set, taking a difference value obtained by subtracting the pitch difference value from the second pitch as the second target pitch, and obtaining the second target pitch set.
In the embodiment of the present application, the second target pitch set includes a plurality of second target pitches, and when the pitch of the target audio is adjusted, the pitch of each audio frame in the target audio is adjusted to the corresponding second target pitch. In this embodiment, the pitch adjustment amplitude may be adjusted directly according to the pitch difference value, that is, in the case that the first average value is greater than the second average value, for each second pitch in the second pitch set, the sum of the second pitch and the pitch difference value is taken as the second target pitch; and in the case that the first average value is smaller than the second average value, taking a difference value obtained by subtracting the pitch difference value from the second pitch as a second target pitch for each second pitch in the second pitch set. Since the pitch is linear, but the human ear is nonlinear in pitch reception, in this embodiment, the first pitch and the second pitch used for calculation need to be converted into corresponding logarithms in advance, so that the calculation result conforms to the auditory rule of the human ear.
Specifically, the second target pitch may be calculated by the following equations (1) to (4):
Figure 501593DEST_PATH_IMAGE001
Figure 706309DEST_PATH_IMAGE002
Figure 905209DEST_PATH_IMAGE003
Figure 511771DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 505135DEST_PATH_IMAGE005
for the second pitch of the sound, is,
Figure 259464DEST_PATH_IMAGE006
for the first pitch of the sound, is,
Figure 996476DEST_PATH_IMAGE007
is the first average value of the first average value,
Figure 956079DEST_PATH_IMAGE008
is the second average value of the first average value,
Figure 385924DEST_PATH_IMAGE009
is the second target pitch.
The first pitch and the second pitch are converted into corresponding logarithms, namely, the first pitch and the second pitch are converted from linear to nonlinear, then the nonlinear second target pitch is obtained through calculation of a formula (3), finally the second target pitch is converted from nonlinear to linear, and the pitch of the target audio is adjusted by utilizing the linear second target pitch.
In the embodiment of the application, firstly, a first audio frame set of a target audio and a second audio frame set of a reference audio are obtained; then, according to the semantic features in the second audio frame set, performing alignment processing on the semantic features in the first audio frame set in a time domain dimension to obtain a target audio frame set corresponding to the first audio frame set; determining a first pitch set corresponding to the target audio frame set, and determining a second pitch set corresponding to the second audio frame set; finally, determining an adjustment strategy based on the first pitch set and the second pitch set; and adjusting the pitch of the target audio by using the adjusting strategy. The adjusting strategy for adjusting the pitch in the scheme not only considers the pitch of the reference audio, but also considers the pitch of the target audio, so that the situation that the pitch of a user is not considered to cause distortion can be avoided when the pitch is adjusted by using the adjusting strategy.
Based on the same technical concept, an embodiment of the present application further provides an audio processing apparatus, as shown in fig. 3, the apparatus including:
an obtaining module 301, configured to obtain a first audio frame set of a target audio and a second audio frame set of a reference audio;
an alignment module 302, configured to perform alignment processing in a time domain dimension on semantic features in the first audio frame set according to the semantic features in the second audio frame set, to obtain a target audio frame set corresponding to the first audio frame set;
a first determining module 303, configured to determine a first pitch set corresponding to the target audio frame set, and determine a second pitch set corresponding to the second audio frame set;
a second determination module 304 to determine an adjustment strategy based on the first pitch set and the second pitch set;
an adjusting module 305, configured to adjust the pitch of the target audio by using the adjusting policy.
Optionally, the second determining module is specifically configured to:
determining a first mean value corresponding to the first pitch set and a second mean value corresponding to the second pitch set;
taking the absolute value of the difference value of the first average value and the second average value as a pitch difference value;
determining the adjustment policy based on the pitch-difference value and the second pitch set.
Optionally, the second determining module is further configured to:
judging whether the pitch difference value is larger than a preset pitch threshold value or not;
if the pitch difference value is larger than the preset pitch threshold value, acquiring a target parameter, and determining a first target pitch set based on the second pitch set and the target parameter, wherein the first target pitch set is used for adjusting the pitch of the target audio;
and if the pitch difference value is smaller than or equal to the preset pitch threshold value, taking the second pitch set as the first target pitch set.
Optionally, the second determining module is further configured to:
taking the sum of the second pitch and the target parameter as a first target pitch for each second pitch in the second pitch set if the first mean is greater than the second mean, resulting in the first target pitch set;
in a case where the first mean is smaller than the second mean, for each second pitch in the second pitch set, taking a difference obtained by subtracting the target parameter from the second pitch as the first target pitch, resulting in the first target pitch set.
Optionally, the apparatus further comprises an input module, configured to:
taking a product result of an adjustment value and a preset parameter as the target parameter under the condition of receiving the adjustment value input by the object;
and taking the preset parameter as the target parameter under the condition that an object input adjustment value is not received.
Optionally, the second determining module is further configured to:
if the first average value is greater than the second average value, taking the sum of the second pitch and the pitch difference value as a second target pitch for each second pitch in the second pitch set, resulting in a second target pitch set, which is used for adjusting the pitch of the target audio;
in a case where the first mean is smaller than the second mean, for each second pitch in the second pitch set, taking a difference obtained by subtracting the pitch difference value from the second pitch as the second target pitch, resulting in the second target pitch set.
Optionally, the alignment module is specifically configured to:
extracting a first semantic feature from the first audio frame set and extracting a second semantic feature from the second audio frame set;
inputting the first semantic features and the second semantic features into a sequence matching model so that the sequence matching model outputs an alignment result;
inputting the alignment result and the first audio frame set to a time domain adjustment model, so that the time domain adjustment model outputs the target audio frame set.
In the embodiment of the application, firstly, a first audio frame set of a target audio and a second audio frame set of a reference audio are obtained; then, according to the semantic features in the second audio frame set, performing alignment processing on the semantic features in the first audio frame set in a time domain dimension to obtain a target audio frame set corresponding to the first audio frame set; determining a first pitch set corresponding to the target audio frame set, and determining a second pitch set corresponding to the second audio frame set; finally, determining an adjustment strategy based on the first pitch set and the second pitch set; and adjusting the pitch of the target audio by using the adjusting strategy. The adjusting strategy for adjusting the pitch in the scheme not only considers the pitch of the reference audio, but also considers the pitch of the target audio, so that the situation that the pitch of a user is not considered to cause distortion can be avoided when the pitch is adjusted by using the adjusting strategy.
Based on the same technical concept, the embodiment of the present application further provides an electronic device, as shown in fig. 4, including a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 complete mutual communication through the communication bus 114,
a memory 113 for storing a computer program;
the processor 111, when executing the program stored in the memory 113, implements the following steps:
acquiring a first audio frame set of a target audio and a second audio frame set of a reference audio;
performing alignment processing on the semantic features in the first audio frame set in a time domain dimension according to the semantic features in the second audio frame set to obtain a target audio frame set corresponding to the first audio frame set;
determining a first pitch set corresponding to the target audio frame set, and determining a second pitch set corresponding to the second audio frame set;
determining an adjustment strategy based on the first pitch set and the second pitch set;
and adjusting the pitch of the target audio by utilizing the adjusting strategy.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In yet another embodiment provided by the present application, there is also provided a computer-readable storage medium having a computer program stored therein, the computer program, when executed by a processor, implementing the steps of any of the audio processing methods described above.
In yet another embodiment provided by the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the audio processing methods of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of audio processing, the method comprising:
acquiring a first audio frame set of a target audio and a second audio frame set of a reference audio;
performing alignment processing on the semantic features in the first audio frame set in a time domain dimension according to the semantic features in the second audio frame set to obtain a target audio frame set corresponding to the first audio frame set;
determining a first pitch set corresponding to the target audio frame set, and determining a second pitch set corresponding to the second audio frame set;
determining an adjustment strategy based on the first pitch set and the second pitch set;
and adjusting the pitch of the target audio by utilizing the adjusting strategy.
2. The method of claim 1, wherein determining an adjustment policy based on the first pitch set and the second pitch set comprises:
determining a first mean value corresponding to the first pitch set and a second mean value corresponding to the second pitch set;
taking the absolute value of the difference value of the first average value and the second average value as a pitch difference value;
determining the adjustment policy based on the pitch-difference value and the second pitch set.
3. The method of claim 2, wherein the determining the adjustment policy based on the pitch-difference value and the second pitch set comprises:
judging whether the pitch difference value is larger than a preset pitch threshold value or not;
if the pitch difference value is larger than the preset pitch threshold value, acquiring a target parameter, and determining a first target pitch set based on the second pitch set and the target parameter, wherein the first target pitch set is used for adjusting the pitch of the target audio;
and if the pitch difference value is smaller than or equal to the preset pitch threshold value, taking the second pitch set as the first target pitch set.
4. The method of claim 3, wherein determining a first target pitch set based on the second pitch set and the target parameters comprises:
taking the sum of the second pitch and the target parameter as a first target pitch for each second pitch in the second pitch set if the first mean is greater than the second mean, resulting in the first target pitch set;
in a case where the first mean is smaller than the second mean, for each second pitch in the second pitch set, taking a difference obtained by subtracting the target parameter from the second pitch as the first target pitch, resulting in the first target pitch set.
5. The method of claim 3, further comprising:
taking a product result of an adjustment value and a preset parameter as the target parameter under the condition of receiving the adjustment value input by the object;
and taking the preset parameter as the target parameter under the condition that an object input adjustment value is not received.
6. The method of claim 2, wherein the determining the adjustment policy based on the pitch-difference value and the second pitch set comprises:
if the first average value is greater than the second average value, taking the sum of the second pitch and the pitch difference value as a second target pitch for each second pitch in the second pitch set, resulting in a second target pitch set, which is used for adjusting the pitch of the target audio;
in a case where the first mean is smaller than the second mean, for each second pitch in the second pitch set, taking a difference obtained by subtracting the pitch difference value from the second pitch as the second target pitch, resulting in the second target pitch set.
7. The method according to claim 1, wherein the performing alignment processing in a time domain dimension on the semantic features in the first audio frame set according to the semantic features in the second audio frame set to obtain a target audio frame set corresponding to the first audio frame set comprises:
extracting a first semantic feature from the first audio frame set and extracting a second semantic feature from the second audio frame set;
inputting the first semantic features and the second semantic features into a sequence matching model so that the sequence matching model outputs an alignment result;
inputting the alignment result and the first audio frame set to a time domain adjustment model, so that the time domain adjustment model outputs the target audio frame set.
8. An audio processing apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring a first audio frame set of target audio and a second audio frame set of reference audio;
the alignment module is used for performing alignment processing on a time domain dimension on the semantic features in the first audio frame set according to the semantic features in the second audio frame set to obtain a target audio frame set corresponding to the first audio frame set;
a first determining module, configured to determine a first pitch set corresponding to the target audio frame set, and determine a second pitch set corresponding to the second audio frame set;
a second determination module to determine an adjustment strategy based on the first pitch set and the second pitch set;
and the adjusting module is used for adjusting the pitch of the target audio by utilizing the adjusting strategy.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.
10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.
CN202111032567.3A 2021-09-03 2021-09-03 Audio processing method and device, electronic equipment and readable storage medium Active CN113470699B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111032567.3A CN113470699B (en) 2021-09-03 2021-09-03 Audio processing method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111032567.3A CN113470699B (en) 2021-09-03 2021-09-03 Audio processing method and device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN113470699A true CN113470699A (en) 2021-10-01
CN113470699B CN113470699B (en) 2022-01-11

Family

ID=77867395

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111032567.3A Active CN113470699B (en) 2021-09-03 2021-09-03 Audio processing method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113470699B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150348566A1 (en) * 2012-12-20 2015-12-03 Seoul National University R&Db Foundation Audio correction apparatus, and audio correction method thereof
CN106057208A (en) * 2016-06-14 2016-10-26 科大讯飞股份有限公司 Audio correction method and device
CN110675886A (en) * 2019-10-09 2020-01-10 腾讯科技(深圳)有限公司 Audio signal processing method, audio signal processing device, electronic equipment and storage medium
CN112185342A (en) * 2020-09-29 2021-01-05 标贝(北京)科技有限公司 Voice conversion and model training method, device and system and storage medium
CN112447182A (en) * 2020-10-20 2021-03-05 开放智能机器(上海)有限公司 Automatic sound modification system and sound modification method
CN112992110A (en) * 2021-05-13 2021-06-18 杭州网易云音乐科技有限公司 Audio processing method, device, computing equipment and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150348566A1 (en) * 2012-12-20 2015-12-03 Seoul National University R&Db Foundation Audio correction apparatus, and audio correction method thereof
CN106057208A (en) * 2016-06-14 2016-10-26 科大讯飞股份有限公司 Audio correction method and device
CN110675886A (en) * 2019-10-09 2020-01-10 腾讯科技(深圳)有限公司 Audio signal processing method, audio signal processing device, electronic equipment and storage medium
CN112185342A (en) * 2020-09-29 2021-01-05 标贝(北京)科技有限公司 Voice conversion and model training method, device and system and storage medium
CN112447182A (en) * 2020-10-20 2021-03-05 开放智能机器(上海)有限公司 Automatic sound modification system and sound modification method
CN112992110A (en) * 2021-05-13 2021-06-18 杭州网易云音乐科技有限公司 Audio processing method, device, computing equipment and medium

Also Published As

Publication number Publication date
CN113470699B (en) 2022-01-11

Similar Documents

Publication Publication Date Title
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
JP6325640B2 (en) Equalizer controller and control method
CN110364140B (en) Singing voice synthesis model training method, singing voice synthesis model training device, computer equipment and storage medium
WO2020237769A1 (en) Accompaniment purity evaluation method and related device
BR122020006972B1 (en) Volume normalization method based on a target volume value, audio processing apparatus configured to normalize volume based on a target volume value, and machine-readable computer-implemented method storage device
US10971125B2 (en) Music synthesis method, system, terminal and computer-readable storage medium
US10854182B1 (en) Singing assisting system, singing assisting method, and non-transitory computer-readable medium comprising instructions for executing the same
CN106898339B (en) Song chorusing method and terminal
CN112382257B (en) Audio processing method, device, equipment and medium
CN110675886A (en) Audio signal processing method, audio signal processing device, electronic equipment and storage medium
US20230401338A1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
WO2023207472A1 (en) Audio synthesis method, electronic device and readable storage medium
CN112420015A (en) Audio synthesis method, device, equipment and computer readable storage medium
CN114491140A (en) Audio matching detection method and device, electronic equipment and storage medium
WO2020098107A1 (en) Detection model-based emotions analysis method, apparatus and terminal device
CN113470699B (en) Audio processing method and device, electronic equipment and readable storage medium
CN110739006A (en) Audio processing method and device, storage medium and electronic equipment
CN114302301B (en) Frequency response correction method and related product
CN112992110B (en) Audio processing method, device, computing equipment and medium
CN112309351A (en) Song generation method and device, intelligent terminal and storage medium
JP2006178334A (en) Language learning system
CN115273826A (en) Singing voice recognition model training method, singing voice recognition method and related device
CN115394317A (en) Audio evaluation method and device
TW202213152A (en) Model constructing method for audio recognition
CN111627413B (en) Audio generation method and device and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant