CN114464151A - Sound repairing method and device - Google Patents

Sound repairing method and device Download PDF

Info

Publication number
CN114464151A
CN114464151A CN202210377923.3A CN202210377923A CN114464151A CN 114464151 A CN114464151 A CN 114464151A CN 202210377923 A CN202210377923 A CN 202210377923A CN 114464151 A CN114464151 A CN 114464151A
Authority
CN
China
Prior art keywords
voice data
tone
user
data
audio data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210377923.3A
Other languages
Chinese (zh)
Other versions
CN114464151B (en
Inventor
高欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Honor Device Co Ltd
Original Assignee
Honor Device Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honor Device Co Ltd filed Critical Honor Device Co Ltd
Priority to CN202210377923.3A priority Critical patent/CN114464151B/en
Publication of CN114464151A publication Critical patent/CN114464151A/en
Application granted granted Critical
Publication of CN114464151B publication Critical patent/CN114464151B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0091Means for obtaining special acoustic effects
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/325Musical pitch modification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/375Tempo or beat alterations; Music timing control
    • G10H2210/385Speed change, i.e. variations from preestablished tempo, tempo change, e.g. faster or slower, accelerando or ritardando, without change in pitch

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

The embodiment of the application provides a sound modification method and a sound modification device, wherein the sound modification method comprises the following steps: acquiring first voice data, wherein the first voice data has the tone of a first object; dividing the first voice data to obtain at least two pieces of second voice data; determining a second object corresponding to the second voice data; wherein the type of at least one second object in the second objects is the first type, and the type of at least one second object is the second type; performing tone conversion on the second voice data to obtain third voice data, wherein the third voice data has the tone of a second object corresponding to the second voice data; fourth voice data is obtained from the third voice data, the fourth voice data having at least a tone of the second object of the first type and a tone of the second object of the second type.

Description

Sound correcting method and device
Technical Field
The present application relates to the field of audio technologies, and in particular, to a method and an apparatus for modifying audio.
Background
For a user who likes music, the user may install a music Application (APP) on the electronic device, the music Application may provide modes (also referred to as functions) such as recording and karaoke, and the user may complete singing of a song through the recording mode or the karaoke mode of the music APP to obtain audio data of the song. However, in the singing process of the user, the intonation and the rhythm of the user can influence the effect of the song, so that how to improve the effect of the song is an urgent problem to be solved.
Disclosure of Invention
The application provides a sound modification method and device, and aims to improve the effect of recording songs by a user. In order to achieve the above object, the present application provides the following technical solutions:
in a first aspect, the present application provides a method for modifying a sound, the method comprising: acquiring first voice data, wherein the first voice data has the tone of a first object; dividing the first voice data to obtain at least two pieces of second voice data; determining a second object corresponding to the second voice data; wherein the type of at least one second object in the second objects is the first type, and the type of at least one second object is the second type; performing tone conversion on the second voice data to obtain third voice data, wherein the third voice data has the tone of a second object corresponding to the second voice data; and at least fusing the third voice data to obtain fourth voice data, wherein the fourth voice data at least has the tone of the second object of the first type and the tone of the second object of the second type. In this embodiment, the electronic device may divide first voice data having a tone of a first object to obtain at least two pieces of second voice data, then determine a second object corresponding to the second voice data, perform tone conversion on the second voice data to obtain third voice data having a tone of the second object, and obtain fourth voice data according to the third voice data, where the fourth voice data at least has a tone of a first type of the second object and a tone of a second type of the second object, that is, the fourth voice data at least has tones of two types of the second object, and an effect of synthesizing one piece of voice data by multiple persons is achieved through tone conversion. The fusion of the third voice data may be a splicing/combining of the third voice data, for example, after the electronic device performs a tone conversion on each piece of the second voice data, all the third voice data are spliced according to the sequence of the second voice data in the first voice data to obtain fourth voice data, where the voice content of the fourth voice data is the same as the voice content of the first voice data, but the tone of the fourth voice data is different from the tone of the first voice data.
The first type and the second type can be in various forms, and in one example, the first type and the second type can refer to the tone color of the second object, and the second object is distinguished by the tone color. In another example, the first type and the second type may refer to a gender and/or an age of the second subject, the second subject being distinguished by gender and/or age. For example, an adult male and a juvenile male are different second subjects. In other examples, the first type and the second type may refer to at least one of timbre, gender, and age of the second subject; in other examples, the first type and the second type may refer to an object category of the second object, such as a real person, a virtual person, and so on.
In some examples, one scene of the manicure method may be a scene without original singing (e.g., a user recording scene); in some examples, one scenario of the manicure method may be an original singing scenario (e.g., a user karaoke scenario); in these two scenarios, the electronic device may obtain voice data (corresponding to the first voice data) of the user, the voice data having a tone of the user (corresponding to the first object). The electronic device may divide the voice data and perform tone conversion on each of the divided parts (corresponding to the second voice data) to obtain voice data (corresponding to the fourth voice data) having at least two tones of the target user, which is the second object. The first voice data can be extracted from one audio data, for example, the audio data is subjected to background music and voice separation to obtain the first voice data, the separation is optional, after the tone conversion is completed, the electronic equipment can synthesize the fourth voice data and the background music data to obtain target audio data, and the target audio data has the tone of at least two types of second objects, so that the purpose of synthesizing one piece of audio data by multiple persons is achieved; if there is no background music in the audio data, the audio data may be regarded as first voice data.
In one possible implementation, the method further includes: acquiring fifth voice data, wherein the fifth voice data has the tone of a third object, and the fifth voice data and the first voice data correspond to the same content; extracting content parameters from the fifth voice data; based on the content parameters, carrying out sound beautifying processing on the first voice data to obtain sixth voice data; and obtaining seventh voice data based on the fourth voice data and the sixth voice data, wherein the seventh voice data at least has the tone of the second object of the first type and the tone of the second object of the second type, and the content parameters of the seventh voice data are matched with the content parameters extracted from the fifth voice data. The electronic equipment can acquire first voice data and fifth voice data corresponding to the same content, and performs sound beautifying processing on the first voice data by using content parameters of the fifth voice data to obtain sixth voice data, wherein the content parameters of the sixth voice data are matched with content parameters extracted from the fifth voice data; and then, based on the fourth voice data and the sixth voice data, obtaining seventh voice data, so that the seventh voice data can have the timbres of at least two types of second objects, and the content parameters of the seventh voice data are matched with the content parameters extracted from the fifth voice data, wherein the matching means that the content parameters of the two pieces of voice data are similar or close to each other.
For example, the fourth voice data and the sixth voice data are fused, so that the seventh voice data retains the tone of the fourth voice data, the seventh voice data retains the content parameter of the sixth voice data, and the content parameter of the seventh voice data can be matched with the content parameter of other voice data while achieving the effect of synthesizing one voice data by multiple persons. For example, in an original singing scene, the electronic equipment can acquire original singing voice data (corresponding to third voice data), and performs sound beautifying processing on the voice data of the user by using content parameters of the original singing voice data, so that the aim of chorusing songs by multiple persons is fulfilled and the running probability is reduced. In the original singing scene, the seventh voice data can be synthesized with the background music data to obtain target audio data, and the target audio data can reduce the running probability while achieving the purpose of singing songs by multiple persons.
In one possible implementation, the content parameters include: a start position and an end position of each sentence, a start position and an end position of each word, a pronunciation of each word, and a pitch of each word; based on the content, the sound beautifying processing of the first voice data comprises the following steps: obtaining a fundamental frequency and an envelope of each word based on a start position and an end position of each word; obtaining consonant information of each character based on pronunciation of each character; the pitch and the speech speed of the first voice data are adjusted by using the fundamental frequency of each word, the envelope of each word, the consonant information of each word, the pitch of each word, the start position and the end position of each word, and the start position and the end position of each sentence, so that the pitch of the sixth voice data is matched with the pitch of the third voice data, and the speech speed of the sixth voice data is matched with the speech speed of the third voice data.
Wherein the adjustment of the pitch and the speech rate of the first speech data is performed by adjusting each word and each sentence in the first speech data, for example, by using the fundamental frequency and the envelope of each word, the fundamental frequency and the envelope of the word in the first speech data are adjusted, and the difference between the fundamental frequencies of the word in the two pieces of speech data (the first speech data and the third speech data) is reduced; adjusting the consonant information of each character in the first voice data by utilizing the consonant information of each character to ensure that the consonant information of the character in the two voice data is the same; and adjusting the pitch of the word in the first voice data by utilizing the pitch of each word to reduce the difference between the pitches of the word in the two voice data, and adjusting the tone of the first voice data by adjusting the fundamental frequency, the envelope, the consonant information and the pitch. The time length of each word in the first voice data is adjusted by using the starting position and the ending position of each word, and after the time length of the word in a sentence is adjusted, the time length of the sentence in the first voice data can be adjusted by using the starting position and the ending position of the sentence, so that the speed of the first voice data is adjusted. Since the parameter of each word in the first voice data after the sound-beautifying processing (i.e., the sixth voice data) is consistent with the parameter of the corresponding word in the third voice data, and the parameter of each sentence in the first voice data after the sound-beautifying processing is consistent with the parameter of the corresponding sentence in the third voice data, the first voice data after the sound-beautifying processing can retain the characteristic of the third voice data. In the original singing scene, the first voice data after the sound beautifying processing can comprise the characteristics of the original singing voice data, so that the first voice data after the sound beautifying processing can be close to the original singing voice data, and the running probability is reduced.
In a possible implementation manner, the dividing the first voice data to obtain at least two pieces of second voice data includes: performing voiceprint recognition on the first voice data to determine a recognition result of at least one part of the first voice data; and dividing the first voice data based on at least one part of recognition results to obtain at least two pieces of second voice data, and realizing division of the first voice data through voiceprint recognition. In some examples, the recognition result may be a history object (e.g., a history user) to which each part belongs, and the first voice data is divided based on the history object to which each part belongs. For example, the electronic device may divide a part belonging to one history object into one piece of second voice data, or may divide at least two connected parts belonging to two different history objects into one piece of second voice data, thereby achieving the purpose of dividing one piece of first voice data into a plurality of pieces of second voice data.
In some examples, the recognition result may be gender of the subject to which each portion belongs, and a portion corresponding to one gender of the subject is divided into one piece of the second voice data. For example, the recognition result may be a gender of the user to which each part belongs, such as male or female, and the electronic device may divide the part belonging to the male into one piece of the second voice data. If a portion belonging to a woman is inserted between the plurality of portions belonging to a man, the electronic apparatus may be divided with the portion inserted with a woman as a division point. For example, 0 seconds(s) to 3s belong to male, 3s to 10s belong to male, 10s to 20s belong to female, and 20s to 35s belong to male, the electronic device may divide 0s to 10s into one piece of second voice data, 10s to 20s into one piece of second voice data, and 20s to 35s into one piece of second voice data.
In a possible implementation manner, the dividing the first voice data based on at least a part of the recognition result to obtain at least two pieces of second voice data includes: dividing the first voice data based on at least one part of the recognition result to obtain a division result; receiving an adjusting instruction aiming at a dividing result; and responding to the adjusting instruction, and adjusting the division result based on the adjusting parameters in the adjusting instruction to obtain at least two pieces of second voice data. That is, after the electronic device divides the first voice data once according to the recognition result of the voiceprint recognition, the electronic device may divide the division result again and perform the fine division based on the coarse division. For example, the user may divide the division result again, and the user may adjust the duration of the division result so that the second voice data meets the user requirement.
In one possible implementation manner, performing voiceprint recognition on the first voice data to determine a recognition result of at least one part of the first voice data includes: extracting first feature data from the first voice data; calling a voiceprint discrimination model to process the first characteristic data to obtain a voiceprint discrimination result output by the voiceprint discrimination model, wherein the voiceprint discrimination result comprises a recognition result of each part in the first voice data; the voiceprint discrimination model is obtained by training voice data of a plurality of historical objects. The recognition result of each part can be each part of historical objects, or the recognition result of each part is the gender of the object, and the voiceprint recognition is automatically completed through a voiceprint recognition model.
In a possible implementation manner, the dividing the first voice data to obtain at least two pieces of second voice data includes: receiving a dividing instruction for first voice data; and responding to the dividing instruction, and dividing the first voice data based on the dividing parameters in the dividing instruction to obtain at least two pieces of second voice data. For example, the user may divide the first voice data, and the division parameter is a parameter given by the user, so that the user may manually divide the first voice data.
In one possible implementation, the division parameter includes a time parameter and/or a lyric parameter. The time parameter can be a time length manually input by a user or a time control manually controlled by the user to select the time length, and the time control can be a progress bar; the lyric parameter may be the number of lyrics contained in the second voice data manually input by the user, and the first voice data may be divided by the time length and/or lyrics. For example, the division is performed based on the lyric parameter, for example, X sentences of lyrics are specified to be divided into a section, X is a natural number greater than or equal to 1, and the number of lyrics contained in each piece of second voice data may be the same or different. Of course, the electronic device may also automatically divide the words according to the words, such as dividing one piece of second voice data by two pieces of words, or making the number of words in each piece of second voice data the same or close.
In one possible implementation, the time parameter is a time length manually input by a user or a time length selected by a time control manually controlled by the user; the method further comprises the following steps: if the user manually controls the time control to select the duration, outputting prompt information after detecting that one duration is selected; based on the time parameter in the dividing instruction, dividing the first voice data comprises: in response to a confirmation instruction for the prompt information, the first voice data is divided based on the selected duration. The prompt message is a division prompt for prompting whether to divide at the time point. The confirmation instruction indicates that the division is performed at the time point, so that after receiving the confirmation instruction, the electronic device can divide the first voice data based on the time point, thereby achieving the purpose of prompting when dividing.
In a possible implementation manner, the dividing the first voice data to obtain at least two pieces of second voice data includes: performing voiceprint recognition on the fifth voice data to determine a recognition result of at least one part of the fifth voice data, wherein the fifth voice data has the tone of a third object, and the fifth voice data and the first voice data correspond to the same content; and dividing the first voice data based on at least one part of the recognition result to obtain at least two pieces of second voice data, and realizing the division of the first voice data by using the recognition result of the fifth voice data. For example, in an original singing scene, the electronic device may acquire original singing voice data and voice data of a user, perform voiceprint recognition on the original singing voice data to obtain a recognition result of each part in the original singing voice data, and divide the voice data of the user by using the recognition result of each part in the original singing voice data.
In one possible implementation, determining the second object corresponding to the second voice data includes: acquiring a second object determined by the user for the second voice data; or, obtaining scores of timbres of different objects, and selecting a second object corresponding to the second voice data from the objects with the scores meeting preset conditions; or acquiring similarity between the tone color feature of the first object and the tone color feature of the second object, and selecting the second object corresponding to the second voice data based on the similarity; or, determining the type of the tone of the first object, and selecting a second object corresponding to the second voice data based on the type of the tone of the first object. The user determines to be a manual selection mode, and the user determines to be an automatic selection mode based on the scores, the similarity and the types of the tones.
If the user determines the second object for the second voice data, the user may determine the second object for each piece of the second voice data, or determine the second object according to the gender corresponding to the second voice data, such as the same gender corresponding to one second object. If the second object is determined based on the scores of the timbres of different objects, the electronic device may select from the timbres with higher scores, and may select the second object with the same gender as the first object by considering gender in the selection process, wherein the score may be obtained according to the number of uses, and the score is higher when the number of uses is larger. If based on the timbre feature selection, the electronic device can select a second object with close timbre or larger timbre difference based on the similarity, the close timbre can reduce the abrupt feeling during playing, and the larger timbre difference can better attract the attention of the user during playing. If the selection is based on the type to which the tone belongs, the electronic device may select a second object having a tone of the same type as the tone of the first object.
In a possible implementation manner, performing tone conversion on the second voice data to obtain third voice data includes: acquiring the tone characteristic of the second voice data; calling a tone representation model of a second object, and processing tone features to obtain third voice data output by the tone representation model, wherein the tone representation model is obtained by training a plurality of pieces of voice data of the second object, the second object and the tone representation model are in a one-to-one relationship, and because the tone of each object is different, one tone representation model is trained for each second object, so that the tone representation model can learn the tone characteristics of the second object, and the accuracy is improved.
In one possible implementation, obtaining the timbre characteristic of the second speech data comprises: extracting second feature data from the second voice data; and calling a tone extraction model to process the second characteristic data to obtain tone characteristics output by the tone extraction model.
In a possible implementation manner, at least fusing the third voice data, and obtaining the fourth voice data includes: and if part of the second voice data in all the second voice data is not subjected to tone conversion, fusing the second voice data which is not subjected to tone conversion in all the second voice data and the third voice data to obtain fourth voice data, wherein the fourth voice data has the tone of the second object and the tone of the first object. The fourth voice data can reserve the tone of the first object, so that the tone of the fourth voice data is more diversified, and the tone characteristic of the first object can be reserved.
In a second aspect, the present application provides an electronic device comprising: a processor and a memory; wherein the memory is configured to store one or more computer program codes, the computer program codes comprising computer instructions, and when the processor executes the computer instructions, the processor executes the above-mentioned sound modifying method.
In a third aspect, the present application provides a computer storage medium comprising computer instructions that, when run on an electronic device, cause the electronic device to perform the above-mentioned sound correction method.
Drawings
FIG. 1 is a hardware block diagram of an electronic device provided herein;
FIG. 2 is a software architecture diagram of an electronic device provided herein;
fig. 3 is a schematic diagram of a sound modifying method provided in the present application;
FIG. 4 is a flow chart of training a voiceprint recognition model and a tone characterization model provided herein;
FIG. 5 is a flow chart of a method for modifying sound provided by the present application;
FIG. 6 is a schematic diagram of another sound modification method provided in the present application;
fig. 7 is a flowchart of a masking process in the masking method provided in the present application;
fig. 8 to 10 are UI diagrams corresponding to the sound modifying method provided by the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. The terminology used in the following examples is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of this application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, such as "one or more", unless the context clearly indicates otherwise. It should also be understood that in the embodiments of the present application, "one or more" means one, two, or more than two; "and/or" describes the association relationship of the associated objects, indicating that three relationships may exist; for example, a and/or B, may represent: a alone, both A and B, and B alone, where A, B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The embodiments of the present application relate to a plurality of numbers greater than or equal to two. It should be noted that, in the description of the embodiments of the present application, the terms "first", "second", and the like are used for distinguishing the description, and are not to be construed as indicating or implying relative importance or order.
In the recording (e.g., recording) process of the user, the electronic device may obtain the audio data of the user, and may perform post-processing on the audio data, such as performing sound modification processing on the audio data. The modification processing of the audio data may be adjusting at least one of a tone and a tone. The sound modification processing mode can include the following two modes:
one way is to adjust the timbre of the audio data. For example, the electronic device may receive a tone conversion instruction, load a tone conversion model of the target character, convert first audio data output by the original character into second audio data output by the target character through the tone conversion model, and convert the audio data into the target character. And the electronic equipment plays the voice according to the tone of the target role after obtaining the second audio data of the target role. When audio data conversion is carried out, a user can select a favorite target role, the audio data is converted through a tone conversion model of the favorite target role of the user, the audio played by the electronic equipment is the audio repeated by the target role, the auditory demand of the user is met, and the user experience is improved.
Alternatively, the fundamental frequency of the audio data is adjusted on the basis of preserving the timbre of the user. The pitch can be affected by the height of the fundamental frequency, and the pitch can be adjusted by adjusting the fundamental frequency.
For example, after the electronic device acquires audio data of a user, extracting fundamental frequency, envelope and consonant information of each word in the audio data of the user through a feature extraction algorithm, wherein each word extracts a preset number of fundamental frequencies, the preset number is determined according to the extraction frequency, and the audio data of the user can be audio data of lyrics of a song pronounced by the user; for each word, adjusting the fundamental frequency of the preset number of words to the pitch frequency of the words in the song, wherein the pitch frequency of each word in the song is the frequency corresponding to the pitch of each word in the song; synthesizing the adjusted fundamental frequency, the envelope of each word and the consonant information to obtain a synthesized audio frequency; according to the duration of each word in the song, the duration of each word in the synthesized audio is adjusted to obtain the synthesized singing voice, so that for a user who does not sing well, the user can pronounce the lyrics of the song, the electronic equipment can obtain the audio data of the song, and then the song close to the voice of the user is synthesized by adjusting the fundamental frequency and the duration of each word.
Because the envelope and the auxiliary information of the voice of the user are reserved in the synthesis process of the electronic equipment, the tone of the user is reserved, the synthesized singing voice is close to the voice of the user, and the voice of the user can be reserved while the voice is beautified. In the synthesis process, the electronic equipment adjusts the fundamental frequency in the audio data to realize the adjustment of the tone. In addition, the electronic equipment can adjust the time length of each word in the audio data.
One scenario is that a music APP can be installed on the electronic device, and the music APP can provide modes such as a recording mode and a karaoke mode. The recording mode and the karaoke mode can provide a function of recording songs for a user, and the audio data of the user can be acquired through the function of recording the songs, so that the audio data of the user can be restored to a song. After the audio data of the user is obtained, the electronic device can call a sound modification mode in the music APP to modify the audio data of the user. In one example, the electronic device invokes the modify mode to adjust the tone of the audio data to the tone of the single target character, but the electronic device may adjust the tone of the audio data, but the effect of a multi-person chorus song cannot be achieved without personalized adjustment. In another example, the electronic device can invoke a manicure mode to voice-beautify the audio data of the user. For example, the electronic device may adjust the fundamental frequency of the audio data and the duration of each word in the audio data via the voicing mode. In the process of adjusting the fundamental frequency and the duration of the words, the electronic equipment does not change the tone of the audio data, and the tone of the user is reserved.
In order to solve the technical problem, the application provides a sound modifying method, after audio data of a user are obtained, the audio data of the user are divided into at least two parts, and tone conversion is performed on the at least two parts, so that the audio data of the user have at least two target tones, and the effect of synthesizing songs by multiple persons is achieved through tone conversion. Besides performing tone conversion on at least two parts, the sound modifying method can perform sound modifying processing on the audio data of the user, such as adjusting the pitch of the audio data by adjusting the fundamental frequency of the audio data of the user, and adjusting the speed of speech of the audio data by adjusting the duration of each word.
The sound modifying method can be applied to electronic equipment. In some embodiments, the electronic device may be a cell phone, a tablet, a desktop, a laptop, a notebook, an ultra-mobile personal computer (UMPC), a handheld computer, a netbook, a Personal Digital Assistant (PDA), a wearable electronic device, a smart watch, or the like. The specific form of the electronic device is not particularly limited in the present application.
As shown in fig. 1, the electronic device may include: a processor, an external memory interface, an internal memory, a Universal Serial Bus (USB) interface, a charging management module, a power management module, a battery, an antenna 1, an antenna 2, a mobile communication module, a wireless communication module, a sensor module, a key, a motor, an indicator, a camera, a display screen, and a Subscriber Identity Module (SIM) card interface, etc. Wherein the audio module may include a speaker, a receiver, a microphone, an earphone interface, etc., and the sensor module may include a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, etc.
It is to be understood that the illustrated structure of the present embodiment does not constitute a specific limitation to the electronic device. In other embodiments, an electronic device may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
The processor may include one or more processing units, such as: the processor may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), among others. The different processing units may be separate devices or may be integrated into one or more processors. The processor is a nerve center and a command center of the electronic equipment, and the controller can generate an operation control signal according to the instruction operation code and the time sequence signal to finish the control of instruction fetching and instruction execution.
The display screen is used for displaying images, videos, a series of Graphical User Interfaces (GUI) and the like, such as an interface for displaying the song K APP, a sound modification mode in the song K APP, audio data of a user and the like.
The external memory interface can be used for connecting an external memory card, such as a Micro SD card, so as to expand the storage capability of the electronic device. The external memory card communicates with the processor through the external memory interface to realize the data storage function. For example, a model used in the sound correction method is stored in the external memory card. The internal memory may be used to store computer-executable program code, which includes instructions. The processor executes various functional applications of the electronic device and data processing by executing instructions stored in the internal memory. For example, in the present application, the processor causes the electronic device to execute the sound modifying method provided in the present application by executing the instructions stored in the internal memory.
The electronic device may implement audio functions through an audio module, a speaker, a receiver, a microphone, an earphone interface, an application processor, and the like. Such as music playing, recording, etc.
The audio module is used for converting digital audio information into analog audio signals to be output and converting the analog audio input into digital audio signals. The audio module may also be used to encode and decode audio signals. In some embodiments, the audio module may be disposed in the processor, or a portion of the functional modules of the audio module may be disposed in the processor.
Loudspeakers, also known as "horns," are used to convert electrical audio signals into sound signals. The electronic device can listen to music through a loudspeaker, or listen to a hands-free call, or play a song, etc.
Receivers, also called "earpieces", are used to convert electrical audio signals into acoustic signals. When the electronic equipment answers a call or audio data, the audio can be answered by placing the telephone receiver close to the ear of the person.
Microphones, also called "microphones" or "microphones", are used to convert sound signals into electrical signals, where the sound signals and the electrical signals carry audio data of a user. For example, when making a call or transmitting audio data, a user may input a sound signal into the microphone by speaking the user's mouth near the microphone. The electronic device may be provided with at least one microphone. In other embodiments, the electronic device may be provided with two microphones to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device may further include three, four, or more microphones to collect sound signals and reduce noise, and may further identify sound sources and implement directional recording functions.
The earphone interface is used for connecting a wired earphone. The headset interface may be a USB interface, or may be an open mobile electronic device platform (OMTP) standard interface of 3.5mm, or a cellular telecommunications industry association (cellular telecommunications industry association) standard interface of the USA.
The wireless communication function of the electronic device can be realized by the antenna 1, the antenna 2, the mobile communication module, the wireless communication module, the modem processor, the baseband processor and the like. The electronic device may download audio data using a wireless communication function, and the processor may train a voiceprint discrimination model and a tone characterization model based on the downloaded audio data. The electronic equipment can call the voiceprint discrimination model and the tone representation model to implement the sound modification method.
In addition, an operating system runs on the above components. For example, the iOS os developed by apple, the Android open source os developed by google, the Windows os developed by microsoft, and the like. An operating application may be installed on the operating system.
The operating system of the electronic device may employ a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the application takes an Android system with a layered architecture as an example, and exemplarily illustrates a software structure of an electronic device. Fig. 2 is a block diagram of the hardware and software architecture of the electronic device. The software structure adopts a layered architecture, the layered architecture divides the software into a plurality of layers, and each layer has clear roles and division of labor. The layers communicate with each other through a software interface. Taking the Android system as an example, in some embodiments, the Android system is divided into four layers, which are an application layer, an application Framework layer (Framework), a Hardware Abstraction Layer (HAL), and a system Kernel layer (Kernel) from top to bottom.
Wherein the application layer may include a series of application packages. The application packages may include APPs such as cameras, galleries, calendars, conversations, maps, WLANs, music, videos, recordings and songs. The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions. For example, the application framework layer may include a window manager, content provider, view system, phone manager, resource manager, notification manager, and the like.
The HAL may contain a plurality of library modules and a plurality of models, wherein the library modules and models may be invoked. For example, HAL includes a voiceprint recognition model and a timbre characterization model. The recording APP and the K song APP in the application program layer can call the voiceprint discrimination model and the tone representation model in the running process. For example, the recording APP and the karaoke APP can acquire audio data of a user through a microphone, a voiceprint discrimination model can identify a voiceprint of the audio data, and the voiceprint is used for dividing the audio data so as to divide the audio data into at least two parts according to the voiceprint; the tone representation model can perform tone conversion on at least two divided parts, so that the sound modification method is implemented by calling the voiceprint discrimination model and the tone representation model. The system kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.
The voiceprint identification model can select basic network models such as a Convolutional Neural Network (CNN), a long-short term memory artificial neural network (LSTM), a convolutional cyclic neural network (CRNN), and the like. The tone characterization model may select a Generative Adaptive Network (GAN). For example, in some examples, the electronic device may select a CRNN model as a voiceprint recognition model, train the CRNN model using a training sample, the trained CRNN model may learn voiceprint characteristics of different users, the trained CRNN model may identify user identities to which audio data belong based on the voiceprint characteristics of the different users, and the trained CRNN model may be used as the voiceprint recognition model. For the tone characterization model, the electronic device may select a GAN model as the tone characterization model, train the GAN model using the training sample, and the GAN model after training may learn the tone characteristics of the user to complete the conversion from one tone to another tone, so that the GAN model after training may be used as the tone characterization model. The training samples of the voiceprint recognition model and the training samples of the tone characterization model may be the same or different.
The sound modification method, the voiceprint discrimination model and the tone representation model provided by the application are explained below by combining scenes. Referring to fig. 3, a schematic diagram of a sound modifying method provided in an embodiment of the present application is shown, where the sound modifying method shown in fig. 3 is directed to a scene without original singing, and the scene without original singing may be a user recording scene. For example, the user records a song scene, the original song is not played during the recording process of the user, or the user records a new song, etc. The voice modification method can comprise a preparation stage and a use stage, and in the preparation stage, the electronic equipment completes training of a voiceprint discrimination model and a tone characterization model. In the using stage, the electronic device may call the tone representation model to perform tone conversion on the audio data of the user, and may call the voiceprint discrimination model to divide the audio data of the user into at least two parts.
The voiceprint discrimination model and the tone characterization model can use the audio data of different historical users as training samples, the audio data of the historical users can be collected by the historical users in the singing process, and the historical users can comprise stars (such as original singers, turning singers and actors), network singers, live broadcast users, virtual characters, common users and the like. The electronic equipment can acquire a plurality of pieces of audio data of different historical users, and the electronic equipment can also acquire a plurality of pieces of audio data of the same historical user. The electronic equipment trains the voiceprint discrimination model by using a plurality of pieces of audio data of different historical users, so that the voiceprint discrimination model can learn the voiceprint characteristics of each historical user to obtain a voiceprint discrimination model matched with the voiceprint characteristics of a plurality of historical users; and training the tone characterization model by using a plurality of pieces of audio data of any historical user, so that the tone characterization model can identify the tone characteristics of the historical user, and obtain the tone characterization model matched with the tone characteristics of the historical user.
That is, the relationship between the voiceprint recognition model and the historical users is that one voiceprint recognition model corresponds to a plurality of historical users, and the electronic device can train the voiceprint recognition model by using the audio data of the plurality of historical users. The relationship between the tone characterization model and the historical user is that one tone characterization model corresponds to one historical user, that is, the tone characterization model and the historical user are in a one-to-one relationship, and the electronic device can train the tone characterization model matched with the tone characteristics of the historical user by using the audio data of any historical user as a training sample. The process of the electronic device training the voiceprint recognition model and the tone characterization model with the audio data of different historical users can be seen in fig. 4, and includes the following steps:
s101, separating background music and human voice of the audio data of each historical user to obtain voice data of the historical user. The voice data of the historical user can restore the voice of the historical user.
And S102, carrying out identity marking on the voice data so as to mark the historical user identity corresponding to each part in the voice data. The historical user identification may be a name of the historical user and the historical user identification may be a gender of the historical user.
S103, extracting mel-frequency pilot coefficients (MFCC) characteristics from the voice data, and extracting fbank (filter bank) characteristics from the voice data. The MFCC features and the Fbank features can reflect the voiceprint characteristics of historical users, the information quantity in the Fbank features is larger than that in the MFCC features, Discrete Cosine Transform (DCT) is added in the process of extracting the MFCC features, and voice data can be lost to some extent, so that the accuracy of the MFCC features is lower than that of the Fbank features, the Fbank features are used in training a voiceprint discrimination model, and the MFCC features are used in training a tone characterization model.
S104, inputting the Fbank characteristics into a voiceprint discrimination model, outputting a voiceprint discrimination result by the voiceprint discrimination model, wherein the voiceprint discrimination result can indicate historical users to which each part in the voice data belongs.
S105, adjusting model parameters of the voiceprint discrimination model based on the voiceprint discrimination result (such as the predicted historical user identity), the labeled historical user identity and the loss function.
The electronic equipment completes training of the voiceprint recognition model through multiple times of adjustment of model parameters of the voiceprint recognition model. After the training of the voiceprint recognition model is completed, the voiceprint recognition model can learn the voiceprint characteristics of each historical user, so that when the voiceprint recognition model is used for voiceprint recognition, the voiceprint recognition model can recognize whether the identity of the user to which the voice data belongs is the historical user during training, and if so, the voiceprint recognition model can output the identity of the historical user to which the voice data belongs; if not, the output of the voiceprint recognition model is null. Therefore, after the electronic equipment acquires the audio data of the user, the electronic equipment can separate background music from human voice of the audio data of the user to obtain the voice data of the user, extract Fbank characteristics from the voice data, and then call a voiceprint discrimination model to identify the user identities of all parts in the voice data to obtain the user identity identifications of all parts in the voice data.
S106, inputting the MFCC features into the tone extraction model, and outputting the tone features by the tone extraction model.
And S107, inputting the tone characteristics into a generator of the tone characterization model, and outputting audio data by the generator.
And S108, inputting the audio data output by the generator into a discriminator of the tone characterization model, wherein the discriminator outputs a discrimination result, and the discrimination result can indicate the difference between the audio data output by the generator and the audio data of the historical user to which the MFCC features belong. In one example, the determination result indicates whether the audio data output by the generator is real audio data indicating that the audio data output by the generator is the same as/similar to the audio data of the historical user to which the MFCC feature belongs or fake audio data indicating that the audio data output by the generator is different from/dissimilar to the audio data of the historical user to which the MFCC feature belongs.
And S109, adjusting the model parameters of the generator and the discriminator based on the discrimination result.
The loss function of the generator may be:
Figure DEST_PATH_IMAGE002
(ii) a The penalty function of the arbiter may be:
Figure DEST_PATH_IMAGE004
. WhereinDThe presence of the discriminator is indicated by the expression,
Figure DEST_PATH_IMAGE006
indicating the discrimination result outputted by the discriminator when the input of the discriminator is real audio data,Ga representation generator for generating a representation of the object,
Figure DEST_PATH_IMAGE008
representing the audio data output by the generator,
Figure DEST_PATH_IMAGE010
indicating the discrimination result outputted by the discriminator when the input of the discriminator is the counterfeit audio data,abandcis a constant number of times that the number of the first,abandcmay be arranged in a manner ofb=c=1,aAnd (4) = -1. In the loss function of the discriminator,
Figure 628323DEST_PATH_IMAGE006
the value of (a) is close to 1,
Figure 951988DEST_PATH_IMAGE010
the value of (a) is close to-1. In the loss function of the generator it is,
Figure 262883DEST_PATH_IMAGE010
is close to 1. The real audio data refers to that the audio data generated by the generator is matched with/close to the audio data of the historical user; the forged audio data means that the audio data generated by the generator does not match/approach the audio data of the historical user.
The electronic equipment completes the training of the tone representation model through multiple adjustments of the model parameters of the generator and the discriminator. After the training of the tone representation model is completed based on a plurality of pieces of audio data of a historical user, the discriminator improves the capability of identifying real audio data and forged audio data, the generator can learn the tone characteristics of the historical user, so that the generator outputs the audio data matched with the tone of the historical user, the audio data matched with the tone of the historical user is equivalent to the audio data acquired by the electronic equipment when the historical user speaks, and the aim of falsifying is fulfilled.
The electronic device completes training of the voiceprint discrimination model and the tone characterization model by using the flow shown in fig. 4, after acquiring audio data of a user, the electronic device may modify the currently acquired audio data, and in the process of modifying the audio, the electronic device may call the voiceprint discrimination model and the tone characterization model, and a flow chart of a modifying method is shown in fig. 5, and may include the following steps:
s201, the electronic equipment acquires audio data of a user. In the scene without original singing, the electronic equipment can operate the recording APP, and the recording APP can record audio data such as songs. In the process of running the recording APP by the electronic equipment, the electronic equipment can call a microphone to collect audio data of a user.
S202, separating background music and voice of the currently acquired audio data of the user to obtain voice data of the user. In this embodiment, step S202 is optional, and if the audio data of the user does not have the background tone color, and the audio data of the user is the voice data of the user, the electronic device may skip step S202 and execute step S203.
S203, extracting Fbank features from the voice data of the user.
S204, inputting the Fbank characteristics into the voiceprint discrimination model to obtain a voiceprint discrimination result output by the voiceprint discrimination model, wherein the voiceprint discrimination result can indicate historical users to which each part in the voice data belongs. That is, all parts in the voice data can be corresponding to the historical users through the voiceprint discrimination model, so that the automatic labeling of the user identities of all parts in the voice data is completed through the voiceprint discrimination model.
In one example, the voiceprint recognition model can record voiceprint feature vectors of historical users, and the voiceprint recognition result is obtained by using the voiceprint feature vectors of the historical users. For example, the electronic device may extract Fbank features from each part of the voice data, input the Fbank features of each part into the voiceprint recognition model, the voiceprint recognition model may generate voiceprint feature vectors of the part to which the Fbank features belong, perform distance calculation on the currently generated voiceprint feature vectors and the recorded voiceprint feature vectors of all historical users to determine whether the currently generated voiceprint features are matched with the recorded voiceprint feature vectors, and if the currently generated voiceprint features are matched with the recorded voiceprint feature vectors, the user identity of the part to which the Fbank features belong is the historical user corresponding to the matched voiceprint feature vectors, so that the user identity of the part to which the Fbank features belong is marked as the historical user corresponding to the matched voiceprint feature vectors, and complete voiceprint recognition of the part to which the Fbank features belong.
In one example, the voiceprint recognition result can be a historical user identity predicted by the voiceprint recognition model, such as the name of the historical user predicted by the voiceprint recognition model, and for each part of the voice data of the historical user, the voiceprint recognition model can identify the name of the historical user corresponding to each part. For example, for any part in the voice data, the voiceprint recognition model outputs the probability that the part belongs to different historical users, and the historical user with the highest probability is marked as the user identity to which the part belongs.
S205, voice data of the user is divided into at least two parts according to the voiceprint judgment result output by the voiceprint judgment model.
In one example, the voiceprint recognition result indicates that each part in the voice data belongs to a historical user, when the voice data is divided, the voice data of the user is divided according to the historical users to which each part belongs, for example, the part belonging to one historical user is divided into independent paragraphs, so that the parts belonging to different historical users are separated, and automatic division of the voice data is completed.
In another example, the voiceprint recognition result is used as a reference for voice data division, the electronic device can output the voiceprint recognition result, and the user can divide the voice data with reference to the voiceprint recognition result. For example, 0 second(s) to 3s in the voice data belong to the historical user a, 3s to 10s belong to the historical user B, 10s to 20s belong to the historical user a, and 20s to 35s belong to the historical user C, the electronic device may divide the voice data in this manner, or may adjust the division based on the division, for example, dividing 0s to 20s into a part and dividing 20s to 35s into a part.
S206, acquiring the target user selected by the user for each divided part. The target user is one of all the history users, the user can select one target user for each divided part of the voice data, namely, one divided part of the voice data corresponds to one target user, and the tone color of the target user is the target tone color of the divided part. The target users corresponding to different parts of the voice data which are divided can be the same or different.
S207, calling the tone representation model of the target user, taking the part corresponding to the target user in the voice data as input, obtaining the target voice data output by the tone representation model of the target user, wherein the tone of the target voice data is the tone of the target user, and completing the conversion of the voice data from the tone of one user to the tone of the target user. For example, for each part divided from the voice data, the electronic device extracts the MFCC features of the part, inputs the MFCC features of the part into the tone extraction model, outputs the tone features of the part by the tone extraction model, inputs the tone features of the part and the part into the tone representation model of the target user, obtains the target voice data output by the tone representation model of the target user, and completes the tone conversion of the part in the voice data.
And S208, combining the target voice data to obtain the target voice data. The target audio data comprises target audio data, the tone of each target audio data is the tone of a target user, so that the target audio data at least has the tone of the target user, and the target audio data can be output as audio data of the user, so that the electronic equipment can output the audio data at least having the tone of the target user, and the tone conversion of the audio data is completed.
For example, in a scenario where a user records a song, the electronic device may collect audio data of the user, where the audio data of the user has a user's timbre, and if the user records a song, the audio data of the user has a single timbre. The electronic device may perform tone conversion on the audio data of the user by using the above-mentioned tone modifying method shown in fig. 5 to obtain target audio data having at least the tone of the target user, where the target audio data may have the tones of at least two target users, so that after the processing by using the tone modifying method shown in fig. 5, the electronic device obtains one target audio data having the tones of at least two target users, and the purpose of multi-user chorus can be achieved in a single user song recording scene.
In the above sound modifying method, after the electronic device collects the audio data of a user, the currently collected audio data is subjected to background music and human voice separation to obtain the voice data of the user. The electronic equipment can call the voiceprint discrimination model to perform voiceprint discrimination on voice data of the user, and the voice data is divided by using a voiceprint discrimination result output by the voiceprint discrimination model so as to divide the voice data into at least two parts. After the target users are selected for the divided parts, the tone representation model of each target user is called respectively, tone conversion is carried out on the parts corresponding to the target users, target voice data output by the tone representation model of each target user are obtained, then the target voice data are combined to obtain target voice data, and the electronic equipment can output the target voice data. The target audio data can have the timbres of at least two target users, so that the electronic equipment can output the target audio data according to the timbres of the at least two target users, the aim of converting the audio data from a single timbre to multiple timbres is fulfilled, and the aim of recording by multiple persons is fulfilled.
Referring to fig. 6, which shows a schematic diagram of another sound modifying method provided in an embodiment of the present application, the sound modifying method shown in fig. 6 is directed to an original singing scene, where the original singing scene may be a user recording scene, and an original singing is played in a user recording process, or the original singing scene may be a user karaoke scene, and the like. The sound modification method shown in fig. 6 may include a preparation stage and a use stage, in the preparation stage, the electronic device completes training of the voiceprint discrimination model and the tone representation model, and please refer to the flow shown in fig. 4, which is not described in detail here. In the using stage, the electronic device may call the tone characterization model to perform tone conversion on the audio data of the user, and call the voiceprint discrimination model to divide the audio data of the user into at least two parts, where the tone conversion of the at least two parts may be completed by the tone characterization model, and in the using stage, the processes of using the tone characterization model and the voiceprint discrimination model refer to the flow illustrated in fig. 5, and are not described in detail here. In addition to performing the timbre conversion on the audio data of the user, the electronic device may perform a sound beautifying process on the audio data of the user during the use stage. And the electronic equipment synthesizes the audio data after the tone conversion and the audio data after the sound beautifying processing, and outputs the synthesized audio data.
When the sound modifying method shown in fig. 6 is implemented, the electronic device may start two threads, one thread is used for performing the tone conversion, and the other thread is used for performing the sound masking processing. The tone conversion can be carried out after the audio data acquisition is finished, and the sound beautifying processing can be carried out in the process of acquiring the audio data so as to carry out the sound beautifying processing while acquiring the audio data, so that the electronic equipment finishes the sound beautifying processing on the audio data after the complete audio data is acquired. The electronic equipment can utilize more resources to carry out tone conversion on the audio data, and the processing efficiency of the audio data is improved. In the original scene, the electronic device can acquire two audio data, one is the audio data of the user, and the other is the audio data of the original song. The tone conversion and the sound masking process are for audio data of a user.
For example, in a user K song scene, when a user sings a song by using K song software, the electronic equipment starts two threads, one thread is started after the electronic equipment collects complete audio data of the user, the thread divides the audio data of the user to divide the audio data of the user into at least two parts, and then the at least two parts are subjected to tone conversion; and another thread can perform sound beautifying processing on the audio data of the user in the process of collecting the audio data of the user. After the tone conversion and the sound masking processing are completed, the electronic device may synthesize the audio data after the tone conversion and the audio data after the sound masking processing, and output the synthesized audio data, which may be used as the audio data of the song sung by the user. The tone conversion enables the audio data to have various tones, and the sound beautifying processing enables audio features of the audio data except for the tone to be matched with the original singing audio features, so that the audio data of the song can have various tones and audio features except for the tone to be matched with the original singing audio features. When the electronic equipment plays the audio data of the song, the song can be played according to various timbres and original singing audio characteristics, so that the running tone probability is reduced, and the purpose of chorusing the song by multiple persons is achieved, and the original singing is a singer of the song.
In this embodiment, a process of the electronic device performing sound beautifying processing on the audio data of the user is shown in fig. 7, and may include the following steps:
s301, extracting song information from the original audio data, wherein the song information comprises the starting position and the ending position of each sentence, the starting position and the ending position of each character, the pronunciation of each character, the tone of each character and the like. The song information is mainly extracted from the voice data of the original singing audio data.
The electronic device may mark a start position and an end position of each sentence and each word in the original audio data by using a voice alignment technique or a Voice Activity Detection (VAD) technique, and the start position and the end position may be divided into units of milliseconds (ms) or a frame length. Pronunciation is obtained by obtaining the pinyin of each character in the audio data of the original song, for example, the audio data of the original song is the audio data of a song, and the lyrics are converted into the pinyin to obtain pronunciation.
The pitch may be extracted from the original audio data by a polyphonic music pitch extraction algorithm, for example, the pitch data of the original audio data may be extracted by a polyphonic music pitch extraction algorithm, and the polyphonic pitch extraction algorithm may be a melodia (a name of algorithm) algorithm, or the like. The pitch data of the original audio data may be denoted as X ═ X (1), X (2) … X (N), where N is a positive integer, and X (N) is the pitch value at different time points in the audio data. The original audio data may include voice data of a plurality of singers, the electronic device may extract voice pitch data of each singer separately using a monophic music pitch extraction algorithm, which may be a pYIN (a name of algorithm) algorithm, or the like. The pitch data of the voice of a single singer can be denoted as Yk ═ Yk (1), Yk (2) … Yk (N) >, where N is a positive integer, k ═ 1, 2, … k, Yk (N) are pitch values of any voice audio in the audio data at different time points. In addition, the electronic device may retrieve the pitch via a Musical Instrument Digital Interface (MIDI) file.
S302, obtaining a fundamental frequency and an envelope of each word based on the starting position and the ending position of each word; and obtaining the consonant information of each character based on the pronunciation of each character.
The envelope is a curve formed by connecting the highest points of amplitude of different frequencies, that is, the envelope corresponds to the highest points of amplitude of different frequencies, and the fundamental frequency is the minimum frequency of all the frequencies. The electronic equipment can extract the fundamental frequency of each word by utilizing a time domain extraction method or a frequency domain extraction method to obtain the linear predictive coding coefficient of the audio data; the envelope of each word is derived from the linear predictive coding coefficients.
The process of extracting the fundamental frequency of the word by using the time domain extraction method may be extracting the fundamental frequency of the word by analyzing a periodic variation of a waveform of the word, wherein the periodic variation of the waveform of the word may be obtained based on a start position and an end position of the word. The autocorrelation function of a word is also periodic, so the autocorrelation function of a word is also an estimate of the fundamental frequency of the word based on the nature of the period. The position of the word at each integer point of the pitch period corresponds to a large peak, the fundamental frequency of the word is estimated by analyzing the distance between the first large peak of the autocorrelation function and the point where k is 0, where k represents the delay. When k =0, the autocorrelation function has the largest peak. The process of extracting the fundamental frequency of the word by using the frequency domain extraction method may be that the waveform of the word in the time domain is subjected to fourier transform and logarithm, the processed waveform is a quasi-periodic signal in the frequency domain, and the period of the signal is the fundamental frequency of the word.
The pronunciation of the character is composed of vowels, consonants and the like, so that the electronic equipment can directly extract consonant information from the pronunciation of the character.
S303, adjusting the tone and the speech speed of the audio data of the user by utilizing the fundamental frequency of each character, the envelope of each character, the consonant information of each character, the pitch of each character, the starting position and the ending position of each character and the starting position and the ending position of each sentence, so as to finish the sound beautifying processing of the audio data of the user. Wherein the adjustment of the pitch and the speech rate of the audio data of the user is the adjustment of the pitch and the speech rate of the voice data in the audio data of the user.
In this embodiment, the adjustment of the pitch and the speech rate of the audio data of the user may be achieved by adjusting information of each word in the audio data of the user, such as the fundamental frequency of each word in the audio data of the user, the envelope of each word, the consonant information of each word, and the start position and the end position of each sentence in the audio data of the user.
One implementation way is that, the fundamental frequency and the envelope of any word obtained from the original audio data in step S302 are used to adjust the fundamental frequency and the envelope of the word in the audio data of the user, so as to reduce the difference between the fundamental frequency of the word and the fundamental frequency extracted in step S302; adjusting the consonant information of any character in the audio data of the user by using the consonant information of any character acquired from the original audio data in the step S302 to ensure that the consonant information of the character is the same as the consonant information extracted in the step S302; and adjusting the pitch of any character acquired from the original singing audio data in the step S301 to reduce the difference between the pitch of the character and the pitch extracted in the step S301, and adjusting the tone of the audio data of the user through adjusting the fundamental frequency, the envelope, the consonant information and the pitch. Wherein the adjusted position of the word in the audio data of the user and the position in the audio data of the original song may be the same.
The time length of any character acquired from the original audio data in step S301 is adjusted by using the start position and the end position of the character, and after the time length adjustment of the character in a sentence is completed, the time length of the sentence can be adjusted by using the start position and the end position of the sentence extracted in step S301, so as to adjust the speech speed of the audio data of the user. Wherein the adjusted position of the word in the audio data of the user and the position in the audio data of the original song may be the same, and the adjusted position of the sentence in the audio data of the user and the position in the audio data of the original song may be the same.
After the sound beautifying processing of the audio data of the user is completed, the electronic device may synthesize the audio data after the sound beautifying processing and the audio data after the tone conversion, that is, fuse the two audio data into one audio data, and the synthesis of the two audio data may be the synthesis of the frequency spectrums of the two audio data. In one example, the frequency spectrum of one of the two audio data is adjusted based on the frequency spectrum of the other audio data, for example, the difference between the frequency spectrum of the sound-aesthetic processed audio data and the frequency spectrum of the sound-color converted audio data is narrowed mainly based on the frequency spectrum of the sound-color converted audio data; the spectrum of the sound-masking processed audio data is adjusted, for example, in accordance with the ratio of the spectrum of the sound-color-converted audio data. In another example, the frequency spectrums of two audio data are subjected to addition processing, such as weighted summation of the frequency spectrums of two audio data. The spectrum synthesis may be to synthesize the spectrum of the voice data in the two audio data with respect to the spectrum of the voice data in the two audio data.
The above-mentioned fig. 3 to fig. 6 illustrate that the audio data of the user is subjected to the tone conversion, the tone conversion may divide the voice data of the user into at least two parts, the user designates a target user corresponding to each part, and then the tone representation model of the target user is called to perform the tone conversion on the part corresponding to the target user. The division of the user's voice data may be based on the voiceprint recognition result output by the voiceprint recognition model, and a corresponding User Interface (UI) is shown in fig. 8.
In the UI shown in fig. 8, the voiceprint recognition model performs voiceprint recognition on the voice data of the user, and determines that the voice data corresponds to two voiceprints, which are star 1 and star 2, respectively. And a prompt message indicating whether to automatically segment according to the voice is displayed in the UI shown in fig. 8, and if the electronic device receives the automatic segmentation according to the voice, the electronic device may automatically segment the voice data according to two voiceprints determined by the voiceprint determination model, for example, if the voiceprint determination model determines that 0s to 10s in the voice data corresponds to star 1, 10s to 25s corresponds to star 2, and 25s to 40s corresponds to star 1, the electronic device may automatically segment according to 0s to 10s, 10s to 25s, and 25s to 40 s.
In the UI shown in fig. 8, the user may specify a target tone, for example, a target user whose tone is used as the target tone, such as specifying that the tone of star 1 is converted to the tone of star 3 and the tone of star 2 is converted to the tone of star 4 in fig. 8. After the user clicks and confirms, the electronic equipment calls the tone representation model of the star 3 to perform tone conversion on the audio data from 0s to 10s and from 25s to 40s, and calls the tone representation model of the star 4 to perform tone conversion on the audio data from 10s to 25 s.
In a scene without original singing, the electronic equipment can acquire audio data, call a voiceprint discrimination model to perform voiceprint discrimination on voice data in the acquired audio data, and then automatically divide the voice data by using a voiceprint discrimination result. In an original singing scene, the electronic equipment can acquire two audio data, one is the original singing audio data, the other is the audio data of a user (also called a singer), the electronic equipment can call a voiceprint discrimination model to perform voiceprint discrimination on voice data in the two audio data respectively, and the voiceprint discrimination result of the original singing or the voiceprint discrimination result of the user is used for automatically dividing the voice data in the audio data of the user.
Besides the identity of the user to which the voice data belongs, the voiceprint recognition model can recognize the gender of the user to which the voice data belongs, so that the electronic equipment can divide the voice data according to the gender. As fig. 9 shows that voice data is automatically divided by gender, a user can designate a target user matching gender; for example, if the sex is male, a male target user is selected, and if the sex is female, a female target user is selected. And then the electronic equipment calls the tone representation model of the target user to perform tone conversion.
Fig. 8 and 9 are diagrams illustrating automatic division of speech data by the electronic device using a voiceprint discrimination result of a voiceprint discrimination model, and the reference symbols in fig. 8 and 9 indicate automatic division of speech data using a voiceprint discrimination result. In addition to automatic division, a user may manually divide voice data, the user may manually divide voice data according to time, the division may be performed at intervals of M seconds or the user manually selects a division time point according to a time division manner, the user may drag a progress bar to divide when manually selecting the division time point, drag a position to pop up a division prompt, the division prompt is used to prompt the user whether to divide at the time point, if so, divide at the time point, otherwise, the division is not performed, and M is a natural number. The voice data manually divided by the user can be divided according to the lyrics, for example, X sentences of lyrics are divided into a section, X is a natural number which is greater than or equal to 1, and the number of the lyrics contained in each section can be the same or different. Of course, the electronic device may also automatically divide the words according to the lyrics. Fig. 10 shows that the user divides the voice data by time, and the user may manually input the start time point and the end time point of each section, or may manually input the target user corresponding to each section. And then the electronic equipment calls the tone representation model of the target user to perform tone conversion.
The target timbre may be specified by the user, and the target timbre may be specified in other ways. In one example, the electronic device may count scores of different timbres, and take at least one timbre with a higher score as a target timbre, and the scores of different timbres may be obtained according to the number of uses, and the score is higher when the number of uses is larger. In another example, similarity between the timbre features of the user and the timbre features of the historical users is obtained, and the target timbre is determined based on the similarity, for example, the timbre with close timbre or large timbre difference is selected as the target timbre based on the similarity. In another example, the electronic device may type the timbres in advance, and after determining the type of the timbre of the currently acquired audio data, select one timbre as the target timbre based on the type of the timbre, such as selecting one timbre from the type of the timbre as the target timbre.
In addition, when the electronic device performs the timbre conversion on the audio data, the electronic device may perform the timbre conversion on a part of the audio data. For example, the user specifies a portion of the audio data subjected to tone conversion using the UI shown in fig. 10; for example, the electronic device pre-designates data to be divided in the audio data, and calls the voiceprint discrimination model to divide the data to be divided, or the user manually divides the data to be divided, or performs adjustment on the basis of division of the voiceprint discrimination model, and the like, so that the audio data can have the original tone of the user, and the target tone is increased. In other examples, after the electronic device performs the tone conversion on the audio data, the audio data obtained by the tone conversion is fused with the audio data of the user collected by the electronic device to obtain one audio data, and the fusion process may be the fusion of the frequency spectrums of the two audio data, which is not described in detail herein.
An embodiment of the present application further provides an electronic device, where the electronic device includes: a processor and a memory; wherein the memory is used for storing one or more computer program codes, the computer program codes comprise computer instructions, and when the processor executes the computer instructions, the processor executes the sound correcting method.
The embodiment of the present application further provides a computer storage medium, where the computer storage medium includes computer instructions, and when the computer instructions are run on an electronic device, the electronic device is enabled to execute the above-mentioned sound repairing method.

Claims (16)

1. A method of modifying sound, the method comprising:
acquiring first voice data, wherein the first voice data has the tone of a first object;
dividing the first voice data to obtain at least two pieces of second voice data;
determining a second object corresponding to the second voice data; wherein, the type of at least one second object in the second objects is a first type, and the type of at least one second object is a second type;
performing tone conversion on the second voice data to obtain third voice data, wherein the third voice data has the tone of the second object corresponding to the second voice data;
and at least fusing the third voice data to obtain fourth voice data, wherein the fourth voice data at least has the tone of the second object of the first type and the tone of the second object of the second type.
2. The method of claim 1, further comprising: acquiring fifth voice data, wherein the fifth voice data has the tone of a third object, and the fifth voice data and the first voice data correspond to the same content;
extracting content parameters from the fifth voice data;
based on the content parameters, performing sound beautifying processing on the first voice data to obtain sixth voice data;
and obtaining seventh voice data based on the fourth voice data and the sixth voice data, wherein the seventh voice data at least has the tone of the second object of the first type and the tone of the second object of the second type, and the content parameters of the seventh voice data are matched with the content parameters extracted from the fifth voice data.
3. The method of claim 2, wherein the content parameters comprise: a start position and an end position of each sentence, a start position and an end position of each word, a pronunciation of each word, and a pitch of each word; the performing, based on the content, a sound-beautifying process on the first voice data includes:
obtaining a fundamental frequency and an envelope of each word based on a start position and an end position of each word;
obtaining consonant information of each character based on pronunciation of each character;
the pitch and the speech rate of the first voice data are adjusted using the fundamental frequency of each word, the envelope of each word, the consonant information of each word, the pitch of each word, the start position and the end position of each word, and the start position and the end position of each sentence.
4. The method according to any one of claims 1 to 3, wherein the dividing the first voice data into at least two pieces of second voice data comprises:
performing voiceprint recognition on the first voice data to determine a recognition result of at least one part of the first voice data;
and dividing the first voice data based on the at least one part of recognition result to obtain at least two pieces of second voice data.
5. The method according to claim 4, wherein the dividing the first speech data based on the at least one part of the recognition result to obtain at least two pieces of the second speech data comprises:
dividing the first voice data based on the at least one part of recognition result to obtain a division result;
receiving an adjusting instruction aiming at the dividing result;
and responding to the adjusting instruction, and adjusting the division result based on an adjusting parameter in the adjusting instruction to obtain at least two pieces of second voice data.
6. The method of claim 4, wherein the performing voiceprint recognition on the first speech data to determine a recognition result of at least a portion of the first speech data comprises:
extracting first feature data from the first voice data;
calling a voiceprint discrimination model to process the first characteristic data to obtain a voiceprint discrimination result output by the voiceprint discrimination model, wherein the voiceprint discrimination result comprises a recognition result of each part in the first voice data; the voiceprint discrimination model is obtained by training voice data of a plurality of historical objects.
7. The method according to any one of claims 1 to 3, wherein the dividing the first voice data into at least two pieces of second voice data comprises:
receiving a dividing instruction for the first voice data;
responding to the dividing instruction, and dividing the first voice data based on the dividing parameters in the dividing instruction to obtain at least two pieces of second voice data.
8. The method according to claim 7, wherein the partitioning parameter comprises a time parameter and/or a lyric parameter.
9. The method of claim 8, wherein the time parameter is a user manually entered time duration or a user manually controlled time control selection time duration; the method further comprises the following steps: if the user manually controls the time control to select the duration, outputting prompt information after detecting that one duration is selected;
the dividing the first voice data based on the time parameter in the dividing instruction comprises: and responding to a confirmation instruction aiming at the prompt message, and dividing the first voice data based on the selected duration.
10. The method according to any one of claims 1 to 3, wherein the dividing the first voice data into at least two pieces of second voice data comprises:
performing voiceprint recognition on fifth voice data to determine a recognition result of at least one part of the fifth voice data, wherein the fifth voice data has a tone of a third object, and the fifth voice data and the first voice data correspond to the same content;
and dividing the first voice data based on the at least one part of recognition result to obtain at least two pieces of second voice data.
11. The method according to any one of claims 1 to 3, wherein the determining of the second object corresponding to the second voice data comprises:
acquiring a second object determined by the user for the second voice data;
or
Obtaining scores of timbres of different objects, and selecting a second object corresponding to the second voice data from the objects with the scores meeting preset conditions;
or
Acquiring similarity between the tone color feature of the first object and the tone color feature of the second object, and selecting the second object corresponding to the second voice data based on the similarity;
or
And determining the type of the tone of the first object, and selecting a second object corresponding to the second voice data based on the type of the tone of the first object.
12. The method of claim 11, wherein performing the timbre conversion on the second speech data to obtain third speech data comprises:
acquiring the tone characteristic of the second voice data;
and calling a tone representation model of the second object, and processing the tone features to obtain the third voice data output by the tone representation model, wherein the tone representation model is obtained by training a plurality of pieces of voice data of the second object, and the second object and the tone representation model are in one-to-one relationship.
13. The method of claim 12, wherein the obtaining the timbre characteristic of the second speech data comprises:
extracting second feature data from the second voice data;
and calling a tone extraction model to process the second characteristic data to obtain tone characteristics output by the tone extraction model.
14. The method according to any one of claims 1 to 3, wherein the at least fusing the third voice data to obtain fourth voice data comprises: and if part of the second voice data in all the second voice data is not subjected to tone conversion, fusing the second voice data which is not subjected to tone conversion in all the second voice data and the third voice data to obtain fourth voice data, wherein the fourth voice data has the tone of the second object and the tone of the first object.
15. An electronic device, characterized in that the electronic device comprises: a processor and a memory; wherein the memory is configured to store one or more computer program codes comprising computer instructions which, when executed by the processor, cause the processor to perform the method of sound modification according to any one of claims 1 to 14.
16. A computer storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the method of any one of claims 1 to 14.
CN202210377923.3A 2022-04-12 2022-04-12 Sound repairing method and device Active CN114464151B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210377923.3A CN114464151B (en) 2022-04-12 2022-04-12 Sound repairing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210377923.3A CN114464151B (en) 2022-04-12 2022-04-12 Sound repairing method and device

Publications (2)

Publication Number Publication Date
CN114464151A true CN114464151A (en) 2022-05-10
CN114464151B CN114464151B (en) 2022-08-23

Family

ID=81417688

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210377923.3A Active CN114464151B (en) 2022-04-12 2022-04-12 Sound repairing method and device

Country Status (1)

Country Link
CN (1) CN114464151B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010095622A1 (en) * 2009-02-17 2010-08-26 国立大学法人京都大学 Music acoustic signal generating system
CN103514883A (en) * 2013-09-26 2014-01-15 华南理工大学 Method for achieving self-adaptive switching of male voice and female voice
CN104464725A (en) * 2014-12-30 2015-03-25 福建星网视易信息系统有限公司 Method and device for singing imitation
CN108305636A (en) * 2017-11-06 2018-07-20 腾讯科技(深圳)有限公司 A kind of audio file processing method and processing device
CN112331222A (en) * 2020-09-23 2021-02-05 北京捷通华声科技股份有限公司 Method, system, equipment and storage medium for converting song tone
CN113836344A (en) * 2021-09-30 2021-12-24 广州艾美网络科技有限公司 Personalized song file generation method and device and music singing equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010095622A1 (en) * 2009-02-17 2010-08-26 国立大学法人京都大学 Music acoustic signal generating system
CN103514883A (en) * 2013-09-26 2014-01-15 华南理工大学 Method for achieving self-adaptive switching of male voice and female voice
CN104464725A (en) * 2014-12-30 2015-03-25 福建星网视易信息系统有限公司 Method and device for singing imitation
CN108305636A (en) * 2017-11-06 2018-07-20 腾讯科技(深圳)有限公司 A kind of audio file processing method and processing device
CN112331222A (en) * 2020-09-23 2021-02-05 北京捷通华声科技股份有限公司 Method, system, equipment and storage medium for converting song tone
CN113836344A (en) * 2021-09-30 2021-12-24 广州艾美网络科技有限公司 Personalized song file generation method and device and music singing equipment

Also Published As

Publication number Publication date
CN114464151B (en) 2022-08-23

Similar Documents

Publication Publication Date Title
CN106898340B (en) Song synthesis method and terminal
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
CN111798821B (en) Sound conversion method, device, readable storage medium and electronic equipment
US20210335364A1 (en) Computer program, server, terminal, and speech signal processing method
CN111508511A (en) Real-time sound changing method and device
CN110675886A (en) Audio signal processing method, audio signal processing device, electronic equipment and storage medium
JP2006517037A (en) Prosodic simulated word synthesis method and apparatus
CN111402842A (en) Method, apparatus, device and medium for generating audio
CN111370024B (en) Audio adjustment method, device and computer readable storage medium
CN112992109B (en) Auxiliary singing system, auxiliary singing method and non-transient computer readable recording medium
CN114121006A (en) Image output method, device, equipment and storage medium of virtual character
CN109346057A (en) A kind of speech processing system of intelligence toy for children
CN111105779A (en) Text playing method and device for mobile client
JP5598516B2 (en) Voice synthesis system for karaoke and parameter extraction device
JP2011186143A (en) Speech synthesizer, speech synthesis method for learning user's behavior, and program
CN114464151B (en) Sound repairing method and device
CN107025902B (en) Data processing method and device
CN115938340A (en) Voice data processing method based on vehicle-mounted voice AI and related equipment
CN114783408A (en) Audio data processing method and device, computer equipment and medium
EP1271469A1 (en) Method for generating personality patterns and for synthesizing speech
JP6003352B2 (en) Data generation apparatus and data generation method
JP6044490B2 (en) Information processing apparatus, speech speed data generation method, and program
JP2013210501A (en) Synthesis unit registration device, voice synthesis device, and program
CN113345416A (en) Voice synthesis method and device and electronic equipment
CN117854478B (en) Speech synthesis method, device and system based on controllable text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220608

Address after: 100095 floors 2-14, building 3, yard 5, honeysuckle Road, Haidian District, Beijing

Applicant after: Beijing Honor Device Co.,Ltd.

Address before: Unit 3401, unit a, building 6, Shenye Zhongcheng, No. 8089, Hongli West Road, Donghai community, Xiangmihu street, Futian District, Shenzhen, Guangdong 518040

Applicant before: Honor Device Co.,Ltd.

GR01 Patent grant
GR01 Patent grant