CN114078464A

CN114078464A - Audio processing method, device and equipment

Info

Publication number: CN114078464A
Application number: CN202210061154.6A
Authority: CN
Inventors: 李婧如; 田思达; 袁微
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-02-22
Anticipated expiration: 2042-01-19
Also published as: CN114078464B

Abstract

The embodiment of the application discloses an audio processing method, an audio processing device and audio processing equipment, wherein the audio processing method comprises the following steps: acquiring a target pitch curve of target audio data, and determining a target silent area in the target audio data; detecting the initial time point of each phoneme in the target audio data to obtain a target initial time sequence; determining one or more singing segments corresponding to the target audio data based on the target starting time sequence, the target pitch curve and the target silent region; performing note transcription processing on each singing segment in the one or more singing segments to obtain one or more notes; and generating a singing intonation indication file corresponding to the target audio data by adopting one or more musical notes. The embodiment of the application can save labor cost and improve the generation efficiency of the singing intonation indication file.

Description

Audio processing method, device and equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to an audio processing method, apparatus, and device.

Background

At present, a singing intonation indication file (such as a Musical Instrument Digital Interface (MIDI) file) corresponding to a song (or called audio data) can be widely applied to various application scenarios; for example, notes having pitch values may be displayed for the user according to the corresponding singing intonation indication file when the user sings a song, so that the user can raise the intonation at the time of singing by referring to the displayed notes. At present, for any song, a singing tone indication file corresponding to the song is generated by labeling based on a manual labeling method; such a processing method would result in a large consumption of labor cost and a low generation efficiency of the singing intonation indication file.

Disclosure of Invention

The embodiment of the application provides an audio processing method, device and equipment, which can save labor cost and improve the generation efficiency of a singing intonation indication file.

In one aspect, an embodiment of the present application provides an audio processing method, where the method includes:

acquiring a target pitch curve of target audio data, and determining a target silent region in the target audio data, wherein the target audio data comprises one or more phonemes sung by a target object, and one phoneme comprises one or more time point positions;

detecting the starting time point positions of all phonemes in the target audio data to obtain a target starting time sequence;

determining one or more singing segments corresponding to the target audio data based on the target start time sequence, the target pitch curve and the target silent region; one singing segment comprises at least one time point location and a corresponding pitch value, and the starting time point location of one singing segment corresponds to the starting time point location of one phoneme;

performing note transcription processing on each singing segment in the one or more singing segments to obtain one or more notes; and generating a singing intonation indication file corresponding to the target audio data by adopting the one or more musical notes.

In another aspect, an embodiment of the present application provides an audio processing apparatus, where the apparatus includes:

the processing unit is used for acquiring a target pitch curve of target audio data and determining a target silent region in the target audio data, wherein the target audio data comprises one or more phonemes sung by a target object, and one phoneme comprises one or more time point positions;

the processing unit is further configured to detect a start time point location of each phoneme in the target audio data to obtain a target start time sequence;

the processing unit is further configured to determine one or more singing segments corresponding to the target audio data based on the target start time sequence, the target pitch curve and the target silent region; one singing segment comprises at least one time point location and a corresponding pitch value, and the starting time point location of one singing segment corresponds to the starting time point location of one phoneme;

the processing unit is further configured to perform note transcription processing on each of the one or more singing segments to obtain one or more notes;

and the generating unit is used for generating a singing intonation indication file corresponding to the target audio data by adopting the one or more notes.

In another aspect, an embodiment of the present application provides a computer device, where the computer device includes a processor and a memory, where the memory is used to store a computer program, and when the computer program is executed by the processor, the computer program implements the following steps:

In yet another aspect, an embodiment of the present application provides a computer storage medium, where a computer program is stored, the computer program being adapted to be loaded by a processor and execute the following steps:

In yet another aspect, an embodiment of the present application provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the above mentioned audio processing method.

According to the method and the device, the start time point of each phoneme in the target audio data can be detected, and one or more singing segments corresponding to the target audio data can be accurately determined based on the target start time sequence obtained through detection, the target pitch curve of the target audio data and the target silent region in the target audio data. Therefore, one singing segment comprises at least one time point location and a corresponding pitch value, a silent area is avoided from appearing in the singing segment, and the accuracy of the singing segment is improved; the starting time point of one singing segment can correspond to the starting time point of one phoneme, so that when the note transcription processing is carried out on each singing segment, one phoneme can be corresponding to one or more notes, namely the notes are divided according to the phonemes, the precise division of the notes is ensured, and the accurate singing note indication file is generated by adopting the accurate notes. According to the embodiment of the application, the singing intonation indication file corresponding to the target audio data is automatically generated after a series of processing is performed on the target audio data, so that the generation efficiency of the singing intonation indication file can be effectively improved, and the labor cost is effectively saved; in addition, one or more singing segments are obtained by utilizing the target pitch curve of the target audio data, so that the target pitch curve can be quantized according to each pitch value in each singing segment, and the singing tone level indication file obtained after the quantized pitch curve is output is further realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1a is a schematic flow chart of an audio processing scheme provided by an embodiment of the present application;

fig. 1b is a schematic diagram of a terminal and a server interaction provided in an embodiment of the present application;

FIG. 1c is a schematic diagram of a note provided by an embodiment of the present application;

fig. 1d is a schematic diagram of a intonation detection result provided in the embodiment of the present application;

fig. 2 is a schematic flowchart of an audio processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a model provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a starting time point provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of another note provided by embodiments of the present application;

FIG. 6 is a schematic flow chart of another audio processing method provided by the embodiments of the present application;

fig. 7a is a schematic diagram of determining a target start time sequence according to an embodiment of the present application;

FIG. 7b is a schematic diagram of another example of determining a target start time sequence according to the present application;

FIG. 8a is a schematic diagram of a segmentation process provided in an embodiment of the present application;

FIG. 8b is a schematic diagram of another segmentation process provided in embodiments of the present application;

fig. 9a is a schematic flowchart of another audio processing method provided in the embodiment of the present application;

FIG. 9b is a schematic flow chart of a post-process provided by an embodiment of the present application;

fig. 10 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

In the embodiment of the present application, the audio data may be understood as a digitized sound data, and the process of digitizing the sound refers to a process of sampling a continuous analog audio signal at a certain frequency and performing analog-to-digital conversion on each sampling result. Performing analog-to-digital conversion on a sampling result to obtain an audio amplitude value of a time point; as can be seen, the audio data may include a plurality of time point locations and an audio amplitude value for each time point location. Furthermore, as the values of the audio amplitude values of the time point locations are different, different audio elements can be formed by one or more continuous time point locations and corresponding audio amplitude values; the audio elements herein may include, but are not limited to: phonemes, accompaniment or ambient sounds singed by the subject singing the song, etc. Wherein, the phoneme here may refer to: data for composing a word, an english word, or the like, for example, for the word "want", it may be composed of three phonemes of "x", "i", and "ang"; the accompaniment means: instrumental performance accompanied by setback singing. That is, the audio data may include one or more audio elements (e.g., phonemes, accompaniment), each of which may be the same or different, and one audio element (e.g., phoneme) may include one or more time point locations and corresponding audio amplitude values.

In order to improve the generation efficiency of the singing intonation indication file corresponding to the audio data and save the labor cost, the embodiment of the application provides an audio processing scheme; referring to fig. 1a, the general principle of the audio processing scheme proposed by the embodiment of the present application is as follows: first, target audio data may be obtained, the target audio data including one or more phonemes sung by a target object; the target object may refer to an original singer of a song or a singer with a song singing copyright, or may refer to a user who needs to perform intonation detection, and the like. After the target audio data are obtained, a target pitch curve of the target audio data can be obtained through pitch detection of the target audio data; detecting the soundless area of the target audio data to obtain a target soundless area of the target audio data; and detecting the start time point of each phoneme in the target audio data to obtain a target start time sequence. Then, a singing tone indication file can be generated based on the target starting time sequence, the target pitch curve and the target silent area; specifically, one or more singing segments corresponding to the target audio data can be determined based on the target starting time sequence, the target pitch curve and the target silent area, and note transcription processing is performed on each singing segment respectively to obtain one or more notes, so that the one or more notes are adopted to generate the singing intonation indication file corresponding to the target audio data.

Practice shows that the audio processing scheme provided by the embodiment of the application can have at least the following beneficial effects: the method comprises the steps of detecting the starting time point of each phoneme in target audio data, determining singing segments based on a target starting time sequence, a target pitch curve and a target silent region obtained through detection, avoiding the occurrence of the silent region in the singing segments, enabling the starting time point of one singing segment to correspond to the starting time point of one phoneme, and improving the accuracy of the singing segments, so that when the musical note transcription processing is carried out on each singing segment, one phoneme can be enabled to correspond to one or more musical notes, namely the musical notes are divided according to the phonemes, accurate segmentation of the musical notes is guaranteed, and the accuracy of a singing tone indication file generated based on each musical note is further guaranteed. And the whole processing process does not need human participation, and the singing intonation indication file can be automatically generated, so that the generation efficiency of the singing intonation indication file can be effectively improved, and the labor cost is effectively saved. And obtaining one or more singing segments by using the target pitch curve, quantizing the target pitch curve by using each pitch value in each singing segment, and outputting a singing tone level indication file obtained after the quantized pitch curve.

In a specific implementation, the above mentioned audio processing scheme may be executed by a computer device, which may be a terminal or a server; among others, the terminals mentioned herein may include but are not limited to: the system comprises a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart watch, smart voice interaction equipment, smart household appliances, a vehicle-mounted terminal, an aircraft and the like; various clients (APPs) can be run in the terminal, such as a video playing client, a social client, a browser client, an information flow client, an education client, and the like. The server mentioned here may be an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides basic cloud computing services such as cloud service, cloud database, cloud computing (cloud computing), cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, CDN (Content Delivery Network), big data and artificial intelligence platform, and so on; so-called cloud computing is a computing model that distributes computing tasks over a resource pool of large numbers of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. Moreover, the computer device mentioned in the embodiment of the present application may be located outside the blockchain network, or may be located inside the blockchain network, which is not limited to this; the blockchain network is a network formed by a peer-to-peer network (P2P network) and blockchains, and a blockchain is a novel application model of computer technologies such as distributed data storage, peer-to-peer transmission, consensus mechanism, encryption algorithm, etc., and is essentially a decentralized database, which is a string of data blocks (or called blocks) associated by using cryptography.

Alternatively, in other embodiments, the above-mentioned audio processing scheme may be executed by both the server and the terminal; the terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. For example: the terminal can be responsible for acquiring target audio data and sending the target audio data to the server, so that the server is responsible for determining a target pitch curve, a target silent area, a target starting time sequence and the like of the target audio data, then returning the target pitch curve, the target silent area and the target starting time sequence to the terminal, and then generating a singing tone level indication file through the target pitch curve, the target silent area and the target starting time sequence by the terminal, as shown in fig. 1 b. For another example: the server is responsible for acquiring target audio data, determining a target pitch curve, a target silent area and a target starting time sequence of the target audio data, determining one or more singing segments based on the target starting time sequence, the target pitch curve and the target silent area, respectively performing note transcription processing on each singing segment to obtain one or more notes, and then sending each note to the terminal, so that the terminal can generate a singing tone indication file by adopting each received note. It should be understood that the two cases of the terminal and the server performing the above audio processing scheme together are only exemplarily set forth herein, and are not exhaustive.

Further, the above audio processing scheme may be applied to various application scenarios, such as an online karaoke (a singing style) scenario, a singing intonation detection scenario, and so on.

For example, when the audio processing scheme is applied to the online karaoke scene, the target object may refer to: the original singer of the song (or the singer with the copyright of the song); in this application scenario, the computer device may record the singing voice of the target object to obtain target audio data during the process of singing any song by the target object, and generate a singing intonation indication file corresponding to the target audio data by using the above-mentioned audio processing scheme, so that when any user selects any song to perform K song, the computer device may display one or more notes in the singing intonation indication file in a device screen of any user, so that any user may raise the intonation during singing by referring to the displayed notes, as shown in fig. 1 c.

For another example, when the audio processing scheme is applied to a singing intonation detection scene, the target object may refer to: a user needing intonation detection; in the application scenario, the computer device may record the singing voice of the target object to obtain target audio data in the process that the target object sings any song, and generate a singing intonation indication file corresponding to the target audio data by using the audio processing scheme; moreover, the computer device may further obtain a singing intonation guide file corresponding to the any song, where the singing intonation guide file is generated based on the singing voice of the original singer (or the singer) who records the any song when singing the any song, and the generated mode may refer to the specific mode of generating the singing intonation guide file corresponding to the target audio data, and is not described herein again. Then, the computer device can perform singing intonation detection on the target object according to the difference between the singing intonation indication file corresponding to the target audio data and the singing intonation guide file of any song, so that a singing intonation detection result is output in a device screen of the target object. It is understood that the singing intonation detection result can include intonation grades, such as excellent or good; or may be a tone score, which is not limited in this application; taking the singing intonation detection result including the intonation score as an example, a schematic diagram of outputting the singing intonation detection result can be seen in fig. 1 d.

Based on the above description of the audio processing scheme, the embodiment of the present application proposes an audio processing method, which can be executed by the above mentioned computer device (terminal or server); alternatively, the audio processing method may be performed by both the terminal and the server. For convenience of explanation, the following description will be given by taking a computer device as an example to execute the audio processing method; referring to fig. 2, the audio processing method may include the following steps S201 to S204:

s201, acquiring a target pitch curve of the target audio data, and determining a target silent area in the target audio data.

In a particular implementation, a computer device may first obtain target audio data; the target audio data may comprise at least one or more phonemes sung by the target object, a phoneme comprising one or more time point bins and an audio amplitude value for each time point bin. It will be appreciated that when the computer device plays the target audio data, the sounds of the playing target object singing the phonemes may be reproduced. For example, if at least one phoneme of the target object singing constitutes the word "i", the computer device may play back the sound of the target object singing the word "i" when playing the target audio data. The target audio data may be obtained in a manner including, but not limited to, the following:

the first acquisition mode is as follows: the computer device may obtain a Uniform Resource Locator (URL) of the initial audio data, so as to obtain corresponding initial audio data based on the URL, and determine target audio data based on the obtained initial audio data; the url is understood to be an audio link, that is, the computer device may download the initial audio data according to the audio link of the initial audio data, so as to obtain the target audio data based on the initial audio data. The initial audio data may be audio data of a single track, that is, the initial audio data only includes one or more phonemes sung by the target object; alternatively, the initial audio data may be multi-track audio data, where one track corresponds to one audio element, that is, the initial audio data may include one or more phonemes sung by the target object and other audio elements (such as accompaniment, environmental sounds, etc.). Accordingly, an embodiment of obtaining the target audio data based on the initial audio data may be: whether the initial audio data is audio data of a single track or audio data of multiple tracks, the initial audio data can be directly used as target audio data. Or, when the initial audio data is multi-track audio data, in order to avoid interference of audio data of other tracks with subsequent processing (such as note transfer), sound source separation (a process of separating input audio data into audio data of a plurality of music tracks (tracks for short)) may be performed on the initial audio data first, and after the audio data of each track is obtained through sound source separation, the audio data of the corresponding track where the phoneme sung by the target object is located may be used as the target audio data, that is, the target audio data in this case is single-track audio data.

It should be noted that, when the computer device performs sound source separation on the initial audio data, the computer device may use an open-source spleteter algorithm (a track separation tool) to perform sound source separation on the initial audio data, and the algorithm uses an encoding and decoding structure of a U-net network (a convolutional neural network) to model the initial audio data, thereby implementing a high-efficiency and accurate sound source separation function; the sound source separation of the initial audio data may also be performed by a non-negative matrix decomposition method, which is not limited in the present application. It should be noted that the initial audio data may be monaural audio data or multi-channel audio data, and whether the initial audio is monaural audio data or multi-channel audio data, the computer device may directly perform sound source separation on the initial audio data; optionally, if the initial audio data is multi-channel audio data, the computer device may also convert the initial audio data into mono audio data, and perform sound source separation on the converted initial audio data, which is not limited in this application.

The second acquisition mode is as follows: the computer equipment can receive the audio data sent by the target object and directly take the received audio data as the target audio data; or when the received audio data is multi-track audio data, the target audio data is obtained by performing sound source separation on the received audio data. The target object sends the audio data to the computer equipment when the target object wants to carry out singing intonation detection on the audio data obtained by singing a certain song. Or, if the storage space of the computer device itself stores audio data, and the computer device can determine the currently scanned audio data as the target audio data by scanning the stored audio data; optionally, if the currently scanned audio data is multi-track audio data, the target audio data may also be obtained by performing sound source separation on the currently scanned audio data.

The third acquisition mode is as follows: the computer equipment can also record the sound of a target object when singing any song in advance, and then the audio data obtained by recording is taken as the target audio data; or, if the recorded audio data is multi-track audio data, the target audio data may be obtained by performing sound source separation on the recorded audio data.

It is noted that when the embodiments of the present application are applied to specific products or technologies, a series of data related to a target object (e.g., target audio data, etc.) are acquired, permission or agreement of the target object is required, and collection, use and processing of the related data are required to comply with relevant laws and regulations and standards of relevant countries and regions. For example, if the computer device obtains the target audio data by recording the sound of the target object, before the computer device records the sound, the computer device may first display a prompt interface or a prompt popup window for prompting the target object: whether the sound of the target object is allowed to be recorded to obtain target audio data or not, and the prompt interface or the prompt pop-up window comprises a confirmation option; if the target object selects the confirmation option in the prompt interface or the prompt pop-up window, the computer equipment can execute the relevant step of starting sound recording to acquire the target audio data, otherwise, the operation is finished.

After the target audio data is obtained through any one of the above obtaining modes, the computer device may perform pitch detection on the target audio data to obtain a target pitch curve of the target audio data. In one embodiment, the computer device may call a pitch value identification model, perform pitch detection on the target audio data, obtain a pitch value of each time point in the target audio data, and thus construct a target pitch curve using the pitch value of each time point; the pitch value recognition model may be obtained by performing model training in advance using training data and based on a machine learning technique/a deep learning technique.

In another embodiment, the computer device may perform fundamental frequency extraction on the target audio data and determine a target pitch curve based on the fundamental frequency extraction results. The fundamental frequency extraction result may include a fundamental frequency value of each time point in the target audio data, and then when the target pitch curve is determined according to the fundamental frequency extraction result, the pitch value of each time point may be calculated based on the fundamental frequency value of each time point, so that the pitch value of each time point is used to construct the target pitch curve. Further, the manner of calculating the corresponding pitch value according to the fundamental frequency value of any time point may be: and directly taking the fundamental frequency value of any time point as the pitch value of any time point, or carrying out linear change and other processing on the fundamental frequency value of any time point to obtain the pitch value of any time point. It should be noted that the present application is not limited to the specific implementation of the fundamental frequency extraction; for example, the computer device may adopt an open-source pyyin algorithm (fundamental frequency extraction algorithm in a time domain) to obtain a fundamental frequency extraction result, where the fundamental frequency extraction algorithm is based on the YIN algorithm (fundamental frequency extraction algorithm in the time domain), and a hidden markov model is used to obtain a more accurate fundamental frequency value, so that a pitch is consistent with an actual pitch, or the computer device may also adopt a swap algorithm (fundamental frequency extraction algorithm in a frequency domain), and so on.

After the target audio data is obtained through any one of the above obtaining modes, the computer device can also perform silence detection on the target audio data to determine a target silence area in the target audio data; wherein, a silent region may refer to: and a pitch-free area, namely an area containing zero pitch value of each time point. The present application is not limited to the specific implementation of silence detection; for example, the computer device may implement silence detection using the open source Tony algorithm (a speech endpoint detection algorithm); or the computer device can also adopt a weighted general entropy endpoint detection algorithm formed by weighting general entropy parameters based on short-time energy to detect the soundless area of the target audio data; or, the computer device may also perform silence detection on the target audio data by using an endpoint detection algorithm based on an artificial neural network, and the like.

S202, detecting the starting time point of each phoneme in the target audio data to obtain a target starting time sequence.

Wherein, the target start time sequence may include: the start time point of each of the at least one phoneme. When the target start time sequence is obtained, the computer device may detect the start time point locations of each phoneme in the target audio data to obtain an initial start time sequence, so as to obtain the target start time sequence based on the initial start time sequence; wherein the initial start time sequence comprises start time points of the respective phonemes. Accordingly, embodiments of deriving the target start time sequence based on the initial start time sequence may include: the initial start time sequence is directly used as the target start time sequence. Or, considering that there may be two start time points in a phoneme due to ventilation or other actions generated when the target object sings, in this case, the two start time points corresponding to the phoneme are usually adjacent, and a time interval between the two start time points is usually small, based on this, in order to improve accuracy of the start time points of the phoneme, the initial start time sequence may be filtered to filter out the start time points with small time intervals in the initial start time sequence, so that the initial start time sequence after the filtering is used as the target start time sequence. Or, considering that there may be a short time for the target object to sing a certain phoneme, so that the time interval between the start time point of the phoneme and the start time point of the next adjacent phoneme is small, in this case, the two phonemes may also be regarded as the same phoneme to be processed, then the computer device may also perform filtering processing on the initial start time sequence to filter out the start time point with a small time interval in the initial start time sequence, so as to take the initial start time sequence after the filtering processing as the target start time sequence. Still alternatively, considering that there may be a word sung by the target object and may be formed by a plurality of phonemes, and the singing duration corresponding to the word is usually concentrated on a key phoneme (e.g., vowel) in the corresponding plurality of phonemes, and the durations of other phonemes are very short, so that the interval between the corresponding start time points is relatively short, the computer device may also perform filtering processing on the initial start time sequence to filter out start time points with relatively small time intervals in the initial start time sequence, so as to use the initial start time sequence after the filtering processing as the target start time sequence, so that the phonemes included in the target start time sequence are key phonemes.

It should be noted that, when detecting the start time point locations of each phoneme in the target audio data to obtain an initial start time sequence, the computer device may use an algorithm for extracting music start time by open source deep learning to detect the start time point locations of each phoneme in the target audio data; the method is characterized by firstly extracting features, then predicting the starting time point of each phoneme in target audio data by using a model shown in FIG. 3, finally selecting the starting time point meeting the conditions through peak value selection, and forming an initial starting time sequence by the selected starting time points. Each of the modules in fig. 3 is formed by a Neural network with a different structure, such as a five-layer Convolutional Neural Network (CNN). Illustratively, as shown in fig. 3, in the training process, after the computer device inputs the representation (Input representation), two Front-ends a (Front-end a) may be used for training, and one Front-end a may refer to a convolutional neural network; correspondingly, a Front end A is pre-trained and fixed with weights (Front-end A pre-trained and weights-fixed), then another Front end A to be trained (Front-end A ready to train) and the pre-trained and fixed-weight Front end A are adopted to carry out feature extraction on the same training data to obtain two feature extraction results, and the two feature extraction results are combined (Concatenate), namely the two feature extraction results output by the two convolutional neural networks are connected; further, Back-end D ready to train based on the combined feature extraction results.

In this case, when the computer device predicts the start time point location by using the model shown in fig. 3, the input representation here may refer to a mel spectrum of the input target audio data, and two front ends a (i.e., two convolutional neural networks) are respectively used for feature extraction, so as to obtain two feature extraction results; and then, combining the two feature extraction results, and outputting the starting time point of each phoneme through a back-end network D and an activation unit to obtain an initial starting time sequence.

For example, as shown in fig. 4, it is assumed that a first graph is used to represent Mel spectrum of the target audio data, and an ordinate of the first graph represents Mel band (Mel bands), a second graph is used to represent each detected start time point, and an ordinate of the second graph represents a start Detection Baseline (ODF Baseline, Onset Detection function Baseline), which may also be referred to as a start Detection function Baseline herein; illustratively, the mel spectrum of the target audio data shown in fig. 4 includes start time points of each phoneme, i.e., start time points A, B, C, D, E, F, G, H and I, and one start time point corresponds to one phoneme, so when the computer device predicts the start time points through the model shown in fig. 3, the start time points of each phoneme in the target audio data may be detected using the mel spectrum of the target audio data, so as to detect the start time points A, B, C, D, E, G, H and I.

And S203, determining one or more singing segments corresponding to the target audio data based on the target starting time sequence, the target pitch curve and the target silent region.

Wherein, a singing segment comprises at least one time point location and a corresponding pitch value, and the starting time point location of the singing segment corresponds to the starting time point location of a phoneme.

It should be noted that, the computer device may segment the target pitch curve by using the target start time sequence and the target silent region to obtain one or more singing segments corresponding to the target audio data; or, the target pitch curve may be adjusted by using the target silent region, and the adjusted target pitch curve may be segmented by using the target silent region to obtain one or more singing segments, and so on.

Correspondingly, because the target pitch curve is obtained by the computer device according to the fundamental frequency extraction result, the computer device can firstly segment the fundamental frequency extraction result of the target audio data to obtain one or more fundamental frequency segmentation results; and then, determining corresponding singing segments according to each fundamental frequency segmentation result respectively to obtain one or more singing segments corresponding to the target audio data. It can be understood that, considering the problem of large fluctuation of the fundamental frequency, the fundamental frequency needs to be divided into intervals to obtain the interval corresponding to each phoneme; that is to say, the volatility of the pitch value of each time point in the target pitch curve is relatively large, so that the pitch value of each time point needs to be divided into intervals to obtain one or more singing segments.

S204, conducting note transcription processing on each singing segment in the one or more singing segments to obtain one or more notes; and generating a singing intonation indication file corresponding to the target audio data by adopting one or more musical notes.

It is to be understood that the singing intonation indication file corresponding to the target audio data may include one or more of the musical notes described above. The singing intonation indication file corresponding to the target audio data may refer to a MIDI file, or may be a text file including one or more notes, and the like, which is not limited in the present application. It should be noted that, the process of generating the singing intonation indication file corresponding to the target audio data by the computer device may also be referred to as singing transcription, where the singing transcription refers to: the process of converting target audio data into a representation of a sequence of notes and pitches (i.e., a singing intonation file) is handled.

Note that, when the computer device performs note transcription processing on each of the one or more singing segments to obtain one or more notes, the computer device may determine attribute information of each note; optionally, the attribute information of any note may include, but is not limited to: pitch attribute, time period attribute, etc. The pitch attribute of any note can be used to represent the pitch value of any note, and the time period attribute of any note can be used to represent the time period in which any note is located. In this case, the singing intonation indication file corresponding to the target audio data generated by the computer device using one or more notes may include attribute information of the respective notes.

In order to further explain the beneficial effects of the embodiment of the present application, in the embodiment of the present application, the manually labeled singing intonation indication file and the singing intonation indication file generated by Tony software (software for performing note transcription on audio data) are respectively compared and analyzed with the singing intonation indication file obtained in the embodiment of the present application, so as to highlight the accuracy of the audio processing method provided in the embodiment of the present application. For example, as shown in fig. 5, each note and pitch value of each note in the manually labeled singing intonation indication file are displayed in the display interface 501, each note and corresponding pitch value in the singing intonation indication file obtained in the embodiment of the present application are displayed in the display interface 502, and each note and corresponding pitch value in the singing intonation indication file generated by the Tony software are displayed in the display interface 503; correspondingly, each note in the display interface 502 is closer to each note in the display interface 501, that is, the singing intonation indication file obtained in the embodiment of the present application is closer to the manually marked singing intonation indication file, that is, the embodiment of the present application can obtain a singing intonation indication file with higher accuracy.

Please refer to fig. 6, which is a flowchart illustrating another audio processing method according to an embodiment of the present application. The audio processing method may be executed by the above-mentioned computer device, which may be a terminal or a server; alternatively, the audio processing method may be performed by both the terminal and the server. For convenience of explanation, the following description will be given by taking a computer device as an example to execute the audio processing method; referring to fig. 6, the audio processing method may include the following steps S601 to S606:

s601, acquiring a target pitch curve of the target audio data, and determining a target silent area in the target audio data.

In a specific implementation, the specific implementation process of the computer device for acquiring the target pitch curve of the target audio data may include: fundamental frequency extraction is carried out on each time point in the target audio data, and an initial pitch curve of the target audio data is determined according to the fundamental frequency extraction result; then, carrying out minimum value detection on the initial pitch curve, and correcting each minimum value in the detected initial pitch curve to obtain a target pitch curve of the target audio data; wherein, the minimum value in the initial pitch curve means: pitch values within a preset time range, wherein the difference value between each pitch value and the reference pitch value is greater than a pitch difference threshold value; the reference pitch value is determined according to the pitch value of at least one time point in the preset time length range. It should be noted that, the computer device performs fundamental frequency extraction on each time point location in the target audio data to obtain a fundamental frequency value of each time point location in the target audio data, and then the computer device determines an initial pitch curve of the target audio data according to the fundamental frequency extraction result, which may mean: respectively adopting the fundamental frequency value of each time point location in the target audio data, and calculating the pitch value of each time point location in the target audio data so as to determine an initial pitch curve of the target audio data; it may also refer to: and respectively taking the fundamental frequency value of each time point in the target audio data as the pitch value of each time point, that is, the pitch value of each time point in the initial pitch curve is equal to the fundamental frequency value of the corresponding time point, and so on.

Optionally, the reference pitch value may refer to an average value between pitch values in at least one time point location within a preset time length range, or may refer to a maximum value of pitch values in at least one time point location within a preset time length range, and the like, which is not limited in this application. Correspondingly, when the computer equipment detects the minimum value of the initial pitch curve, a sliding window can be adopted to slide on the initial pitch curve according to the sliding step length, and the sliding window is used for representing a preset duration range; after sliding the sliding window each time, the pitch values of the time point positions currently located in the sliding window in the initial pitch curve can be used as target pitch values, and a current reference pitch value can be determined according to at least one target pitch value; if the difference between the minimum value of each target pitch value and the current reference pitch value is greater than the pitch difference threshold, determining the minimum value of each target pitch value as a minimum value in the initial pitch curve. Alternatively, the pitch difference threshold may be set empirically, or may be calculated based on various target pitch values, such as 15 semitones or 20 semitones; wherein semitone refers to the pitch difference between two adjacent scales.

It should be noted that, the computer device may use an average value between each of the at least one target pitch value as the current reference pitch value, and may also use a maximum value of the at least one target pitch value as the current reference pitch value, which is not limited in this application. Wherein, at least one target pitch value here may refer to pitch values of all time points currently located in the sliding window, i.e. all target pitch values; but also all target pitch values except the minimum value among the respective target pitch values, and so on.

In this case, each time a minimum value is determined from the initial pitch curve, the computer device may determine a pitch correction value for the current minimum value according to each target pitch value in the sliding window where the current minimum value is currently determined; recording the pitch correction value of the current minimum value; correspondingly, when the computer device corrects each minimum value in the detected initial pitch curve to obtain the target pitch curve of the target audio data, the computer device can correct each minimum value in the detected initial pitch curve to the recorded corresponding pitch correction value respectively to obtain the target pitch curve of the target audio data. Optionally, each time a minimum value is determined from the initial pitch curve, the computer device may modify the current minimum value with the determined pitch modification value.

Optionally, when determining the pitch modification value of the current minimum value according to each target pitch value in the sliding window where the current minimum value is currently determined, the computer device may determine the maximum value of each target pitch value as the pitch modification value of the current minimum value, and may also determine the average value between each target pitch value as the pitch modification value of the current minimum value, and so on.

For example, assuming that the respective target pitch values determined by the computer device are 17, 25, 66, 20 and 16 semitones, respectively, at least one target pitch value is the respective determined target pitch value, and the current reference pitch value is the maximum of the respective target pitch values, the computer device may determine the current reference pitch value to be 66. In this case, assuming that the pitch difference threshold is 15 semitones, the difference between the minimum value of the respective target pitch values and the current reference pitch value is 50 semitones, i.e., the difference between the minimum value of the respective target pitch values and the current reference pitch value is greater than the pitch difference threshold, the computer device may determine the minimum value of the respective target pitch values as one minimum value in the initial pitch curve, i.e., the current minimum value. Further, assuming that the computer device determines the maximum value among the respective target pitch values as the pitch correction value of the current minimum value, the computer device may record the pitch correction value of the current minimum value as 66, thereby correcting the current minimum value as 66.

It should be noted that, the computer device may also correct the pitch value of each time point within the preset time range where each minimum value is located. In this case, each time a minimum value is determined from the initial pitch curve, the computer device may determine a pitch correction value for each target pitch value according to each target pitch value in the sliding window where the currently determined current minimum value is located; recording pitch correction values of all target pitch values; accordingly, the computer device may modify each target pitch value to a respective pitch modification value of the record, respectively. Optionally, each time a minimum value is determined from the initial pitch curve, the computer device may modify each target pitch value to a corresponding pitch modification value.

Further, the implementation of the computer device to determine the target unvoiced region in the target audio data may include: performing silent zone detection on target audio data to obtain one or more initial silent zones in the target audio data; identifying valid unvoiced regions of the one or more initial unvoiced regions according to the region length of each initial unvoiced region; and the identified effective silent regions are used as target silent regions in the target audio data. Here, the effective silent region means: a silent region having a region length greater than a length threshold. Alternatively, the length threshold may be set by the computer device, or may be set as desired by the target object, such as 50 milliseconds or 10 seconds(s). It is understood that, if the unit of frame is used, the length threshold here may also be 20 frames or 25 frames, etc.; a frame may include audio data of one or more time point locations, that is, a frame may refer to a time period, for example, a frame may be 20 milliseconds (ms) or 23 ms; accordingly, assuming a frame is 20 ms, a length threshold of 20 frames may refer to: the length threshold is 6.67 seconds.

For example, assume that the one or more initial silent regions include an initial silent region a, an initial silent region B, and an initial silent region C, and that the region length of the initial silent region a is 12 seconds, the region length of the initial silent region B is 5 seconds, and the region length of the initial silent region C is 20 seconds, and that the length threshold is 7 seconds; since the initial unvoiced region a and the initial unvoiced region C are greater than the length threshold, the computer device may identify valid unvoiced regions of the one or more initial unvoiced regions as including the initial unvoiced region a and the initial unvoiced region C. In this case, the computer apparatus may take the initial silent region a and the initial silent region C as target silent regions in the target audio data.

As another example, assuming that the one or more initial silent regions include an initial silent region a, an initial silent region B, and an initial silent region C, and that the region length of the initial silent region a is 25 frames, the region length of the initial silent region B is 5 frames, and the region length of the initial silent region C is 30 frames, and that the length threshold is 20 frames, the computer device may identify the initial silent region a and the initial silent region C as valid silent regions to obtain the target silent region.

It is to be understood that the specific implementation process of the computer device for determining the target unvoiced region in the target audio data may also include: performing silent zone detection on target audio data to obtain one or more initial silent zones in the target audio data; identifying invalid unvoiced regions of the one or more initial unvoiced regions according to the region length of each initial unvoiced region; and each of the one or more initial unvoiced regions other than the identified each invalid unvoiced region is set as a target unvoiced region in the target audio data. Here, the invalid silent region means: a silent region having a region length less than or equal to a length threshold. In other words, the computer device may obtain the target unvoiced region after excluding the shorter unvoiced high region among the one or more initial unvoiced regions.

S602, detecting the start time point of each phoneme in the target audio data to obtain an initial start time sequence, wherein the initial start time sequence comprises the start time point of each phoneme.

For example, assuming that the target audio data includes a phoneme a, a phoneme B, and a phoneme C, when the computer device detects start time points of the phonemes in the target audio data, the start time point of the phoneme a, the start time point of the phoneme B, and the start time point of the phoneme C may be obtained, so as to obtain an initial start time sequence; it is understood that the initial start time sequence at this time includes the start time point location of phoneme a, the start time point location of phoneme B and the start time point location of phoneme C.

S603, according to the time interval between adjacent starting time points in the initial starting time sequence, one or more invalid starting time points are determined from the initial starting time sequence.

Note that, the invalid start time point means: the starting time point with the time interval from the previous starting time point being less than the interval threshold. Accordingly, the interval threshold may be set empirically, or may be calculated based on the time interval between each start time point, such as 10 seconds or 6 seconds. Alternatively, the interval threshold may also be in units of frames, such as 20 frames or 12 frames; in this case, assuming that the interval threshold is 20 frames, and assuming that one frame is 20 milliseconds, the interval threshold at this time is 6.67 seconds.

In one embodiment, in the process of determining one or more invalid start time points from the initial start time sequence, the computer device may traverse the initial start time sequence starting from a second start time point in the initial start time sequence and using a currently traversed start time point as a current start time point; that is, the computer device may traverse all but the first start-time point in the initial start-time sequence in sequence. Further, the computer device may determine a previous start-time point location of the current start-time point location in the initial start-time sequence; and if the time interval between the current starting time point and the previous starting time point is smaller than the interval threshold, taking the current starting time point as an invalid starting time point. It is to be understood that if the time interval between the current start-time point and the previous start-time point is greater than or equal to the interval threshold, the computer device may take the current start-time point as the valid start-time point.

For example, as shown in fig. 7a, when the computer device determines a start time point before the current start time point in the initial start time sequence, assuming that the initial start time sequence includes start time point a, start time point B, start time point C, start time point D, and start time point E, the computer device may determine that the start time point before the start time point B is start time point a, the start time point before the start time point C is start time point B, the start time point before the start time point D is start time point C, and the start time point before the start time point E is start time point D. Correspondingly, it is assumed that the time interval between the start time point B and the start time point a is 5 seconds, the time interval between the start time point C and the start time point B is 6 seconds, the time interval between the start time point D and the start time point C is 10 seconds, the time interval between the start time point E and the start time point D is 7 seconds, and the interval threshold is 8 seconds; in this case, since the time interval between the start time point B and the start time point a is smaller than the interval threshold, the time interval between the start time point C and the start time point B is smaller than the interval threshold, and the time interval between the start time point E and the start time point D is smaller than the interval threshold, the computer device may determine that the start time point B, the start time point C, and the start time point E are all invalid start time points in the initial start time sequence; that is, the one or more invalid start-time points determined by the computer device in the initial start-time sequence include start-time point B, start-time point C, and start-time point E. It should be noted that fig. 7a only shows an initial starting time sequence and a target starting time sequence by way of example, which is not limited in the present application; if the number of start time points in the target start time sequence may also be equal to the number of start time points in the initial start time sequence, that is, the time intervals between each start time point in the initial start time sequence and the previous start time point are all greater than the interval threshold, in this case, the computer device may use the initial start time sequence as the target start time sequence.

In another embodiment, in the process of determining one or more invalid start time points from the initial start time sequence, the computer device may use a first start time point in the initial start time sequence as a valid start time point; traversing the initial starting time sequence from a second starting time point in the initial starting time sequence, taking the currently traversed starting time point as a current starting time point, and determining a previous starting time point of the current starting time point in each effective starting time point. Further, if the time interval between the current start time point location and the previous start time point location is less than the interval threshold, the computer device takes the current start time point location as an invalid start time point location; and if the time interval between the current starting time point and the previous starting time point is greater than or equal to the interval threshold, the computer equipment takes the current starting time point as an effective starting time point.

As another example, as shown in fig. 7B, when the computer device determines a starting time point before the current starting time point in each valid starting time point, it is assumed that the initial starting time sequence includes a starting time point a, a starting time point B, a starting time point C, a starting time point D, and a starting time point E, and it is assumed that a time interval between the starting time point B and the starting time point a is 5 seconds, a time interval between the starting time point C and the starting time point B is 6 seconds, a time interval between the starting time point D and the starting time point C is 10 seconds, and a time interval between the starting time point E and the starting time point D is 7 seconds. Correspondingly, assuming that the interval threshold is 8 seconds, the computer device may determine that a previous starting time point of the starting time point B is the starting time point a, and a time interval between the starting time point B and the starting time point a is less than the interval threshold, and then the computer device may determine that the starting time point B is an invalid starting time point; further, the computer device may determine a previous start time point of the start time point C, and since the valid start time point determined by the computer device only includes the start time point a, the computer device may determine that the previous start time point of the start time point C is the start time point a, and since a time interval between the start time point C and the start time point a is greater than an interval threshold, the computer device may use the start time point C as the valid start time point. In this case, since the valid start time point under the current system time includes the start time point a and the start time point C, the computer device may determine that a previous start time point of the start time point D is the start time point C, and a time interval between the start time point D and the start time point C is greater than an interval threshold, the computer device may take the start time point D as the valid start time point; correspondingly, at this time, the valid start time point includes start time point a, start time point C, and start time point D, so that the computer device may determine that the previous start time point of start time point E is start time point D, and since the time interval between start time point E and start time point D is smaller than the interval threshold, the computer device may determine that start time point E is an invalid start time point. It will be appreciated that the one or more invalid start time points determined by the computer device from the initial start time sequence include start time point B and start time point E.

Similarly, fig. 7b only shows the initial start time sequence and the target start time sequence by way of example, which is not limited in this application; for example, the target start time sequence may be the same as the initial start time sequence.

S604, deleting each invalid starting time point from the initial starting time sequence to obtain a target starting time sequence.

It is understood that the target start time sequence may include: all start time points in the initial start time sequence except for each invalid start time point.

For example, as shown in fig. 7a, when the computer device determines a start time point before the current start time point in the initial start time sequence, the computer device may determine that the invalid start time points in the initial start time sequence include a start time point B, a start time point C, and a start time point E, and then the computer device may delete the start time point B, the start time point C, and the start time point E from the initial start time sequence to obtain the target start time sequence. It is understood that the computer device may use all the start time points (i.e., start time point a and start time point D) in the initial start time sequence except for start time point B, start time point C, and start time point E as the start time points in the target start time sequence, thereby obtaining the target start time sequence.

For another example, as shown in fig. 7B, when the computer device determines, in each valid start time point, a start time point before the current start time point, the computer device may determine that invalid start time points in the initial start time sequence include a start time point B and a start time point E, and then the computer device may delete the start time point B and the start time point E from the initial start time sequence to obtain the target start time sequence. It is understood that the computer device may use all the start time points (i.e., start time point a, start time point C, and start time point D) in the initial start time sequence except for start time point B and start time point E as the start time points in the target start time sequence, thereby obtaining the target start time sequence.

And S605, determining one or more singing segments corresponding to the target audio data based on the target starting time sequence, the target pitch curve and the target silent region.

Wherein, a singing segment comprises at least one time point location and a corresponding pitch value, and the starting time point location of the singing segment corresponds to the starting time point location of a phoneme. It can be understood that the target object usually accompanies more processing during singing, such as turning, tremolo, and skating, which results in a larger fluctuation of the fundamental frequency value, that is, the fundamental frequency value of each time point in the target audio data may be characterized by: if the fluctuation value of the fundamental frequency value is large, the computer device needs to perform segmentation processing to obtain one or more singing segments corresponding to the target audio data.

In one embodiment, the number of target silent regions is M, M being a positive integer; the specific implementation process of the computer device for determining one or more singing segments corresponding to the target audio data based on the target start time sequence, the target pitch curve and the target silent region may include: determining the region starting point of each target silent region in the M target silent regions and the time distance between each starting time point in the target starting time sequence and the determined region starting point; then, according to each determined time distance, determining an adjacent silent area corresponding to each starting time point in the M target silent areas, wherein the adjacent silent area corresponding to any starting time point is as follows: the starting point of the region in the M target silent regions is positioned behind any starting time point, and the corresponding target silent region with the minimum time distance is positioned in the target silent region; and dividing the target pitch curve based on each starting time point and the corresponding area starting point of the adjacent silent area to obtain Q curve segments, and determining one curve segment in the Q curve segments as a singing segment corresponding to the target audio data to obtain Q singing segments. The starting point of the ith curve segment is the ith starting time point in the target starting time sequence, and the ending point is the area starting point corresponding to the ith starting time point; and i belongs to [1, Q ], and the value of Q is equal to the number of start time points in the target start time sequence.

For example, as shown in fig. 8a, assuming that M takes a value of 4, the M target silent regions include a target silent region 1, a target silent region 2, a target silent region 3, and a target silent region 4; the target silent area 1 is an area of 2-6 seconds, the target silent area 2 is an area of 10-15 seconds, the target silent area 3 is an area of 24-28 seconds, and the target silent area 4 is an area of 33-40 seconds, and assuming that the target start time sequence includes a start time point a, a start time point B, and a start time point C, the computer device may determine the time distance between each start time point of the start time point a, the start time point B, and the start time point C and the area start point of each target silent area, respectively. In this case, since the region start point among the M target silent regions is located after the start time point a, and the corresponding target silent region having the smallest temporal distance is the target silent region 1, the computer device may determine the adjacent silent region corresponding to the start time point a as the target silent region 1; since the starting point of the region in the M target unvoiced regions is located after the starting time point B, and the corresponding target unvoiced region with the smallest time distance is the target unvoiced region 2, the computer device may determine the adjacent unvoiced region corresponding to the starting time point B as the target unvoiced region 2; also, since the region start point among the M target silent regions is located after the start time point C, and the corresponding target silent region having the smallest temporal distance is the target silent region 3, the computer device may determine the adjacent silent region corresponding to the start time point C as the target silent region 3. Further, the computer device may segment the target pitch curve based on each start time point location and the corresponding region start point of the target silent region, to obtain 3 (i.e., Q takes a value of 3) curve segments. Wherein, the starting point of the 1 st curve segment (i.e. curve segment 1) is the starting time point a, and the end point is the area starting point of the target silent area 1; the starting point of the 2 nd curve segment (i.e. curve segment 2) is the starting time point B, and the end point is the region starting point of the target silent region 2; the starting point of the 3 rd curve segment (i.e., curve segment 3) is the starting time point C, and the end point is the region starting point of the target silent region 3. In this case, the computer device may determine one of the 3 curve segments as one singing segment corresponding to the target audio data to obtain 3 singing segments.

It should be noted that fig. 8a only exemplarily shows each start time point in the target start time sequence and each target silent region, which is not limited in the present application; for example, the start time point B may also overlap with the region end point of the target silent region 1, that is, the start time point B may be the region end point of the target silent region 1; as another example, the number of start time points in the target start time sequence may be the same as M; as another example, the dashed lines in FIG. 8a are only used to highlight the extent of the area, and may not be displayed when actually displayed, and so on.

It can be understood that the computer device merges the boundary of each target silent region with the start time sequence, and can obtain a target start time sequence and an end time sequence of phonemes, and end time points in the end time sequence correspond to start time points in the target start time sequence one-to-one; that is, the computer device may obtain a start time point location and an end time point location for each of the Q curve segments, one end time point location being a region start point corresponding to a respective start time point location.

In another embodiment, the number of target silent regions is M, where M is a positive integer; the specific implementation process of the computer device for determining one or more singing segments corresponding to the target audio data based on the target start time sequence, the target pitch curve and the target silent region may include: adjusting the pitch value of each time point position in M target silent areas in the target pitch curve to be zero value to obtain an adjusted target pitch curve; then, based on each starting time point in the target starting time sequence, carrying out segmentation processing on the adjusted target pitch curve to obtain Q curve segments; and determining one curve segment in the Q curve segments as a singing segment corresponding to the target audio data to obtain Q singing segments. The starting point of the ith curve segment is the ith starting time point in the target starting time sequence, and the ending point is the time point which is positioned behind the ith starting time point and is closest to the ith starting time point in the adjusted time point; and i belongs to [1, Q ], and the value of Q is equal to the number of start time points in the target start time sequence.

It is understood that the curve of the adjusted target pitch curve in any target silence region is: a line segment overlapping the abscissa.

For example, assuming that the value of M is 3, and the M target silent regions include target silent region 1, target silent region 2, and target silent region 3, the computer device may adjust the pitch value of each time point in the target pitch curve in target silent region 1 to zero, adjust the pitch value of each time point in the target pitch curve in target silent region 2 to zero, and adjust the pitch value of each time point in the target pitch curve in target silent region 3 to zero, thereby obtaining an adjusted target pitch curve, that is, the pitch values of the adjusted target pitch curve at any time point in target silent region 1, target silent region 2, and target silent region 3 are all 0.

Furthermore, a target silent region is represented by a zero-value sequence, wherein the zero-value sequence is a sequence formed by a plurality of zero values, and each zero value corresponds to a time point; the specific implementation process of the computer device adjusting the pitch values of the time point locations in the M target silent regions in the target pitch curve to zero values to obtain the adjusted target pitch curve may include: and performing multiplication operation on the M target silent areas and the target pitch curve, and adjusting the pitch value of each time point position in the M target silent areas in the target pitch curve to be zero through the multiplication operation to obtain an adjusted target pitch curve.

It is understood that the number of time points in any target unvoiced region is the same as the number of zero values in the sequence of zero values used to represent the any target unvoiced region; that is, the computing device multiplying the M target unvoiced regions and the target pitch curve means: the pitch values at each time point in the target pitch curve in the M target silence regions are each multiplied by a zero value.

For example, as shown in fig. 8b, assuming that M takes a value of 3, and that M target silent regions include target silent region 1, target silent region 2 and target silent region 3, then a sequence of zero values may be used to represent target silent region 1, target silent region 2 and target silent region 3, respectively; the target silent area 1 is an area of 2-6 seconds, the target silent area 2 is an area of 10-15 seconds, and the target silent area 3 is an area of 24-28 seconds. Accordingly, assuming that the zero-value sequence for representing the target silent region 1 is zero-value sequence 1, and the target silent region 1 includes 3 time point locations, the zero-value sequence 1 includes 3 zero values; assuming that a sequence of zero values used to represent the target silent region 2 is a sequence of zero values 2, and the target silent region 2 includes 4 time point bins, the sequence of zero values 2 includes 4 zero values; assuming that the zero-value sequence for representing the target unvoiced region 3 is the zero-value sequence 3, and the target unvoiced region 3 includes 2 time point locations, the zero-value sequence 3 includes 2 zero values. Further, the computer device may perform multiplication operations on each target silent region and the target pitch curve, that is, may perform multiplication operations on corresponding zero-value sequences (i.e., zero-value sequence 1, zero-value sequence 2, and zero-value sequence 3) and the target pitch curve to obtain an adjusted target pitch curve; the pitch value at any time point in the adjusted target pitch curve among the target silent region 1, the target silent region 2, and the target silent region 3 is 0.

Further, as shown in fig. 8B, assuming that the target start time sequence includes a start time point a, a start time point B, and a start time point C, when the computer device performs the segmentation processing on the adjusted target pitch curve based on each start time point in the target start time sequence, it may be determined that a time point located after the start time point a and closest to the start time point a is a time point 1 (i.e., a time point located at 2 nd second), a time point located after the start time point B and closest to the start time point B is a time point 2 (i.e., a time point located at 10 th second), a time point located after the start time point C and closest to the start time point C is a time point 3 (i.e., a time point located at 24 th second), thus, the adjusted target pitch curve is divided into 3 curve segments based on each starting time point location and the corresponding adjusted time point location. Wherein, the starting point of the 1 st curve segment (namely the curve segment 1) is a starting time point A, and the ending point is a time point 1; the starting point of the 2 nd curve segment (namely the curve segment 2) is a starting time point B, and the ending point is a time point 2; the starting point of the 3 rd curve segment (i.e. curve segment 3) is the starting time point C, and the ending point is the time point 3. In this case, the computer device may determine one of the 3 curve segments as one singing segment corresponding to the target audio data to obtain the 3 singing segments.

It should be noted that fig. 8b only shows an exemplary target start time sequence, a target pitch curve, an adjusted target pitch curve, and the like, which is not limited in the present application; for example, the adjusted target pitch curve may also be a continuous curve; for another example, except for the first start time point (i.e., start time point a), any start time point overlaps with the area end point of the target silent area corresponding to the previous start time point, i.e., the any start time point refers to the area end point of the target silent area corresponding to the previous start time point; as another example, the dashed line in FIG. 8b is only used to highlight the area range, the dashed line may not be displayed when actually displayed, and so on.

It should be noted that the computer device may also adopt a target adjustment sequence to adjust the pitch values of the respective time points in the M target silent regions in the target pitch curve to zero values, so as to obtain an adjusted target pitch curve. Wherein, the target adjusting sequence is: a sequence consisting of a plurality of 0 s and a plurality of 1 s, wherein each 0 corresponds to a time point location in a target unvoiced region, and each 1 corresponds to a time point location in any region of a target pitch curve except for the M target unvoiced regions; that is to say, each sequence value in the target adjustment sequence corresponds to each time point in the target pitch curve one to one, and the sequence value corresponding to any time point located in any target silence area is 0, and the sequence value corresponding to any time point in other time points is 1. In this case, the specific implementation process of adjusting the pitch values of the time points in the M target silent regions in the target pitch curve to zero values by using the target adjustment sequence is as follows: for any time point, if the time point is located in any target silence area, adjusting the pitch value of the time point to be zero, that is, performing multiplication operation on the pitch value of the time point and the zero value; if the any time point is not located in any target silence area, the pitch value of the any time point is not adjusted, that is, the pitch value of the any time point is multiplied by 1.

S606, conducting note transcription processing on each singing segment in the one or more singing segments to obtain one or more notes; and generating a singing intonation indication file corresponding to the target audio data by adopting one or more musical notes.

Specifically, the process of subjecting each of the one or more singing segments to the note transcription processing by the computer device to obtain one or more notes may include the following steps s11-s13:

s11, for any singing segment, the computer device can detect the inflection of any singing segment according to each pitch value in any singing segment; wherein, the sound conversion means: and time points of sudden pitch value conversion are needed during singing.

Optionally, the above-mentioned turning may also refer to a singing mode, and the target object may perform pitch value abrupt change conversion during singing through the singing mode; it can be understood that, if there is a transition in any singing segment, there is a time point at which a sudden pitch value transition is required when singing exists in any singing segment. It is to be noted that even if the fundamental frequency value (i.e., pitch value) is divided according to the target start time sequence, the target silent region, the fundamental frequency value within one region (i.e., one singing segment) still fluctuates greatly, especially in the case of a transbeat sound; therefore, the fundamental frequency curve (i.e. pitch curve) cannot be directly and simply averaged by regions, that is, one note corresponding to any singing segment cannot be directly determined according to the mean value between the pitch values of each time point in the any singing segment, and at least one note corresponding to any singing segment should be allocated according to whether there is a roll change in any singing segment.

It should be noted that, in the process of performing inflection detection on any singing segment according to each pitch value in any singing segment, the computer device may select a time point from any singing segment as a reference time point, and divide any singing segment into a first segment and a second segment based on the reference time point; then, an average value between pitch values of time points in the first segment is taken as a first average pitch value, and an average value between pitch values of time points in the second segment is taken as a second average pitch value. In this case, if the difference between the first average pitch height value and the second average pitch height value is greater than the inflection decision threshold, determining the reference time point as the time point at which the pitch value abrupt change is required during singing, and determining that any singing segment has inflection; and if the difference value between the first average pitch value and the second average pitch value is less than or equal to the turning judgment threshold value, determining that no turning exists in any singing segment.

Optionally, the computer device may use a middle value of each time point in any singing segment as a reference time point, may also use any time point in any singing segment as a reference time point, and may also use any time point in any singing segment except for the first time point and the last time point as a reference time point, which is not limited in this application. Specifically, when the computer device takes the intermediate value of each time point in any singing segment as the reference time point, the computer device may determine the number of time points of the time points included in the any singing segment; then, the number of the time point locations is adopted to calculate the intermediate value of each time point location in any singing segment, and the intermediate value is used as a reference time point location. Optionally, if the intermediate value calculated by the computer device is a floating point value, that is, the calculated intermediate value is not an integer, the computer device may round the intermediate value upward, may round the intermediate value downward, and so on; and taking the integrated intermediate value as a datum time point.

It is noted that the computer device may divide the point of reference time point into the first segment, that is, the first segment may include the point of reference time point, that is, the end point of the first segment is the point of reference time point, and the start point of the second segment is a point of time subsequent to the point of reference time point; the computer device may also divide the point of reference time point into the second segment, i.e. the second segment may comprise the point of reference time point, i.e. the end point of the first segment is the previous point of time of the point of reference time point, the start point of the second segment is the point of reference time point, etc.

It should be noted that the inflection decision threshold may be set empirically, and since inflection generally occurs in a phoneme with two pitch values that differ by more than 1.5 semitones, the inflection decision threshold may be 1.5 semitones or 1 semitone, etc.; optionally, the inflection decision threshold may also be calculated by using the pitch value of each time point in any one of the singing segments, for example, an average value between the pitch values of each time point in any one of the singing segments is used as the inflection decision threshold, which is not limited in the present application.

For example, assuming that any one singing segment sequentially includes a time point location a, a time point location B, a time point location C, a time point location D, and a time point location E, and assuming that the reference time point location is the time point location C, the computer device may divide the any one singing segment into a first segment and a second segment based on the time point location C. Further, assuming that the computer device divides the reference time point location into a first segment, the first segment includes a time point location a, a time point location B, and a time point location C, and the second segment includes a time point location D and a time point location E, the computer device may use a mean value among pitch values of the time point location a, the time point location B, and the time point location C as a first mean pitch value, and use a mean value among pitch values of the time point location D and the time point location E as a second mean pitch value. Correspondingly, if the difference between the first average pitch height value and the second average pitch height value is 3 semitones, and if the transfer decision threshold is 1.5 semitones, the difference between the first average pitch height value and the second average pitch height value is greater than the transfer decision threshold, the computer device can determine the time point C as a time point at which pitch value abrupt change conversion is required during singing, and determine that any singing segment has a transfer; if the difference between the first average pitch height and the second average pitch height is 1 semitone, and if the inflection decision threshold is 1.5 semitones, the difference between the first average pitch height and the second average pitch height is smaller than the inflection decision threshold, and the computer device may determine that no inflection exists in any singing segment.

It can be understood that, if the reference time point is a middle value of each time point in any singing segment, and the computer device determines an initial pitch curve of the target audio data, the computer device uses the fundamental frequency value of each time point in the target audio data as the pitch value of the corresponding time point, and then the computer device can determine, interval by interval, whether the average value of pitch values (i.e., fundamental frequency values) of the first half and the second half of the interval exceeds a turning decision threshold (e.g., 1.5 semitones), that is, the computer device can determine, one by one, whether the average value of pitch values of the first half and the second half of the singing segment exceeds the turning decision threshold; if the singing segment does not exceed the preset value, the singing segment is considered to have the voice conversion, and if the singing segment does not exceed the preset value, the singing segment is considered to have no voice conversion.

s12, if there is a commentary in any singing segment, dividing any singing segment into two sub-segments based on the existing commentary, and performing note transcription processing on each sub-segment to obtain two notes.

It is understood that the computer device may divide any singing segment into two sub-segments based on the time point at which the abrupt pitch value transition is required during singing. It should be noted that, the computer device may divide the transition (i.e. the time point at which the abrupt change of pitch value is required during singing) into the first sub-segment, and at this time, the end point of the first sub-segment is the transition; the computer device may also divide the inflection into a second sub-segment, where the starting point of the second sub-segment is the inflection, which is not limited in this application.

Specifically, the implementation process of performing, by the computer device, note transcription processing on each sub-segment to obtain two notes may include: carrying out mean operation on the pitch values of all time point positions in each sub-segment to obtain an average pitch value of each sub-segment; and assigning a note to each sub-segment based on the average pitch value of each sub-segment and the singing time period of each sub-segment. It is to be understood that, when the computer device performs the note-transcription process on each sub-segment, a note may be assigned to each sub-segment, and attribute assignment may be performed on each note, that is, attribute information of each note may be determined.

In this case, if the two sub-segments are the first segment and the second segment respectively; the computer device may assign a note to the first segment, and use the average pitch value of the first segment as the pitch value of the note corresponding to the first segment, and use the singing time period of the first segment as the time period in which the note corresponding to the first segment is located, that is, the pitch attribute of the note corresponding to the first segment may be the average pitch value of the first segment, and the time period attribute of the note corresponding to the first segment may be the singing time period of the first segment; accordingly, the computer device may assign a note to the second segment, and use the average pitch value of the second segment as the pitch value of the note corresponding to the second segment, and use the singing time period of the second segment as the time period in which the note corresponding to the second segment is located, that is, the pitch attribute of the note corresponding to the second segment may be the average pitch value of the second segment, and the time period attribute of the note corresponding to the second segment may be the singing time period of the second segment.

For example, assume that the computer device divides any singing segment into sub-segment a and sub-segment B based on the commentary, i.e., the first segment is sub-segment a and the second segment is sub-segment B. Assuming that the first segment includes time point location a, time point location B, and time point location C, and the second segment includes time point location D and time point location E, the computer device may use an average value between pitch values of time point location a, time point location B, and time point location C as an average pitch value of the first segment, and allocate a note to the first segment based on the average pitch value of the first segment and a singing time period of the first segment, that is, the pitch value of the note corresponding to the first segment is an average value between pitch values of time point location a, time point location B, and time point location C; accordingly, the computer device may use an average value between the pitch values of time point location D and time point location E as an average pitch value of the second segment, and assign a note to the second segment based on the average pitch value of the second segment and the singing time period of the second segment, that is, the pitch value of the note corresponding to the second segment is the average value between the pitch values of time point location D and time point location E.

It is understood that when there is a inflection in any one of the singing segments, the computer device may add two notes to the musical note indication file for the any one of the singing segments, that is, the generated musical note indication file may include two notes corresponding to the any one of the singing segments.

s13, if there is no commentary in any singing segment, the process of note transcription is carried out on any singing segment to obtain a note.

Note that, the specific implementation process of performing note transcription processing on any singing segment by the computer device to obtain a note may include: selecting a target number of time points from each time point in any singing segment according to the sequence of the time points from big to small; carrying out mean operation on the selected pitch values of the time point positions of the target number to obtain a mean operation result; and distributing a note for any singing segment based on the mean operation result and the singing time period of any singing segment. In this case, the computer device may add a note to the musical note indication file for the any one of the musical segments, i.e., the generated musical note indication file may contain a note corresponding to the any one of the musical segments.

Alternatively, the target number may be set by the computer device, or may be set as desired by the user, such as 60 or 78. It can be understood that the computer device may determine the target number before selecting the time point locations of the target number; in this case, the computer device may directly obtain the value of the target number; or a target percentage may be determined first, and the number of time points included in any one of the singing segments may be multiplied by the target percentage to obtain a target number, and so on.

For example, assuming that the target percentage is 80% and that any one of the singing segments includes a number of time points of 60, the computer device may determine that the target number is 48. In this case, the computer device selects a target number of time points from each time point in any singing segment according to the order of the time points from large to small, and performing the mean operation on the pitch values of the selected target number of time points means: selecting the later 80% of time points in any singing segment, and carrying out mean operation on the pitch values of the selected time points; correspondingly, if the computer device obtains the initial pitch curve, the fundamental frequency value of each time point in the initial audio data can be respectively used as the pitch value of each time point, then for any one of the singing segments, that is, for the segment in which no pitch difference is detected, the computer device can average 80% of the available pitch values (fundamental frequency values) to obtain the pitch value of the segment, that is, the pitch value of the corresponding note is obtained.

It should be noted that, in the process of performing note transcription processing on any singing segment to obtain a note, the computer device may also select a target number of time points from each time point in any singing segment according to the sequence of the time points from small to large; the time points with the target number can be randomly selected from all the time points in any singing segment, and the specific implementation mode for selecting the time points with the target number is not limited in the application.

The embodiment of the application can detect the start time point of each phoneme in the target audio data to obtain an initial start time sequence, and delete each invalid start time point from the initial start time sequence to obtain a target start time sequence, wherein the invalid start time point means: the starting time point with the time interval with the previous starting time point smaller than the interval threshold value can effectively reduce the influence of the starting time point with the smaller time interval on the determination result of the singing segment. In addition, the method and the device can correct each minimum value in the initial pitch curve of the target audio data so as to smooth each pitch value in the preset duration range of each minimum value; in addition, the silent areas with smaller area length can be deleted in each initial silent area to obtain the target silent area, namely, the effective silent area with the area length larger than the length threshold value is used as the target silent area, so that the occurrence of shorter singing segments is effectively reduced, and the occurrence of shorter notes is avoided; moreover, the embodiment of the application can consider the tone turning, and can improve the accuracy of the note, thereby improving the accuracy of the singing intonation indication file.

In order to better understand the audio processing method shown in fig. 6 in the embodiment of the present application, the following further describes the audio processing method proposed in the embodiment of the present application in detail:

referring to fig. 9a, a computer device may first acquire initial audio data, which may include one or more phonemes sung by a target object, one or more accompaniments, and so on; that is, the initial audio data may include audio data in a plurality of tracks, where the plurality of tracks includes a target track corresponding to a target object. Then, the computer device may perform sound source separation on the initial audio data to obtain target audio data, where the target audio data refers to audio data in a target track corresponding to the target object, and the target audio data includes one or more phonemes sung by the target object.

Further, the computer device may perform pitch detection on each time point location in the target audio data to obtain an initial pitch curve, where the initial pitch curve includes pitch values of each time point location; performing silent zone detection on the target audio data to obtain an initial silent zone of the target audio data; and detecting the initial time point of each phoneme in the target audio data to obtain an initial time sequence, wherein the initial time sequence comprises the initial time point of each phoneme.

In this case, the computer device may perform post-processing on the initial pitch curve, the initial silent region, and the initial start time sequence to obtain a singing intonation indication file, and output the singing intonation indication file; accordingly, the output singing intonation indication file herein may refer to: and displaying each note in the singing intonation indication file, the pitch value of each note and the like according to the sequence of the time point from small to large. The specific process of post-processing the initial pitch curve, the initial unvoiced region, and the initial start time sequence by the computer device is shown in fig. 9 b.

Specifically, the computer device may post-process the initial pitch curve, the initial unvoiced region, and the initial start time sequence by a pre-processing module and an added note module, which may also be referred to as a quantized note module, when post-processing the initial pitch curve, the initial unvoiced region, and the initial start time sequence. In this case, the computer device may first preprocess the initial pitch curve, the initial start time sequence, and the initial unvoiced region by the preprocessing module; then, by adding a note module, one or more singing segments are determined based on the pre-processing result, and notes are distributed to each singing segment, so that the notes corresponding to each singing segment are added to a singing note indication file, that is, the singing note indication file is generated by adopting each note, and the singing note indication file comprises each note.

During the pre-processing of the initial pitch curve, the initial start time sequence and the initial soundless area by the pre-processing module, the computer device may modify, that is, may remove, the minima in the initial pitch curve to obtain the target pitch curve; a silent region whose region length is less than or equal to the length threshold value in the initial silent region may be deleted and an unvoiced silent region may be set as a target silent region, that is, an effective silent region whose region length is greater than the length threshold value in the initial silent region may be set as a target silent region; and deleting the starting time points with the time interval smaller than the interval threshold in the initial starting time sequence, namely deleting each invalid starting time point to obtain the target starting time point. Further, the computer device may adjust the target pitch curve by using the target unvoiced region through the preprocessing module, where one target unvoiced region is a zero-value sequence, that is, multiplying the target pitch curve by the target unvoiced region to obtain an adjusted target pitch curve, and a pitch value of any time point in any target unvoiced region in the adjusted target pitch curve is zero.

Correspondingly, in the process of determining one or more singing segments based on the preprocessing result and distributing notes to each singing segment by adding the note module, the computer equipment can perform segmentation processing on the adjusted target pitch curve based on the target starting time sequence by adding the note module so as to extract the singing segments (namely, one or more singing segments corresponding to the target audio data) of which the pitch average value is not zero and perform tuning detection on each singing segment; and assigning a note to each singing segment according to the inflection detection result.

It should be noted that fig. 9a only shows an exemplary flow of the audio processing method, and the present application is not limited thereto. For example, in the above flow, after acquiring the initial audio data, the computer device may separate the target audio data from the initial audio data; in other embodiments, however, the computer device may use the initial audio data as the target audio data. For another example, after the computer device generates the singing intonation instruction file, the computer device may perform singing intonation detection on the target object according to the obtained singing intonation instruction file without outputting the singing intonation instruction file. For another example, when performing pitch detection on each time point in the target audio data, the computer device may obtain a target pitch curve of the target audio data, perform post-processing on the target pitch curve, the initial silent region, and the initial start time sequence, and so on.

Accordingly, fig. 9b is merely an exemplary embodiment of the post-processing of the initial pitch curve, the initial start time sequence and the initial unvoiced region, and outputting the singing tone indication file, which is not limited in the present application. For example, the computer device may also determine, based on each target silent region, a region start point corresponding to each start time point in the target start time sequence, and perform segmentation processing on the target pitch curve according to each start time point and the corresponding region start point, so as to obtain one or more singing segments corresponding to the target audio data. For another example, the computer device may also segment the adjusted target pitch curve based on the target start time sequence by using a preprocessing module to obtain one or more singing segments corresponding to the target audio data, that is, the preprocessing result may also refer to: one or more singing segments corresponding to the target audio data. For another example, when the computer device performs pitch detection on each time point in the target audio data to obtain a target pitch curve of the target audio data, the inputs of the preprocessing module are the target pitch curve, the initial start time sequence, and the initial silence region, etc.

Practice shows that the audio processing method provided by the embodiment of the application can have at least the following beneficial effects: the first point is as follows: each minimum value in the initial pitch curve can be corrected to realize the smooth processing of each pitch value in the preset time range of each minimum value; and a second point: deleting the soundless area with the area length less than or equal to the length threshold in the initial soundless area and deleting the starting time point in the initial starting time sequence with the time interval less than the interval threshold to obtain the target soundless area and the target starting time sequence, thereby effectively reducing the situation of obtaining shorter notes; and a third point: based on the target starting time sequence obtained by detection, the target pitch curve of the target audio data and the target silent region in the target audio data, the silent region in the singing segment can be avoided, the accuracy of the singing segment is improved, and therefore a relatively accurate singing intonation indication file is generated; a fourth point: the singing intonation indication file corresponding to the target audio data can be automatically generated, and the generation efficiency of the singing intonation indication file is improved.

Based on the description of the related embodiments of the audio processing method, the embodiment of the present application also provides an audio processing apparatus, which may be a computer program (including a program code) running in a computer device. The audio processing apparatus may perform the audio processing method shown in fig. 2 or fig. 6; referring to fig. 10, the audio processing apparatus may operate as follows:

a processing unit 1001, configured to obtain a target pitch curve of target audio data, and determine a target silent region in the target audio data, where the target audio data includes one or more phonemes sung by a target object, and a phoneme includes one or more time point locations;

the processing unit 1001 is further configured to detect start time points of each phoneme in the target audio data to obtain a target start time sequence;

the processing unit 1001 is further configured to determine one or more singing segments corresponding to the target audio data based on the target start time sequence, the target pitch curve, and the target unvoiced region; one singing segment comprises at least one time point location and a corresponding pitch value, and the starting time point location of one singing segment corresponds to the starting time point location of one phoneme;

the processing unit 1001 is further configured to perform note transcription processing on each of the one or more singing segments to obtain one or more notes;

a generating unit 1002, configured to generate a singing intonation indication file corresponding to the target audio data by using the one or more musical notes.

In one embodiment, the number of target silent regions is M, M being a positive integer; when determining one or more singing segments corresponding to the target audio data based on the target start time sequence, the target pitch curve, and the target unvoiced region, the processing unit 1001 may be specifically configured to:

determining a region starting point of each target unvoiced region of the M target unvoiced regions and a time distance between each starting time point in the target starting time sequence and the determined region starting point;

according to the determined time distances, determining adjacent silent areas corresponding to the starting time points in the M target silent areas; the adjacent silent region corresponding to any one start time point is: the starting point of the region in the M target silent regions is positioned behind any starting time point, and the corresponding target silent region with the minimum time distance is positioned in the target silent region;

based on each starting time point and the corresponding area starting point of the adjacent soundless area, carrying out segmentation processing on the target pitch curve to obtain Q curve segments; determining one curve segment in the Q curve segments as a singing segment corresponding to the target audio data to obtain Q singing segments;

the starting point of the ith curve segment is the ith starting time point in the target starting time sequence, and the ending point is the area starting point corresponding to the ith starting time point; and i belongs to [1, Q ], and the value of Q is equal to the number of starting time points in the target starting time sequence.

In another embodiment, the number of target silent regions is M, where M is a positive integer; when determining one or more singing segments corresponding to the target audio data based on the target start time sequence, the target pitch curve, and the target unvoiced region, the processing unit 1001 may be specifically configured to:

adjusting the pitch value of each time point position in the M target silent areas in the target pitch curve to be zero value to obtain an adjusted target pitch curve;

based on each starting time point in the target starting time sequence, carrying out segmentation processing on the adjusted target pitch curve to obtain Q curve segments; determining one curve segment in the Q curve segments as a singing segment corresponding to the target audio data to obtain Q singing segments;

the starting point of the ith curve segment is the ith starting time point in the target starting time sequence, and the ending point is the time point which is positioned behind the ith starting time point and is closest to the ith starting time point in the adjusted time point; and i belongs to [1, Q ], and the value of Q is equal to the number of starting time points in the target starting time sequence.

In another embodiment, a target silent region is represented by a zero-value sequence, the zero-value sequence is a sequence composed of a plurality of zero values, and each zero value corresponds to a time point; when adjusting the pitch value of each time point location in the M target silent regions in the target pitch curve to a zero value to obtain an adjusted target pitch curve, the processing unit 1001 may be specifically configured to:

and performing multiplication operation on the M target silent areas and the target pitch curve, and adjusting the pitch value of each time point position in the M target silent areas in the target pitch curve to be zero through the multiplication operation to obtain an adjusted target pitch curve.

In another embodiment, when the processing unit 1001 detects start time points of each phoneme in the target audio data to obtain a target start time sequence, it may specifically be configured to:

detecting the starting time point of each phoneme in the target audio data to obtain an initial starting time sequence, wherein the initial starting time sequence comprises the starting time point of each phoneme;

determining one or more invalid starting time point positions from the initial starting time sequence according to the time interval between adjacent starting time point positions in the initial starting time sequence; the invalid start time point is: the starting time point with the time interval with the previous starting time point smaller than the interval threshold value;

and deleting each invalid starting time point from the initial starting time sequence to obtain a target starting time sequence.

In another embodiment, when determining the target unvoiced region in the target audio data, the processing unit 1001 may specifically be configured to:

performing silence area detection on the target audio data to obtain one or more initial silence areas in the target audio data;

identifying valid unvoiced regions of the one or more initial unvoiced regions according to a region length of each initial unvoiced region; here, the effective silent region means: a silent region having a region length greater than a length threshold;

and using each identified effective silent area as a target silent area in the target audio data.

In another embodiment, when the processing unit 1001 obtains the target pitch curve of the target audio data, it may specifically be configured to:

fundamental frequency extraction is carried out on each time point in the target audio data, and an initial pitch curve of the target audio data is determined according to a fundamental frequency extraction result;

carrying out minimum value detection on the initial pitch curve, and correcting each minimum value in the detected initial pitch curve to obtain a target pitch curve of the target audio data;

wherein the minima in the initial pitch curve refer to: pitch values within a preset time range, wherein the difference value between each pitch value and the reference pitch value is greater than a pitch difference threshold value; the reference pitch value is determined according to the pitch value of at least one time point location within the preset time length range.

In another embodiment, when the processing unit 1001 performs minimum detection on the initial pitch curve, it may specifically be configured to:

sliding on the initial pitch curve by adopting a sliding window according to the sliding step length, wherein the sliding window is used for representing a preset duration range;

after sliding the sliding window each time, taking the pitch values of all time point positions, which are currently located in the sliding window, in the initial pitch curve as target pitch values, and determining a current reference pitch value according to at least one target pitch value;

if the difference between the minimum value of each target pitch value and the current reference pitch value is greater than a pitch difference threshold, determining the minimum value of each target pitch value as a minimum value in the initial pitch curve.

In another embodiment, the processing unit 1001 may further be configured to:

determining a minimum value from the initial pitch curve, and determining a pitch correction value of the current minimum value according to each target pitch value in a sliding window where the current minimum value is currently determined; recording the pitch correction value of the current minimum value;

the correcting each minimum value in the detected initial pitch curve to obtain a target pitch curve of the target audio data includes:

and respectively correcting each detected minimum value in the initial pitch curve into a recorded corresponding pitch correction value to obtain a target pitch curve of the target audio data.

In another embodiment, when the processing unit 1001 performs the note transcription processing on each of the one or more singing segments to obtain one or more notes, it may specifically be configured to:

aiming at any singing segment, performing sound conversion detection on the any singing segment according to each pitch value in the any singing segment; the sound conversion means: time point positions for carrying out pitch value abrupt change conversion during singing;

if any singing segment has a transfer, dividing the any singing segment into two sub-segments based on the existing transfer, and performing note transcription processing on each sub-segment to obtain two notes;

and if the vocal conversion does not exist in any vocal segment, performing note transcription processing on any vocal segment to obtain a note.

In another embodiment, when the processing unit 1001 performs the note transcription processing on each sub-segment to obtain two notes, it can specifically be configured to:

carrying out mean operation on the pitch values of all time point positions in each sub-segment to obtain an average pitch value of each sub-segment;

assigning a note to each sub-segment based on the average pitch value of each sub-segment and the singing time period of each sub-segment.

In another embodiment, when the processing unit 1001 performs a note transcription process on any one of the singing segments to obtain a note, it may specifically be configured to:

selecting a target number of time points from each time point in any singing segment according to the sequence of the time points from big to small;

carrying out mean operation on the selected pitch values of the time point positions of the target number to obtain a mean operation result; and distributing a note for any singing segment based on the mean operation result and the singing time period of any singing segment.

In another embodiment, when the processing unit 1001 performs the phonetic transcription detection on any one of the singing segments according to each pitch value in the any one of the singing segments, it may specifically be configured to:

selecting a time point position from any singing segment as a reference time point position; dividing any singing segment into a first segment and a second segment based on the datum time point;

taking the average value between the pitch values of the time point positions in the first segment as a first average pitch value; and taking the average value between the pitch values of the time point positions in the second segment as a second average pitch value;

if the difference value between the first average pitch value and the second average pitch value is larger than a turning judgment threshold value, determining the reference time point as a time point required to carry out pitch value abrupt change conversion during singing, and determining that any singing segment has turning;

and if the difference value between the first average pitch value and the second average pitch value is smaller than or equal to the voice conversion judgment threshold value, determining that no voice conversion exists in any singing segment.

According to an embodiment of the present application, the steps involved in the method shown in fig. 2 or fig. 6 may be performed by various units in the audio processing apparatus shown in fig. 10. For example, steps S201 to S203 shown in fig. 2 may all be performed by the processing unit 1001 shown in fig. 10, and step S204 may be performed by both the processing unit 1001 and the generating unit 1002 shown in fig. 10. As another example, steps S601 to S605 shown in fig. 6 may all be performed by the processing unit 1001 shown in fig. 10, step S606 may be performed by both the processing unit 1001 and the generating unit 1002 shown in fig. 10, and so on.

According to another embodiment of the present application, the units in the audio processing apparatus shown in fig. 10 may be respectively or entirely combined into one or several other units to form one or several other units, or some unit(s) may be further split into multiple units with smaller functions to form the same operation, without affecting the achievement of the technical effect of the embodiment of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the audio processing apparatus may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units.

According to another embodiment of the present application, the audio processing apparatus as shown in fig. 10 may be configured by running a computer program (including program codes) capable of executing the steps involved in the respective methods as shown in fig. 2 or fig. 6 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and a storage element, and the audio processing method of the embodiment of the present application may be implemented. The computer program may be embodied on, for example, a computer storage medium, and loaded into and executed by the computing device described above via the computer storage medium.

Based on the description of the method embodiment and the device embodiment, the embodiment of the application further provides a computer device. Referring to fig. 11, the computer device comprises at least a processor 1101, an input interface 1102, an output interface 1103, and a computer storage medium 1104. The processor 1101, the input interface 1102, the output interface 1103, and the computer storage medium 1104 in the computer device may be connected by a bus or other means.

A computer storage medium 1104 may be stored in the memory of the computer device, the computer storage medium 1104 being used to store a computer program comprising program instructions, the processor 1101 being used to execute the program instructions stored by the computer storage medium 1104. The processor 1101 (or CPU) is a computing core and a control core of the computer device, and is adapted to implement one or more instructions, and specifically, adapted to load and execute the one or more instructions so as to implement a corresponding method flow or a corresponding function; in an embodiment, the processor 1101 according to the embodiment of the present application may be configured to perform a series of audio processing, which specifically includes: acquiring a target pitch curve of target audio data, and determining a target silent region in the target audio data, wherein the target audio data comprises one or more phonemes sung by a target object, and one phoneme comprises one or more time point positions; detecting the starting time point positions of all phonemes in the target audio data to obtain a target starting time sequence; determining one or more singing segments corresponding to the target audio data based on the target start time sequence, the target pitch curve and the target silent region; one singing segment comprises at least one time point location and a corresponding pitch value, and the starting time point location of one singing segment corresponds to the starting time point location of one phoneme; performing note transcription processing on each singing segment in the one or more singing segments to obtain one or more notes; and generating a singing intonation indication file corresponding to the target audio data by adopting the one or more musical notes.

An embodiment of the present application further provides a computer storage medium (Memory), which is a Memory device in a computer device and is used to store programs and data. It is understood that the computer storage medium herein may include both built-in storage media in the computer device and, of course, extended storage media supported by the computer device. Computer storage media provide storage space that stores an operating system for a computer device. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), suitable for loading and execution by the processor. The computer storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer storage medium located remotely from the processor. In one embodiment, a computer program may be loaded into and executed by a processor in a computer storage medium to perform the method steps described above with respect to the embodiments of the audio processing method illustrated in FIG. 2 or FIG. 6.

It should be noted that according to an aspect of the present application, a computer program product or a computer program is also provided, and the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer storage medium. The computer instructions are read by a processor of the computer device from a computer storage medium and executed by the processor to cause the computer device to perform the method provided in the various alternatives described above in connection with the embodiment of the audio processing method shown in fig. 2 or fig. 6.

It should be understood that the above-described embodiments are merely illustrative of the preferred embodiments of the present invention, which should not be taken as limiting the scope of the invention, but rather the scope of the invention is defined by the appended claims.

Claims

1. An audio processing method, comprising:

2. The method according to claim 1, wherein the number of target silent regions is M, M being a positive integer; the determining one or more singing segments corresponding to the target audio data based on the target start time sequence, the target pitch curve, and the target unvoiced region includes:

3. The method according to claim 1, wherein the number of target silent regions is M, M being a positive integer; the determining one or more singing segments corresponding to the target audio data based on the target start time sequence, the target pitch curve, and the target unvoiced region includes:

4. The method of claim 3, wherein a target silence region is represented by a sequence of zero values, wherein a sequence of zero values is a sequence of zero values, and each zero value corresponds to a time point;

adjusting the pitch value of each time point location in the M target silent regions in the target pitch curve to a zero value to obtain an adjusted target pitch curve, including:

5. The method according to any one of claims 1 to 4, wherein the detecting the start time point of each phoneme in the target audio data to obtain a target start time sequence comprises:

6. The method of any of claims 1-4, wherein the determining a target unvoiced region in the target audio data comprises:

7. The method of any of claims 1-4, wherein obtaining a target pitch curve for target audio data comprises:

8. The method of claim 7, wherein the performing minima detection on the initial pitch curve comprises:

9. The method of claim 8, further comprising:

10. The method according to any one of claims 1-4, wherein said subjecting each of said one or more singing segments to a note transcription process to obtain one or more notes comprises:

11. The method of claim 10, wherein said subjecting each sub-segment to a note transcription process results in two notes, comprising:

12. The method of claim 10, wherein said subjecting any one of said vocal segments to a note transcription process to obtain a note comprises:

13. The method of claim 10, wherein the detecting the inflection of any one of the singing segments according to each pitch value of the any one of the singing segments comprises:

14. An audio processing apparatus, comprising:

15. A computer device comprising a processor and a memory, wherein the memory is configured to store a computer program that, when executed by the processor, performs the method of any one of claims 1-13.