CN105895079B

CN105895079B - Voice data processing method and device

Info

Publication number: CN105895079B
Application number: CN201510926346.9A
Authority: CN
Inventors: 刘方宇
Original assignee: Tianjin Zhirong Innovation Technology Development Co ltd
Current assignee: Tianjin Zhirong Innovation Technology Development Co ltd
Priority date: 2015-12-14
Filing date: 2015-12-14
Publication date: 2022-07-29
Anticipated expiration: 2035-12-14
Also published as: CN105895079A

Abstract

The embodiment of the invention provides a method and a device for processing voice data. The processing method comprises the following steps: acquiring voice data to be processed; extracting corresponding acoustic characteristic information from the voice data to be processed; and searching a pre-stored reference acoustic characteristic music table according to the acoustic characteristic information to obtain a music corresponding to the voice data to be processed. By adopting the embodiment of the invention, the music score of the voice data can be rapidly acquired, the spreading performance of the music score is enhanced, and the user experience is improved.

Description

Voice data processing method and device

Technical Field

The present invention relates to computer technologies, and in particular, to a method and an apparatus for processing voice data.

Background

Along with the popularization of the internet and the promotion of audio and video technologies, people have more and more abundant daily entertainment life, for example, people can sing songs on KTV, or sing songs for online users in a video live broadcast mode, and the like.

The music can be enjoyed and can be used for mastering the temperament of people, so that a plurality of people like music. Music does not include lyrics alone but also music score, which is a carrier for accurately recording music, which is a regular combination of various written symbols recording the pitch or rhythm of music. Music scores are important components of music.

However, people who have not learned music only know the lyrics, but not the music score, and cannot recognize the music score, and the new music thought which is accidentally flashed in the mind of the user is quickly forgotten, so that people can only record a few sentences of tones through the recording equipment, and the method has poor spreading performance and user experience.

Disclosure of Invention

The invention aims to provide a method for composing a music of voice data and a device for realizing the method, which are used for acquiring a music score corresponding to the voice data to be processed based on acoustic characteristic information acquired from the voice data to be processed, thereby quickly acquiring the music score of the voice data, enhancing the transmissibility of the music score and improving the user experience.

According to an aspect of the present invention, a method for processing voice data is provided. The processing method comprises the steps of obtaining voice data to be processed; acquiring corresponding acoustic characteristic information from the voice data to be processed; and searching a pre-stored reference acoustic characteristic music table according to the acoustic characteristic information to obtain a music corresponding to the voice data to be processed.

Preferably, the searching a pre-stored reference acoustic feature music score table according to the acoustic feature information, and the processing of obtaining the music score corresponding to the to-be-processed voice data includes: searching a reference acoustic characteristic information range value in the pre-stored reference acoustic characteristic spectrum table according to the acoustic characteristic information; and taking the music score corresponding to the searched reference acoustic characteristic information range value as the music score corresponding to the voice data to be processed.

Preferably, the processing method further comprises: and outputting the voice data to be processed and the acquired music score.

Preferably, the processing of acquiring corresponding acoustic feature information from the voice data to be processed includes: and according to the sampling time of the voice data to be processed, dividing the voice data to be processed into a plurality of data segments with preset duration, and acquiring corresponding acoustic characteristic information from any data segment.

Preferably, the reference acoustic feature profile comprises a scale, a pitch, a chromatic scale and/or a long note.

According to another aspect of the present invention, there is provided a processing apparatus for voice data. The processing device comprises: the voice data acquisition module is used for acquiring voice data to be processed; the acoustic feature acquisition module is used for acquiring corresponding acoustic feature information from the voice data to be processed acquired by the voice data acquisition module; and the music score acquisition module is used for searching a prestored reference acoustic feature music score table according to the acoustic feature information acquired by the acoustic feature acquisition module and acquiring the music score corresponding to the voice data to be processed.

Preferably, the score obtaining module includes: the information searching unit is used for searching a reference acoustic characteristic information range value in the prestored reference acoustic characteristic spectrum table according to the acoustic characteristic information acquired by the acoustic characteristic acquiring module; and the music score acquisition unit is used for taking the music score corresponding to the reference acoustic characteristic information range value searched by the information search unit as the music score corresponding to the voice data to be processed.

Preferably, the processing apparatus further comprises: and the music score output module is used for outputting the voice data to be processed and the acquired music score.

Preferably, the acoustic feature acquisition module is configured to: and according to the sampling time of the voice data to be processed, which is acquired by the voice data acquisition module, the voice data to be processed is divided into a plurality of data segments with preset duration, and corresponding acoustic feature information is extracted from any data segment.

According to the voice data processing method and device provided by the embodiment of the invention, the corresponding acoustic feature information is obtained from the obtained voice data to be processed, the prestored reference acoustic feature music score table is searched according to the acoustic feature information, and the music score corresponding to the voice data to be processed is obtained, so that the music score of the voice data can be quickly obtained, the propagation of the music score is enhanced, and the user experience is improved.

Drawings

Fig. 1 is a flowchart illustrating a processing method of voice data according to a first embodiment of the present invention;

FIG. 2 is an exemplary diagram illustrating a display interface of a home page of an application for voice data processing;

Fig. 3 is a flowchart showing a processing method of voice data according to the second embodiment of the present invention;

FIG. 4 is an exemplary diagram illustrating a display interface of a home page of an application containing speech data processing of a melody;

fig. 5 is a logic block diagram showing a processing apparatus of voice data according to a third embodiment of the present invention;

fig. 6 is another logic block diagram showing a speech data processing apparatus according to a third embodiment of the present invention;

fig. 7 is still another logic block diagram showing a speech data processing apparatus according to a third embodiment of the present invention.

Detailed Description

The technical scheme can be applied to voice data processing scenes such as a recording studio, online video live broadcast and the like, corresponding acoustic characteristic information is obtained from the obtained voice data to be processed, a prestored reference acoustic characteristic music score table is searched according to the acoustic characteristic information, and a music score corresponding to the voice data to be processed is obtained, so that the music score of the voice data can be quickly obtained, the transmissibility of the music score is enhanced, and the user experience is improved.

Exemplary embodiments of the present invention are described in detail below with reference to the accompanying drawings.

Example one

Fig. 1 is a flowchart illustrating a processing method of voice data according to a first embodiment of the present invention. The processing method is performed by a computer system including the processing apparatus shown in fig. 5.

Referring to fig. 1, voice data to be processed is acquired at step S110.

The terminal device can be installed with an application program for processing voice data, when a user needs to compose a song or a tune sung by himself or other users, a shortcut icon of the application program can be clicked, the terminal device starts the application program and displays a home page of the application program, as shown in fig. 2, the home page can include a microphone icon, a voice input box, an output box, a help icon and the like, wherein the microphone icon can include an activated state and an inactivated state, for example, when the user clicks the microphone icon, the terminal device starts a microphone and collects voice data input by the user through the microphone, and at this time, the microphone icon is in the activated state; if the user does not input voice data within the preset time, the terminal equipment can close the microphone, and at the moment, the microphone icon is in an inactivated state; the voice input box can be used for displaying an icon of voice data input by a user, or a text of the voice data and the like, so that the user can determine whether the voice data collected by the terminal equipment is accurate; the output box may be used to output data obtained by processing the voice data, and the like. After the terminal device displays the home page of the application program, the microphone can be started, at this time, the microphone icon is in an activated state, then, the user can enable the microphone of the terminal device to face the user singing the song or the tune, and the terminal device can collect voice data (namely the voice data to be processed) input by the user through the microphone. The homepage may further include a determination key, and the determination key may be clicked after the user input is completed, and the terminal device acquires the voice data to be processed collected by the microphone, or may preset a receiving duration threshold, and when the duration after the user stops inputting reaches the receiving duration threshold, the voice data input before the user stops inputting may be determined as the voice data to be processed.

It should be noted that, if the voice of the user is too small, and the terminal device cannot receive the voice data, the terminal device may send a prompt signal indicating that the voice data reception is failed, so as to prompt the user to re-input the voice data.

In step S120, corresponding acoustic feature information is obtained from the voice data to be processed.

Specifically, the terminal device may perform preprocessing on the voice data to be processed, for example, perform processing such as sampling (the sampling frequency may be 10KHz or 16KHz, etc.), anti-aliasing filtering, and removing glottal excitation and noise influence, and then may perform feature extraction on the processed voice data, where the feature extraction is used to extract one or more sets of parameters capable of describing acoustic attribute features in the voice data, such as average energy, zero-crossing number, formants, cepstrum, linear prediction coefficients, etc., from a waveform of the voice data, so as to perform subsequent voice training and acquisition of acoustic feature information, and the selection of the parameter directly relates to the level of accuracy of the acoustic feature information in the voice data. Through the analysis of the above parameters of the voice data, the acoustic feature information of the voice data, such as the tone information, the timbre information, the loudness information and/or the scale information, etc., can be obtained.

In step S130, a pre-stored reference acoustic feature score table is searched according to the acoustic feature information, and a score corresponding to the to-be-processed voice data is obtained.

Specifically, the terminal device may store a reference acoustic feature profile in advance, where the reference acoustic feature profile may include a plurality of pieces of reference acoustic feature information, and the reference acoustic feature profile may be obtained by performing a large amount of training on the voice data obtained by the processing, or may be formed by general standard acoustic feature information. The terminal device may compare each piece of reference acoustic feature information in the reference acoustic feature score table with the acoustic feature information, and calculate a matching degree between the acoustic feature information and each piece of reference acoustic feature information, may determine the first piece of reference acoustic feature information with the highest matching degree as the acoustic feature information corresponding to the voice data, and may analyze the obtained first piece of reference acoustic feature information, and set a corresponding score based on information such as pitch information, timbre information, loudness information, and/or scale information in the first piece of reference acoustic feature information, thereby obtaining the score corresponding to the voice data.

According to the voice data processing method provided by the embodiment of the invention, the corresponding acoustic feature information is obtained from the obtained voice data to be processed, the prestored reference acoustic feature music score table is searched according to the acoustic feature information, and the music score corresponding to the voice data to be processed is obtained, so that the music score of the voice data can be quickly obtained, the transmissibility of the music score is enhanced, and the user experience is improved.

Example two

Fig. 3 is a flowchart showing a processing method of voice data according to a second embodiment of the present invention, which can be regarded as still another specific implementation of fig. 1.

Referring to fig. 3, in step S310, voice data to be processed is acquired.

The content of the step S310 is the same as the content of the step S110 in the first embodiment, and is not repeated herein.

In step S320, the voice data to be processed is divided into a plurality of data segments with preset duration according to the sampling time of the voice data to be processed, and corresponding acoustic feature information is obtained from any one of the data segments.

Specifically, since the speech signal corresponding to the speech data can be generally regarded as a short-time stationary signal, for example, the speech signal between adjacent sampling times (e.g. 10-20ms) of the speech data can be regarded as a short-time stationary signal, and its spectral characteristic and some physical characteristic parameters can be approximately regarded as unchanged, so that the speech data to be processed can be processed by using an analysis processing method of stationary process, specifically: the voice data to be processed can be divided into a plurality of data segments with preset time duration (such as 10-20ms) according to the sampling time, and the endpoint detection can be performed on each data segment, wherein the endpoint detection refers to determining the starting point and the ending point of the voice from a segment of data containing the voice. Then, feature extraction may be performed on each data segment, one or more sets of parameters capable of describing acoustic attribute features in the corresponding data segment are extracted from each data segment, and the acoustic feature information of each data segment can be obtained through analysis of the parameters of each data segment.

In step S330, a reference acoustic feature information range value in the pre-stored reference acoustic feature profile table is searched according to the acoustic feature information.

Wherein the reference acoustic feature music score comprises a scale, a tone, a chromatic scale and/or a long tone.

Specifically, the terminal device may store a reference acoustic feature profile in advance, the reference acoustic feature profile may include a plurality of pieces of reference acoustic feature information such as musical scale, pitch, chromatic scale, and/or long tone, different recognition ranges may be divided for the musical scale, pitch, chromatic scale, and/or long tone according to a predetermined division standard, and corresponding range values may be set, and the reference acoustic feature profile may be obtained by performing a large amount of training on voice data, or may be composed using general standard acoustic feature information. The acoustic feature information of each data segment may be set with a feature value according to a predetermined standard, and for the acoustic feature information of a certain data segment in the voice data, the terminal device may compare each piece of reference acoustic feature information in the reference acoustic feature profile with the acoustic feature information of the data segment, and find a reference acoustic feature information range value in which the feature value of the acoustic feature information of the data segment is located in the reference acoustic feature profile. By the above method, the above processing can be performed on other data segments in the voice data, and the reference acoustic feature information range value where the feature value of the acoustic feature information of each data segment is located is found in the reference acoustic feature profile table respectively.

In step S340, the music score corresponding to the searched reference acoustic feature information range value is used as the music score corresponding to the voice data to be processed.

Specifically, the matching degree between the acoustic feature information of the data segment and each piece of reference acoustic feature information is obtained through calculation, the first piece of reference acoustic feature information with the highest matching degree can be determined as the acoustic feature information corresponding to the data segment, the terminal device can analyze the reference acoustic feature information corresponding to each found reference acoustic feature information range value, and set a corresponding music score based on the information of scale, tone, chromatic scale, and/or chromatic scale in the corresponding reference acoustic feature information, so as to obtain a music score corresponding to the corresponding data segment, the above processing can be performed on other data segments in the voice data through the above manner, so as to obtain a music score corresponding to each data segment, then, the position of the data segment in the voice data can be determined according to the start point and the end point corresponding to each data segment, and the corresponding music scores can be sorted according to the position of each data segment, and obtaining a music score corresponding to the voice data.

In addition, the process of composing the voice data may be implemented in other various manners besides the above manners, for example, the voice data may be composed through a voice composition model, the voice composition model may be trained before the voice data is composed, and a technician may obtain various voice data through various ways, for example, the technician may obtain the voice data from various channels (such as purchasing from a user) before developing a voice composition mechanism, then train the voice composition model by using the obtained voice data, specifically, may set parameters of a plurality of voice composition models, extract relevant parameters in the voice data after obtaining the voice data, and obtain acoustic feature information of the voice data according to the relevant parameters, and then may perform state labeling on each frame of voice data, specifically, a neural network model may be set, the speech data may be divided into three layers, and then the neural network model of the acoustic features of the context may be used, the acoustic feature information of the head layer, the middle layer, and the tail layer is extracted from the speech data, the acoustic feature information of the three layers may be used as a sample feature space, the acoustic feature information corresponding to the sample feature space is obtained based on the sample feature space, and the acoustic feature information corresponding to the middle layer may be used as a flag. An artificial neural network topology structure can be used as a core of a speech recognition model, the artificial neural network topology structure can comprise three layers, such as an input layer, an implied layer and an output layer, firstly, the artificial neural network can be initialized, at the moment, the direct network connection weight of every two neurons is initialized to a very small random number (for example, -1.0), each neuron has a bias and is initialized to a random number, the output of each neuron is obtained through calculation according to the network input layer of input speech data, the calculation method of each neuron is the same and can be obtained by linear combination of the input of the neuron, finally, the actual output, namely the corresponding curved spectrum, can be obtained by comparing with an expected output result to obtain the error of each output unit, and the obtained error needs to be propagated from the output layer to the input layer, the error of the unit in the previous layer can be calculated by the error of all the units in the next layer connected with the unit, and the network weight and the neuron bias can be adjusted. And for each voice data, if the final output error is smaller than a preset acceptable range or a preset iteration threshold, continuing the processing on the next voice data, and thus continuously training to obtain a voice music score model. After the terminal device obtains the voice data to be processed, the voice data can be input into the voice recognition model for voice music composition, and a voice music composition result is obtained.

In step S350, the voice data to be processed and the acquired music score are output.

Specifically, as shown in fig. 4, the terminal device may display the text of the voice data to be processed and the acquired score at a preset position of an output box in the top page of the application program of the voice data processing, "XXXX" in fig. 4 represents the text of the voice data, and "a a …" represents the score.

It should be noted that the text of the voice data to be processed and the acquired music score may be displayed correspondingly, for example, a first character in the text corresponds to a first character in the music score, a second character in the text corresponds to a second character and a third character in the music score, and the like.

In addition, the home page of the application program for processing the voice data can also comprise a key for playing the music score, and when the user needs to listen to the music score, the key can be clicked, and the terminal equipment plays the music score. In order to improve the user experience, when the music score is played, the voice data to be processed input by the user can be played, so that the user can know the matching degree between the voice data and the music score through the playing of the terminal equipment.

On one hand, according to the voice data processing method provided by the embodiment of the invention, on the one hand, the acquired voice data to be processed is divided into a plurality of data segments with preset duration, corresponding acoustic feature information is acquired from any data segment, a prestored reference acoustic feature music score table is searched according to the acoustic feature information, and a music score corresponding to the voice data to be processed is acquired, so that the music score of the voice data can be acquired quickly, the transmissibility of the music score is enhanced, and the user experience is improved; on the other hand, the voice data to be processed and the acquired music score are output and displayed, and the music score can be played, so that a user can know the matching degree between the voice data and the music score, and the user experience is improved.

EXAMPLE III

Based on the same technical concept, fig. 5 is a logic block diagram showing a processing apparatus of voice data according to a third embodiment of the present invention. Referring to fig. 5, the processing apparatus includes a voice data obtaining module 510, an acoustic feature obtaining module 520, and a music score obtaining module 530, where the voice data obtaining module 510 is connected to the acoustic feature obtaining module 520, and the acoustic feature obtaining module 520 is connected to the music score obtaining module 530.

The voice data obtaining module 510 is used for obtaining the voice data to be processed.

The acoustic feature obtaining module 520 is configured to obtain corresponding acoustic feature information from the to-be-processed voice data obtained by the voice data obtaining module 510.

The music score obtaining module 530 is configured to search a pre-stored reference acoustic feature music score table according to the acoustic feature information obtained by the acoustic feature obtaining module 520, and obtain a music score corresponding to the to-be-processed voice data.

According to the voice data processing device provided by the embodiment of the invention, the corresponding acoustic feature information is obtained from the obtained voice data to be processed, the prestored reference acoustic feature music score table is searched according to the acoustic feature information, and the music score corresponding to the voice data to be processed is obtained, so that the music score of the voice data can be quickly obtained, the transmissibility of the music score is enhanced, and the user experience is improved.

Further, on the basis of the embodiment shown in fig. 5, the curvelet obtaining module 530 shown in fig. 6 includes: the information searching unit 531 is configured to search, according to the acoustic feature information obtained by the acoustic feature obtaining module 520, a reference acoustic feature information range value in the pre-stored reference acoustic feature profile table; a music score obtaining unit 532, configured to use the music score corresponding to the range value of the reference acoustic feature information found by the information finding unit 531 as the music score corresponding to the to-be-processed voice data.

Further, on the basis of the embodiment shown in fig. 6, the processing apparatus shown in fig. 7 further includes: and a music score output module 540, configured to output the voice data to be processed and the obtained music score.

Preferably, the acoustic feature obtaining module 520 is configured to divide the to-be-processed voice data into a plurality of data segments with preset time duration according to the sampling time of the to-be-processed voice data obtained by the voice data obtaining module 510, and extract corresponding acoustic feature information from any one of the data segments.

Further, in the apparatus for processing voice data provided in the embodiment of the present invention, on one hand, by dividing the acquired to-be-processed voice data into a plurality of data segments with preset durations, corresponding acoustic feature information is acquired from any data segment, and a prestored reference acoustic feature music score table is searched according to the acoustic feature information, so as to acquire a music score corresponding to the to-be-processed voice data, thereby quickly acquiring the music score of the voice data, enhancing the transmissibility of the music score, and improving user experience; on the other hand, the voice data to be processed and the acquired music score are output and displayed, and the music score can be played, so that a user can know the matching degree between the voice data and the music score, and the user experience is improved.

It should be noted that, according to implementation requirements, each step/component described in the present application can be divided into more steps/components, and two or more steps/components or partial operations of the steps/components can also be combined into a new step/component to achieve the purpose of the present invention.

The above-described method according to the present invention can be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the method described herein can be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the processing methods described herein. Further, when a general-purpose computer accesses code for implementing the processes shown herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the processes shown herein.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for processing voice data, the method comprising:

acquiring voice data to be processed;

acquiring corresponding acoustic characteristic information from the voice data to be processed, wherein the acquisition comprises the steps of dividing the voice data to be processed into a plurality of data sections with preset duration according to the sampling time of the voice data to be processed, and acquiring the corresponding acoustic characteristic information from each data section;

searching a pre-stored reference acoustic characteristic music score table according to the acoustic characteristic information to obtain a music score corresponding to the voice data to be processed;

outputting the voice data to be processed and the acquired music score;

searching a pre-stored reference acoustic feature music table according to the acoustic feature information, and acquiring a music score corresponding to the voice data to be processed comprises:

Searching a reference acoustic characteristic information range value in the pre-stored reference acoustic characteristic spectrum table according to the acoustic characteristic information, wherein the reference acoustic characteristic information range value comprises the following steps: comparing each piece of reference acoustic feature information in a reference acoustic feature profile with the acoustic feature information of each data segment in the voice data to be processed, and finding a reference acoustic feature information range value where a feature value of the acoustic feature information of each data segment is located in the reference acoustic feature profile; the reference acoustic characteristic information comprises a scale, a tone, a chromatic scale and/or a long tone, and different identification ranges are respectively divided for the scale, the tone, the chromatic scale and/or the long tone according to a preset division standard;

taking the music score corresponding to the searched reference acoustic characteristic information range value as the music score corresponding to the voice data to be processed, wherein the music score corresponding to the voice data to be processed comprises the following steps: calculating the matching degree of the acoustic feature information of each data segment and each datum acoustic feature information, determining the first datum acoustic feature information with the highest matching degree as the acoustic feature information corresponding to the data segment, analyzing the datum acoustic feature information corresponding to the searched range value of each datum acoustic feature information, and setting a corresponding music score based on the musical scale, tone, chromatic scale and/or long-pitch information in the corresponding datum acoustic feature information, so as to obtain the music score corresponding to the corresponding data segment; and determining the position of each data segment in the voice data according to the corresponding starting point and ending point of each data segment, and sequencing the corresponding music score according to the position of each data segment to obtain the music score corresponding to the voice data.

2. A processing apparatus of voice data, the processing apparatus comprising:

the voice data acquisition module is used for acquiring voice data to be processed;

the acoustic feature acquisition module is used for acquiring corresponding acoustic feature information from the voice data to be processed acquired by the voice data acquisition module, and comprises the steps of dividing the voice data to be processed into a plurality of data segments with preset duration according to the sampling time of the voice data to be processed, and acquiring corresponding acoustic feature information from each data segment;

the music score acquisition module is used for searching a prestored reference acoustic feature music score table according to the acoustic feature information acquired by the acoustic feature acquisition module and acquiring a music score corresponding to the voice data to be processed;

the music score output module is used for outputting the voice data to be processed and the acquired music score;

the music score acquisition module comprises:

the information searching unit is configured to search a reference acoustic feature information range value in the pre-stored reference acoustic feature profile table according to the acoustic feature information acquired by the acoustic feature acquisition module, and includes: comparing each piece of reference acoustic feature information in a reference acoustic feature profile with the acoustic feature information of each data segment in the voice data to be processed, and finding a reference acoustic feature information range value where a feature value of the acoustic feature information of each data segment is located in the reference acoustic feature profile; the reference acoustic characteristic information comprises a scale, a tone, a chromatic scale and/or a long tone, and different identification ranges are respectively divided for the scale, the tone, the chromatic scale and/or the long tone according to a preset division standard;

A score acquisition unit, configured to take the score corresponding to the range value of the reference acoustic feature information found by the information search unit as the score corresponding to the to-be-processed voice data, where the score acquisition unit is configured to: calculating the matching degree of the acoustic feature information of each data segment and each datum acoustic feature information, determining the first datum acoustic feature information with the highest matching degree as the acoustic feature information corresponding to the data segment, analyzing the datum acoustic feature information corresponding to the searched range value of each datum acoustic feature information, and setting a corresponding music score based on the musical scale, tone, chromatic scale and/or long-pitch information in the corresponding datum acoustic feature information, so as to obtain the music score corresponding to the corresponding data segment; and determining the position of each data segment in the voice data according to the corresponding starting point and ending point of each data segment, and sequencing the corresponding music score according to the position of each data segment to obtain the music score corresponding to the voice data.