WO2020206975A1 - 单位时间内音节数量的计算方法及相关装置 - Google Patents
单位时间内音节数量的计算方法及相关装置 Download PDFInfo
- Publication number
- WO2020206975A1 WO2020206975A1 PCT/CN2019/112242 CN2019112242W WO2020206975A1 WO 2020206975 A1 WO2020206975 A1 WO 2020206975A1 CN 2019112242 W CN2019112242 W CN 2019112242W WO 2020206975 A1 WO2020206975 A1 WO 2020206975A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- target
- audio segment
- syllables
- feature vector
- neural network
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000003062 neural network model Methods 0.000 claims abstract description 78
- 238000012549 training Methods 0.000 claims description 74
- 238000004590 computer program Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 14
- 238000013528 artificial neural network Methods 0.000 claims description 13
- 238000000926 separation method Methods 0.000 claims description 8
- 238000001514 detection method Methods 0.000 claims description 7
- 238000004891 communication Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 description 25
- 238000004364 calculation method Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
Definitions
- This application relates to the technical field of audio processing, and in particular to a method for calculating the number of syllables per unit time and related devices.
- the method for determining the number of syllables per unit time is to calculate the number of syllables and the singing time of songs with lyrics text, and then calculate the number of syllables per unit time for songs with lyrics text. Since this method requires time-stamped lyrics text, it cannot be applied to various audio segments and has poor adaptability. Therefore, a method for calculating the number of syllables per unit time with improved applicability is needed.
- the embodiment of the present application provides a method for calculating the number of syllables per unit time and related devices, which are used to calculate the number of syllables per unit time of songs without lyrics text.
- an embodiment of the present application provides a method for calculating the number of syllables per unit time, and the method includes:
- a first audio segment including human voice and background music perform human voice separation on the first audio segment, and obtain a second audio segment including only human voice; input the second audio segment into the trained neural network model Processing and outputting a first feature vector, and the trained neural network model is used to extract the feature vector of the audio segment of the human voice;
- the target number of syllables per unit time corresponding to the second audio segment is determined based on the target number of syllables and the target singing time.
- an embodiment of the present application provides a device for calculating the number of syllables per unit time, the device including:
- An execution unit configured to perform human voice separation on the first audio segment to obtain a second audio segment including only human voice
- a processing unit configured to input the second audio segment into the trained neural network model for processing, and output a first feature vector, and the trained neural network model is used to extract the feature vector of the human voice audio segment;
- a first determining unit configured to determine the number of target syllables corresponding to the second audio segment based on the first feature vector
- a second determining unit configured to determine the target singing time corresponding to the second audio segment
- the third determining unit is configured to determine the number of syllables in the target unit time corresponding to the second audio segment based on the number of target syllables and the target singing time.
- embodiments of the present application provide an electronic device, including a processor, a memory, a communication interface, and one or more programs.
- the one or more programs are stored in the memory and configured by the processor. Execution, the foregoing program includes instructions for executing part or all of the steps in the method described in the first aspect of the embodiments of the present application.
- an embodiment of the present application provides a computer-readable storage medium.
- the above-mentioned computer-readable storage medium is used to store a computer program, and the above-mentioned computer program is executed by a processor to implement the method described in the first aspect of the Part or all of the steps described in the method.
- an embodiment of the present application provides a computer program product.
- the computer program product includes a non-transitory computer-readable storage medium storing a computer program.
- the computer program is operable to cause a computer to execute Part or all of the steps described in the method described in one aspect.
- the electronic device obtains the first audio segment including human voice and background music, separates the first audio segment from human voice, and obtains the second audio segment including only human voice.
- the audio segment is input to the trained neural network model for processing, and the first feature vector is output.
- the target number of syllables corresponding to the second audio segment is determined, and the target singing time corresponding to the second audio segment is determined, based on the target number of syllables and The target singing time determines the number of syllables in the target unit time corresponding to the second audio segment.
- the unit for calculating songs without lyric text is realized Number of syllables in time.
- FIG. 1 is a schematic flowchart of the first method for calculating the number of syllables per unit time provided by an embodiment of the present application;
- FIG. 2 is a schematic flowchart of a second method for calculating the number of syllables per unit time provided by an embodiment of the present application
- FIG. 3 is a schematic flowchart of a third method for calculating the number of syllables per unit time provided by an embodiment of the present application;
- FIG. 4 is a block diagram of functional units of a device for calculating the number of syllables per unit time provided by an embodiment of the present application;
- Fig. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
- the device for calculating the number of syllables per unit time involved in the embodiments of the present application can be integrated in electronic equipment.
- the electronic equipment can include various handheld devices with wireless communication functions, vehicle-mounted devices, wearable devices, computing devices or connected to wireless Other processing equipment of the modem, as well as various forms of user equipment (User Equipment, UE), mobile station (Mobile Station, MS), terminal equipment (Terminal Device, TD), etc.
- FIG. 1 is a schematic flowchart of the first method for calculating the number of syllables per unit time provided by an embodiment of the present application.
- the method for calculating the number of syllables per unit time includes steps 101-104, which are specifically as follows:
- the device for calculating the number of syllables per unit time obtains a first audio segment including human voice and background music, and performs human voice separation on the first audio segment to obtain a second audio segment including only human voice;
- the two audio segments are input to the trained neural network model for processing, and the first feature vector is output.
- the trained neural network model is used to extract the feature vector of the audio segment of the human voice.
- a syllable is the smallest unit of speech structure formed by a phoneme combination, and the duration of the second audio segment is less than the duration of the first audio segment.
- the first audio segment is separated from the human voice, and the second audio segment including only the human voice adopts the existing technology, which will not be described here.
- the device for calculating the number of syllables per unit time inputs the second audio segment into the trained neural network model for processing, and before outputting the first feature vector, the method further includes:
- the device for calculating the number of syllables per unit time determines the duration of the second audio segment, and determines whether the duration of the second audio segment is greater than or equal to the target duration;
- the device for calculating the number of syllables per unit time triggers the operation of inputting the second audio segment into the trained neural network model for processing, and outputting the first feature vector.
- the target duration can be user-defined, for example, the target duration is 10s.
- the device for calculating the number of syllables per unit time inputs the second audio segment into the trained neural network model for processing. Before outputting the first feature vector, the method further includes:
- the device for calculating the number of syllables per unit time obtains N training sample data, where N is an integer greater than 1;
- the device for calculating the number of syllables per unit time inputs the i-th training sample data into the initial neural network model for forward training, outputs the prediction result, constructs the neural network loss function based on the prediction result, and compares all data based on the neural network loss function.
- the device for calculating the number of syllables per unit time performs the same operation on (N-1) training sample data other than the i-th training sample data among the N training sample data to obtain a neural network after N trainings model;
- the device for calculating the number of syllables per unit time uses the neural network model after the N trainings as the trained neural network model.
- the training sample data is a song without lyrics text
- a word in a song without lyrics text corresponds to a syllable
- a syllable corresponds to a moment.
- the initial neural network model is an untrained neural network model.
- the trained neural network model includes M network layers, the M network layers include fully connected layers, the M is an integer greater than 1, and the calculation device for the number of syllables per unit time will
- the second audio segment is input to the trained neural network model for processing, and the first feature vector is output, including:
- the device for calculating the number of syllables per unit time performs audio feature extraction on the second audio segment to obtain the target audio feature
- the device for calculating the number of syllables per unit time inputs the target audio feature into the i-th network layer for processing, and outputs the output data set corresponding to the i-th network layer;
- the device for calculating the number of syllables per unit time inputs the output data set corresponding to the i-th network layer into the (i+1)-th network layer for processing, and outputs the output data corresponding to the (i+1)-th network layer set;
- the device for calculating the number of syllables per unit time obtains the output data set corresponding to the (M-1)th network layer, where i is an initial value of 1, and an increasing integer with an interval of 1;
- the device for calculating the number of syllables per unit time inputs the output data set corresponding to the (M-1)th network layer into the fully connected layer for processing, and outputs the first feature vector.
- the first network layer to the (M-1)th network layer in the (M-1) network layers are the same.
- the device for calculating the number of syllables per unit time performs audio feature extraction on the second audio segment to obtain the target audio feature in an implementation manner:
- the device for calculating the number of syllables per unit time down-samples the second audio segment to obtain the down-sampled second audio segment, and the down-sampled second audio segment corresponds to the set sampling rate;
- the device for calculating the number of syllables per unit time performs discrete-time short-time Fourier transform on the down-sampled second audio segment based on the discrete-time Fourier transform parameters to obtain multiple first audio segments corresponding to the down-sampled second audio segment.
- Discrete spectrogram each first discrete spectrogram corresponds to one frame;
- the device for calculating the number of syllables per unit time performs Mel spectrum conversion on each first discrete spectrogram to obtain multiple second discrete spectrograms corresponding to the multiple first discrete spectrograms;
- the device for calculating the number of syllables per unit time generates the target spectrogram based on the plurality of second discrete spectrograms;
- the device for calculating the number of syllables per unit time determines the first matrix corresponding to the target spectrogram, and generates a second matrix based on the first matrix.
- the jth column in the second matrix is equal to the (j+1)th column in the first matrix and The difference in column j;
- the device for calculating the number of syllables per unit time superimposes the first matrix and the second matrix to obtain a third matrix, and uses the third matrix as the target audio feature.
- the set sampling rate can be 8000 Hz
- the discrete-time short-time Fourier transform parameters include frame length and step length.
- the frame length can be 256 sampling points
- the step length can be 80 sampling points, which are not limited here.
- the target spectrogram is a spectrogram that changes with time.
- the last column of the first matrix is the same as the last column of the second matrix.
- the device for calculating the number of syllables per unit time inputs the output data set corresponding to the i-th network layer into the (i+1)th network layer for processing, and outputs the output data set corresponding to the (i+1)th network layer
- the implementation can be:
- the device for calculating the number of syllables per unit time inputs the output data set corresponding to the i-th network layer into the (i+1)th network layer, and the (i+1)th network layer includes the convolution matrix (i+1)-1 , Convolution matrix (i+1)-2 and activation matrix (i+1)-3;
- the device for calculating the number of syllables per unit time multiplies the output data set corresponding to the i-th network layer with the convolution matrix (i+1)-1 to obtain the first output matrix (i+1)-4;
- the calculation device for the number of syllables per unit time multiplies the output data set corresponding to the i-th network layer with the convolution matrix (i+1)-2 to obtain the second output matrix (i+1)-5, and The second output matrix (i+1)-5 is multiplied by the activation matrix (i+1)-3 to obtain the third output matrix (i+1)-6;
- the device for calculating the number of syllables per unit time multiplies the first output matrix (i+1)-4 and the third output matrix (i+1)-6 to obtain the fourth output matrix (i+1)-7;
- the device for calculating the number of syllables per unit time superimposes the fourth output matrix (i+1)-7 and the output data set corresponding to the i-th network layer to obtain the output data set corresponding to the (i+1)-th network layer.
- the device for calculating the number of syllables per unit time determines the target number of syllables corresponding to the second audio segment based on the first feature vector, and determines the target singing time corresponding to the second audio segment.
- the device for calculating the number of syllables per unit time determines the target number of syllables corresponding to the second audio segment based on the first feature vector, including:
- the device for calculating the number of syllables per unit time performs binarization processing on the first feature vector to obtain a second feature vector.
- the magnitude of each value in the second feature vector is the first threshold or the second threshold.
- a threshold is less than the second threshold;
- the device for calculating the number of syllables per unit time sets the size of the at least one first target value to the first threshold to obtain a third feature vector ,
- the number of first values between each first target value and its nearest second target value is greater than or equal to a third threshold, and the size of the first target value and the second target value are both the second threshold ;
- each target value group includes two adjacent third target values, the size of each third target value is the second threshold, and each third The target value corresponds to a moment, and the calculation device for the number of syllables per unit time determines the time difference corresponding to each target value group;
- the device for calculating the number of syllables per unit time sets the size of any third target value in the target value group as the first threshold to obtain the fourth Feature vector;
- the device for calculating the number of syllables per unit time determines that the magnitude of each value in the fourth feature vector is the second value number of the second threshold, and uses the second value number as the corresponding number of the second audio segment. State the number of target syllables.
- the first feature vector includes multiple values, the size of each value is between 0-1, and the size of each value represents the probability of a syllable.
- the device for calculating the number of syllables per unit time performs binarization processing on the first feature vector to obtain the second feature vector.
- the embodiment may be: the device for calculating the number of syllables per unit time determines the value of each value in the first feature vector Whether the size is greater than or equal to a fixed value; if the size of the value is less than the fixed value, the device for calculating the number of syllables per unit time sets the value as the first threshold; or, if the value is greater than or equal to the fixed value, then the unit time The means for calculating the number of syllables sets this value as the second threshold.
- the fixed value can be user-defined, for example, the fixed value is 0.5.
- the first threshold may be 0, and the second threshold may be 1.
- the third threshold and the set duration may be user-defined, and are not limited here.
- the device for calculating the number of syllables per unit time to determine the target singing time corresponding to the second audio segment includes:
- the device for calculating the number of syllables per unit time performs silence detection on the second audio segment to obtain at least one silent segment and at least one non-silent segment included in the second audio segment;
- the device for calculating the number of syllables per unit time determines the target duration corresponding to the at least one non-silent segment
- the device for calculating the number of syllables per unit time uses the target duration as the target singing time corresponding to the second audio segment.
- the device for calculating the number of syllables per unit time performs mute detection on the second audio segment to obtain at least one silent segment and at least one non-silent segment included in the second audio segment using existing technology, which will not be described here.
- the device for calculating the number of syllables per unit time determines the target number of syllables per unit time corresponding to the second audio segment based on the target number of syllables and the target singing time.
- the electronic device obtains the first audio segment including human voice and background music, separates the first audio segment from human voice, and obtains the second audio segment including only human voice.
- the audio segment is input to the trained neural network model for processing, and the first feature vector is output.
- the target number of syllables corresponding to the second audio segment is determined, and the target singing time corresponding to the second audio segment is determined, based on the target number of syllables and The target singing time determines the number of syllables in the target unit time corresponding to the second audio segment.
- the unit for calculating songs without lyric text is realized Number of syllables in time.
- the device for calculating the number of syllables per unit time determines the target number of syllables per unit time corresponding to the second audio segment based on the target number of syllables and the target singing time, including:
- the device for calculating the number of syllables per unit time determines the target ratio of the target number of syllables to the target singing time;
- the device for calculating the number of syllables per unit time determines whether the target ratio is within a set range
- the device for calculating the number of syllables per unit time uses the target ratio as the target number of syllables per unit time corresponding to the second audio segment.
- the setting range can be user-defined, which is not limited here.
- the method further includes:
- the device for calculating the number of syllables per unit time determines whether the target ratio is greater than the maximum value of the set range
- the device for calculating the number of syllables per unit time uses the maximum value of the set range as the target number of syllables per unit time corresponding to the second audio segment;
- the device for calculating the number of syllables per unit time uses the minimum value of the set range as the target number of syllables per unit time corresponding to the second audio segment.
- FIG. 2 is a schematic flowchart of the second method for calculating the number of syllables per unit time provided by an embodiment of the present application. Applied to a device for calculating the number of syllables in a unit time, the method for calculating the number of syllables in a unit time includes steps 201-210, which are specifically as follows:
- the device for calculating the number of syllables per unit time obtains a first audio segment including human voice and background music, and separates the first audio segment from human voice to obtain a second audio segment including only human voice;
- the two audio segments are input to the trained neural network model for processing, and the first feature vector is output.
- the trained neural network model is used to extract the feature vector of the audio segment of the human voice.
- the device for calculating the number of syllables per unit time performs binarization processing on the first feature vector to obtain a second feature vector, and the magnitude of each value in the second feature vector is the first threshold or the second threshold, so The first threshold is less than the second threshold.
- the device for calculating the number of syllables per unit time sets the size of the at least one first target value as the first threshold to obtain a third Feature vector, the number of first values between each first target value and its nearest second target value is greater than or equal to a third threshold, and the first target value and the second target value are both the first target value Two thresholds.
- each target value group includes two adjacent third target values, the magnitude of each third target value is the second threshold, and each The third target value corresponds to a moment, and the calculation device for the number of syllables per unit time determines the time difference corresponding to each target value group.
- the device for calculating the number of syllables in a unit time sets the size of any third target value in the target value group as the first threshold to obtain The fourth feature vector.
- the device for calculating the number of syllables per unit time determines that the magnitude of each value in the fourth feature vector is the second value quantity of the second threshold, and uses the second value quantity as the corresponding second audio segment The target number of syllables.
- the device for calculating the number of syllables per unit time determines the target singing time corresponding to the second audio segment.
- the device for calculating the number of syllables per unit time determines a target ratio of the target number of syllables to the target singing time.
- the device for calculating the number of syllables per unit time determines whether the target ratio is within a set range.
- the device for calculating the number of syllables per unit time uses the target ratio as the target number of syllables per unit time corresponding to the second audio segment.
- Fig. 3 is a schematic flowchart of the third method for calculating the number of syllables per unit time provided by an embodiment of the present application.
- the calculation method of is applied to a device for calculating the number of syllables in a unit time.
- the calculation method of the number of syllables in a unit time includes steps 301-313, which are specifically as follows:
- the device for calculating the number of syllables per unit time obtains a first audio segment including human voice and background music, and performs human voice separation on the first audio segment to obtain a second audio segment including only human voice.
- the device for calculating the number of syllables per unit time obtains N training sample data, where N is an integer greater than 1.
- the device for calculating the number of syllables per unit time inputs the i-th training sample data into the initial neural network model for forward training, outputs a prediction result, constructs a neural network loss function based on the prediction result, and based on the neural network loss function Reverse training is performed on the initial neural network model to obtain a neural network model after training once, and the i-th training sample data is any one of the N training sample data.
- the device for calculating the number of syllables per unit time performs the same operation on (N-1) training sample data except for the i-th training sample data among the N training sample data, to obtain N training sample data Neural network model.
- the device for calculating the number of syllables per unit time uses the neural network model after the N trainings as the trained neural network model.
- the device for calculating the number of syllables per unit time inputs the second audio segment into the trained neural network model for processing, and outputs a first feature vector, and the trained neural network model is used to extract human voice audio The feature vector of the segment.
- the device for calculating the number of syllables per unit time determines the target number of syllables corresponding to the second audio segment based on the first feature vector.
- the device for calculating the number of syllables per unit time performs silence detection on the second audio segment to obtain at least one silent segment and at least one non-silent segment included in the second audio segment.
- the device for calculating the number of syllables in a unit time determines the target duration corresponding to the at least one non-silent segment.
- the device for calculating the number of syllables per unit time uses the target duration as the target singing time corresponding to the second audio segment.
- the device for calculating the number of syllables per unit time determines a target ratio of the target number of syllables to the target singing time.
- the device for calculating the number of syllables per unit time determines whether the target ratio is within a set range.
- the device for calculating the number of syllables per unit time uses the target ratio as the target number of syllables per unit time corresponding to the second audio segment.
- FIG. 4 is a block diagram of functional units of a device for calculating the number of syllables per unit time provided by an embodiment of the present application.
- the device 400 for calculating the number of syllables per unit time includes:
- the acquiring unit 401 is configured to acquire a first audio segment including human voice and background music
- the execution unit 402 is configured to perform human voice separation on the first audio segment to obtain a second audio segment including only human voices;
- the processing unit 403 is configured to input the second audio segment into the trained neural network model for processing, and output a first feature vector, and the trained neural network model is used to extract the feature vector of the human voice audio segment;
- the first determining unit 404 is configured to determine the number of target syllables corresponding to the second audio segment based on the first feature vector;
- the second determining unit 405 is configured to determine the target singing time corresponding to the second audio segment
- the third determining unit 406 is configured to determine the target number of syllables per unit time corresponding to the second audio segment based on the target number of syllables and the target singing time.
- the first audio segment including human voice and background music is obtained, and the first audio segment is separated from the human voice to obtain the second audio segment including only human voice.
- Input the trained neural network model for processing output the first feature vector, determine the target syllable number corresponding to the second audio segment based on the first feature vector, determine the target singing time corresponding to the second audio segment, based on the target syllable number and target singing The time determines the number of syllables in the target unit time corresponding to the second audio segment.
- the unit for calculating songs without lyric text is realized Number of syllables in time.
- the device 400 for calculating the number of syllables per unit time further includes a training unit 407,
- the training unit 407 is used to obtain N training sample data, where N is an integer greater than 1; input the i-th training sample data into the initial neural network model for forward training, output the prediction result, and construct based on the prediction result Neural network loss function, based on the neural network loss function to perform reverse training on the initial neural network model to obtain a neural network model after one training, the i-th training sample data is the N training sample data Perform the same operation on (N-1) training sample data other than the i-th training sample data among the N training sample data to obtain the neural network model after N training; The neural network model after the N trainings is used as the trained neural network model.
- the trained neural network model includes M network layers, the M network layers include fully connected layers, and M is an integer greater than 1, and the second audio segment is input into the trained neural network model
- the processing unit 403 is specifically configured to:
- the output data set corresponding to the (M-1)th network layer is obtained, and the i is an initial value of 1, and an increasing integer with an interval of 1;
- the output data set corresponding to the (M-1)th network layer is input to the fully connected layer for processing, and the first feature vector is output.
- the first determining unit 404 is specifically configured to:
- the size of the at least one first target value is set to the first threshold to obtain a third feature vector, and each first target value is The number of first values between the most recent second target values is greater than or equal to a third threshold, and the first target value and the second target value are both the second threshold;
- each target value group includes two adjacent third target values, the size of each third target value is the second threshold, and each third If the target value corresponds to a moment, the time difference corresponding to each target value group is determined;
- the size of any third target value in the target value group is set as the first threshold to obtain a fourth feature vector
- the magnitude of each value in the fourth feature vector is the second value quantity of the second threshold, and the second value quantity is used as the target syllable quantity corresponding to the second audio segment.
- the second determining unit 405 is specifically configured to:
- the third determining unit 406 is specifically configured to:
- the target ratio is used as the number of syllables in the target unit time corresponding to the second audio segment.
- FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
- the electronic device 500 includes a processor, a memory, A communication interface, and one or more programs.
- the one or more programs are stored in the memory and are configured to be executed by the processor.
- the programs include instructions for executing the following steps:
- a first audio segment including human voice and background music perform human voice separation on the first audio segment, and obtain a second audio segment including only human voice; input the second audio segment into the trained neural network model Processing and outputting a first feature vector, and the trained neural network model is used to extract the feature vector of the audio segment of the human voice;
- the target number of syllables per unit time corresponding to the second audio segment is determined based on the target number of syllables and the target singing time.
- the first audio segment including human voice and background music is obtained, and the first audio segment is separated from the human voice to obtain the second audio segment including only human voice.
- Input the trained neural network model for processing output the first feature vector, determine the target syllable number corresponding to the second audio segment based on the first feature vector, determine the target singing time corresponding to the second audio segment, based on the target syllable number and target singing The time determines the number of syllables in the target unit time corresponding to the second audio segment.
- the determination is based on the second audio segment that only includes human voice
- the target number of syllables and the target singing time corresponding to the second audio segment are calculated, and the number of syllables in the target unit time corresponding to the second audio segment is calculated. Since the second audio segment does not include lyric text, the unit for calculating songs without lyric text is realized Number of syllables in time.
- the above program also includes instructions for performing the following steps:
- the neural network model after the N trainings is used as the trained neural network model.
- the trained neural network model includes M network layers, the M network layers include fully connected layers, and M is an integer greater than 1, and the second audio segment is input into the trained neural network model
- the above program includes instructions specifically for executing the following steps:
- the output data set corresponding to the (M-1)th network layer is obtained, and the i is an initial value of 1, and an increasing integer with an interval of 1;
- the output data set corresponding to the (M-1)th network layer is input to the fully connected layer for processing, and the first feature vector is output.
- the foregoing program includes instructions specifically for executing the following steps:
- the size of the at least one first target value is set to the first threshold value to obtain a third feature vector, and each first target value is The number of first values between the most recent second target values is greater than or equal to a third threshold, and the first target value and the second target value are both the second threshold;
- each target value group includes two adjacent third target values, the size of each third target value is the second threshold, and each third If the target value corresponds to a moment, the time difference corresponding to each target value group is determined;
- the size of any third target value in the target value group is set as the first threshold to obtain a fourth feature vector
- the magnitude of each value in the fourth feature vector is the second value quantity of the second threshold, and the second value quantity is used as the target syllable quantity corresponding to the second audio segment.
- the above program includes instructions specifically for executing the following steps:
- the foregoing program includes instructions specifically for executing the following steps:
- the target ratio is used as the number of syllables in the target unit time corresponding to the second audio segment.
- the embodiment of the present application also provides a computer storage medium for storing a computer program.
- the computer program is executed by a processor to implement part or all of the steps of any method as described in the above method embodiment.
- Computers include electronic equipment.
- the embodiments of the present application also provide a computer program product.
- the above-mentioned computer program product includes a non-transitory computer-readable storage medium storing a computer program.
- the above-mentioned computer program is operable to cause a computer to execute any of the methods described in the above-mentioned method embodiments. Part or all of the steps of the method.
- the computer program product may be a software installation package, and the above-mentioned computer includes electronic equipment.
- the disclosed device may be implemented in other ways.
- the device embodiments described above are only illustrative.
- the division of the above-mentioned units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
- the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
- each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
- the above integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable memory.
- the technical solution of the present application essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the foregoing methods of the various embodiments of the present application.
- the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other various media that can store program codes.
- the program can be stored in a computer-readable memory, and the memory can include: flash disk , Read-only memory (English: Read-Only Memory, abbreviation: ROM), random access device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disc, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Auxiliary Devices For Music (AREA)
- Telephonic Communication Services (AREA)
Abstract
一种单位时间内音节数量的计算方法及相关装置,方法包括:获取包括人声和背景音乐的第一音频段,对第一音频段进行人声分离,得到只包括人声的第二音频段;将第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量,训练好的神经网络模型用于提取人声的音频段的特征向量(101);基于第一特征向量确定第二音频段对应的目标音节数量,以及确定第二音频段对应的目标唱歌时间(102);基于目标音节数量和目标唱歌时间确定第二音频段对应的目标单位时间内音节数量(103)。可实现计算无歌词文本的歌曲的单位时间内音节数量。
Description
本申请涉及音频处理技术领域,具体涉及一种单位时间内音节数量的计算方法及相关装置。
目前,确定单位时间内音节数量的方式为:对具有歌词文本的歌曲进行音节数量和唱歌时间的统计,进而计算具有歌词文本的歌曲的单位时间内音节数量。由于该方式需要有时间戳的歌词文本,因此无法适用于各种音频段,适应性较差,因此需要一种提升适用性的单位时间内音节数量的计算方法。
发明内容
本申请实施例提供一种单位时间内音节数量的计算方法及相关装置,用于计算无歌词文本的歌曲的单位时间内音节数量。
第一方面,本申请实施例提供一种单位时间内音节数量的计算方法,所述方法包括:
获取包括人声和背景音乐的第一音频段,对所述第一音频段进行人声分离,得到只包括人声的第二音频段;将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量,所述训练好的神经网络模型用于提取人声的音频段的特征向量;
基于所述第一特征向量确定所述第二音频段对应的目标音节数量,以及确定所述第二音频段对应的目标唱歌时间;
基于所述目标音节数量和所述目标唱歌时间确定所述第二音频段对应的目标单位时间内音节数量。
第二方面,本申请实施例提供一种单位时间内音节数量的计算装置,所述装置包括:
获取单元,用于获取包括人声和背景音乐的第一音频段;
执行单元,用于对所述第一音频段进行人声分离,得到只包括人声的第二音频段;
处理单元,用于将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特 征向量,所述训练好的神经网络模型用于提取人声的音频段的特征向量;
第一确定单元,用于基于所述第一特征向量确定所述第二音频段对应的目标音节数量;
第二确定单元,用于确定所述第二音频段对应的目标唱歌时间;
第三确定单元,用于基于所述目标音节数量和所述目标唱歌时间确定所述第二音频段对应的目标单位时间内音节数量。
第三方面,本申请实施例提供一种电子设备,包括处理器、存储器、通信接口,以及一个或多个程序,上述一个或多个程序被存储在上述存储器中,并且被配置由上述处理器执行,上述程序包括用于执行本申请实施例第一方面所述的方法中的部分或全部步骤的指令。
第四方面,本申请实施例提供了一种计算机可读存储介质,上述计算机可读存储介质用于存储计算机程序,上述计算机程序被处理器执行,以实现如本申请实施例第一方面所述的方法中所描述的部分或全部步骤。
第五方面,本申请实施例提供了一种计算机程序产品,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如本申请实施例第一方面所述的方法中所描述的部分或全部步骤。
可以看出,在本申请实施例中,电子设备获取包括人声和背景音乐的第一音频段,对第一音频段进行人声分离,得到只包括人声的第二音频段,将第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量,基于第一特征向量确定第二音频段对应的目标音节数量,确定第二音频段对应的目标唱歌时间,基于目标音节数量和目标唱歌时间确定第二音频段对应的目标单位时间内音节数量。相较于对具有歌词文本的歌曲进行音节数量和唱歌时间的统计,进而计算具有歌词文本的歌曲的单位时间内音节数量,在本申请实施例中,基于只包括人声的第二音频段确定第二音频段对应的目标音节数量和目标唱歌时间,进而计算第二音频段对应的目标单位时间内音节数量,由于第二音频段不包括歌词文本,这样实现了计算无歌词文本的歌曲的单位时间内音节数量。
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。
图1是本申请实施例提供的第一种单位时间内音节数量的计算方法的流程示意图;
图2是本申请实施例提供的第二种单位时间内音节数量的计算方法的流程示意图;
图3是本申请实施例提供的第三种单位时间内音节数量的计算方法的流程示意图;
图4是本申请实施例提供的一种单位时间内音节数量的计算装置的功能单元组成框图;
图5是本申请实施例提供的一种电子设备的结构示意图。
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
以下分别进行详细说明。
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
本申请实施例所涉及到的单位时间内音节数量的计算装置可集成在电子设备中,电子设备可以包括各种具有无线通信功能的手持设备、车载设备、可穿戴设备、计算设备或连接到无线调制解调器的其他处理设备,以及各种形式的用户设备(User Equipment,UE),移动台(Mobile Station,MS),终端设备(Terminal Device,TD),等等。
下面对本申请实施例进行详细介绍。
请参阅图1,图1是本申请实施例提供的第一种单位时间内音节数量的计算方法的流程示意图,该单位时间内音节数量的计算方法应用于单位时间内音节数量的计算装置,该单位时间内音节数量的计算方法包括步骤101-104,具体如下:
101:单位时间内音节数量的计算装置获取包括人声和背景音乐的第一音频段,对所述第一音频段进行人声分离,得到只包括人声的第二音频段;将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量,所述训练好的神经网络模型用于提取人声的音频段的特征向量。
其中,音节是音位组合构成的最小的语音结构单位,第二音频段的时长小于第一音频段的时长。
其中,对第一音频段进行人声分离,得到只包括人声的第二音频段采用现有技术,在此不再叙述。
在一个可能的示例中,单位时间内音节数量的计算装置将第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量之前,所述方法还包括:
单位时间内音节数量的计算装置确定第二音频段的时长,以及判断第二音频段的时长是否大于或等于目标时长;
若是,则单位时间内音节数量的计算装置触发所述将第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量的操作。
其中,目标时长可以为用户自定义的,比如目标时长为10s。
在一个可能的示例中,单位时间内音节数量的计算装置将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量之前,所述方法还包括:
单位时间内音节数量的计算装置获取N个训练样本数据,所述N为大于1的整数;
单位时间内音节数量的计算装置将第i个训练样本数据输入初始的神经网络模型进行正向训练,输出预测结果,基于所述预测结果构造神经网络损失函数,基于所述神经网络损失函数对所述初始的神经网络模型进行反向训练,得到一次训练后的神经网络模型,所述第i个训练样本数据为所述N个训练样本数据中的任意一个;
单位时间内音节数量的计算装置对所述N个训练样本数据中除所述第i个训练样本数据之外的(N-1)个训练样本数据执行相同操作,得到N次训练后的神经网络模型;
单位时间内音节数量的计算装置将所述N次训练后的神经网络模型作为所述训练好的 神经网络模型。
其中,训练样本数据为无歌词文本的歌曲,无歌词文本的歌曲中的一个字对应一个音节,一个音节对应一个时刻。
其中,初始的神经网络模型为未训练的神经网络模型。
在一个可能的示例中,训练好的神经网络模型包括M个网络层,所述M个网络层包括全连接层,所述M为大于1的整数,单位时间内音节数量的计算装置将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量,包括:
单位时间内音节数量的计算装置对所述第二音频段进行音频特征提取,得到目标音频特征;
单位时间内音节数量的计算装置将所述目标音频特征输入第i个网络层进行处理,输出所述第i个网络层对应的输出数据集合;
单位时间内音节数量的计算装置将所述第i个网络层对应的输出数据集合输入第(i+1)个网络层进行处理,输出所述第(i+1)个网络层对应的输出数据集合;
直到i=(M-1),单位时间内音节数量的计算装置得到第(M-1)个网络层对应的输出数据集合,所述i是初始值为1,以1为间隔的递增整数;
单位时间内音节数量的计算装置将所述第(M-1)个网络层对应的输出数据集合输入所述全连接层进行处理,输出所述第一特征向量。
其中,(M-1)个网络层中的第1个网络层至第(M-1)个网络层是相同的。
具体地,单位时间内音节数量的计算装置对第二音频段进行音频特征提取,得到目标音频特征的实施方式可以为:
单位时间内音节数量的计算装置对第二音频段进行降采样,得到降采样后的第二音频段,降采样后的第二音频段对应设定采样率;
单位时间内音节数量的计算装置基于离散时间傅里叶变换参数对降采样后的第二音频段进行离散时间短时傅里叶变换,得到降采样后的第二音频段对应的多个第一离散频谱图,每个第一离散频谱图对应一帧;
单位时间内音节数量的计算装置将每个第一离散频谱图进行梅尔频谱转换,得到多个第一离散频谱图对应的多个第二离散频谱图;
单位时间内音节数量的计算装置基于多个第二离散频谱图生成目标声谱图;
单位时间内音节数量的计算装置确定目标声谱图对应的第一矩阵,基于第一矩阵生成第二矩阵,第二矩阵中的第j列等于第一矩阵中的第(j+1)列与第j列的差值;
单位时间内音节数量的计算装置对第一矩阵和第二矩阵进行叠加,得到第三矩阵,以及将第三矩阵作为目标音频特征。
其中,设定采样率可以为8000Hz,离散时间短时傅里叶变换参数包括帧长和步长,帧长可以为256个采样点,步长可以为80个采样点,在此不作限定。
其中,目标声谱图为一个随时间变化的频谱图。
其中,第一矩阵的最后一列和第二矩阵的最后一列相同。
具体地,单位时间内音节数量的计算装置将第i个网络层对应的输出数据集合输入第(i+1)个网络层进行处理,输出第(i+1)个网络层对应的输出数据集合的实施方式可以为:
单位时间内音节数量的计算装置将第i个网络层对应的输出数据集合输入第(i+1)个网络层,第(i+1)个网络层包括卷积矩阵(i+1)-1、卷积矩阵(i+1)-2和激活矩阵(i+1)-3;
单位时间内音节数量的计算装置将第i个网络层对应的输出数据集合与卷积矩阵(i+1)-1进行相乘运算,得到第一输出矩阵(i+1)-4;
单位时间内音节数量的计算装置将第i个网络层对应的输出数据集合与卷积矩阵(i+1)-2进行相乘运算,得到第二输出矩阵(i+1)-5,以及将第二输出矩阵(i+1)-5与激活矩阵(i+1)-3进行相乘运算,得到第三输出矩阵(i+1)-6;
单位时间内音节数量的计算装置将第一输出矩阵(i+1)-4与第三输出矩阵(i+1)-6进行相乘运算,得到第四输出矩阵(i+1)-7;
单位时间内音节数量的计算装置对第四输出矩阵(i+1)-7和第i个网络层对应的输出数据集合进行叠加,得到第(i+1)个网络层对应的输出数据集合。
102:单位时间内音节数量的计算装置基于所述第一特征向量确定所述第二音频段对应的目标音节数量,以及确定所述第二音频段对应的目标唱歌时间。
在一个可能的示例中,单位时间内音节数量的计算装置基于所述第一特征向量确定所述第二音频段对应的目标音节数量,包括:
单位时间内音节数量的计算装置对所述第一特征向量进行二值化处理,得到第二特征向量,所述第二特征向量中各值的大小为第一阈值或第二阈值,所述第一阈值小于所述第 二阈值;
若所述第二特征向量中存在至少一个第一目标值,则单位时间内音节数量的计算装置将所述至少一个第一目标值的大小均设置为所述第一阈值,得到第三特征向量,每个第一目标值与其最近的第二目标值之间的第一值数量大于或等于第三阈值,所述第一目标值和所述第二目标值的大小均为所述第二阈值;
若所述第三特征向量中存在至少一个目标数值组,每个目标数值组包括相邻的两个第三目标值,每个第三目标值的大小为所述第二阈值,每个第三目标值对应一个时刻,则单位时间内音节数量的计算装置确定每个目标数值组对应的时差;
若目标数值组对应的时差小于或等于设定时长,则单位时间内音节数量的计算装置将所述目标数值组中的任意一个第三目标值的大小设置为所述第一阈值,得到第四特征向量;
单位时间内音节数量的计算装置确定所述第四特征向量中各值的大小为所述第二阈值的第二值数量,以及将所述第二值数量作为所述第二音频段对应的所述目标音节数量。
其中,第一特征向量包括多个值,每个值的大小介于0-1之间,每个值的大小表示音节的概率。
具体地,单位时间内音节数量的计算装置对第一特征向量进行二值化处理,得到第二特征向量的实施方式可以为:单位时间内音节数量的计算装置判断第一特征向量中各值的大小是否大于或等于固定值;若值的大小小于固定值,则单位时间内音节数量的计算装置将该值设置为第一阈值;或者,若值的大小大于或等于固定值,则单位时间内音节数量的计算装置将该值设置为第二阈值。
其中,固定值可以为用户自定义的,比如固定值为0.5。
其中,第一阈值可以为0,第二阈值可以为1。
其中,第三阈值和设定时长可以是用户自定义的,在此不作限定。
在一个可能的示例中,单位时间内音节数量的计算装置确定所述第二音频段对应的目标唱歌时间,包括:
单位时间内音节数量的计算装置对所述第二音频段进行静音检测,得到所述第二音频段包括的至少一个静音段和至少一个非静音段;
单位时间内音节数量的计算装置确定所述至少一个非静音段对应的目标时长;
单位时间内音节数量的计算装置将所述目标时长作为所述第二音频段对应的所述目标 唱歌时间。
其中,单位时间内音节数量的计算装置对第二音频段进行静音检测,得到第二音频段包括的至少一个静音段和至少一个非静音段采用现有技术,在此不再叙述。
103:单位时间内音节数量的计算装置基于所述目标音节数量和所述目标唱歌时间确定所述第二音频段对应的目标单位时间内音节数量。
可以看出,在本申请实施例中,电子设备获取包括人声和背景音乐的第一音频段,对第一音频段进行人声分离,得到只包括人声的第二音频段,将第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量,基于第一特征向量确定第二音频段对应的目标音节数量,确定第二音频段对应的目标唱歌时间,基于目标音节数量和目标唱歌时间确定第二音频段对应的目标单位时间内音节数量。相较于对具有歌词文本的歌曲进行音节数量和唱歌时间的统计,进而计算具有歌词文本的歌曲的单位时间内音节数量,在本申请实施例中,基于只包括人声的第二音频段确定第二音频段对应的目标音节数量和目标唱歌时间,进而计算第二音频段对应的目标单位时间内音节数量,由于第二音频段不包括歌词文本,这样实现了计算无歌词文本的歌曲的单位时间内音节数量。
在一个可能的示例中,单位时间内音节数量的计算装置基于所述目标音节数量和所述目标唱歌时间确定所述第二音频段对应的目标单位时间内音节数量,包括:
单位时间内音节数量的计算装置确定所述目标音节数量与所述目标歌唱时间的目标比值;
单位时间内音节数量的计算装置判断所述目标比值是否处于设定范围;
若是,则单位时间内音节数量的计算装置将所述目标比值作为所述第二音频段对应的所述目标单位时间内音节数量。
其中,设定范围可以是用户自定义的,在此不作限定。
进一步地,所述方法还包括:
若目标比值未处于设定范围,则单位时间内音节数量的计算装置判断目标比值是否大于设定范围的最大值;
若是,则单位时间内音节数量的计算装置将设定范围的最大值作为第二音频段对应的目标单位时间内音节数量;
若否,则单位时间内音节数量的计算装置将设定范围的最小值作为第二音频段对应的 目标单位时间内音节数量。
与上述图1所示的实施例一致的,请参阅图2,图2是本申请实施例提供的第二种单位时间内音节数量的计算方法的流程示意图,该单位时间内音节数量的计算方法应用于单位时间内音节数量的计算装置,该单位时间内音节数量的计算方法包括步骤201-210,具体如下:
201:单位时间内音节数量的计算装置获取包括人声和背景音乐的第一音频段,对所述第一音频段进行人声分离,得到只包括人声的第二音频段;将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量,所述训练好的神经网络模型用于提取人声的音频段的特征向量。
202:单位时间内音节数量的计算装置对所述第一特征向量进行二值化处理,得到第二特征向量,所述第二特征向量中各值的大小为第一阈值或第二阈值,所述第一阈值小于所述第二阈值。
203:若所述第二特征向量中存在至少一个第一目标值,则单位时间内音节数量的计算装置将所述至少一个第一目标值的大小均设置为所述第一阈值,得到第三特征向量,每个第一目标值与其最近的第二目标值之间的第一值数量大于或等于第三阈值,所述第一目标值和所述第二目标值的大小均为所述第二阈值。
204:若所述第三特征向量中存在至少一个目标数值组,每个目标数值组包括相邻的两个第三目标值,每个第三目标值的大小为所述第二阈值,每个第三目标值对应一个时刻,则单位时间内音节数量的计算装置确定每个目标数值组对应的时差。
205:若目标数值组对应的时差小于或等于设定时长,则单位时间内音节数量的计算装置将所述目标数值组中的任意一个第三目标值的大小设置为所述第一阈值,得到第四特征向量。
206:单位时间内音节数量的计算装置确定所述第四特征向量中各值的大小为所述第二阈值的第二值数量,以及将所述第二值数量作为所述第二音频段对应的目标音节数量。
207:单位时间内音节数量的计算装置确定所述第二音频段对应的目标唱歌时间。
208:单位时间内音节数量的计算装置确定所述目标音节数量与所述目标歌唱时间的目标比值。
209:单位时间内音节数量的计算装置判断所述目标比值是否处于设定范围。
210:若是,则单位时间内音节数量的计算装置将所述目标比值作为所述第二音频段对应的目标单位时间内音节数量。
需要说明的是,图2所示的方法的各个步骤的具体实现过程可参见上述方法所述的具体实现过程,在此不再叙述。
与上述图1和图2所示的实施例一致的,请参阅图3,图3是本申请实施例提供的第三种单位时间内音节数量的计算方法的流程示意图,该单位时间内音节数量的计算方法应用于单位时间内音节数量的计算装置,该单位时间内音节数量的计算方法包括步骤301-313,具体如下:
301:单位时间内音节数量的计算装置获取包括人声和背景音乐的第一音频段,对所述第一音频段进行人声分离,得到只包括人声的第二音频段。
302:单位时间内音节数量的计算装置获取N个训练样本数据,所述N为大于1的整数;
303:单位时间内音节数量的计算装置将第i个训练样本数据输入初始的神经网络模型进行正向训练,输出预测结果,基于所述预测结果构造神经网络损失函数,基于所述神经网络损失函数对所述初始的神经网络模型进行反向训练,得到一次训练后的神经网络模型,所述第i个训练样本数据为所述N个训练样本数据中的任意一个。
304:单位时间内音节数量的计算装置对所述N个训练样本数据中除所述第i个训练样本数据之外的(N-1)个训练样本数据执行相同操作,得到N次训练后的神经网络模型。
305:单位时间内音节数量的计算装置将所述N次训练后的神经网络模型作为训练好的神经网络模型。
306:单位时间内音节数量的计算装置将所述第二音频段输入所述训练好的神经网络模型进行处理,输出第一特征向量,所述训练好的神经网络模型用于提取人声的音频段的特征向量。
307:单位时间内音节数量的计算装置基于所述第一特征向量确定所述第二音频段对应的目标音节数量。
308:单位时间内音节数量的计算装置对所述第二音频段进行静音检测,得到所述第二 音频段包括的至少一个静音段和至少一个非静音段。
309:单位时间内音节数量的计算装置确定所述至少一个非静音段对应的目标时长。
310:单位时间内音节数量的计算装置将所述目标时长作为所述第二音频段对应的目标唱歌时间。
311:单位时间内音节数量的计算装置确定所述目标音节数量与所述目标歌唱时间的目标比值。
312:单位时间内音节数量的计算装置判断所述目标比值是否处于设定范围。
313:若是,则单位时间内音节数量的计算装置将所述目标比值作为所述第二音频段对应的目标单位时间内音节数量。
需要说明的是,图3所示的方法的各个步骤的具体实现过程可参见上述方法所述的具体实现过程,在此不再叙述。
请参阅图4,图4是本申请实施例提供的一种单位时间内音节数量的计算装置的功能单元组成框图,该单位时间内音节数量的计算装置400包括:
获取单元401,用于获取包括人声和背景音乐的第一音频段;
执行单元402,用于对所述第一音频段进行人声分离,得到只包括人声的第二音频段;
处理单元403,用于将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量,所述训练好的神经网络模型用于提取人声的音频段的特征向量;
第一确定单元404,用于基于所述第一特征向量确定所述第二音频段对应的目标音节数量;
第二确定单元405,用于确定所述第二音频段对应的目标唱歌时间;
第三确定单元406,用于基于所述目标音节数量和所述目标唱歌时间确定所述第二音频段对应的目标单位时间内音节数量。
可以看出,在本申请实施例中,获取包括人声和背景音乐的第一音频段,对第一音频段进行人声分离,得到只包括人声的第二音频段,将第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量,基于第一特征向量确定第二音频段对应的目标音节数量,确定第二音频段对应的目标唱歌时间,基于目标音节数量和目标唱歌时间确定第二音频段对应的目标单位时间内音节数量。相较于对具有歌词文本的歌曲进行音节数量和唱歌时间 的统计,进而计算具有歌词文本的歌曲的单位时间内音节数量,在本申请实施例中,基于只包括人声的第二音频段确定第二音频段对应的目标音节数量和目标唱歌时间,进而计算第二音频段对应的目标单位时间内音节数量,由于第二音频段不包括歌词文本,这样实现了计算无歌词文本的歌曲的单位时间内音节数量。
在一个可能的示例中,上述单位时间内音节数量的计算装置400还包括训练单元407,
训练单元407,用于获取N个训练样本数据,所述N为大于1的整数;将第i个训练样本数据输入初始的神经网络模型进行正向训练,输出预测结果,基于所述预测结果构造神经网络损失函数,基于所述神经网络损失函数对所述初始的神经网络模型进行反向训练,得到一次训练后的神经网络模型,所述第i个训练样本数据为所述N个训练样本数据中的任意一个;对所述N个训练样本数据中除所述第i个训练样本数据之外的(N-1)个训练样本数据执行相同操作,得到N次训练后的神经网络模型;将所述N次训练后的神经网络模型作为所述训练好的神经网络模型。
在一个可能的示例中,训练好的神经网络模型包括M个网络层,M个网络层包括全连接层,M为大于1的整数,在将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量方面,上述处理单元403具体用于:
对所述第二音频段进行音频特征提取,得到目标音频特征;
将所述目标音频特征输入第i个网络层进行处理,输出所述第i个网络层对应的输出数据集合;
将所述第i个网络层对应的输出数据集合输入第(i+1)个网络层进行处理,输出所述第(i+1)个网络层对应的输出数据集合;
直到i=(M-1),得到第(M-1)个网络层对应的输出数据集合,所述i是初始值为1,以1为间隔的递增整数;
将所述第(M-1)个网络层对应的输出数据集合输入所述全连接层进行处理,输出所述第一特征向量。
在一个可能的示例中,在基于所述第一特征向量确定所述第二音频段对应的目标音节数量方面,上述第一确定单元404具体用于:
对所述第一特征向量进行二值化处理,得到第二特征向量,所述第二特征向量中各值的大小为第一阈值或第二阈值,所述第一阈值小于所述第二阈值;
若所述第二特征向量中存在至少一个第一目标值,则将所述至少一个第一目标值的大小设置均为所述第一阈值,得到第三特征向量,每个第一目标值与其最近的第二目标值之间的第一值数量大于或等于第三阈值,所述第一目标值和所述第二目标值的大小均为所述第二阈值;
若所述第三特征向量中存在至少一个目标数值组,每个目标数值组包括相邻的两个第三目标值,每个第三目标值的大小为所述第二阈值,每个第三目标值对应一个时刻,则确定每个目标数值组对应的时差;
若目标数值组对应的时差小于或等于设定时长,则将所述目标数值组中的任意一个第三目标值的大小设置为所述第一阈值,得到第四特征向量;
确定所述第四特征向量中各值的大小为所述第二阈值的第二值数量,以及将所述第二值数量作为所述第二音频段对应的所述目标音节数量。
在一个可能的示例中,在确定所述第二音频段对应的目标唱歌时间方面,上述第二确定单元405具体用于:
对所述第二音频段进行静音检测,得到所述第二音频段包括的至少一个静音段和至少一个非静音段;
确定所述至少一个非静音段对应的目标时长;
将所述目标时长作为所述第二音频段对应的所述目标唱歌时间。
在一个可能的示例中,在基于所述目标音节数量和所述目标唱歌时间确定所述第二音频段对应的目标单位时间内音节数量方面,上述第三确定单元406具体用于:
确定所述目标音节数量与所述目标歌唱时间的目标比值;
判断所述目标比值是否处于设定范围;
若是,则将所述目标比值作为所述第二音频段对应的所述目标单位时间内音节数量。
与上述图1、图2和图3所示的实施例一致的,请参阅图5,图5是本申请实施例提供的一种电子设备的结构示意图,该电子设备500包括处理器、存储器、通信接口,以及一个或多个程序,上述一个或多个程序被存储在上述存储器中,并且被配置由上述处理器执行,上述程序包括用于执行以下步骤的指令:
获取包括人声和背景音乐的第一音频段,对所述第一音频段进行人声分离,得到只包 括人声的第二音频段;将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量,所述训练好的神经网络模型用于提取人声的音频段的特征向量;
基于所述第一特征向量确定所述第二音频段对应的目标音节数量,以及确定所述第二音频段对应的目标唱歌时间;
基于所述目标音节数量和所述目标唱歌时间确定所述第二音频段对应的目标单位时间内音节数量。
可以看出,在本申请实施例中,获取包括人声和背景音乐的第一音频段,对第一音频段进行人声分离,得到只包括人声的第二音频段,将第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量,基于第一特征向量确定第二音频段对应的目标音节数量,确定第二音频段对应的目标唱歌时间,基于目标音节数量和目标唱歌时间确定第二音频段对应的目标单位时间内音节数量。相较于对具有歌词文本的歌曲进行音节数量和歌唱时间的统计,进而计算具有歌词文本的歌曲的单位时间内音节数量,在本申请实施例中,基于只包括人声的第二音频段确定第二音频段对应的目标音节数量和目标唱歌时间,进而计算第二音频段对应的目标单位时间内音节数量,由于第二音频段不包括歌词文本,这样实现了计算无歌词文本的歌曲的单位时间内音节数量。
在一个可能的示例中,上述程序还包括用于执行以下步骤的指令:
获取N个训练样本数据,所述N为大于1的整数;
将第i个训练样本数据输入初始的神经网络模型进行正向训练,输出预测结果,基于所述预测结果构造神经网络损失函数,基于所述神经网络损失函数对所述初始的神经网络模型进行反向训练,得到一次训练后的神经网络模型,所述第i个训练样本数据为所述N个训练样本数据中的任意一个;
对所述N个训练样本数据中除所述第i个训练样本数据之外的(N-1)个训练样本数据执行相同操作,得到N次训练后的神经网络模型;
将所述N次训练后的神经网络模型作为所述训练好的神经网络模型。
在一个可能的示例中,训练好的神经网络模型包括M个网络层,M个网络层包括全连接层,M为大于1的整数,在将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量方面,上述程序包括具体用于执行以下步骤的指令:
对所述第二音频段进行音频特征提取,得到目标音频特征;
将所述目标音频特征输入第i个网络层进行处理,输出所述第i个网络层对应的输出数据集合;
将所述第i个网络层对应的输出数据集合输入第(i+1)个网络层进行处理,输出所述第(i+1)个网络层对应的输出数据集合;
直到i=(M-1),得到第(M-1)个网络层对应的输出数据集合,所述i是初始值为1,以1为间隔的递增整数;
将所述第(M-1)个网络层对应的输出数据集合输入所述全连接层进行处理,输出所述第一特征向量。
在一个可能的示例中,在基于所述第一特征向量确定所述第二音频段对应的目标音节数量方面,上述程序包括具体用于执行以下步骤的指令:
对所述第一特征向量进行二值化处理,得到第二特征向量,所述第二特征向量中各值的大小为第一阈值或第二阈值,所述第一阈值小于所述第二阈值;
若所述第二特征向量中存在至少一个第一目标值,则将所述至少一个第一目标值的大小均设置为所述第一阈值,得到第三特征向量,每个第一目标值与其最近的第二目标值之间的第一值数量大于或等于第三阈值,所述第一目标值和所述第二目标值的大小均为所述第二阈值;
若所述第三特征向量中存在至少一个目标数值组,每个目标数值组包括相邻的两个第三目标值,每个第三目标值的大小为所述第二阈值,每个第三目标值对应一个时刻,则确定每个目标数值组对应的时差;
若目标数值组对应的时差小于或等于设定时长,则将所述目标数值组中的任意一个第三目标值的大小设置为所述第一阈值,得到第四特征向量;
确定所述第四特征向量中各值的大小为所述第二阈值的第二值数量,以及将所述第二值数量作为所述第二音频段对应的所述目标音节数量。
在一个可能的示例中,在确定所述第二音频段对应的目标唱歌时间方面,上述程序包括具体用于执行以下步骤的指令:
对所述第二音频段进行静音检测,得到所述第二音频段包括的至少一个静音段和至少一个非静音段;
确定所述至少一个非静音段对应的目标时长;
将所述目标时长作为所述第二音频段对应的所述目标唱歌时间。
在一个可能的示例中,在基于所述目标音节数量和所述目标唱歌时间确定所述第二音频段对应的目标单位时间内音节数量方面,上述程序包括具体用于执行以下步骤的指令:
确定所述目标音节数量与所述目标歌唱时间的目标比值;
判断所述目标比值是否处于设定范围;
若是,则将所述目标比值作为所述第二音频段对应的所述目标单位时间内音节数量。
本申请实施例还提供一种计算机存储介质,该计算机存储介质用于存储计算机程序,上述计算机程序被处理器执行,以实现如上述方法实施例中记载的任一方法的部分或全部步骤,上述计算机包括电子设备。
本申请实施例还提供一种计算机程序产品,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如上述方法实施例中记载的任一方法的部分或全部步骤。该计算机程序产品可以为一个软件安装包,上述计算机包括电子设备。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络 单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例上述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实现方式及应用范围上均会有改变之处,综上上述,本说明书内容不应理解为对本申请的限制。
Claims (20)
- 一种单位时间内音节数量的计算方法,其特征在于,所述方法包括:获取包括人声和背景音乐的第一音频段,对所述第一音频段进行人声分离,得到只包括人声的第二音频段;将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量,所述训练好的神经网络模型用于提取人声的音频段的特征向量;基于所述第一特征向量确定所述第二音频段对应的目标音节数量,以及确定所述第二音频段对应的目标唱歌时间;基于所述目标音节数量和所述目标唱歌时间确定所述第二音频段对应的目标单位时间内音节数量。
- 根据权利要求1所述的方法,其特征在于,所述将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量之前,所述方法还包括:获取N个训练样本数据,所述N为大于1的整数;将第i个训练样本数据输入初始的神经网络模型进行正向训练,输出预测结果,基于所述预测结果构造神经网络损失函数,基于所述神经网络损失函数对所述初始的神经网络模型进行反向训练,得到一次训练后的神经网络模型,所述第i个训练样本数据为所述N个训练样本数据中的任意一个;对所述N个训练样本数据中除所述第i个训练样本数据之外的(N-1)个训练样本数据执行相同操作,得到N次训练后的神经网络模型;将所述N次训练后的神经网络模型作为所述训练好的神经网络模型。
- 根据权利要求1所述的方法,其特征在于,所述将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量之前,所述方法还包括:确定所述第二音频段的时长,以及判断所述第二音频段的时长是否大于或等于目标时长;若是,则触发所述将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量的操作。
- 根据权利要求2或3所述的方法,其特征在于,所述训练好的神经网络模型包括M个网络层,所述M个网络层包括全连接层,所述M为大于1的整数,所述将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量,包括:对所述第二音频段进行音频特征提取,得到目标音频特征;将所述目标音频特征输入第i个网络层进行处理,输出所述第i个网络层对应的输出数据集合;将所述第i个网络层对应的输出数据集合输入第(i+1)个网络层进行处理,输出所述第(i+1)个网络层对应的输出数据集合;直到i=(M-1),得到第(M-1)个网络层对应的输出数据集合,所述i是初始值为1,以1为间隔的递增整数;将所述第(M-1)个网络层对应的输出数据集合输入所述全连接层进行处理,输出所述第一特征向量。
- 根据权利要求4所述的方法,其特征在于,所述基于所述第一特征向量确定所述第二音频段对应的目标音节数量,包括:对所述第一特征向量进行二值化处理,得到第二特征向量,所述第二特征向量中各值的大小为第一阈值或第二阈值,所述第一阈值小于所述第二阈值;若所述第二特征向量中存在至少一个第一目标值,则将所述至少一个第一目标值的大小均设置为所述第一阈值,得到第三特征向量,每个第一目标值与其最近的第二目标值之间的第一值数量大于或等于第三阈值,所述第一目标值和所述第二目标值的大小均为所述第二阈值;若所述第三特征向量中存在至少一个目标数值组,每个目标数值组包括相邻的两个第三目标值,每个第三目标值的大小为所述第二阈值,每个第三目标值对应一个时刻,则确定每个目标数值组对应的时差;若目标数值组对应的时差小于或等于设定时长,则将所述目标数值组中的任意一个第三目标值的大小设置为所述第一阈值,得到第四特征向量;确定所述第四特征向量中各值的大小为所述第二阈值的第二值数量,以及将所述第二值数量作为所述第二音频段对应的所述目标音节数量。
- 根据权利要求5所述的方法,其特征在于,所述确定所述第二音频段对应的目标唱歌时间,包括:对所述第二音频段进行静音检测,得到所述第二音频段包括的至少一个静音段和至少一个非静音段;确定所述至少一个非静音段对应的目标时长;将所述目标时长作为所述第二音频段对应的所述目标唱歌时间。
- 根据权利要求6所述的方法,所述基于所述目标音节数量和所述目标唱歌时间确定所述第二音频段对应的目标单位时间内音节数量,包括:确定所述目标音节数量与所述目标歌唱时间的目标比值;判断所述目标比值是否处于设定范围;若是,则将所述目标比值作为所述第二音频段对应的所述目标单位时间内音节数量。
- 根据权利要求7所述的方法,其特征在于,所述方法还包括:若所述目标比值未处于所述设定范围,则判断所述目标比值是否大于所述设定范围的最大值;若是,则将所述设定范围的最大值作为所述第二音频段对应的所述目标单位时间内音节数量。
- 根据权利要求7所述的方法,其特征在于,所述方法还包括:若所述目标比值未处于所述设定范围,则判断所述目标比值是否小于所述设定范围的最小值;若是,则将所述设定范围的最小值作为所述第二音频段对应的所述目标单位时间内音节数量。
- 一种单位时间内音节数量的计算装置,其特征在于,所述装置包括:获取单元,用于获取包括人声和背景音乐的第一音频段;执行单元,用于对所述第一音频段进行人声分离,得到只包括人声的第二音频段;处理单元,用于将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量,所述训练好的神经网络模型用于提取人声的音频段的特征向量;第一确定单元,用于基于所述第一特征向量确定所述第二音频段对应的目标音节数量;第二确定单元,用于确定所述第二音频段对应的目标唱歌时间;第三确定单元,用于基于所述目标音节数量和所述目标唱歌时间确定所述第二音频段对应的目标单位时间内音节数量。
- 根据权利要求10所述的装置,其特征在于,所述装置还包括:训练单元,用于获取N个训练样本数据,所述N为大于1的整数;将第i个训练样本 数据输入初始的神经网络模型进行正向训练,输出预测结果,基于所述预测结果构造神经网络损失函数,基于所述神经网络损失函数对所述初始的神经网络模型进行反向训练,得到一次训练后的神经网络模型,所述第i个训练样本数据为所述N个训练样本数据中的任意一个;对所述N个训练样本数据中除所述第i个训练样本数据之外的(N-1)个训练样本数据执行相同操作,得到N次训练后的神经网络模型;将所述N次训练后的神经网络模型作为所述训练好的神经网络模型。
- 根据权利要求10所述的装置,其特征在于,所述装置还包括:第四确定单元,用于确定所述第二音频段的时长;判断单元,用于判断所述第二音频段的时长是否大于或等于目标时长;触发单元,用于若所述判断单元判断出所述第二音频段的时长大于或等于所述目标时长,则触发所述将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量的操作。
- 根据权利要求11或12所述的装置,其特征在于,所述训练好的神经网络模型包括M个网络层,所述M个网络层包括全连接层,所述M为大于1的整数,在将所述第二音频段输入训练好的神经网络模型进行处理,输出第一特征向量方面,所述处理单元具体用于:对所述第二音频段进行音频特征提取,得到目标音频特征;将所述目标音频特征输入第i个网络层进行处理,输出所述第i个网络层对应的输出数据集合;将所述第i个网络层对应的输出数据集合输入第(i+1)个网络层进行处理,输出所述第(i+1)个网络层对应的输出数据集合;直到i=(M-1),得到第(M-1)个网络层对应的输出数据集合,所述i是初始值为1,以1为间隔的递增整数;将所述第(M-1)个网络层对应的输出数据集合输入所述全连接层进行处理,输出所述第一特征向量。
- 根据权利要求13所述的装置,其特征在于,在基于所述第一特征向量确定所述第二音频段对应的目标音节数量方面,所述第一确定单元具体用于:对所述第一特征向量进行二值化处理,得到第二特征向量,所述第二特征向量中各值 的大小为第一阈值或第二阈值,所述第一阈值小于所述第二阈值;若所述第二特征向量中存在至少一个第一目标值,则将所述至少一个第一目标值的大小均设置为所述第一阈值,得到第三特征向量,每个第一目标值与其最近的第二目标值之间的第一值数量大于或等于第三阈值,所述第一目标值和所述第二目标值的大小均为所述第二阈值;若所述第三特征向量中存在至少一个目标数值组,每个目标数值组包括相邻的两个第三目标值,每个第三目标值的大小为所述第二阈值,每个第三目标值对应一个时刻,则确定每个目标数值组对应的时差;若目标数值组对应的时差小于或等于设定时长,则将所述目标数值组中的任意一个第三目标值的大小设置为所述第一阈值,得到第四特征向量;确定所述第四特征向量中各值的大小为所述第二阈值的第二值数量,以及将所述第二值数量作为所述第二音频段对应的所述目标音节数量。
- 根据权利要求14所述的装置,其特征在于,在确定所述第二音频段对应的目标唱歌时间方面,所述第二确定单元具体用于:对所述第二音频段进行静音检测,得到所述第二音频段包括的至少一个静音段和至少一个非静音段;确定所述至少一个非静音段对应的目标时长;将所述目标时长作为所述第二音频段对应的所述目标唱歌时间。
- 根据权利要求15所述的装置,其特征在于,在基于所述目标音节数量和所述目标唱歌时间确定所述第二音频段对应的目标单位时间内音节数量方面,所述第三确定单元具体用于:确定所述目标音节数量与所述目标歌唱时间的目标比值;判断所述目标比值是否处于设定范围;若是,则将所述目标比值作为所述第二音频段对应的所述目标单位时间内音节数量。
- 根据权利要求16所述的装置,其特征在于,在基于所述目标音节数量和所述目标唱歌时间确定所述第二音频段对应的目标单位时间内音节数量方面,所述第三确定单元具体用于:若所述目标比值未处于所述设定范围,则判断所述目标比值是否大于所述设定范围的 最大值;若是,则将所述设定范围的最大值作为所述第二音频段对应的所述目标单位时间内音节数量。
- 根据权利要求16所述的装置,其特征在于,在基于所述目标音节数量和所述目标唱歌时间确定所述第二音频段对应的目标单位时间内音节数量方面,所述第三确定单元具体用于:若所述目标比值未处于所述设定范围,则判断所述目标比值是否小于所述设定范围的最小值;若是,则将所述设定范围的最小值作为所述第二音频段对应的所述目标单位时间内音节数量。
- 一种电子设备,其特征在于,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,所述程序包括用于执行如权利要求1-9任一项所述的方法中的部分或全部步骤的指令。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质用于存储计算机程序,所述计算机程序被处理器执行,以实现如权利要求1-9任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910288833.5 | 2019-04-11 | ||
CN201910288833.5A CN110033782B (zh) | 2019-04-11 | 2019-04-11 | 单位时间内音节数量的计算方法及相关装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020206975A1 true WO2020206975A1 (zh) | 2020-10-15 |
Family
ID=67238051
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/112242 WO2020206975A1 (zh) | 2019-04-11 | 2019-10-21 | 单位时间内音节数量的计算方法及相关装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110033782B (zh) |
WO (1) | WO2020206975A1 (zh) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110033782B (zh) * | 2019-04-11 | 2021-08-24 | 腾讯音乐娱乐科技(深圳)有限公司 | 单位时间内音节数量的计算方法及相关装置 |
CN113450823B (zh) * | 2020-03-24 | 2022-10-28 | 海信视像科技股份有限公司 | 基于音频的场景识别方法、装置、设备及存储介质 |
CN113607269B (zh) * | 2021-02-02 | 2023-12-15 | 深圳市冠旭电子股份有限公司 | 声音剂量确定方法、装置、电子设备及存储介质 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011180429A (ja) * | 2010-03-02 | 2011-09-15 | Brother Industries Ltd | 歌詞音節数提示装置及びプログラム |
JP2011180428A (ja) * | 2010-03-02 | 2011-09-15 | Brother Industries Ltd | 歌詞音節数提示装置及びプログラム |
CN105023573A (zh) * | 2011-04-01 | 2015-11-04 | 索尼电脑娱乐公司 | 使用听觉注意力线索的语音音节/元音/音素边界检测 |
JP2016050985A (ja) * | 2014-08-29 | 2016-04-11 | ヤマハ株式会社 | 演奏情報編集装置 |
CN107785011A (zh) * | 2017-09-15 | 2018-03-09 | 北京理工大学 | 语速估计模型的训练、语速估计方法、装置、设备及介质 |
CN109584905A (zh) * | 2019-01-22 | 2019-04-05 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种衡量音乐速度的方法、终端及计算机可读介质 |
CN110033782A (zh) * | 2019-04-11 | 2019-07-19 | 腾讯音乐娱乐科技(深圳)有限公司 | 单位时间内音节数量的计算方法及相关装置 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10134424B2 (en) * | 2015-06-25 | 2018-11-20 | VersaMe, Inc. | Wearable word counter |
US10433052B2 (en) * | 2016-07-16 | 2019-10-01 | Ron Zass | System and method for identifying speech prosody |
-
2019
- 2019-04-11 CN CN201910288833.5A patent/CN110033782B/zh active Active
- 2019-10-21 WO PCT/CN2019/112242 patent/WO2020206975A1/zh active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011180429A (ja) * | 2010-03-02 | 2011-09-15 | Brother Industries Ltd | 歌詞音節数提示装置及びプログラム |
JP2011180428A (ja) * | 2010-03-02 | 2011-09-15 | Brother Industries Ltd | 歌詞音節数提示装置及びプログラム |
CN105023573A (zh) * | 2011-04-01 | 2015-11-04 | 索尼电脑娱乐公司 | 使用听觉注意力线索的语音音节/元音/音素边界检测 |
JP2016050985A (ja) * | 2014-08-29 | 2016-04-11 | ヤマハ株式会社 | 演奏情報編集装置 |
CN107785011A (zh) * | 2017-09-15 | 2018-03-09 | 北京理工大学 | 语速估计模型的训练、语速估计方法、装置、设备及介质 |
CN109584905A (zh) * | 2019-01-22 | 2019-04-05 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种衡量音乐速度的方法、终端及计算机可读介质 |
CN110033782A (zh) * | 2019-04-11 | 2019-07-19 | 腾讯音乐娱乐科技(深圳)有限公司 | 单位时间内音节数量的计算方法及相关装置 |
Also Published As
Publication number | Publication date |
---|---|
CN110033782B (zh) | 2021-08-24 |
CN110033782A (zh) | 2019-07-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10789290B2 (en) | Audio data processing method and apparatus, and computer storage medium | |
CN108305643B (zh) | 情感信息的确定方法和装置 | |
CN105976812B (zh) | 一种语音识别方法及其设备 | |
JP6633153B2 (ja) | 情報を抽出する方法及び装置 | |
CN108305641B (zh) | 情感信息的确定方法和装置 | |
WO2020206975A1 (zh) | 单位时间内音节数量的计算方法及相关装置 | |
EP4018437B1 (en) | Optimizing a keyword spotting system | |
WO2017084334A1 (zh) | 一种语种识别方法、装置、设备及计算机存储介质 | |
US12027165B2 (en) | Computer program, server, terminal, and speech signal processing method | |
CN105206257B (zh) | 一种声音转换方法及装置 | |
KR101616112B1 (ko) | 음성 특징 벡터를 이용한 화자 분리 시스템 및 방법 | |
US11017763B1 (en) | Synthetic speech processing | |
CN110008481B (zh) | 翻译语音生成方法、装置、计算机设备和存储介质 | |
JP7255032B2 (ja) | 音声認識 | |
Kuamr et al. | Continuous Hindi speech recognition using Gaussian mixture HMM | |
Boril et al. | Arabic Dialect Identification-'Is the Secret in the Silence?'and Other Observations. | |
CN113823323A (zh) | 一种基于卷积神经网络的音频处理方法、装置及相关设备 | |
CN111785302A (zh) | 说话人分离方法、装置及电子设备 | |
Han et al. | Continuous Speech Separation Using Speaker Inventory for Long Recording. | |
CN104900226A (zh) | 一种信息处理方法和装置 | |
Lin et al. | Focus on the sound around you: Monaural target speaker extraction via distance and speaker information | |
Lakshmi Sarada et al. | Automatic transcription of continuous speech into syllable-like units for Indian languages | |
CN108538309B (zh) | 一种歌声侦测的方法 | |
CN111243618B (zh) | 用于确定音频中的特定人声片段的方法、装置和电子设备 | |
CN114495907B (zh) | 自适应的语音活动检测方法、装置、设备以及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19924386 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02.02.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19924386 Country of ref document: EP Kind code of ref document: A1 |