US10410615B2 - Audio information processing method and apparatus - Google Patents

Audio information processing method and apparatus Download PDF

Info

Publication number
US10410615B2
US10410615B2 US15/762,841 US201715762841A US10410615B2 US 10410615 B2 US10410615 B2 US 10410615B2 US 201715762841 A US201715762841 A US 201715762841A US 10410615 B2 US10410615 B2 US 10410615B2
Authority
US
United States
Prior art keywords
audio
sound channel
energy value
attribute
subfile
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
US15/762,841
Other versions
US20180293969A1 (en
Inventor
Weifeng Zhao
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Assigned to TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED reassignment TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ZHAO, WEIFENG
Publication of US20180293969A1 publication Critical patent/US20180293969A1/en
Application granted granted Critical
Publication of US10410615B2 publication Critical patent/US10410615B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/361Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/02Means for controlling the tone frequencies, e.g. attack or decay; Means for producing special musical effects, e.g. vibratos or glissandos
    • G10H1/06Circuits for establishing the harmonic content of tones, or other arrangements for changing the tone colour
    • G10H1/12Circuits for establishing the harmonic content of tones, or other arrangements for changing the tone colour by filtering complex waveforms
    • G10H1/125Circuits for establishing the harmonic content of tones, or other arrangements for changing the tone colour by filtering complex waveforms using a digital filter
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/087Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters using mixed excitation models, e.g. MELP, MBE, split band LPC or HVXC
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/005Musical accompaniment, i.e. complete instrumental rhythm synthesis added to a performed melody, e.g. as output by drum machines
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/041Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal based on mfcc [mel -frequency spectral coefficients]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/056Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction or identification of individual instrumental parts, e.g. melody, chords, bass; Identification or separation of instrumental parts by their characteristic voices or timbres
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2230/00General physical, ergonomic or hardware implementation of electrophonic musical tools or instruments, e.g. shape or architecture
    • G10H2230/025Computing or signal processing architecture features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/055Filters for musical processing or musical effects; Filter responses, filter architecture, filter coefficients or control parameters therefor
    • G10H2250/071All pole filter, i.e. autoregressive [AR] filter
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/131Mathematical functions for musical analysis, processing, synthesis or composition
    • G10H2250/261Window, i.e. apodization function or tapering function amounting to the selection and appropriate weighting of a group of samples in a digital signal within some chosen time interval, outside of which it is zero valued
    • G10H2250/275Gaussian window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/311Neural networks for electrophonic musical instruments or musical processing, e.g. for musical recognition or control, automatic composition or improvisation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present application relates to the information processing technology, and in particular to an audio information processing method and apparatus.
  • Audio files with an accompaniment function generally have two sound channels: an original sound channel (having accompaniments and human voices) and an accompanying sound channel, which are switched by a user when he or she is singing Karaoke. Since there is no fixed standard, the audio files acquired from different channels have different versions, the first sound channel of some audio files is an accompaniment while the second sound channel of other audio files is an accompaniment. Thus it is not possible to confirm which sound channel is the accompanying sound channel after these audio files are acquired. Generally, the audio files may be put into use only after being adjusted to a uniform format by artificial recognition or by being automatically resolved by equipment.
  • a method comprising decoding a first audio file to acquire a first audio subfile corresponding to a first sound channel and a second audio subfile corresponding to a second sound channel; extracting first audio data from the first audio subfile; extracting second audio data from the second audio subfile; acquiring a first audio energy value of the first audio data; acquiring a second audio energy value of the second audio data; and determining an attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value.
  • an apparatus comprising at least one memory configured to store computer program code; and at least one processor configured to access the at least one memory and operate according to the computer program code, said computer program code including decoding code configured to cause at least one of the at least one processor to decode an audio file to acquire a first audio subfile corresponding to a first sound channel and a second audio subfile corresponding to a second sound channel; extracting code configured to cause at least one of the at least one processor to extract first audio data from the first audio subfile and second audio data from the second audio subfile; acquisition code configured to cause at least one of the at least one processor to acquire a first audio energy value of the first audio data and a second audio energy value of the second audio data; and processing code configured to cause at least one of the at least one processor to determine an attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value.
  • a non-transitory computer-readable storage medium that stores computer program code that, when executed by a processor of a calculating apparatus, causes the calculating apparatus to execute a method comprising decoding an audio file to acquire a first audio subfile outputted corresponding to a first sound channel and a second audio subfile outputted corresponding to a second sound channel; extracting first audio data from the first audio subfile; extracting second audio data from the second audio subfile; acquiring a first audio energy value of the first audio data; acquiring a second audio energy value of the second audio data; and determining the attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value.
  • FIG. 1 is a schematic diagram of dual channel music to be distinguished
  • FIG. 2 is a flow diagram of an audio information processing method according an exemplary embodiment
  • FIG. 3 is a flow diagram of a method to obtain a Deep Neural Networks (DNN) model through training according an exemplary embodiment
  • FIG. 4 is a schematic diagram of the DNN model according an exemplary embodiment
  • FIG. 5 is a flow diagram of an audio information processing method according an exemplary embodiment
  • FIG. 6 is a flow diagram of Perceptual Linear Predictive (PLP) parameter extraction according an exemplary embodiment
  • FIG. 7 may be a flow diagram of an audio information processing method according an exemplary embodiment
  • FIG. 8 is a schematic diagram of an a cappella data extraction process according an exemplary embodiment
  • FIG. 9 is a flow diagram of an audio information processing method according an exemplary embodiment
  • FIG. 10 is a structural diagram of an audio information processing apparatus according an exemplary embodiment.
  • FIG. 11 is a structural diagram of a hardware composition of an audio information processing apparatus according an exemplary embodiment.
  • Exemplary embodiments acquire the corresponding first audio subfile and second audio subfile by dual-channel decoding of the audio file, then extract the audio data including the first audio data and the second audio data (the first audio data and the second audio data may have a same attribute), and finally determine an attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value, so as to determine a sound channel that meets particular attribute requirements.
  • the corresponding accompanying sound channel and original sound channel of the audio file may be distinguished efficiently and accurately, thus solving the problem of high human cost and low efficiency of manpower resolution and low accuracy of equipment automatic resolution.
  • An audio information processing method may be achieved through software, hardware, firmware or a combination thereof.
  • the software may be, for example, WeSing software, that is, the audio information processing method provided by the present application may be used, for example, in the WeSing software.
  • Exemplary embodiments may be applied to distinguish the corresponding accompanying sound channel of the audio file automatically, quickly and accurately based on machine learning.
  • Exemplary embodiments decode an audio file to acquire a first audio subfile outputted corresponding to the first sound channel and a second audio subfile outputted corresponding to a second sound channel; extract first audio data from the first audio subfile and second audio data from the second audio subfile; acquire a first audio energy value of the first audio data and a second audio energy value of the second audio data; and determine an attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value so as to determine a sound channel that meets particular attribute requirements.
  • FIG. 2 is a flow diagram of the audio information processing method according an exemplary embodiment. As shown in FIG. 2 , the audio information processing method according an exemplary embodiment may include the following steps:
  • Step S 201 Decode the audio file to acquire the first audio subfile outputted corresponding to the first sound channel and the second audio subfile outputted corresponding to the second sound channel.
  • the audio file herein may be any music file whose accompanying/original sound channels are to be distinguished.
  • the first sound channel and the second sound channel may be the left channel and the right channel respectively, and correspondingly, the first audio subfile and the second audio subfile may be the accompanying file and the original file corresponding to the first audio file respectively.
  • a song is decoded to acquire the accompanying file or original file representing the left channel output and the original file or accompanying file representing the right channel output.
  • Step S 202 Extract the first audio data from the first audio subfile and the second audio data from the second audio subfile.
  • the first audio data and the second audio data may have the same attribute, or the two may represent the same attribute. If the two are both human-voice audios, then the human-voice audios are extracted from the first audio subfile and the second audio subfile.
  • the specific human-voice extraction method may be any method that may be used to extract human-voice audios from the audio files.
  • a Deep Neural Networks (DNN) model may be trained to extract human-voice audios from the audio files, for example, when the first audio file may be a song, if the first audio subfile may be an accompanying audio file and the second audio subfile may be an original audio file, then the DNN model is used to extract the human-voice accompanying data from the accompanying audio file and extract the a cappella data from the original audio file.
  • DNN Deep Neural Networks
  • Step S 203 Acquire the first audio energy value of the first audio data and the second audio energy value of the second audio data.
  • the first audio energy value may be calculated from the first audio data and the second audio energy value may be calculated from the second audio data.
  • the first audio energy value may be the average audio energy value of the first audio data
  • the second audio energy value may be the average audio energy value of the second audio data.
  • different methods may be used to acquire the average audio energy value corresponding to the audio data.
  • the audio data may be composed of multiple sampling points, and each sampling point may generally correspond to a value between 0 and 32767, and the average value of all sampling point values may be taken as the average audio energy value corresponding to the audio data. In this way, the average value of all sampling points of the first audio data may be taken as the first audio energy value, and the average value of all sampling points of the second audio data may be taken as the second audio energy value.
  • Step S 204 Determine the attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value.
  • the sound channel that meets the particular attribute requirements may be the sound channel where the outputted audio of the first audio file is the accompanying audio in the first sound channel and the second sound channel.
  • the sound channel that meets the particular attribute requirements may be the sound channel outputting the accompaniment corresponding to the song in left and right channels.
  • the difference value between the first audio energy value and the second audio energy value may be determined, if the result shows that the difference value is greater than the threshold and the first audio energy value is less than the second audio energy value, then determine the attribute of the first sound channel as the first attribute and the attribute of the second sound channel as the second attribute, that is to determine the first sound channel as the sound channel outputting accompanying audios and the second sound channel as the sound channel outputting original audios.
  • the difference value between the first audio energy value and the second audio energy value is greater than the threshold and the second audio energy value is less than the first audio energy value, then determine the attribute of the second sound channel as the first attribute and the attribute of the first sound channel as the second attribute, that is to determine the second sound channel as the sound channel outputting accompanying audios and the first sound channel as the sound channel outputting original audios.
  • the first audio subfile or the second audio subfile corresponding to the first audio energy value or the second audio energy value may be determined as the audio file (i.e. accompanying files) that meets the particular attribute requirements, and the sound channel corresponding to the audio subfile that meets the particular attribute requirements as the sound channel that meets the particular requirements (i.e. sound channel that outputs accompanying files).
  • the difference value between the first audio energy value and the second audio energy value is not greater than the audio energy difference threshold, then there may be many human-voice accompaniments in the accompanying audio file in application.
  • the frequency spectrum characteristics of accompanying audios and a cappella audios are still different, so human-voice accompanying data may be distinguished from a cappella data according to the frequency spectrum characteristics thereof.
  • the accompanying data may be determined finally based on the principle that the average audio energy of the accompanying data is less than that of the a cappella data, and then the result that the sound channel corresponding to the accompanying data is the sound channel that meets the particular attribute requirements is obtained.
  • FIG. 3 is a flow diagram of the method to obtain the DNN model through training according an exemplary embodiment. As shown in FIG. 3 , the method to obtain the DNN model through training according an exemplary embodiment may include the following steps:
  • Step S 301 Decode the audios in the multiple predetermined audio files respectively to acquire the corresponding multiple Pulse Code Modulation (PCM) audio files.
  • PCM Pulse Code Modulation
  • the multiple predetermined audio files may be N original songs and corresponding N a cappella songs thereof selected from a song library of WeSing.
  • N may be a positive integer and may be greater than 2,000 for the follow-up training.
  • There have been tens of thousands of songs with both original and high-quality a cappella data (the a cappella data is mainly selected by a free scoring system, that is to select the a cappella data with a higher score), so all such songs may be collected, from which 10,000 songs may be randomly selected for follow-up operations (here the complexity and accuracy of the follow-up training are mainly considered for the selection).
  • PCM pulse code modulation
  • Step S 302 Extract the frequency spectrum features from the obtained multiple PCM audio files.
  • Step S 303 Train the extracted frequency spectrum features by using the BP algorithm to obtain the DNN model.
  • 4 matrices are obtained, including a 2827*2048 dimensional matrix, a 2048*2048 dimensional matrix, a 2048*2048 dimensional matrix and a 2048*257 dimensional matrix.
  • FIG. 5 is a flow diagram of the audio information processing method according an exemplary embodiment. As shown in FIG. 5 , the audio information processing method according an exemplary embodiment may include the following steps:
  • Step S 501 Decode the audio file to acquire the first audio subfile outputted corresponding to the first sound channel and the second audio subfile outputted corresponding to the second sound channel.
  • the audio file herein may be any music file whose accompanying/original sound channels are to be distinguished. If the audio file is a song whose accompanying/original sound channels are to be distinguished, then the first sound channel and the second sound channel may be the left channel and the right channel respectively, and correspondingly, the first audio subfile and the second audio subfile may be the accompanying file and the original file corresponding to the first audio file, respectively.
  • the first audio file is a song
  • Step S 501 the song is decoded to acquire the accompanying file or original file of the song outputted by the left channel and the original file or accompanying file of the song outputted by the right channel.
  • Step S 502 Extract the first audio data from the first audio subfile and the second audio data from the second audio subfile respectively by using the predetermined DNN model.
  • the predetermined DNN model may be the DNN model obtained through in-advance training by using the BP algorithm in exemplary embodiment 2 described above or the DNN model obtained through other methods;
  • the first audio data and the second audio data may have a same attribute, or the two may represent the same attribute. If the two are both human-voice audios, then the human-voice audios are extracted from the first audio subfile and the second audio subfile by using the DNN model obtained through in-advance training. For example, when the first audio file is a song, if the first audio subfile is an accompanying audio file and the second audio subfile is an original audio file, then the DNN model is used to extract the human-voice accompanying data from the accompanying audio file and the human a cappella data from the original audio file.
  • the process of extracting the a cappella data by using the DNN model obtained through training may include the following steps:
  • step S 302 of exemplary embodiment 2 Use the method provided in step S 302 of exemplary embodiment 2 to extract the frequency spectrum features
  • each frame feature extends 5 frames forward and backward respectively to obtain 11*257 dimensional feature (the operation is not performed for the first 5 frames and the last 5 frames of the audio file), and multiple the input feature by the matrix in each layer of the DNN model obtained through training in the embodiment 2 to finally obtain a 257 dimensional output feature and then obtain m ⁇ 10 frame output feature.
  • the first frame extends 5 frames forward and the last frame extends 5 frames backward to obtain m frame output result;
  • i denotes 512 dimensions
  • j denotes the corresponding frequency band of i, which is 257
  • j may correspond to one or two i
  • variables z and t correspond to z i and t i obtained in step 2) respectively;
  • Step S 503 Acquire the first audio energy value of the first audio data and the second audio energy value of the second audio data.
  • the first audio energy value may be calculated from the first audio data
  • the second audio energy value may be calculated from the second audio data.
  • the first audio energy value may be the average audio energy value of the first audio data
  • the second audio energy value may be the average audio energy value of the second audio data.
  • different methods may be used to acquire the average audio energy value corresponding to the audio data.
  • the audio data is composed of multiple sampling points, and each sampling point generally corresponds to a value between 0 and 32767, and the average value of all sampling point values is taken as the average audio energy value corresponding to the audio data.
  • the average value of all sampling points of the first audio data may be taken as the first audio energy value
  • the average value of all sampling points of the second audio data may be taken as the second audio energy value.
  • Step S 504 Determine whether the difference value between the first audio energy value and the second audio energy value is greater than the predetermined threshold or not. If yes, proceed to step S 505 ; otherwise, proceed to step S 506 .
  • a threshold i.e. audio energy difference threshold
  • the audio energy difference threshold may be predetermined. Specifically, the threshold may be set experimentally according to the actual use. For example, the threshold may be set as 486. If the difference value between the first audio energy value and the second audio energy value is greater than the audio energy difference threshold, the sound channel corresponding to the sound channel whose audio energy value is smaller is determined as the accompanying sound channel.
  • Step S 505 if the first audio energy value is less than the second audio energy value, then determine the attribute of the first sound channel as the first attribute, and if the second audio energy value is less than the first audio energy value, then determine the attribute of the second sound channel as the first attribute.
  • determining the first audio energy value and the second audio energy value If the first audio energy value is less than the second audio energy value, then determine the attribute of the first sound channel as the first attribute and the attribute of the second sound channel as the second attribute, that is to determine the first sound channel as the sound channel outputting accompanying audios and the second sound channel as the sound channel outputting original audios. If the second audio energy value is less than the first audio energy value, then determine the attribute of the second sound channel as the first attribute and the attribute of the first sound channel as the second attribute, that is to determine the second sound channel as the sound channel outputting accompanying audios and the first sound channel as the sound channel outputting original audios.
  • the audio file that meets the particular attribute requirements may be determined as the audio file that meets the particular attribute requirements, and the sound channel corresponding to the audio subfile that meets the particular attribute requirements as the sound channel that meets the particular requirements.
  • the audio file that meets the particular attribute requirements is the accompanying audio file corresponding to the first audio file
  • the sound channel that meets the particular requirements is the sound channel where the outputted audio of the first audio file is the accompanying audio in the first sound channel and the second sound channel.
  • Step S 506 Assign attribute to the first sound channel and/or the second sound channel by using the predetermined GMM.
  • the predetermined GMM model is obtained through in-advance training, and the specific training process includes the following:
  • PLP Perceptual Linear Predictive
  • the determined sound channel is the sound channel that preliminarily meets the particular attribute requirements.
  • Step S 507 Determine the first audio energy value and the second audio energy value. If the first attribute is assigned to the first sound channel and the first audio energy value is less than the second audio energy value, or the first attribute is assigned to the second sound channel and the second audio energy value is less than the first audio energy value, proceed to step S 508 ; otherwise proceed to step S 509 .
  • step S 508 determines whether the audio energy value corresponding to the sound channel that preliminarily meets the particular attribute requirements is less than the audio energy value corresponding to the other sound channel or not. If yes, proceed to step S 508 ; otherwise proceed to step S 509 .
  • the audio energy value corresponding to the sound channel that preliminarily meets the particular attribute requirements is exactly the audio energy value of the audio file outputted by the sound channel.
  • Step S 508 If the first attribute is assigned to the first sound channel and the first audio energy value is less than the second audio energy value, determine the attribute of the first sound channel as the first attribute and the attribute of the second sound channel as the second attribute, that is to determine the first sound channel as the sound channel outputting accompanying audio and the second sound channel as the sound channel outputting original audio. If the first attribute is assigned to the second sound channel and the second audio energy value is less than the first audio energy value, determine the attribute of the second sound channel as the first attribute and the attribute of the first sound channel as the second attribute, that is to determine the second sound channel as the sound channel outputting accompanying audio and the first sound channel as the sound channel outputting original audio.
  • the sound channel that preliminarily meets the particular attribute requirements may be determined as the sound channel that meets the particular attribute requirements which is the sound channel outputting accompanying audio.
  • the method may further include the following steps after Step S 508 :
  • the sound channel that meets the particular attribute requirements may be the sound channel outputting accompanying audio.
  • the sound channel outputting accompanying audio such as the first sound channel
  • the sound channel is labeled as the accompanying audio sound channel.
  • a user may switch between accompaniments and originals based on the labeled sound channel when the user is singing karaoke;
  • Step S 509 Output the prompt message.
  • the prompt message may be used to prompt the user that the corresponding sound channel outputting accompanying audio of the first audio file cannot be distinguished, so that the user can confirm that the corresponding sound channel outputs accompanying audio manually.
  • the attributes of the first sound channel and the second sound channel need to be confirmed artificially.
  • the human-voice component from the music by using the trained DNN model, and then obtain the final classification result through comparison of dual-channel human-voice energy.
  • the accuracy of the final classification may reach 99% or above.
  • FIG. 7 is a flow diagram of an audio information processing method according an exemplary embodiment. As shown in FIG. 7 , the audio information processing method according an exemplary embodiment may include the following steps:
  • Step S 701 Extract the dual-channel a cappella data (and/or human-voice accompanying data) of the music to be detected by using the DNN model trained in advance.
  • FIG. 8 A specific process of extracting the a cappella data is shown in FIG. 8 .
  • Step S 702 Calculate the average audio energy value of the extracted dual-channel a cappella (and/or human-voice accompanying) data respectively.
  • Step S 703 Determine whether the audio energy difference value of the dual-channel a cappella (and/or human-voice accompanying) data is greater than the predetermined threshold or not. If yes, proceed to step S 704 ; otherwise, proceed to step S 705 .
  • Step S 704 Determine the sound channel corresponding to the a cappella (and/or human-voice accompanying) data with a smaller average audio energy value as the accompanying sound channel.
  • Step S 705 Classify the music to be detected with dual-channel output by using the GMM trained in advance.
  • Step S 706 Determine whether the audio energy value corresponding to the sound channel that is classified as accompanying audio is smaller or not. If yes, proceed to step S 707 ; otherwise, proceed to step S 708 .
  • Step S 707 Determine the sound channel with a smaller audio energy value as the accompanying sound channel.
  • Step S 708 Output the prompt message to use manual confirmation.
  • the dual-channel a cappella (and/or human-voice accompanying) data may be extracted while the accompanying audio sound channel is determined by using the GMM, and then a regression function is used to execute the above steps 703 - 708 .
  • the operations in step S 705 have been executed in advance, so such operations may be skipped when the regression function is used, as shown in FIG. 9 .
  • FIG. 9 conduct dual-channel decoding on the music to be classified (i.e. music to be detected).
  • use the a cappella training data to obtain the DNN model through training and use the accompanying human-voice training data to obtain the GMM model through training.
  • FIG. 10 is a structural diagram of the composition of the audio information processing apparatus according an exemplary embodiment.
  • the composition of the audio information processing apparatus according an exemplary embodiment includes a decoding module 11 , an extracting module 12 , an acquisition module 13 and a processing module 14 ;
  • the decoding module 11 being configured to decode the audio file (i.e. the first audio file) to acquire the first audio subfile outputted corresponding to first sound channel and the second audio subfile outputted corresponding to the second sound channel;
  • the extracting module 12 being configured to extract the first audio data from the first audio subfile and the second audio data from the second audio subfile;
  • the acquisition module 13 being configured to acquire the first audio energy value of the first audio data and the second audio energy value of the second audio data
  • the processing module 14 being configured to determine the attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value.
  • the first audio data and the second audio data may have a same attribute.
  • the first audio data may correspond to the human-voice audio outputted by the first sound channel and the second audio data may correspond to the human-voice audio outputted by the second sound channel;
  • the processing module 14 may be configured to determine which one of the first sound channel and the second sound channel is the sound channel outputting accompanying audio based on the first audio energy value of the human-voice audio outputted by the first sound channel and the second audio energy value of the human-voice audio outputted by the second sound channel.
  • the apparatus may further comprise a first model training module 15 configured to extract the frequency spectrum features of the multiple predetermined audio files respectively;
  • the extracting module 12 may be further configured to extract the first audio data from the first audio subfile and the second audio data from the second audio subfile respectively by using the DNN model.
  • the processing module 14 may be configured to determine the difference value between the first audio energy value and the second audio energy value. If the difference value is greater than the threshold (e.g. an audio energy difference threshold) and the first audio energy value is less than the second audio energy value, then determine the attribute of the first sound channel as the first attribute and the attribute of the second sound channel as the second attribute, that is to determine the first sound channel as the sound channel outputting accompanying audio and the second sound channel as the sound channel outputting original audio.
  • the threshold e.g. an audio energy difference threshold
  • the difference value between the first audio energy value and the second audio energy value is greater than the threshold and the second audio energy value is less than the first audio energy value, then determine the attribute of the second sound channel as the first attribute and the attribute of the first sound channel as the second attribute, that is to determine the second sound channel as the sound channel outputting accompanying audio and the first sound channel as the sound channel outputting original audio.
  • the processing module 14 detects that the difference value between the first audio energy value and the second audio energy value is greater than the audio energy difference threshold, the first audio subfile or the second audio subfile corresponding to the first audio energy value or the second audio energy value (whichever is smaller) is determined as the audio file that meets the particular attribute requirements, and the sound channel corresponding to the audio subfile that meets the particular attribute requirements as the sound channel that meets the particular requirements;
  • the classification method is used to assign attribute to at least one of the first sound channel and the second sound channel, so as to preliminarily determine which one of the first sound channel and the second sound channel is the sound channel that meets the particular attribute requirements.
  • the apparatus may further comprise a second model training module 16 being configured to extract the Perceptual Linear Predictive (PLP) characteristic parameters of multiple audio files;
  • PLP Perceptual Linear Predictive
  • GMM Gaussian Mixture Model
  • EM Expectation Maximization
  • the processing module 14 may be further configured to assign an attribute to at least one of the first sound channel and the second sound channel by using the GMM obtained through training, so as to preliminarily determine the first sound channel or the second sound channel as the sound channel that preliminarily meets the particular attribute requirements.
  • the processing module 14 may be configured to determine the first audio energy value and the second audio energy value. If the first attribute is assigned to the first sound channel and the first audio energy value is less than the second audio energy value, or the first attribute is assigned to the second sound channel and the second audio energy value is less than the first audio energy value. This is also to preliminarily determine whether the audio energy value corresponding to the sound channel that meets the particular attribute requirements is less than the audio energy value corresponding to the other sound channel or not;
  • the sound channel that preliminarily meets the particular attribute requirements if the result shows that the audio energy value corresponding to the sound channel that preliminarily meets the particular attribute requirements is less than the audio energy value corresponding to the other sound channel, determine the sound channel that preliminarily meets the particular attribute requirements as the sound channel that meets the particular attribute requirements.
  • the processing module 14 may be further configured to output a prompt message when the result shows that the audio energy value corresponding to the sound channel that preliminarily meets the particular attribute requirements is not less than the audio energy value corresponding to the other sound channel.
  • the decoding module 11 , the extracting module 12 , the acquisition module 13 , the processing module 14 , the first model training module 15 and the second model training module 16 in the audio information processing apparatus may be achieved through a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC) in the apparatus.
  • CPU Central Processing Unit
  • DSP Digital Signal Processor
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • FIG. 11 is a structural diagram of the hardware composition of the audio information processing apparatus according an exemplary embodiment.
  • the apparatus S 11 is shown as FIG. 11 .
  • the apparatus S 11 may include a processor 111 , a storage medium 112 and at least one external communication interface 113 ; and the processor 111 , the storage medium 112 and the external communication interface 113 may be connected through a bus 114 .
  • the audio information processing apparatus may be a mobile phone, a desktop computer, a PC or an all-in-one machine.
  • the audio information processing method may also be achieved through the operations of a server.
  • the audio information processing apparatus may be a terminal or a server.
  • the audio information processing method according to an exemplary embodiment is not limited to being used in the terminal, instead, the audio information processing method may also be used in a server such as a web server or a server corresponding to music application software (e.g. WeSing software).
  • a server such as a web server or a server corresponding to music application software (e.g. WeSing software).
  • WeSing software e.g. WeSing software
  • the foregoing computer program code may be stored in a computer-readable storage medium, and a computer may execute the steps including the above exemplary embodiments during execution; and the foregoing storage medium may include a mobile storage device, a Random Access Memory (RAM), a Read-Only Memory (ROM), a disk, a disc or other media that can store program codes.
  • RAM Random Access Memory
  • ROM Read-Only Memory
  • the software functional module(s) may also be stored in a computer-readable storage medium.
  • the technical solution according exemplary embodiments essentially or the part contributing to the related technology may be embodied in the form of a software product.
  • the computer software product is stored in a storage medium and includes several instructions used to allow a computer device (which may be a personal computer, a server or a network device) to execute the whole or part of the method provided by each exemplary embodiment of the present application.
  • the foregoing storage medium includes a mobile storage device, an RAM, an ROM, a disk, a disc or other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)
  • Stereophonic System (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

An audio information processing method and apparatus are provided. The method includes decoding a first audio file to acquire a first audio subfile corresponding to a first sound channel and a second audio subfile corresponding to a second sound channel; extracting first audio data from the first audio subfile; extracting second audio data from the second audio subfile; acquiring a first audio energy value of the first audio data; acquiring a second audio energy value of the second audio data; and determining an attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value.

Description

RELATED APPLICATION
This application is a National Stage entry of International Application No. PCTCN2017/076939, filed on Mar. 16, 2017, which claims priority from Chinese Patent Application No. 201610157251.X, entitled “Audio Information Processing Method and Terminal” filed on Mar. 18, 2016 to the Chinese Patent Office, which is incorporated by reference in its entirety.
FIELD OF THE TECHNOLOGY
The present application relates to the information processing technology, and in particular to an audio information processing method and apparatus.
BACKGROUND OF THE DISCLOSURE
Audio files with an accompaniment function generally have two sound channels: an original sound channel (having accompaniments and human voices) and an accompanying sound channel, which are switched by a user when he or she is singing Karaoke. Since there is no fixed standard, the audio files acquired from different channels have different versions, the first sound channel of some audio files is an accompaniment while the second sound channel of other audio files is an accompaniment. Thus it is not possible to confirm which sound channel is the accompanying sound channel after these audio files are acquired. Generally, the audio files may be put into use only after being adjusted to a uniform format by artificial recognition or by being automatically resolved by equipment.
However, an artificial filtering method is low in efficiency and high in cost, and an equipment resolution method is low in accuracy because a large number of human-voice accompaniments exist in many accompanying audios. At present, there is no effective solution to the above problems.
SUMMARY
It may be an aspect to provide an audio information processing method and apparatus, which can distinguish the corresponding accompanying sound channel of an audio file efficiently and accurately.
According to an aspect of one or more exemplary embodiments, there is provided a method comprising decoding a first audio file to acquire a first audio subfile corresponding to a first sound channel and a second audio subfile corresponding to a second sound channel; extracting first audio data from the first audio subfile; extracting second audio data from the second audio subfile; acquiring a first audio energy value of the first audio data; acquiring a second audio energy value of the second audio data; and determining an attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value.
According to an aspect of one or more exemplary embodiments, there is provided an apparatus comprising at least one memory configured to store computer program code; and at least one processor configured to access the at least one memory and operate according to the computer program code, said computer program code including decoding code configured to cause at least one of the at least one processor to decode an audio file to acquire a first audio subfile corresponding to a first sound channel and a second audio subfile corresponding to a second sound channel; extracting code configured to cause at least one of the at least one processor to extract first audio data from the first audio subfile and second audio data from the second audio subfile; acquisition code configured to cause at least one of the at least one processor to acquire a first audio energy value of the first audio data and a second audio energy value of the second audio data; and processing code configured to cause at least one of the at least one processor to determine an attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value.
According to an aspect of one or more exemplary embodiments, there is provided a non-transitory computer-readable storage medium that stores computer program code that, when executed by a processor of a calculating apparatus, causes the calculating apparatus to execute a method comprising decoding an audio file to acquire a first audio subfile outputted corresponding to a first sound channel and a second audio subfile outputted corresponding to a second sound channel; extracting first audio data from the first audio subfile; extracting second audio data from the second audio subfile; acquiring a first audio energy value of the first audio data; acquiring a second audio energy value of the second audio data; and determining the attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value.
BRIEF DESCRIPTION OF THE DRAWINGS
The above and other aspects will become more apparent from the following description along with the accompanying drawings, in which:
FIG. 1 is a schematic diagram of dual channel music to be distinguished;
FIG. 2 is a flow diagram of an audio information processing method according an exemplary embodiment;
FIG. 3 is a flow diagram of a method to obtain a Deep Neural Networks (DNN) model through training according an exemplary embodiment;
FIG. 4 is a schematic diagram of the DNN model according an exemplary embodiment;
FIG. 5 is a flow diagram of an audio information processing method according an exemplary embodiment;
FIG. 6 is a flow diagram of Perceptual Linear Predictive (PLP) parameter extraction according an exemplary embodiment;
FIG. 7 may be a flow diagram of an audio information processing method according an exemplary embodiment;
FIG. 8 is a schematic diagram of an a cappella data extraction process according an exemplary embodiment;
FIG. 9 is a flow diagram of an audio information processing method according an exemplary embodiment;
FIG. 10 is a structural diagram of an audio information processing apparatus according an exemplary embodiment; and
FIG. 11 is a structural diagram of a hardware composition of an audio information processing apparatus according an exemplary embodiment.
DESCRIPTION OF EMBODIMENTS
In related art technology, automatically distinguishing a corresponding accompanying sound channel of an audio file by equipment is mainly realized through training a Support Vector Machine (SVM) model or a Gaussian Mixture Model (GMM). A distribution gap of the dual-channel audio spectrum is small, as shown in FIG. 1, a large number of human-voice accompaniments exist in many accompanying audios, thus the resolution accuracy is not high.
Exemplary embodiments acquire the corresponding first audio subfile and second audio subfile by dual-channel decoding of the audio file, then extract the audio data including the first audio data and the second audio data (the first audio data and the second audio data may have a same attribute), and finally determine an attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value, so as to determine a sound channel that meets particular attribute requirements. In this way, the corresponding accompanying sound channel and original sound channel of the audio file may be distinguished efficiently and accurately, thus solving the problem of high human cost and low efficiency of manpower resolution and low accuracy of equipment automatic resolution.
An audio information processing method according an exemplary embodiment may be achieved through software, hardware, firmware or a combination thereof. The software may be, for example, WeSing software, that is, the audio information processing method provided by the present application may be used, for example, in the WeSing software. Exemplary embodiments may be applied to distinguish the corresponding accompanying sound channel of the audio file automatically, quickly and accurately based on machine learning.
Exemplary embodiments decode an audio file to acquire a first audio subfile outputted corresponding to the first sound channel and a second audio subfile outputted corresponding to a second sound channel; extract first audio data from the first audio subfile and second audio data from the second audio subfile; acquire a first audio energy value of the first audio data and a second audio energy value of the second audio data; and determine an attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value so as to determine a sound channel that meets particular attribute requirements.
The following further describes various exemplary embodiments in more detail with reference to the accompanying drawings.
Exemplary Embodiment 1
FIG. 2 is a flow diagram of the audio information processing method according an exemplary embodiment. As shown in FIG. 2, the audio information processing method according an exemplary embodiment may include the following steps:
Step S201: Decode the audio file to acquire the first audio subfile outputted corresponding to the first sound channel and the second audio subfile outputted corresponding to the second sound channel.
The audio file herein (also denoted as a first audio file) may be any music file whose accompanying/original sound channels are to be distinguished. The first sound channel and the second sound channel may be the left channel and the right channel respectively, and correspondingly, the first audio subfile and the second audio subfile may be the accompanying file and the original file corresponding to the first audio file respectively. For example, a song is decoded to acquire the accompanying file or original file representing the left channel output and the original file or accompanying file representing the right channel output.
Step S202: Extract the first audio data from the first audio subfile and the second audio data from the second audio subfile.
The first audio data and the second audio data may have the same attribute, or the two may represent the same attribute. If the two are both human-voice audios, then the human-voice audios are extracted from the first audio subfile and the second audio subfile. The specific human-voice extraction method may be any method that may be used to extract human-voice audios from the audio files. For example, during actual implementation, a Deep Neural Networks (DNN) model may be trained to extract human-voice audios from the audio files, for example, when the first audio file may be a song, if the first audio subfile may be an accompanying audio file and the second audio subfile may be an original audio file, then the DNN model is used to extract the human-voice accompanying data from the accompanying audio file and extract the a cappella data from the original audio file.
Step S203: Acquire the first audio energy value of the first audio data and the second audio energy value of the second audio data.
For example, the first audio energy value may be calculated from the first audio data and the second audio energy value may be calculated from the second audio data. The first audio energy value may be the average audio energy value of the first audio data, and the second audio energy value may be the average audio energy value of the second audio data. In practical application, different methods may be used to acquire the average audio energy value corresponding to the audio data. For example, the audio data may be composed of multiple sampling points, and each sampling point may generally correspond to a value between 0 and 32767, and the average value of all sampling point values may be taken as the average audio energy value corresponding to the audio data. In this way, the average value of all sampling points of the first audio data may be taken as the first audio energy value, and the average value of all sampling points of the second audio data may be taken as the second audio energy value.
Step S204: Determine the attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value.
Determine the attribute of the first sound channel and/or the second sound channel based on the first audio energy value and the second audio energy value so as to determine a sound channel that meets particular attribute requirements, that is to determine which one of the first sound channel and the second sound channel is the sound channel that meets the particular attribute requirements. For example, determine that the first sound channel or the second sound channel is the sound channel that outputs accompanying audios based on the first audio energy value of the human-voice audio outputted by the first sound channel and the second audio energy value of the human-voice audio outputted by the second sound channel.
On the basis of the exemplary embodiment, in practical application, the sound channel that meets the particular attribute requirements may be the sound channel where the outputted audio of the first audio file is the accompanying audio in the first sound channel and the second sound channel. For example, for a song, the sound channel that meets the particular attribute requirements may be the sound channel outputting the accompaniment corresponding to the song in left and right channels.
In the process of determining the sound channel that meets the particular attribute requirements, specifically, for a song, if there are few human-voice accompaniments in the song, then correspondingly, the audio energy value corresponding to the accompanying file of the song will be small, while the audio energy value corresponding to the a cappella file of the song will be large. Therefore, a threshold (i.e. audio energy difference threshold) may be used. The audio energy difference threshold may be predetermined. Specifically, the threshold may be set, experimentally, according to the actual use. The difference value between the first audio energy value and the second audio energy value may be determined, if the result shows that the difference value is greater than the threshold and the first audio energy value is less than the second audio energy value, then determine the attribute of the first sound channel as the first attribute and the attribute of the second sound channel as the second attribute, that is to determine the first sound channel as the sound channel outputting accompanying audios and the second sound channel as the sound channel outputting original audios. On the contrary, if the difference value between the first audio energy value and the second audio energy value is greater than the threshold and the second audio energy value is less than the first audio energy value, then determine the attribute of the second sound channel as the first attribute and the attribute of the first sound channel as the second attribute, that is to determine the second sound channel as the sound channel outputting accompanying audios and the first sound channel as the sound channel outputting original audios.
In this way, if the difference value between the first audio energy value and the second audio energy value is greater than the audio energy difference threshold, then the first audio subfile or the second audio subfile corresponding to the first audio energy value or the second audio energy value (whichever is smaller) may be determined as the audio file (i.e. accompanying files) that meets the particular attribute requirements, and the sound channel corresponding to the audio subfile that meets the particular attribute requirements as the sound channel that meets the particular requirements (i.e. sound channel that outputs accompanying files).
If the difference value between the first audio energy value and the second audio energy value is not greater than the audio energy difference threshold, then there may be many human-voice accompaniments in the accompanying audio file in application. However, the frequency spectrum characteristics of accompanying audios and a cappella audios are still different, so human-voice accompanying data may be distinguished from a cappella data according to the frequency spectrum characteristics thereof. After the accompanying data is determined preliminarily, the accompanying data may be determined finally based on the principle that the average audio energy of the accompanying data is less than that of the a cappella data, and then the result that the sound channel corresponding to the accompanying data is the sound channel that meets the particular attribute requirements is obtained.
Exemplary Embodiment 2
FIG. 3 is a flow diagram of the method to obtain the DNN model through training according an exemplary embodiment. As shown in FIG. 3, the method to obtain the DNN model through training according an exemplary embodiment may include the following steps:
Step S301: Decode the audios in the multiple predetermined audio files respectively to acquire the corresponding multiple Pulse Code Modulation (PCM) audio files.
Here the multiple predetermined audio files may be N original songs and corresponding N a cappella songs thereof selected from a song library of WeSing. N may be a positive integer and may be greater than 2,000 for the follow-up training. There have been tens of thousands of songs with both original and high-quality a cappella data (the a cappella data is mainly selected by a free scoring system, that is to select the a cappella data with a higher score), so all such songs may be collected, from which 10,000 songs may be randomly selected for follow-up operations (here the complexity and accuracy of the follow-up training are mainly considered for the selection).
All selected original files and corresponding a cappella files are decoded to acquire a pulse code modulation (PCM) audio file of 16 k/16 bit, that is to acquire 10,000 PCM original audios and corresponding 10,000 PCM a cappella audios. If xn1, n1∈(1˜10000) is used to represent the original audios and yn2, n2∈(1˜10000) represents the corresponding a cappella audios, then there may be a one-to-one correspondence between n1 and n2.
Step S302: Extract the frequency spectrum features from the obtained multiple PCM audio files.
Specifically, the following operations are included:
1) Frame the audios. Here, set the frame length as 512 sampling points and the frame shift as 128 sampling points;
2) Weight each frame data by a Hamming window function and perform fast Fourier transform to obtain a 257 dimensional real-domain spectral density and a 255 dimensional virtual-domain spectral density, totaling 512 dimensional feature zi, i∈(1˜512);
3) Calculate the quadratic sum of each real-domain spectral density and the corresponding virtual-domain spectral density thereof;
in other words, it is to calculate |Sreal(f)|2+|Svirtual(f)|2, where f denotes frequency, Sreal(f) denotes the real-domain spectral density/energy value corresponding to the frequency f after the Fourier transform, and Svirtual(f) denotes the virtual-domain spectral density/energy value corresponding to the frequency f after the Fourier transform, so as to obtain the 257 dimensional feature ti, i∈(1˜257).
4) Calculate the loge of the above results to obtain the required 257 dimensional frequency spectrum feature ln|S(f)|2.
Step S303: Train the extracted frequency spectrum features by using the BP algorithm to obtain the DNN model.
Here, the Error Back Propagation (BP) algorithm is used to train a deep neural network with three hidden layers. As shown in FIG. 4, the number of nodes in each of the three hidden layers is 2048, an input layer is original audio xi, each frame of 257 dimensional feature extends 5 frames forward and then extends 5 frames backward to obtain 11 frames data, totaling 11*257=2827 dimensional feature, i.e. a∈[1, 2827], and the output is the 257 dimensional feature of the frame corresponding to the a cappella audio yi, i.e. b∈[1, 257]. After being trained by the BP algorithm, 4 matrices are obtained, including a 2827*2048 dimensional matrix, a 2048*2048 dimensional matrix, a 2048*2048 dimensional matrix and a 2048*257 dimensional matrix.
Exemplary Embodiment 3
FIG. 5 is a flow diagram of the audio information processing method according an exemplary embodiment. As shown in FIG. 5, the audio information processing method according an exemplary embodiment may include the following steps:
Step S501: Decode the audio file to acquire the first audio subfile outputted corresponding to the first sound channel and the second audio subfile outputted corresponding to the second sound channel.
The audio file herein (also denoted a first audio file) may be any music file whose accompanying/original sound channels are to be distinguished. If the audio file is a song whose accompanying/original sound channels are to be distinguished, then the first sound channel and the second sound channel may be the left channel and the right channel respectively, and correspondingly, the first audio subfile and the second audio subfile may be the accompanying file and the original file corresponding to the first audio file, respectively. In other words, if the first audio file is a song, then in Step S501, the song is decoded to acquire the accompanying file or original file of the song outputted by the left channel and the original file or accompanying file of the song outputted by the right channel.
Step S502: Extract the first audio data from the first audio subfile and the second audio data from the second audio subfile respectively by using the predetermined DNN model.
Here, the predetermined DNN model may be the DNN model obtained through in-advance training by using the BP algorithm in exemplary embodiment 2 described above or the DNN model obtained through other methods;
The first audio data and the second audio data may have a same attribute, or the two may represent the same attribute. If the two are both human-voice audios, then the human-voice audios are extracted from the first audio subfile and the second audio subfile by using the DNN model obtained through in-advance training. For example, when the first audio file is a song, if the first audio subfile is an accompanying audio file and the second audio subfile is an original audio file, then the DNN model is used to extract the human-voice accompanying data from the accompanying audio file and the human a cappella data from the original audio file.
The process of extracting the a cappella data by using the DNN model obtained through training may include the following steps:
1) Decode the audio file of the a cappella data to be extracted to a PCM audio file of 16 k/16 bit;
2) Use the method provided in step S302 of exemplary embodiment 2 to extract the frequency spectrum features;
3) Suppose that the audio file has a total of m frames. Each frame feature extends 5 frames forward and backward respectively to obtain 11*257 dimensional feature (the operation is not performed for the first 5 frames and the last 5 frames of the audio file), and multiple the input feature by the matrix in each layer of the DNN model obtained through training in the embodiment 2 to finally obtain a 257 dimensional output feature and then obtain m−10 frame output feature. The first frame extends 5 frames forward and the last frame extends 5 frames backward to obtain m frame output result;
4) Calculate the ex of each dimensional feature of each frame to obtain the 257 dimensional feature ki, i∈(1˜257);
5) Use the formula
z i · k j t j
to obtain 512 dimensional frequency spectrum feature, where i denotes 512 dimensions, j denotes the corresponding frequency band of i, which is 257, and j may correspond to one or two i, and variables z and t correspond to zi and ti obtained in step 2) respectively;
6) Perform inverse Fourier transform on the above 512 dimensional feature to obtain the time-domain feature, and connect the time-domain features of all frames together to obtain the required a cappella file.
Step S503: Acquire the first audio energy value of the first audio data and the second audio energy value of the second audio data.
For example, the first audio energy value may be calculated from the first audio data, and the second audio energy value may be calculated from the second audio data. The first audio energy value may be the average audio energy value of the first audio data, and the second audio energy value may be the average audio energy value of the second audio data. In practical application, different methods may be used to acquire the average audio energy value corresponding to the audio data. For example, the audio data is composed of multiple sampling points, and each sampling point generally corresponds to a value between 0 and 32767, and the average value of all sampling point values is taken as the average audio energy value corresponding to the audio data. In this way, the average value of all sampling points of the first audio data may be taken as the first audio energy value, and the average value of all sampling points of the second audio data may be taken as the second audio energy value.
Step S504: Determine whether the difference value between the first audio energy value and the second audio energy value is greater than the predetermined threshold or not. If yes, proceed to step S505; otherwise, proceed to step S506.
In practical application, for a song, if there are few human-voice accompaniments in the song, then correspondingly, the audio energy value corresponding to the accompanying file of the song will be small, while the audio energy value corresponding to the a cappella file of the song will be large. Therefore, a threshold (i.e. audio energy difference threshold) may be used. The audio energy difference threshold may be predetermined. Specifically, the threshold may be set experimentally according to the actual use. For example, the threshold may be set as 486. If the difference value between the first audio energy value and the second audio energy value is greater than the audio energy difference threshold, the sound channel corresponding to the sound channel whose audio energy value is smaller is determined as the accompanying sound channel.
Step S505: if the first audio energy value is less than the second audio energy value, then determine the attribute of the first sound channel as the first attribute, and if the second audio energy value is less than the first audio energy value, then determine the attribute of the second sound channel as the first attribute.
Here, determining the first audio energy value and the second audio energy value. If the first audio energy value is less than the second audio energy value, then determine the attribute of the first sound channel as the first attribute and the attribute of the second sound channel as the second attribute, that is to determine the first sound channel as the sound channel outputting accompanying audios and the second sound channel as the sound channel outputting original audios. If the second audio energy value is less than the first audio energy value, then determine the attribute of the second sound channel as the first attribute and the attribute of the first sound channel as the second attribute, that is to determine the second sound channel as the sound channel outputting accompanying audios and the first sound channel as the sound channel outputting original audios.
In this way, whichever is smaller of the first audio subfile or the second audio subfile (corresponding to the first audio energy value or the second audio energy value, respectively), may be determined as the audio file that meets the particular attribute requirements, and the sound channel corresponding to the audio subfile that meets the particular attribute requirements as the sound channel that meets the particular requirements. The audio file that meets the particular attribute requirements is the accompanying audio file corresponding to the first audio file, and the sound channel that meets the particular requirements is the sound channel where the outputted audio of the first audio file is the accompanying audio in the first sound channel and the second sound channel.
Step S506: Assign attribute to the first sound channel and/or the second sound channel by using the predetermined GMM.
Here, the predetermined GMM model is obtained through in-advance training, and the specific training process includes the following:
extract the 13 dimensional Perceptual Linear Predictive (PLP) characteristic parameters of the multiple predetermined audio files; and the specific process of extracting the PLP parameters is shown in FIG. 6. As shown in FIG. 6, perform front-end processing on an audio signal (i.e. audio file), and then perform discrete Fourier transform, then processing such as frequency band calculation, critical band analysis, equiloudness pre-emphasis and intensity-loudness conversion, and then perform inverse Fourier transform to generate an all-pole model, and calculate the cepstrum to obtain the PLP parameters.
Calculate the first order difference and the second order difference by using the extracted PLP characteristic parameters, totaling 39 dimensional features. Use the Expectation Maximization (EM) algorithm to obtain the GMM model which can preliminarily distinguish the accompanying audios from the a cappella audios through training based on the extracted PLP characteristic parameters. However, in practical application, an accompanying GMM model may be trained, and a similarity calculation may be performed between the model and the audio data to be distinguished, and the group of audio data with high similarity is exactly the accompanying audio data. In the present embodiment, by assigning attribute to the first sound channel and/or the second sound channel by using the predetermined GMM, which one of the first sound channel and the second sound channel is the sound channel that meets the particular attribute requirements may be preliminarily determined. For example, by performing a similarity calculation between the predetermined GMM model and the first and second audio data, assign or determine the sound channel corresponding to the audio data with high similarity as the sound channel outputting accompanying audios.
In this way, after determining which one of the first sound channel and the second sound channel is the sound channel outputting accompanying audio by using the predetermined GMM model, the determined sound channel is the sound channel that preliminarily meets the particular attribute requirements.
Step S507: Determine the first audio energy value and the second audio energy value. If the first attribute is assigned to the first sound channel and the first audio energy value is less than the second audio energy value, or the first attribute is assigned to the second sound channel and the second audio energy value is less than the first audio energy value, proceed to step S508; otherwise proceed to step S509.
In other words, determine whether the audio energy value corresponding to the sound channel that preliminarily meets the particular attribute requirements is less than the audio energy value corresponding to the other sound channel or not. If yes, proceed to step S508; otherwise proceed to step S509. The audio energy value corresponding to the sound channel that preliminarily meets the particular attribute requirements is exactly the audio energy value of the audio file outputted by the sound channel.
Step S508: If the first attribute is assigned to the first sound channel and the first audio energy value is less than the second audio energy value, determine the attribute of the first sound channel as the first attribute and the attribute of the second sound channel as the second attribute, that is to determine the first sound channel as the sound channel outputting accompanying audio and the second sound channel as the sound channel outputting original audio. If the first attribute is assigned to the second sound channel and the second audio energy value is less than the first audio energy value, determine the attribute of the second sound channel as the first attribute and the attribute of the first sound channel as the second attribute, that is to determine the second sound channel as the sound channel outputting accompanying audio and the first sound channel as the sound channel outputting original audio.
In this way, the sound channel that preliminarily meets the particular attribute requirements may be determined as the sound channel that meets the particular attribute requirements which is the sound channel outputting accompanying audio.
In some exemplary embodiments, the method may further include the following steps after Step S508:
label the sound channel that meets the particular attribute requirements;
switch between sound channels based on the labeling of the sound channel that meets the particular attribute requirements if it is determined to switch the sound channels;
for example, the sound channel that meets the particular attribute requirements may be the sound channel outputting accompanying audio. After the sound channel outputting accompanying audio (such as the first sound channel) is determined, the sound channel is labeled as the accompanying audio sound channel. In this way, it is possible to switch between accompaniments and originals based on the labeled sound channel. For example, a user may switch between accompaniments and originals based on the labeled sound channel when the user is singing karaoke;
alternatively, adjust the sound channel that meets the particular attribute requirements as the first sound channel or the second sound channel uniformly; in this way, all sound channels outputting accompanying audios/original audios may be unified for the convenience of unified management.
Step S509: Output the prompt message.
Here, the prompt message may be used to prompt the user that the corresponding sound channel outputting accompanying audio of the first audio file cannot be distinguished, so that the user can confirm that the corresponding sound channel outputs accompanying audio manually.
For example, if the first attribute is assigned to the first sound channel but the first audio energy value is not less than the second audio energy value, or the first attribute is assigned to the second sound channel but the second audio energy value is not less than the first audio energy value, then the attributes of the first sound channel and the second sound channel need to be confirmed artificially.
In applying the above exemplary embodiment, based on the features of music files, firstly extract the human-voice component from the music by using the trained DNN model, and then obtain the final classification result through comparison of dual-channel human-voice energy. The accuracy of the final classification may reach 99% or above.
Exemplary Embodiment 4
FIG. 7 is a flow diagram of an audio information processing method according an exemplary embodiment. As shown in FIG. 7, the audio information processing method according an exemplary embodiment may include the following steps:
Step S701: Extract the dual-channel a cappella data (and/or human-voice accompanying data) of the music to be detected by using the DNN model trained in advance.
A specific process of extracting the a cappella data is shown in FIG. 8. As shown in FIG. 8, firstly extract the features of the a cappella data for training and the music data for training, and then perform DNN training to obtain the DNN model. Extract the features of the a cappella music to be extracted and perform DNN decoding based on the DNN model, then extract the features again, and finally obtain the a cappella data.
Step S702: Calculate the average audio energy value of the extracted dual-channel a cappella (and/or human-voice accompanying) data respectively.
Step S703: Determine whether the audio energy difference value of the dual-channel a cappella (and/or human-voice accompanying) data is greater than the predetermined threshold or not. If yes, proceed to step S704; otherwise, proceed to step S705.
Step S704: Determine the sound channel corresponding to the a cappella (and/or human-voice accompanying) data with a smaller average audio energy value as the accompanying sound channel.
Step S705: Classify the music to be detected with dual-channel output by using the GMM trained in advance.
Step S706: Determine whether the audio energy value corresponding to the sound channel that is classified as accompanying audio is smaller or not. If yes, proceed to step S707; otherwise, proceed to step S708.
Step S707: Determine the sound channel with a smaller audio energy value as the accompanying sound channel.
Step S708: Output the prompt message to use manual confirmation.
When the audio information processing method according to the exemplary embodiment is implemented practically, the dual-channel a cappella (and/or human-voice accompanying) data may be extracted while the accompanying audio sound channel is determined by using the GMM, and then a regression function is used to execute the above steps 703-708. It should be noted that the operations in step S705 have been executed in advance, so such operations may be skipped when the regression function is used, as shown in FIG. 9. Referring to FIG. 9, conduct dual-channel decoding on the music to be classified (i.e. music to be detected). At the same time, use the a cappella training data to obtain the DNN model through training and use the accompanying human-voice training data to obtain the GMM model through training. Then, conduct similarity calculation by using the GMM model and extract the a cappella data by using the DNN model, and operate by using the regression function as mentioned above to finally obtain the classification results.
Exemplary Embodiment 5
FIG. 10 is a structural diagram of the composition of the audio information processing apparatus according an exemplary embodiment. As shown in FIG. 10, the composition of the audio information processing apparatus according an exemplary embodiment includes a decoding module 11, an extracting module 12, an acquisition module 13 and a processing module 14;
the decoding module 11 being configured to decode the audio file (i.e. the first audio file) to acquire the first audio subfile outputted corresponding to first sound channel and the second audio subfile outputted corresponding to the second sound channel;
the extracting module 12 being configured to extract the first audio data from the first audio subfile and the second audio data from the second audio subfile;
the acquisition module 13 being configured to acquire the first audio energy value of the first audio data and the second audio energy value of the second audio data;
the processing module 14 being configured to determine the attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value.
The first audio data and the second audio data may have a same attribute. For example, the first audio data may correspond to the human-voice audio outputted by the first sound channel and the second audio data may correspond to the human-voice audio outputted by the second sound channel;
further, the processing module 14 may be configured to determine which one of the first sound channel and the second sound channel is the sound channel outputting accompanying audio based on the first audio energy value of the human-voice audio outputted by the first sound channel and the second audio energy value of the human-voice audio outputted by the second sound channel.
In some exemplary embodiments, the apparatus may further comprise a first model training module 15 configured to extract the frequency spectrum features of the multiple predetermined audio files respectively;
train the extracted frequency spectrum features by using the error back propagation (BP) algorithm to obtain the DNN model;
correspondingly, the extracting module 12 may be further configured to extract the first audio data from the first audio subfile and the second audio data from the second audio subfile respectively by using the DNN model.
In some exemplary embodiments, the processing module 14 may be configured to determine the difference value between the first audio energy value and the second audio energy value. If the difference value is greater than the threshold (e.g. an audio energy difference threshold) and the first audio energy value is less than the second audio energy value, then determine the attribute of the first sound channel as the first attribute and the attribute of the second sound channel as the second attribute, that is to determine the first sound channel as the sound channel outputting accompanying audio and the second sound channel as the sound channel outputting original audio. On the contrary, if the difference value between the first audio energy value and the second audio energy value is greater than the threshold and the second audio energy value is less than the first audio energy value, then determine the attribute of the second sound channel as the first attribute and the attribute of the first sound channel as the second attribute, that is to determine the second sound channel as the sound channel outputting accompanying audio and the first sound channel as the sound channel outputting original audio.
In this way, when the processing module 14 detects that the difference value between the first audio energy value and the second audio energy value is greater than the audio energy difference threshold, the first audio subfile or the second audio subfile corresponding to the first audio energy value or the second audio energy value (whichever is smaller) is determined as the audio file that meets the particular attribute requirements, and the sound channel corresponding to the audio subfile that meets the particular attribute requirements as the sound channel that meets the particular requirements;
alternatively, when the processing module 14 detects that the difference value between the first audio energy value and the second audio energy value is not greater than the audio energy difference threshold, the classification method is used to assign attribute to at least one of the first sound channel and the second sound channel, so as to preliminarily determine which one of the first sound channel and the second sound channel is the sound channel that meets the particular attribute requirements.
In some exemplary embodiments, the apparatus may further comprise a second model training module 16 being configured to extract the Perceptual Linear Predictive (PLP) characteristic parameters of multiple audio files;
obtain the Gaussian Mixture Model (GMM) through training by using the Expectation Maximization (EM) algorithm based on the extracted PLP characteristic parameters;
correspondingly, the processing module 14 may be further configured to assign an attribute to at least one of the first sound channel and the second sound channel by using the GMM obtained through training, so as to preliminarily determine the first sound channel or the second sound channel as the sound channel that preliminarily meets the particular attribute requirements.
Further, the processing module 14 may be configured to determine the first audio energy value and the second audio energy value. If the first attribute is assigned to the first sound channel and the first audio energy value is less than the second audio energy value, or the first attribute is assigned to the second sound channel and the second audio energy value is less than the first audio energy value. This is also to preliminarily determine whether the audio energy value corresponding to the sound channel that meets the particular attribute requirements is less than the audio energy value corresponding to the other sound channel or not;
if the result shows that the audio energy value corresponding to the sound channel that preliminarily meets the particular attribute requirements is less than the audio energy value corresponding to the other sound channel, determine the sound channel that preliminarily meets the particular attribute requirements as the sound channel that meets the particular attribute requirements.
In some exemplary embodiments, the processing module 14 may be further configured to output a prompt message when the result shows that the audio energy value corresponding to the sound channel that preliminarily meets the particular attribute requirements is not less than the audio energy value corresponding to the other sound channel.
The decoding module 11, the extracting module 12, the acquisition module 13, the processing module 14, the first model training module 15 and the second model training module 16 in the audio information processing apparatus may be achieved through a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC) in the apparatus.
FIG. 11 is a structural diagram of the hardware composition of the audio information processing apparatus according an exemplary embodiment. As an example of a hardware implementation, the apparatus S11 is shown as FIG. 11. The apparatus S11 may include a processor 111, a storage medium 112 and at least one external communication interface 113; and the processor 111, the storage medium 112 and the external communication interface 113 may be connected through a bus 114.
It should be noted that the audio information processing apparatus according an exemplary embodiment may be a mobile phone, a desktop computer, a PC or an all-in-one machine. The audio information processing method may also be achieved through the operations of a server.
It should be noted that the above descriptions related to the apparatus are similar to those related to the method, so the descriptions of the advantageous effects of the same method are omitted herein. Please refer to the descriptions of the exemplary embodiments of the method discussed above for the technical details that are not disclosed in the exemplary embodiments of the apparatus.
The audio information processing apparatus according an exemplary embodiment may be a terminal or a server. Similarly, the audio information processing method according to an exemplary embodiment is not limited to being used in the terminal, instead, the audio information processing method may also be used in a server such as a web server or a server corresponding to music application software (e.g. WeSing software). Please refer to the above descriptions of the exemplary embodiments for specific processing procedures, and details are omitted herein.
A person skilled in the art may understand that partial or all steps to achieve the above exemplary embodiments of the method may be implemented by the related hardware executing computer program code. The foregoing computer program code may be stored in a computer-readable storage medium, and a computer may execute the steps including the above exemplary embodiments during execution; and the foregoing storage medium may include a mobile storage device, a Random Access Memory (RAM), a Read-Only Memory (ROM), a disk, a disc or other media that can store program codes.
Alternatively, if the above integrated unit of the present application is achieved in the form of software functional module(s) and is sold or used as an independent product, then the software functional module(s) may also be stored in a computer-readable storage medium. On this basis, the technical solution according exemplary embodiments essentially or the part contributing to the related technology may be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions used to allow a computer device (which may be a personal computer, a server or a network device) to execute the whole or part of the method provided by each exemplary embodiment of the present application. The foregoing storage medium includes a mobile storage device, an RAM, an ROM, a disk, a disc or other media that can store program codes.
The foregoing descriptions are merely specific exemplary embodiments, but the protection scope of the present application is not limited thereto. Any changes or replacements within the technical scope disclosed in the present application made by those skilled in the art should fall within the scope of protection of the present application. Therefore, the protection scope of the present application is provided by the appended claims.

Claims (20)

What is claimed is:
1. A method comprising:
decoding a first audio file to acquire a first audio subfile corresponding to a first sound channel and a second audio subfile corresponding to a second sound channel, where one of the first sound channel and the second sound channel includes original audio, and the other one of the first sound channel and the second sound channel includes accompanying audio;
extracting first audio data from the first audio subfile;
extracting second audio data from the second audio subfile;
acquiring a first audio energy value of the first audio data;
acquiring a second audio energy value of the second audio data;
determining an attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value; and
determining which one of the first and second sound channels includes the accompanying audio based on the attribute that is determined.
2. The method according to claim 1, further comprising:
extracting frequency spectrum features of a plurality of second audio files, respectively; and
training the frequency spectrum features by using an error back propagation (BP) algorithm to obtain a deep neural networks (DNN) model,
wherein the first audio data is extracted from the first audio subfile by using the DNN model,
wherein the second audio data is extracted from the second audio subfile by using the DNN model.
3. The method according to claim 1, wherein the determining the attribute includes:
determining a difference value between the first audio energy value and the second audio energy value;
determining the attribute of the first sound channel as a first attribute in response to the difference value being greater than a threshold and the first audio energy value being less than the second audio energy value.
4. The method according to claim 1, wherein the determining the attribute includes:
determining a difference value between the first audio energy value and the second audio energy value; and
assigning an attribute to at least one of the first sound channel and the second sound channel by using a classification method in response to the difference value being less than or equal to a threshold value.
5. The method according to claim 4, further comprising:
extracting Perceptual Linear Predictive (PLP) characteristic parameters from a plurality of second audio files; and
obtaining a Gaussian Mixture Model (GMM) through training by using an EM algorithm based on the PLP characteristic parameters,
wherein the attribute may be assigned by using the GMM obtained through training.
6. The method according to claim 4, wherein the method further comprises, in response to the attribute being assigned to the first sound channel:
determining whether the first audio energy value is less than the second audio energy value;
determining the attribute of the first sound channel as a first attribute in response to the first audio energy value being less than the second audio energy value.
7. The method according to claim 3, wherein, the first audio data is human-voice audio corresponding to the first sound channel, and the second audio data is human-voice audio corresponding to the second sound channel, and
wherein the determining the attribute of the first sound channel as the first attribute includes:
determining the first sound channel as a sound channel outputting accompanying audio.
8. The method according to claim 1, further comprising:
labeling the attribute;
determining whether to switch between the first sound channel and the second sound channel; and
switching between the first sound channel and the second sound channel based on the labeling in response to determining to switch between the first sound channel and the second sound channel.
9. The method according to claim 1, wherein the first audio data has a same attribute as an attribute of the second audio data.
10. The method according to claim 1, wherein the attribute indicates that the sound channel is an accompaniment audio or an original audio.
11. An apparatus comprising:
at least one memory configured to store computer program code; and
at least one processor configured to access the at least one memory and operate according to the computer program code, said computer program code including:
decoding code configured to cause the at least one processor to decode an audio file to acquire a first audio subfile corresponding to a first sound channel and a second audio subfile corresponding to a second sound channel, where one of the first sound channel and the second sound channel includes original audio, and the other one of the first sound channel and the second sound channel includes accompanying audio;
extracting code configured to cause the at least one processor to extract first audio data from the first audio subfile and second audio data from the second audio subfile;
acquisition code configured to cause the at least one processor to acquire a first audio energy value of the first audio data and a second audio energy value of the second audio data;
processing code configured to cause the at least one processor to determine an attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value; and
determining code configured to cause the at least one processor to determine which one of the first and second sound channels includes the accompanying audio based on the attribute that is determined.
12. The apparatus according to claim 11, wherein the computer program code further comprises first model training code configured to cause the at least one processor to:
extract frequency spectrum features of multiple other audio files respectively;
train the extracted frequency spectrum features by using an error back propagation (BP) algorithm to obtain a deep neural networks (DNN) model,
wherein the extracting code is configured to cause the at least one processor to extract the first audio data from the first audio subfile and the second audio data from the second audio subfile respectively by using the DNN model.
13. The apparatus according to claim 11, wherein the at least one processor is further configured to:
determine a difference value between the first audio energy value and the second audio energy value; and
determine the attribute of the first sound channel as a first attribute in response to the difference value being greater than a threshold value and the first audio energy value being less than the second audio energy value.
14. The apparatus according to claim 11, wherein the at least one processor is configured to:
determine a difference value between the first audio energy value and the second audio energy value; and
assign an attribute to at least one of the first sound channel and the second sound channel by using a classification method in response to the difference value being not greater than a threshold.
15. The apparatus according to claim 14, wherein the computer program code further comprises second model training code configured to cause the at least one processor to:
extract Perceptual Linear Predictive (PLP) characteristic parameters of multiple other audio files; and
obtain a Gaussian Mixture Model (GMM) through training by using an Expectation Maximization (EM) algorithm based on the extracted PLP characteristic parameters,
wherein the processing code is further configured to cause at least one of the at least one processor to:
assign the attribute to at least one of the first sound channel and the second sound channel by using the GMM obtained through training.
16. The apparatus according to claim 14, wherein, in response to the first attribute being assigned to the first sound channel, the at least one processor is configured to:
determine whether the first audio energy value is less than the second audio energy value; and
determine the attribute of the first sound channel as the first attribute in response to the first audio energy value being determine to be less than the second audio energy value.
17. The apparatus according to claim 13,
wherein, the first audio data is a first human-voice audio corresponding to the first sound channel, and the second audio data is a second human-voice audio corresponding to the second sound channel,
wherein, to determine the attribute of the first sound channel as the first attribute, the processing code is configured to cause at least one of the at least one processor to determine the first sound channel as the sound channel outputting accompanying audio.
18. The apparatus according to claim 11, wherein the at least one processor is further configured to:
label the attribute;
determine whether to switch between the first sound channel and the second sound channel; and
switch between the first sound channel and the second sound channel based on the labeling in response to determining to switch between the first sound channel and the second sound channel.
19. The apparatus according to claim 11, wherein the first audio data has the same attribute as the attribute of the second audio data.
20. A non-transitory computer-readable storage medium that stores computer program code that, when executed by a processor of a calculating apparatus, causes the calculating apparatus to perform:
decoding an audio file to acquire a first audio subfile outputted corresponding to a first sound channel and a second audio subfile outputted corresponding to a second sound channel where one of the first sound channel and the second sound channel includes original audio, and the other one of the first sound channel and the second sound channel includes accompanying audio;
extracting first audio data from the first audio subfile;
extracting second audio data from the second audio subfile;
acquiring a first audio energy value of the first audio data;
acquiring a second audio energy value of the second audio data;
determining the attribute of at least one of the first sound channel and the second sound channel based on the first audio energy value and the second audio energy value; and
determining which one of the first and second sound channels includes the accompanying audio based on the attribute that is determined.
US15/762,841 2016-03-18 2017-03-16 Audio information processing method and apparatus Active US10410615B2 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201610157251.XA CN105741835B (en) 2016-03-18 2016-03-18 A kind of audio-frequency information processing method and terminal
CN201610157251.X 2016-03-18
CN201610157251 2016-03-18
PCT/CN2017/076939 WO2017157319A1 (en) 2016-03-18 2017-03-16 Audio information processing method and device

Publications (2)

Publication Number Publication Date
US20180293969A1 US20180293969A1 (en) 2018-10-11
US10410615B2 true US10410615B2 (en) 2019-09-10

Family

ID=56251827

Family Applications (1)

Application Number Title Priority Date Filing Date
US15/762,841 Active US10410615B2 (en) 2016-03-18 2017-03-16 Audio information processing method and apparatus

Country Status (6)

Country Link
US (1) US10410615B2 (en)
JP (1) JP6732296B2 (en)
KR (1) KR102128926B1 (en)
CN (1) CN105741835B (en)
MY (1) MY185366A (en)
WO (1) WO2017157319A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180350392A1 (en) * 2016-06-01 2018-12-06 Tencent Technology (Shenzhen) Company Limited Sound file sound quality identification method and apparatus

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105741835B (en) * 2016-03-18 2019-04-16 腾讯科技(深圳)有限公司 A kind of audio-frequency information processing method and terminal
CN106448630B (en) * 2016-09-09 2020-08-04 腾讯科技(深圳)有限公司 Method and device for generating digital music score file of song
CN106375780B (en) * 2016-10-20 2019-06-04 腾讯音乐娱乐(深圳)有限公司 A kind of multimedia file producting method and its equipment
CN108461086B (en) * 2016-12-13 2020-05-15 北京唱吧科技股份有限公司 Real-time audio switching method and device
CN110085216A (en) * 2018-01-23 2019-08-02 中国科学院声学研究所 A kind of vagitus detection method and device
CN108231091B (en) * 2018-01-24 2021-05-25 广州酷狗计算机科技有限公司 Method and device for detecting whether left and right sound channels of audio are consistent
US10522167B1 (en) * 2018-02-13 2019-12-31 Amazon Techonlogies, Inc. Multichannel noise cancellation using deep neural network masking
CN109102800A (en) * 2018-07-26 2018-12-28 广州酷狗计算机科技有限公司 A kind of method and apparatus that the determining lyrics show data
CN111061909B (en) * 2019-11-22 2023-11-28 腾讯音乐娱乐科技(深圳)有限公司 Accompaniment classification method and accompaniment classification device
CN113420771B (en) * 2021-06-30 2024-04-19 扬州明晟新能源科技有限公司 Colored glass detection method based on feature fusion
CN113744708B (en) * 2021-09-07 2024-05-14 腾讯音乐娱乐科技(深圳)有限公司 Model training method, audio evaluation method, device and readable storage medium
CN114615534A (en) * 2022-01-27 2022-06-10 海信视像科技股份有限公司 Display device and audio processing method

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0916189A (en) 1995-04-18 1997-01-17 Texas Instr Inc <Ti> Karaoke marking method and karaoke device
US5736943A (en) * 1993-09-15 1998-04-07 Fraunhofer-Gesellschaft Zur Forderung Der Angewandten Forschung E.V. Method for determining the type of coding to be selected for coding at least two signals
JP2003330497A (en) 2002-05-15 2003-11-19 Matsushita Electric Ind Co Ltd Method and device for encoding audio signal, encoding and decoding system, program for executing encoding, and recording medium with the program recorded thereon
US20040074378A1 (en) * 2001-02-28 2004-04-22 Eric Allamanche Method and device for characterising a signal and method and device for producing an indexed signal
US20040094019A1 (en) * 2001-05-14 2004-05-20 Jurgen Herre Apparatus for analyzing an audio signal with regard to rhythm information of the audio signal by using an autocorrelation function
US20040125961A1 (en) * 2001-05-11 2004-07-01 Stella Alessio Silence detection
JP2005201966A (en) 2004-01-13 2005-07-28 Daiichikosho Co Ltd Karaoke machine for automatically controlling background chorus sound volume
US20070131095A1 (en) * 2005-12-10 2007-06-14 Samsung Electronics Co., Ltd. Method of classifying music file and system therefor
US20070180980A1 (en) * 2006-02-07 2007-08-09 Lg Electronics Inc. Method and apparatus for estimating tempo based on inter-onset interval count
US20080187153A1 (en) * 2005-06-17 2008-08-07 Han Lin Restoring Corrupted Audio Signals
CN101577117A (en) 2009-03-12 2009-11-11 北京中星微电子有限公司 Extracting method of accompaniment music and device
US7630500B1 (en) * 1994-04-15 2009-12-08 Bose Corporation Spatial disassembly processor
CN101894559A (en) 2010-08-05 2010-11-24 展讯通信(上海)有限公司 Audio processing method and device thereof
US20110081024A1 (en) * 2009-10-05 2011-04-07 Harman International Industries, Incorporated System for spatial extraction of audio signals
US8378964B2 (en) * 2006-04-13 2013-02-19 Immersion Corporation System and method for automatically producing haptic events from a digital audio signal
US20130121511A1 (en) * 2009-03-31 2013-05-16 Paris Smaragdis User-Guided Audio Selection from Complex Sound Mixtures
US8489403B1 (en) * 2010-08-25 2013-07-16 Foundation For Research and Technology—Institute of Computer Science ‘FORTH-ICS’ Apparatuses, methods and systems for sparse sinusoidal audio processing and transmission
US20160049162A1 (en) * 2013-03-21 2016-02-18 Intellectual Discovery Co., Ltd. Audio signal size control method and device
CN105741835A (en) 2016-03-18 2016-07-06 腾讯科技(深圳)有限公司 Audio information processing method and terminal
US20160254001A1 (en) * 2013-11-27 2016-09-01 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Decoder, encoder, and method for informed loudness estimation in object-based audio coding systems

Patent Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5736943A (en) * 1993-09-15 1998-04-07 Fraunhofer-Gesellschaft Zur Forderung Der Angewandten Forschung E.V. Method for determining the type of coding to be selected for coding at least two signals
US7630500B1 (en) * 1994-04-15 2009-12-08 Bose Corporation Spatial disassembly processor
JPH0916189A (en) 1995-04-18 1997-01-17 Texas Instr Inc <Ti> Karaoke marking method and karaoke device
US5719344A (en) * 1995-04-18 1998-02-17 Texas Instruments Incorporated Method and system for karaoke scoring
US20040074378A1 (en) * 2001-02-28 2004-04-22 Eric Allamanche Method and device for characterising a signal and method and device for producing an indexed signal
US20040125961A1 (en) * 2001-05-11 2004-07-01 Stella Alessio Silence detection
US20040094019A1 (en) * 2001-05-14 2004-05-20 Jurgen Herre Apparatus for analyzing an audio signal with regard to rhythm information of the audio signal by using an autocorrelation function
JP2003330497A (en) 2002-05-15 2003-11-19 Matsushita Electric Ind Co Ltd Method and device for encoding audio signal, encoding and decoding system, program for executing encoding, and recording medium with the program recorded thereon
JP2005201966A (en) 2004-01-13 2005-07-28 Daiichikosho Co Ltd Karaoke machine for automatically controlling background chorus sound volume
US20080187153A1 (en) * 2005-06-17 2008-08-07 Han Lin Restoring Corrupted Audio Signals
US20070131095A1 (en) * 2005-12-10 2007-06-14 Samsung Electronics Co., Ltd. Method of classifying music file and system therefor
US20070180980A1 (en) * 2006-02-07 2007-08-09 Lg Electronics Inc. Method and apparatus for estimating tempo based on inter-onset interval count
US8378964B2 (en) * 2006-04-13 2013-02-19 Immersion Corporation System and method for automatically producing haptic events from a digital audio signal
CN101577117A (en) 2009-03-12 2009-11-11 北京中星微电子有限公司 Extracting method of accompaniment music and device
US20130121511A1 (en) * 2009-03-31 2013-05-16 Paris Smaragdis User-Guided Audio Selection from Complex Sound Mixtures
US20110081024A1 (en) * 2009-10-05 2011-04-07 Harman International Industries, Incorporated System for spatial extraction of audio signals
CN101894559A (en) 2010-08-05 2010-11-24 展讯通信(上海)有限公司 Audio processing method and device thereof
US8489403B1 (en) * 2010-08-25 2013-07-16 Foundation For Research and Technology—Institute of Computer Science ‘FORTH-ICS’ Apparatuses, methods and systems for sparse sinusoidal audio processing and transmission
US20160049162A1 (en) * 2013-03-21 2016-02-18 Intellectual Discovery Co., Ltd. Audio signal size control method and device
US20160254001A1 (en) * 2013-11-27 2016-09-01 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Decoder, encoder, and method for informed loudness estimation in object-based audio coding systems
CN105741835A (en) 2016-03-18 2016-07-06 腾讯科技(深圳)有限公司 Audio information processing method and terminal

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Communication dated Jun. 17, 2019, from the Japanese Patent Office in counterpart application No. 2018-521411.
Eric's Memo Pad, "KTV Automatic Sound Channel Judgment", http://ericpeng1968.blogspot.com/2015/08/ktv_5.html, Aug. 5, 2015.
International Search Report for PCT/CN2017/076939 dated Jun. 20, 2017.

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180350392A1 (en) * 2016-06-01 2018-12-06 Tencent Technology (Shenzhen) Company Limited Sound file sound quality identification method and apparatus
US10832700B2 (en) * 2016-06-01 2020-11-10 Tencent Technology (Shenzhen) Company Limited Sound file sound quality identification method and apparatus

Also Published As

Publication number Publication date
CN105741835B (en) 2019-04-16
JP6732296B2 (en) 2020-07-29
US20180293969A1 (en) 2018-10-11
KR102128926B1 (en) 2020-07-01
JP2019502144A (en) 2019-01-24
MY185366A (en) 2021-05-11
CN105741835A (en) 2016-07-06
KR20180053714A (en) 2018-05-23
WO2017157319A1 (en) 2017-09-21

Similar Documents

Publication Publication Date Title
US10410615B2 (en) Audio information processing method and apparatus
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
CN109599093B (en) Intelligent quality inspection keyword detection method, device and equipment and readable storage medium
CN105244026B (en) A kind of method of speech processing and device
CN106486128B (en) Method and device for processing double-sound-source audio data
US9368116B2 (en) Speaker separation in diarization
US20150356967A1 (en) Generating Narrative Audio Works Using Differentiable Text-to-Speech Voices
WO2022203699A1 (en) Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech
CN112037764B (en) Method, device, equipment and medium for determining music structure
CN107680584B (en) Method and device for segmenting audio
EP4425482A2 (en) Model training and tone conversion method and apparatus, device, and medium
CN108764114B (en) Signal identification method and device, storage medium and terminal thereof
WO2024055752A9 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses
US12093314B2 (en) Accompaniment classification method and apparatus
Petermann et al. Tackling the cocktail fork problem for separation and transcription of real-world soundtracks
Mandel et al. Audio super-resolution using concatenative resynthesis
CN112712793A (en) ASR (error correction) method based on pre-training model under voice interaction and related equipment
JP6220733B2 (en) Voice classification device, voice classification method, and program
CN114783410A (en) Speech synthesis method, system, electronic device and storage medium
CN113825009B (en) Audio and video playing method and device, electronic equipment and storage medium
Reddy et al. MusicNet: Compact Convolutional Neural Network for Real-time Background Music Detection
CN114822492B (en) Speech synthesis method and device, electronic equipment and computer readable storage medium
Ramona et al. Comparison of different strategies for a SVM-based audio segmentation
US20240071367A1 (en) Automatic Speech Generation and Intelligent and Robust Bias Detection in Automatic Speech Recognition Model
CN116229989A (en) Method, device and equipment for analyzing voice and readable storage medium

Legal Events

Date Code Title Description
AS Assignment

Owner name: TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED, CHI

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ZHAO, WEIFENG;REEL/FRAME:045332/0653

Effective date: 20180313

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4