WO2020078120A1 - 音频识别方法、装置及存储介质 - Google Patents
音频识别方法、装置及存储介质 Download PDFInfo
- Publication number
- WO2020078120A1 WO2020078120A1 PCT/CN2019/103883 CN2019103883W WO2020078120A1 WO 2020078120 A1 WO2020078120 A1 WO 2020078120A1 CN 2019103883 W CN2019103883 W CN 2019103883W WO 2020078120 A1 WO2020078120 A1 WO 2020078120A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- target
- time
- target word
- pitch
- probability
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000005070 sampling Methods 0.000 claims description 28
- 230000009467 reduction Effects 0.000 claims description 12
- 239000011295 pitch Substances 0.000 description 137
- 238000010586 diagram Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 11
- 230000008569 process Effects 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
- G10H2210/091—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for performance evaluation, i.e. judging, grading or scoring the musical qualities or faithfulness of a performance, e.g. with respect to pitch, tempo or other timings of a reference performance
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
- G10L2025/906—Pitch tracking
Definitions
- the invention relates to the field of information technology, in particular to an audio recognition method, device and storage medium.
- the terminal can also score the user's singing audio for the user's reference.
- the start time and end time of the lyrics are generally regarded as the time when the person starts and ends singing. However, in the actual singing process, some people may sing earlier than the start time of the lyrics, and some people may sing later than the start time of the lyrics. Lower.
- Embodiments of the present invention provide an audio recognition method, device, and storage medium, which can improve the accuracy of audio recognition.
- An embodiment of the present invention provides an audio recognition method, which includes:
- each word in the text information is a target word and obtain time information corresponding to the target word, the time information including the start time of the target word and the end time of the target word;
- the audio file is identified according to multiple start adjustment times of the target word and multiple end adjustment times of the target word to obtain pitch information of the target word.
- the multiple start adjustment times corresponding to the target word are determined according to the start time of the target word, and the multiple ends corresponding to the target word are determined according to the end time of the target word Adjust the time steps, including:
- the step of recognizing the audio file according to a plurality of start adjustment times of the target word and a plurality of end adjustment times of the target word to obtain pitch information of the target word include:
- the pitch information of the target word is generated according to the highest-score pitch probability set.
- the step of scoring the plurality of pitch probability sets and selecting the highest pitch probability set includes:
- the plurality of error reduction values are set as target error reduction values in turn, and a first probability and a second probability are obtained from the pitch probability set corresponding to the target error reduction values, wherein the first The probability is the maximum probability, and the second probability is the second largest probability;
- the pitch probability set corresponding to the target error deduction value is scored.
- the pitch probability set corresponding to each target adjustment time group is determined to obtain multiple pitch probability sets, and the pitch probability set includes pitch, probability, and the The association steps include:
- the pitch, the probability, and the association relationship between the two are stored to generate a pitch probability set corresponding to the target adjustment time group.
- the time information corresponding to the target word further includes the duration of the target word; the setting of each word in the text information as the target word in turn, and obtaining the correspondence of the target word Time information, the time information includes the start time of the target word and the end time of the target word, and further includes:
- the split target word continues to be split until the duration of each word in the text information is not greater than the preset duration.
- An embodiment of the present invention also provides an audio recognition device, including:
- An acquisition module for acquiring an audio file and text information corresponding to the audio file, the text information including multiple words;
- a setting module for sequentially setting each word in the text information as a target word, and acquiring time information corresponding to the target word, the time information including the start time of the target word and the target word End time
- the first determining module is configured to determine a plurality of start adjustment times corresponding to the target word according to the start time of the target word, and determine a plurality of end adjustments corresponding to the target word according to the end time of the target word time;
- the recognition module is configured to recognize the audio file according to a plurality of start adjustment times of the target word and a plurality of end adjustment times of the target word to obtain pitch information of the target word.
- the first determining module includes:
- Acquisition submodule used to acquire the preset time step and the preset maximum error value
- the identification module includes:
- a selection sub-module for selecting a target start adjustment time from a plurality of start adjustment times of the target word, and selecting a target end corresponding to the target start adjustment time from a plurality of end adjustment times of the target word Adjust the time to get multiple target adjustment time groups;
- a scoring submodule for scoring the plurality of pitch probability sets, and selecting the highest pitch probability set
- a generation submodule is used to generate pitch information of the target word according to the highest-score pitch probability set.
- the scoring sub-module is specifically used to:
- the pitch probability set corresponding to the target error deduction value is scored.
- the obtaining submodule is specifically used for:
- the pitch, the probability, and the association relationship between the two are stored to generate a pitch probability set corresponding to the target adjustment time group.
- the audio recognition device further includes:
- a second determination module configured to determine whether the duration of the target word is greater than a preset duration
- a splitting module configured to split the target word when it is greater than a preset duration, and determine the duration of the split target word
- a determining module used to re-determine whether the duration of the split target word is greater than a preset duration
- a continuation splitting module which is used to continue splitting the split target word when it is greater than a preset duration, until the duration of each word in the text information is not greater than the preset duration .
- an embodiment of the present invention also provides a storage medium in which processor-executable instructions are stored, and the processor provides any of the above audio recognition methods by executing the instructions.
- the audio recognition method, device and storage medium of the embodiments of the present invention first determine a plurality of start adjustment times and a plurality of end adjustment times according to the start time and end time corresponding to the target word, and then determine At the end of adjusting the time, the audio file is recognized, which improves the accuracy of audio recognition.
- FIG. 1 is a schematic diagram of a first scenario of an audio recognition method according to an embodiment of the present invention.
- FIG. 2 is a schematic flowchart of an audio recognition method provided by an embodiment of the present invention.
- FIG. 3 is another schematic diagram of an audio recognition method provided by an embodiment of the present invention.
- FIG. 4 is another schematic flowchart of an audio recognition method according to an embodiment of the present invention.
- FIG. 5 is another schematic diagram of an audio recognition method according to an embodiment of the present invention.
- FIG. 6 is another schematic diagram of an audio recognition method according to an embodiment of the present invention.
- FIG. 7 is a schematic structural diagram of an audio recognition device according to an embodiment of the present invention.
- FIG. 8 is a schematic structural diagram of a first determining module provided by an embodiment of the present invention.
- FIG. 9 is a schematic structural diagram of an identification module provided by an embodiment of the present invention.
- FIG. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
- FIG. 1 is a schematic diagram of a scene of an audio recognition method provided by an embodiment of the present invention.
- an audio recognition apparatus may be implemented as an entity, or may be implemented by being integrated in an electronic device such as a terminal or server.
- this scenario may include a terminal a and a server b.
- User A can record songs and generate audio files through the singing application H integrated in the terminal a.
- the terminal a may obtain text information corresponding to the audio file from the server b, specifically including lyrics text information, and the text information includes multiple words.
- each word in the text information has time information, specifically including the start time and end time of each word.
- the beginning and end of a word correspond to the beginning and end of a person's high voice.
- the terminal a sets each word in the text information as the target word, and further obtains time information corresponding to the target word from the server b.
- the time information includes the start time of the target word and the end time of the target word. Because in the audio file recorded by the user, the start and end of the human voice is not necessarily completely synchronized with the start and end of the corresponding word. Therefore, multiple start adjustment times corresponding to the target word can be determined according to the start time of the target word, and multiple end adjustment times corresponding to the target word can be determined according to the end time of the target word. Finally, the terminal a then recognizes the audio file according to the multiple start adjustment times and the multiple end adjustment times to obtain pitch information of the target word.
- Embodiments of the present invention provide an audio recognition method, device, and storage medium, which will be described in detail below.
- An audio recognition method includes: acquiring an audio file and text information corresponding to the audio file, the text information includes a plurality of words; sequentially setting each word in the text information as a target word, and acquiring time information corresponding to the target word , The time information includes the start time of the target word and the end time of the target word; according to the start time of the target word, determine the multiple start adjustment time corresponding to the target word, and according to the end time of the target word, determine the multiple end corresponding to the target word Adjust the time; identify the audio file according to the multiple start adjustment times of the target word and the multiple end adjustment times of the target word to obtain the pitch information of the target word.
- FIG. 2 is a flowchart of an audio recognition method according to an embodiment of the present invention.
- the method may include:
- Step S101 Acquire an audio file and text information corresponding to the audio file.
- the text information includes multiple words.
- accompaniment sounds When a user uses a singing application to record a song, accompaniment sounds, vocals, and other sounds together form an audio file. These sounds exist in the form of digital signals in audio files. To accurately identify the human voice from the audio file, you need to know the start time and end time of the human voice in the audio file.
- the singing application when a user uses a singing application to record a song, the singing application will display lyrics text information to prompt the user to sing. It can be roughly considered that the time when the lyrics start is the time when the user starts singing, and the time when the lyrics end is the time when the user ends singing. Therefore, after the audio file is obtained, the text information corresponding to the audio file may be further obtained to assist in identifying the human voice in the audio file. Among them, the text information includes multiple words, which correspond to the human voice.
- step S102 each word in the text information is set as the target word in turn, and time information corresponding to the target word is obtained.
- the time information includes the start time of the target word and the end time of the target word.
- the time when the user starts and ends the singing is not necessarily completely synchronized with the time corresponding to the text information provided by the singing application.
- the start time of the word "dang” is 43000 milliseconds and the end time is 43300 milliseconds
- the start time of the word "dang” sung by the user is 42000 milliseconds and the end time It is 42300 milliseconds.
- the human voice is also detected according to the start time and end time corresponding to the lyrics "Dang" provided by the singing application, the accuracy of audio recognition will be reduced.
- each word in the text information as the target word in turn, and obtain the time information corresponding to the target word, and adjust the time information to improve the accuracy of human voice recognition in the audio file.
- the time information includes the start time of the target word and the end time of the target word.
- step S103 multiple start adjustment times corresponding to the target word are determined according to the start time of the target word, and multiple end adjustment times corresponding to the target word are determined according to the end time of the target word.
- multiple time points can be selected as the start adjustment time within a period before and after the start time of the target word.
- multiple time points can be selected as the end adjustment time within a period before and after the end time of the target word.
- the start time of the target word is 10000 milliseconds and the end time is 10500 milliseconds
- the 10050 milliseconds and 10100 milliseconds are used as the start adjustment time.
- the 10400 ms, 10450 ms, 10500 ms, 10550 ms, and 10600 ms before and after the 10500 ms are used as the end adjustment time.
- Step S104 Identify the audio file according to the multiple start adjustment times of the target word and the multiple end adjustment times of the target word to obtain pitch information of the target word.
- the target start adjustment time and the target end adjustment time satisfying the preset conditions may be selected from the multiple start adjustment times of the target word and the multiple end adjustment times of the target word to form multiple target adjustment time groups.
- the time group adjusts the time group according to each target to perform human voice high recognition on the audio file, and score the recognized human voice height. If the quality of the human voice high recognition in the target adjustment time group is higher, the score is more high. That is, the time group can be adjusted according to the target to obtain pitch information of the target word.
- the human voice height refers to the height of the human voice.
- the audio recognition method provided by the embodiment of the present invention first determines multiple start adjustment times and multiple end adjustment times according to the start time and end time corresponding to the target word, and then determines the multiple start adjustment times and multiple end adjustment times End the adjustment time to identify the audio file and improve the accuracy of audio recognition.
- FIG. 4 is another flowchart of an audio recognition method according to an embodiment of the present invention.
- the method may include:
- Step S201 Acquire an audio file and text information corresponding to the audio file.
- the text information includes multiple words.
- accompaniment sounds When a user uses a singing application to record a song, accompaniment sounds, vocals, and other sounds together form an audio file. These sounds exist in the form of digital signals in audio files. To accurately identify the human voice from the audio file, you need to know the start time and end time of the human voice in the audio file.
- the singing application when a user uses a singing application to record a song, the singing application will display lyrics text information to prompt the user to sing. Therefore, the time when the lyrics start is the time when the user starts singing, and the time when the lyrics end is the time when the user ends singing. Therefore, after the audio file is obtained, the text information corresponding to the audio file may be further obtained to assist in identifying the human voice in the audio file.
- the text information includes multiple words, which correspond to the human voice.
- step S202 each word in the text information is set as the target word in turn, and time information corresponding to the target word is obtained, and the time information includes the start time of the target word and the end time of the target word.
- the time when the user starts and ends the singing is not necessarily completely synchronized with the time corresponding to the text information provided by the singing application.
- the start time of the word "dang” is 43000 milliseconds and the end time is 43300 milliseconds
- the start time of the word "dang” of the user is 42000 milliseconds and the end time It is 42300 milliseconds.
- the human voice is also detected according to the start time and end time corresponding to the lyrics "Dang" provided by the singing application, the accuracy of audio recognition will be reduced.
- the time information includes time information such as the start time, end time and duration of the target word.
- the lyrics include 15 words, which can be set as the target word in sequence.
- the word “dang” is first set as the target word, and the start time of the word “dang” can be obtained as 43000 milliseconds, the end time is 43300 milliseconds, and the duration is 300 milliseconds.
- the duration corresponding to a word is about 100 milliseconds. If the duration corresponding to the target word is detected to be greater than 100 milliseconds, it can be considered that the target word has a polyphony, that is, a target A word may correspond to multiple pitches, where pitch refers to the height of the sound. For the above one-word polyphony situation, you can use the following steps to deal with:
- the duration of the target word can be calculated according to the end time and start time of the target word. Specifically, assume that the start time of the target word is E and the end time is F. Then the duration of the target word is (F-V).
- the duration corresponding to a single pitch can be obtained, so the preset duration can be set according to the duration corresponding to the single pitch, such as setting the preset duration to the duration corresponding to the single pitch, here
- the value of the preset duration is not specifically limited.
- the target word needs to be split until each word in the text information corresponds to only one pitch.
- the target word can be split into a first target word and a second target word, the start time of the first target word is set to E, and the end time of the first target word is set to The start time of the second target word is set to The end time of the second target word is set to F.
- the duration of the first target word is The duration of the second target word is In summary, the duration of the first target word It must be less than the preset duration V, so next only the duration of the second target word needs to be revisited Screen whether it is greater than the preset duration V.
- the second target word If the duration of the second target word If it is not greater than the preset duration V, stop splitting the second target word; if the duration of the second target word If it is greater than the preset duration V, the second target word is split according to the above method of splitting the target word, which will not be repeated here. Until the duration of each word in the text information is not greater than the preset duration V.
- Step S203 obtaining a preset time step and a preset maximum error value
- the preset time step refers to the difference between two preset time points.
- the preset maximum error value refers to the error value between two preset time points. The larger the value of the preset maximum error value, the more accurate the actual start time and actual end time of the target word can be determined, but it also brings the problem of excessive calculation, so the preset maximum error value can be adjusted according to the actual situation Set the error value.
- Step S204 Determine a plurality of start adjustment times corresponding to the target word according to the target word's start time, preset time step size and preset maximum error value, and according to the target word's end time, preset time step size and preset maximum The error value determines the multiple end adjustment times corresponding to the target word.
- the Multiple start adjustment times include 100ms, 200ms, 300ms, 400ms, 500ms, 600ms, and 700ms, and multiple end adjustment times include 500ms, 600ms, 700th Milliseconds, 800 milliseconds, 900 milliseconds, 1000 milliseconds, and 1100 milliseconds.
- Step S205 Select the target start adjustment time from the multiple start adjustment times of the target word, and select the target end adjustment time corresponding to the target start adjustment time from the multiple end adjustment times of the target word to obtain multiple target adjustment times group.
- a start adjustment time may be arbitrarily selected as the target start adjustment time from the multiple start adjustment times
- an end adjustment time may be arbitrarily selected as the target end adjustment time from the multiple end adjustment times
- the 800th millisecond is selected as the target end adjustment time
- the target start adjustment time of 200 milliseconds and the target end adjustment time of 800 milliseconds can be used as a target adjustment time group.
- the selected target start adjustment time is 700 milliseconds and the target end adjustment time is 500 milliseconds, there will be an unreasonable situation where the target start adjustment time of the target word is greater than the target end adjustment time.
- the area of the multiple target start adjustment time and the multiple target end adjustment time Compare the value areas. If there is an overlap area between the two, you can divide the overlap area. As shown in Fig. 5, the overlap area is from the 500th millisecond to the 700th millisecond, the 600th millisecond of the middle value of the overlap area can be taken as the dividing line between the target start adjustment time and the target end adjustment time.
- the start adjustment time of each target includes the 100th millisecond, 200th millisecond, 300th millisecond, 400th millisecond, 500th millisecond, and 600th millisecond. Milliseconds, 1000th milliseconds, and 1100th milliseconds.
- the start adjustment time may be selected from the multiple start adjustment times of the target word as the target start adjustment time, and then all the end adjustments that are not less than the target start adjustment time may be selected from the multiple end adjustment times Time, as the target end adjustment time corresponding to the target start adjustment time.
- the 500th, 600th, 700th, 800th, 900th, 1000th, and 1000th milliseconds can be selected from the end adjustment time. 1100 milliseconds is used as the target end adjustment time.
- 600 milliseconds, 700 milliseconds, 800 milliseconds, 900 milliseconds, 1000 milliseconds, and 1100 milliseconds can be selected as the target end adjustment time from the end adjustment time. This can also effectively avoid the unreasonable situation that the target start adjustment time of the target word is greater than the target end adjustment time.
- the target start adjustment time and target end adjustment time are regarded as the target adjustment time group.
- Step S206 Determine a pitch probability set corresponding to each target adjustment time group to obtain multiple pitch probability sets.
- the pitch probability set includes pitch, probability, and the association between the two.
- the audio file may be identified according to the target start adjustment time and the target end adjustment time in the target adjustment time group to obtain a pitch probability set.
- the steps of establishing the pitch probability set are as follows:
- the 100th to 300th The audio file between milliseconds is divided into 4 sampling intervals, in which the pitch measured in the 100 ms-150 ms sampling interval is m2, and the pitch measured in the 150 mm-200 ms sampling interval is m4, 200 ms -The pitch measured in the 250 ms sampling interval is m3, and the pitch measured in the 250 ms-300 ms sampling interval is m1.
- the measurement of the pitch in each sampling interval may use a neural network algorithm to process the audio file to obtain the pitch corresponding to the sampling interval.
- the set of pitch probabilities corresponding to the target adjustment time group can be obtained as
- the pitch probability set can also be stored in the form of Table 1 below.
- the pitch probability set corresponding to each target time adjustment group can be obtained, that is, multiple pitch probability sets can be obtained, for example, as shown in Table 2 below:
- step S207 a plurality of pitch probability sets are scored, and a pitch probability set with the highest score is selected.
- U i represents the target start adjustment time in the i-th target adjustment time group
- V i represents the target end adjustment time in the i-th target adjustment time group
- i is a positive integer
- Y represents the start time of the target word
- Z represents the end time of the target word
- Q represents the error deduction coefficient
- T i represents the first probability corresponding to the i-th error gain value R i
- O i represents the second probability corresponding to the i-th error gain value R i . It should be noted that, if the first probability far exceeds the second probability, it means that the group is adjusted according to the target time, and the greater the accuracy of high-recognition of human voice for audio, that is, the greater the score S i .
- the target adjustment time group As shown in the corresponding relationship between the target adjustment time group and the pitch probability set shown in Table 2, suppose the error deduction coefficient Q is 0.0001, the end time Z of the target word is 300 milliseconds, and the start time Y of the target word is 100 milliseconds.
- the target group of a corresponding adjustment time error gain value of R 1 is 0, the target adjustment time corresponding to the error gain set value 2 and R 2 is 0.01, the target group to adjust the time corresponding to the error gain values 3 and R 3 is 0.01.
- a first error gain value as the target error Save R & lt gain value, from a set of pitch error probability R & lt gain value 1 corresponding to a probability of T 1 is the first
- the second probability O 1 is Finally, according to a first probability T 1, O 1 and a second target error probability gain value Save, Save pitch error probability target gain value corresponding to a set of R & lt score 1, score was obtained
- step S208 the pitch information of the target word is generated according to the pitch probability set with the highest score.
- the pitch information of the target word is generated according to the pitch probability set 3.
- the set of pitch probability 3 The pitch with the highest probability is selected as the pitch of the target word, that is, m2 is used as the pitch of the target word.
- the audio recognition method provided by the embodiment of the present invention first determines multiple start adjustment times and multiple end adjustment times according to the start time and end time corresponding to the target word, and then determines the multiple start adjustment times and multiple end adjustment times End the adjustment time to identify the audio file and improve the accuracy of audio recognition.
- FIG. 7 is a structural diagram of an audio recognition device according to an embodiment of the present invention.
- the device 30 includes an acquisition module 301, a setting module 302, a first determination module 303, and an identification module 304.
- the obtaining module 301 is used to obtain audio files and text information corresponding to the audio files.
- the text information includes multiple words.
- accompaniment sounds When a user uses a singing application to record a song, accompaniment sounds, vocals, and other sounds together form an audio file. These sounds exist in the form of digital signals in audio files. To accurately identify the human voice from the audio file, you need to know the start time and end time of the human voice in the audio file.
- the singing application when a user uses a singing application to record a song, the singing application will display lyrics text information to prompt the user to sing. Therefore, the time when the lyrics start is the time when the user starts singing, and the time when the lyrics end is the time when the user ends singing. Therefore, after the obtaining module 301 obtains the audio file, the obtaining module 301 may further obtain the text information corresponding to the audio file to assist in identifying the human voice in the audio file.
- the text information includes multiple words, which correspond to the human voice.
- the setting module 302 sequentially sets each word in the text information as the target word, and obtains time information corresponding to the target word.
- the time information includes the start time of the target word and the end time of the target word.
- the time when the user starts and ends the singing is not necessarily completely synchronized with the time corresponding to the text information provided by the singing application.
- the start time of the word "dang” is 43000 milliseconds and the end time is 43300 milliseconds
- the start time of the word "dang” sung by the user is 42000 milliseconds and the end time It is 42300 milliseconds.
- the human voice is also detected according to the start time and end time corresponding to the lyrics "Dang" provided by the singing application, the accuracy of audio recognition will be reduced.
- the setting module 302 can use the setting module 302 to set each word in the text information as the target word in turn, and obtain the time information corresponding to the target word, and adjust the time information to improve the human voice recognition in the audio file. accuracy.
- the time information includes time information such as the start time, end time and duration of the target word.
- the lyrics include 15 words, and the setting module 302 may sequentially set these 15 words as the target word. Specifically, the setting module 302 first sets the word "dang" as the target word, and can obtain that the start time of the word "dang" is 43000 milliseconds, the end time is 43300 milliseconds and the duration is 300 milliseconds.
- the duration corresponding to a word is about 100 milliseconds. If the setting module 302 detects that the duration corresponding to the target word is greater than 100 milliseconds, it may be considered that the target word has a polyphony, That is, a target word may correspond to multiple pitches, where pitch refers to the height of the pitch.
- the audio recognition device 30 is further provided with a second determination module 305, a split module 306, a determination module 307, and a continue split module 308.
- the second determination module 305 is used to determine whether the duration of the target word is greater than the preset duration; the splitting module 306 is used to split the target word when it is greater than the preset duration and determine the split target The duration of the word; the determination module 307 is used to re-determine whether the duration of the split target word is greater than the preset duration; the continue splitting module 308 is used to continue the split after the preset duration The target word of is split until the duration of each word in the text information is not greater than the preset duration.
- the duration of the target word can be calculated according to the end time and start time of the target word. Specifically, assume that the start time of the target word is E and the end time is F. Then the duration of the target word is (F-V).
- the duration corresponding to a single pitch can be obtained, so the preset duration can be set according to the duration corresponding to the single pitch, such as setting the preset duration to the duration corresponding to the single pitch, here
- the value of the preset duration is not specifically limited.
- the second determination module 305 determines that the duration of the target word is greater than the preset duration, it indicates that the target word may have a situation where one word corresponds to multiple pitches. Therefore, the target word needs to be split until each word in the text information corresponds to only one pitch.
- the splitting module 306 can split the target word into a first target word and a second target word, set the start time of the first target word to E, and set the end time of the first target word to The start time of the second target word is set to The end time of the second target word is set to F.
- the duration of the first target word is The duration of the second target word is In summary, the duration of the first target word It must be less than the preset duration V, so next only need to pass the determination module 307 to re-confirm the duration of the second target Screen whether it is greater than the preset duration V.
- the splitting module 308 continues to split the second target word according to the above method for splitting the target word, which will not be repeated here. Until the duration of each word in the text information is not greater than the preset duration V.
- the first determining module 303 is configured to determine a plurality of start adjustment times corresponding to the target word according to the start time of the target word, and determine a plurality of end adjustment times corresponding to the target word according to the end time of the target word.
- the first determination module 303 includes: an acquisition submodule 3031 and a determination submodule 3032.
- the obtaining submodule 3031 is used to obtain a preset time step and a preset maximum error value.
- the preset time step refers to the difference between two preset time points. The smaller the preset time step value is set, the more accurate the actual start time and actual end time of the target word can be determined, but it will also cause the problem of excessive calculation, so the preset time can be adjusted according to the actual situation Set the step value.
- the preset maximum error value refers to the error value between two preset time points. The larger the value of the preset maximum error value, the more accurate the actual start time and actual end time of the target word can be determined, but it also brings the problem of excessive calculation, so the preset maximum error value can be adjusted according to the actual situation Set the error value.
- the determination submodule 3032 is used to determine a plurality of start adjustment times corresponding to the target word according to the start time, preset time step and preset maximum error value of the target word, and according to the end time and preset time step of the target word With the preset maximum error value, determine the multiple end adjustment times corresponding to the target word.
- the Multiple start adjustment times include 100ms, 200ms, 300ms, 400ms, 500ms, 600ms, and 700ms, and multiple end adjustment times include 500ms, 600ms, 700th Milliseconds, 800 milliseconds, 900 milliseconds, 1000 milliseconds, and 1100 milliseconds.
- the recognition module 304 is used to recognize the audio file according to the multiple start adjustment times of the target word and the multiple end adjustment times of the target word to obtain pitch information of the target word.
- the identification module 304 includes: a selection submodule 3041, a obtaining submodule 3042, a scoring submodule 3043, and a generation submodule 3044.
- the selection submodule 3041 is used to select the target start adjustment time from the multiple start adjustment times of the target word, and select the target end adjustment time corresponding to the target start adjustment time from the multiple end adjustment times of the target word. Time group for each target.
- the selection submodule 3041 may arbitrarily select a start adjustment time from the plurality of start adjustment times as the target start adjustment time, and select an end adjustment time from the plurality of end adjustment times as the target end adjustment time .
- the selection submodule 3041 can select the 200 milliseconds as the target start adjustment time from the multiple start adjustment times such as the 100 milliseconds, 200 milliseconds, and 300 milliseconds, from the 700 milliseconds, the 800 milliseconds, and the first Among the multiple end adjustment times such as 900 milliseconds, the 800th millisecond is selected as the target end adjustment time, then the target start adjustment time of 200 milliseconds and the target end adjustment time of 800 milliseconds can be used as a target adjustment time group.
- the target start adjustment time selected by the sub-module 3041 is 700 milliseconds and the target end adjustment time is 500 milliseconds, there will be an unreasonable situation where the target start adjustment time of the target word is greater than the target end adjustment time.
- the sub-module 3041 can select the area and the plurality of start adjustment times for the plurality of targets Compare the value areas of the target end adjustment time. If there is an overlap area between the two, you can divide the overlap area. As shown in FIG.
- the overlapping area is from 500 milliseconds to 700 milliseconds
- the submodule 3041 can take the middle value of the overlapping area at 600 milliseconds as the boundary between the target start adjustment time and the target end adjustment time, that is, a compromise After division, multiple target start adjustment times include 100 ms, 200 ms, 300 ms, 400 ms, 500 ms, and 600 ms, and multiple end adjustment times include 600 ms, 700 ms, and 800 ms Milliseconds, 900 milliseconds, 1000 milliseconds, and 1100 milliseconds.
- the selection submodule 3041 may also select the start adjustment time from the multiple start adjustment times of the target word as the target start adjustment time, and then select all the multiple end adjustment times that are not less than the target to start adjustment
- the end adjustment time of time is regarded as the target end adjustment time corresponding to the target start adjustment time.
- the selection submodule 3041 can select the 500th millisecond, 600th millisecond, 700th millisecond, 800th millisecond, 800th millisecond, 900th millisecond, and 900th millisecond from the end adjustment time 1000 milliseconds and 1100 milliseconds are the target end adjustment time.
- the selection submodule 3041 can select 600 milliseconds, 700 milliseconds, 800 milliseconds, 900 milliseconds, 1000 milliseconds, and 1100 milliseconds as the target end adjustment from the end adjustment time time. This can also effectively avoid the unreasonable situation that the target start adjustment time of the target word is greater than the target end adjustment time.
- the target start adjustment time and target end adjustment time are regarded as the target adjustment time group.
- the obtaining submodule 3042 is used to determine the pitch probability set corresponding to each target adjustment time group, and obtain multiple pitch probability sets.
- the pitch probability set includes pitch, probability, and the relationship between the two.
- the obtaining submodule 3042 may identify the audio file according to the target start adjustment time and the target end adjustment time in the target adjustment time group to obtain a pitch probability set.
- the steps of obtaining the submodule 3042 to establish the pitch probability set are as follows:
- the pitch, probability and the relationship between the two are stored to generate a set of pitch probabilities corresponding to the target adjustment time group.
- the submodule 3042 can obtain the 100th
- the audio file between milliseconds-300 milliseconds is divided into 4 sampling intervals, of which the pitch measured in the 100 milliseconds-150 milliseconds sampling interval is m2, and the pitch measured in the 150 milliseconds-200th millisecond sampling interval is m4
- the pitch measured in the 200 ms-250 ms sampling interval is m3, and the pitch measured in the 250 ms-300 ms sampling interval is m1.
- the measurement of the pitch in each sampling interval can be processed by the neural network algorithm to obtain the pitch corresponding to the sampling interval.
- the obtaining submodule 3042 can obtain the pitch probability set corresponding to the target adjustment time group as The pitch probability set can also be stored as shown in Table 1.
- the obtaining submodule 3042 can obtain a pitch probability set corresponding to each target time adjustment group, that is, multiple pitch probability sets, as shown in Table 2.
- the scoring sub-module 3043 is used to score multiple pitch probability sets, and select the highest pitch probability set.
- the scoring submodule 3043 is specifically used to:
- multiple error deduction values are set as target error deduction values, and the first probability and the second probability are obtained from the pitch probability set corresponding to the target error deduction values;
- the pitch probability set corresponding to the target error deduction value is scored.
- U i represents the target start adjustment time in the i-th target adjustment time group
- V i represents the target end adjustment time in the i-th target adjustment time group
- i is a positive integer
- Y represents the start time of the target word
- Z represents the end time of the target word
- Q represents the error deduction coefficient
- T i represents the first probability corresponding to the i-th error gain value R i
- O i represents the second probability corresponding to the i-th error gain value R i . It should be noted that, if the first probability far exceeds the second probability, it means that the group is adjusted according to the target time, and the greater the accuracy of high-recognition of human voice for audio, that is, the greater the score S i .
- the score sub-module 3043 may obtain the target adjustment time group 1 corresponding to error gain value of R 1 is 0, the target adjustment time group 2 corresponding to the error gain value of R 2 is 0.01, the target adjustment time group 3 corresponding to the error gain value of R 3 is 0.01.
- the second probability O 2 is Finally, according to a first probability T 2, O 2 and the second target error probability Save gain value R 2, the target error probability of the pitch corresponding to 2 R gain value minus set 2 score, the score is obtained
- the scoring sub-module 3043 can score the pitch probability set 3 corresponding to the target error deduction value R 3 , and the score obtained is The detailed calculation process is not repeated here.
- the generating sub-module 3044 is used for generating pitch information of the target word according to the pitch probability set with the highest score.
- generate submodule 3044 contrast score with It can be seen that the score highest. Therefore, the generation submodule 3044 generates pitch information of the target word according to the pitch probability set 3. Specifically, the generation sub-module 3044 may set 3 from the pitch probability: The pitch with the highest probability is selected as the pitch of the target word, that is, m2 is used as the pitch of the target word.
- the audio recognition device of the embodiment of the present invention first determines multiple start adjustment times and multiple end adjustment times according to the start time and end time corresponding to the target word, and then determines the multiple start adjustment times and the multiple end adjustment times, Recognize audio files, improve the accuracy of audio recognition.
- an embodiment of the present invention also provides an electronic device, as shown in FIG. 10, which shows a schematic structural diagram of the electronic device involved in the embodiment of the present invention, specifically speaking:
- the electronic device may include a processor 401 with one or more processing cores, a memory 402 with one or more computer-readable storage media, a power supply 403, an input unit 404, and other components.
- a processor 401 with one or more processing cores
- a memory 402 with one or more computer-readable storage media
- a power supply 403 with one or more computer-readable storage media
- a power supply 403 with one or more computer-readable storage media
- a power supply 403 with one or more computer-readable storage media
- a power supply 403 with one or more computer-readable storage media
- a power supply 403 with one or more computer-readable storage media
- a power supply 403 with one or more computer-readable storage media
- a power supply 403 with one or more computer-readable storage media
- a power supply 403 with one or more computer-readable storage media
- a power supply 403 with one or more computer-readable storage media
- a power supply 403 with one or more computer-readable storage media
- the processor 401 is the control center of the electronic device, and uses various interfaces and lines to connect various parts of the entire electronic device, by running or executing the software programs and / or modules stored in the memory 402, and calling the stored in the memory 402 Data, perform various functions of electronic devices and process data, so as to monitor electronic devices as a whole.
- the processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, and application programs, etc.
- the modem processor mainly handles wireless communication. It can be understood that, the foregoing modem processor may not be integrated into the processor 401.
- the memory 402 may be used to store software programs and modules.
- the processor 401 executes various functional applications and data processing by running the software programs and modules stored in the memory 402.
- the memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required by at least one function (such as a sound playback function, an image playback function, etc.), etc .; the storage data area may store Data created by the use of electronic devices, etc.
- the memory 402 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. Accordingly, the memory 402 may further include a memory controller to provide the processor 401 with access to the memory 402.
- the electronic device further includes a power supply 403 that supplies power to various components.
- the power supply 403 can be logically connected to the processor 401 through a power management system, so as to realize functions such as charging, discharging, and power management through the power management system.
- the power supply 403 may also include any component such as one or more DC or AC power supplies, a recharging system, a power failure detection circuit, a power converter or inverter, and a power status indicator.
- the electronic device may further include an input unit 404, which may be used to receive input digital or character information, and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.
- an input unit 404 which may be used to receive input digital or character information, and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.
- the electronic device may further include a display unit and the like, which will not be repeated here.
- the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs and stores the The application program in the memory 402, thereby implementing various functions, as follows:
- the text information includes multiple words
- the time information includes the start time of the target word and the end time of the target word;
- the target word determine multiple start adjustment times corresponding to the target word, and according to the end time of the target word, determine multiple end adjustment times corresponding to the target word;
- the audio file is identified to obtain pitch information of the target word.
- the electronic device can realize the effective effects that can be achieved by any audio recognition apparatus provided in the embodiments of the present invention. For details, see the foregoing embodiments, and details are not described herein again.
- the electronic device first determines multiple start adjustment times and multiple end adjustment times according to the start time and end time corresponding to the target word, and then determines the multiple start adjustment times and the multiple end adjustment times according to the multiple start adjustment times and the multiple end adjustment times. Recognize audio files, improve the accuracy of audio recognition.
- the one or more operations may constitute computer-readable instructions stored on one or more computer-readable media, which when executed by an electronic device will cause the computing device to perform the operations.
- the order in which some or all operations are described should not be interpreted as implying that these operations must be sequentially related. Those skilled in the art will understand alternative rankings that have the benefits of this specification. Moreover, it should be understood that not all operations are necessarily present in every embodiment provided herein.
- Each functional unit in the embodiment of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units may be integrated into one module.
- the above integrated modules can be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it may also be stored in a computer-readable storage medium.
- the storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk.
- the above devices or systems may execute the methods in the corresponding method embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Auxiliary Devices For Music (AREA)
- Electrophonic Musical Instruments (AREA)
Abstract
Description
Claims (13)
- 一种音频识别方法,其中,包括:获取音频文件,以及所述音频文件对应的文本信息,所述文本信息包括多个字;依次将所述文本信息中的每个字,设置为目标字,并获取所述目标字对应的时间信息,所述时间信息包括所述目标字的开始时间和所述目标字的结束时间;根据所述目标字的开始时间,确定所述目标字对应的多个开始调整时间,并根据所述目标字的结束时间,确定所述目标字对应的多个结束调整时间;根据所述目标字的多个开始调整时间和所述目标字的多个结束调整时间,对所述音频文件进行识别,得到所述目标字的音高信息。
- 根据权利要求1所述的音频识别方法,其中,所述根据所述目标字的开始时间,确定所述目标字对应的多个开始调整时间,并根据所述目标字的结束时间,确定所述目标字对应的多个结束调整时间步骤,包括:获取预设时间步长和预设最大误差值;根据所述目标字的开始时间、所述预设时间步长和所述预设最大误差值,确定所述目标字对应的多个开始调整时间,并根据所述目标字的结束时间、所述预设时间步长和所述预设最大误差值,确定所述目标字对应的多个结束调整时间。
- 根据权利要求1所述的音频识别方法,其中,所述根据所述目标字的多个开始调整时间和所述目标字的多个结束调整时间,对所述音频文件进行识别,得到所述目标字的音高信息步骤,包括:从所述目标字的多个开始调整时间中,选取目标开始调整时间,并从所述目标字的多个结束调整时间中,选取所述目标开始调整时间对应的目标结束调整时间,得到多个目标调整时间组;确定每一组所述目标调整时间组对应的音高概率集合,得到多个音高概率集合,所述音高概率集合包括音高、概率以及二者之间的关联关系;对所述多个音高概率集合进行评分,并选取评分最高的音高概率集合;根据所述评分最高的音高概率集合,生成所述目标字的音高信息。
- 根据权利要求3所述的音频识别方法,其中,所述对所述多个音高概率集合进行评分,并选取评分最高的音高概率集合步骤包括:根据所述目标字的开始时间、所述目标字的结束时间、所述目标字的多个目标调整时间组,得到多个误差减益值;依次将所述多个误差减益值,设置为目标误差减益值,并从所述目标误差减益值对应的音高概率集合中,获取第一概率和第二概率,其中所述第一概率为最大概率,所述第二概率为第二大概率;根据所述第一概率、所述第二概率以及所述目标误差减益值,对所述目标误差减益值对应的音高概率集合进行评分。
- 根据权利要求3所述的音频识别方法,其中,所述确定每一组所述目标调整时间组对应的音高概率集合,得到多个音高概率集合,所述音高概率集合包括音高、概率以及二者之间的关联关系步骤,包括:根据所述目标调整时间组,对所述音频文件划分多个采样区间;获取每一个采样区间对应的音高,以及所述音高对应的概率;将所述音高、所述概率以及二者之间的关联关系存储,生成所述目标调整时间组对应的音高概率集合。
- 根据权利要求1所述的音频识别方法,其中,所述目标字对应的时间信息还包括所述目标字的持续时长;所述依次将所述文本信息中的每个字,设置为目标字,并获取所述目标字对应的时间信息,所述时间信息包括所述目标字的开始时间和所述目标字的结束时间步骤之后,还包括:确定所述目标字的持续时长是否大于预设持续时长;如果大于预设持续时长,则对所述目标字进行拆分,并确定拆分后的目标字的持续时长;重新确定所述拆分后的目标字的持续时长是否大于预设持续时长;如果大于预设持续时长,则继续对所述拆分后的目标字进行拆分,直至所述文本信息中的每个字的持续时长都不大于预设持续时长为止。
- 一种音频识别装置,其中,包括:获取模块,用于获取音频文件,以及所述音频文件对应的文本信息,所述文本信息包括多个字;设置模块,用于依次将所述文本信息中的每个字,设置为目标字,并获取所述目标字对应的时间信息,所述时间信息包括所述目标字的开始时间和所述目标字的结束时间;第一确定模块,用于根据所述目标字的开始时间,确定所述目标字对应的多个开始调整时间,并根据所述目标字的结束时间,确定所述目标字对应的多个结束调整时间;识别模块,用于根据所述目标字的多个开始调整时间和所述目标字的多个结束调整时间,对所述音频文件进行识别,得到所述目标字的音高信息。
- 根据权利要求7所述的音频识别装置,其中,所述第一确定模块包括:获取子模块,用于获取预设时间步长和预设最大误差值;确定子模块,用于根据所述目标字的开始时间、所述预设时间步长和所述预设最大误差值,确定所述目标字对应的多个开始调整时间,并根据所述目标字的结束时间、所述预设时间步长和所述预设最大误差值,确定所述目标字对应的多个结束调整时间。
- 根据权利要求7所述的音频识别装置,其中,所述识别模块包括:选取子模块,用于从所述目标字的多个开始调整时间中,选取目标开始调整时间,并从所述目标字的多个结束调整时间中,选取所述目标开始调整时间对应的目标结束调整时间,得到多个目标调整时间组;得到子模块,用于确定每一组所述目标调整时间组对应的音高概率集合,得到多个音高概率集合,所述音高概率集合包括音高、概率以及二者之间的关联关系;评分子模块,用于对所述多个音高概率集合进行评分,并选取评分最高的音高概率集合;生成子模块,用于根据所述评分最高的音高概率集合,生成所述目标字的音高信息。
- 根据权利要求9所述的音频识别装置,其中,所述评分子模块具体用于:根据所述目标字的开始时间、所述目标字的结束时间、所述目标字的多个目标调整时间组,得到多个误差减益值;依次将所述多个误差减益值,设置为目标误差减益值,并从所述目标误差减益值对应的音高概率集合中,获取第一概率和第二概率;根据所述第一概率、所述第二概率以及所述目标误差减益值,对所述目标误差减益值对应的音高概率集合进行评分。
- 根据权利要求9所述的音频识别装置,其中,所述得到子模块具体用于:根据所述目标调整时间组,对所述音频文件划分多个采样区间;获取每一个采样区间对应的音高,以及所述音高对应的概率;将所述音高、所述概率以及二者之间的关联关系存储,生成所述目标调整时间组对应 的音高概率集合。
- 根据权利要求7所述的音频识别装置,其中,所述音频识别装置还包括:第二确定模块,用于确定所述目标字的持续时长是否大于预设持续时长;拆分模块,用于在大于预设持续时长时,对所述目标字进行拆分,并确定拆分后的目标字的持续时长;确定模块,用于重新确定所述拆分后的目标字的持续时长是否大于预设持续时长;继续拆分模块,用于在大于预设持续时长时,继续对所述拆分后的目标字进行拆分,直至所述文本信息中的每个字的持续时长都不大于预设持续时长为止。
- 一种存储介质,其内存储有处理器可执行指令,该处理器通过执行所述指令提供如权利要求1-6中任一的音频识别方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811198963.1A CN108962286B (zh) | 2018-10-15 | 2018-10-15 | 音频识别方法、装置及存储介质 |
CN201811198963.1 | 2018-10-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020078120A1 true WO2020078120A1 (zh) | 2020-04-23 |
Family
ID=64480972
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/103883 WO2020078120A1 (zh) | 2018-10-15 | 2019-08-30 | 音频识别方法、装置及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108962286B (zh) |
WO (1) | WO2020078120A1 (zh) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108962286B (zh) * | 2018-10-15 | 2020-12-01 | 腾讯音乐娱乐科技(深圳)有限公司 | 音频识别方法、装置及存储介质 |
CN110335629B (zh) * | 2019-06-28 | 2021-08-03 | 腾讯音乐娱乐科技(深圳)有限公司 | 音频文件的音高识别方法、装置以及存储介质 |
CN111063372B (zh) * | 2019-12-30 | 2023-01-10 | 广州酷狗计算机科技有限公司 | 确定音高特征的方法、装置、设备及存储介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105788589A (zh) * | 2016-05-04 | 2016-07-20 | 腾讯科技(深圳)有限公司 | 一种音频数据的处理方法及装置 |
CN107507628A (zh) * | 2017-08-31 | 2017-12-22 | 广州酷狗计算机科技有限公司 | 唱歌评分方法、装置及终端 |
EP3316257A1 (en) * | 2016-10-28 | 2018-05-02 | Fujitsu Limited | Pitch extraction device and pitch extraction method |
CN108008930A (zh) * | 2017-11-30 | 2018-05-08 | 广州酷狗计算机科技有限公司 | 确定k歌分值的方法和装置 |
CN108206026A (zh) * | 2017-12-05 | 2018-06-26 | 北京小唱科技有限公司 | 确定音频内容音高偏差的方法及装置 |
CN108962286A (zh) * | 2018-10-15 | 2018-12-07 | 腾讯音乐娱乐科技(深圳)有限公司 | 音频识别方法、装置及存储介质 |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101149957B (zh) * | 2007-09-30 | 2010-06-23 | 炬力集成电路设计有限公司 | 一种音字同步播放的方法及播放器 |
CN102737685A (zh) * | 2011-04-15 | 2012-10-17 | 盛乐信息技术(上海)有限公司 | 歌词滚动播放系统及其实现方法 |
US20120290285A1 (en) * | 2011-05-09 | 2012-11-15 | Gao-Peng Wang | Language learning device for expanding vocaburary with lyrics |
CN102982832B (zh) * | 2012-11-24 | 2015-05-27 | 安徽科大讯飞信息科技股份有限公司 | 一种在线卡拉ok伴奏、人声与字幕的同步方法 |
CN104091595B (zh) * | 2013-10-15 | 2017-02-15 | 广州酷狗计算机科技有限公司 | 一种音频处理方法及装置 |
US9064484B1 (en) * | 2014-03-17 | 2015-06-23 | Singon Oy | Method of providing feedback on performance of karaoke song |
CN105702240B (zh) * | 2014-11-25 | 2019-09-03 | 广州酷狗计算机科技有限公司 | 智能终端调整歌曲伴奏音乐的方法和装置 |
CN104967900B (zh) * | 2015-05-04 | 2018-08-07 | 腾讯科技(深圳)有限公司 | 一种生成视频的方法和装置 |
-
2018
- 2018-10-15 CN CN201811198963.1A patent/CN108962286B/zh active Active
-
2019
- 2019-08-30 WO PCT/CN2019/103883 patent/WO2020078120A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105788589A (zh) * | 2016-05-04 | 2016-07-20 | 腾讯科技(深圳)有限公司 | 一种音频数据的处理方法及装置 |
EP3316257A1 (en) * | 2016-10-28 | 2018-05-02 | Fujitsu Limited | Pitch extraction device and pitch extraction method |
CN107507628A (zh) * | 2017-08-31 | 2017-12-22 | 广州酷狗计算机科技有限公司 | 唱歌评分方法、装置及终端 |
CN108008930A (zh) * | 2017-11-30 | 2018-05-08 | 广州酷狗计算机科技有限公司 | 确定k歌分值的方法和装置 |
CN108206026A (zh) * | 2017-12-05 | 2018-06-26 | 北京小唱科技有限公司 | 确定音频内容音高偏差的方法及装置 |
CN108962286A (zh) * | 2018-10-15 | 2018-12-07 | 腾讯音乐娱乐科技(深圳)有限公司 | 音频识别方法、装置及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN108962286B (zh) | 2020-12-01 |
CN108962286A (zh) | 2018-12-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020078120A1 (zh) | 音频识别方法、装置及存储介质 | |
US10261965B2 (en) | Audio generation method, server, and storage medium | |
WO2020177190A1 (zh) | 一种处理方法、装置及设备 | |
US20220366880A1 (en) | Method and electronic device for recognizing song, and storage medium | |
CN106782601B (zh) | 一种多媒体数据处理方法及其装置 | |
WO2017157319A1 (zh) | 音频信息处理方法及装置 | |
US8892565B2 (en) | Method and apparatus for accessing an audio file from a collection of audio files using tonal matching | |
CN108766451B (zh) | 一种音频文件处理方法、装置和存储介质 | |
US11511200B2 (en) | Game playing method and system based on a multimedia file | |
US20180210952A1 (en) | Music track search method, music track search device, and computer readable recording medium | |
CN105825872B (zh) | 歌曲的难度确定方法和装置 | |
CN106887233B (zh) | 音频数据处理方法及系统 | |
WO2020199384A1 (zh) | 音频识别方法、装置、设备及存储介质 | |
US10964301B2 (en) | Method and apparatus for correcting delay between accompaniment audio and unaccompanied audio, and storage medium | |
CN110010159B (zh) | 声音相似度确定方法及装置 | |
CN111785238A (zh) | 音频校准方法、装置及存储介质 | |
US9940326B2 (en) | System and method for speech to speech translation using cores of a natural liquid architecture system | |
CN108170845B (zh) | 多媒体数据处理方法、装置及存储介质 | |
EP3979241B1 (en) | Audio clip matching method and apparatus, computer-readable medium and electronic device | |
US20120053937A1 (en) | Generalizing text content summary from speech content | |
CN108829370B (zh) | 有声资源播放方法、装置、计算机设备及存储介质 | |
CN110070891A (zh) | 一种歌曲识别方法、装置以及存储介质 | |
JPH0736478A (ja) | 音符列間類似度計算装置 | |
CN107025902B (zh) | 数据处理方法及装置 | |
CN113674725B (zh) | 音频混音方法、装置、设备及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19873560 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 09.06.2021) |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 120A DATED 15.12.2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19873560 Country of ref document: EP Kind code of ref document: A1 |