WO2020078120A1 - 音频识别方法、装置及存储介质 - Google Patents

音频识别方法、装置及存储介质 Download PDF

Info

Publication number
WO2020078120A1
WO2020078120A1 PCT/CN2019/103883 CN2019103883W WO2020078120A1 WO 2020078120 A1 WO2020078120 A1 WO 2020078120A1 CN 2019103883 W CN2019103883 W CN 2019103883W WO 2020078120 A1 WO2020078120 A1 WO 2020078120A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
time
target word
pitch
probability
Prior art date
Application number
PCT/CN2019/103883
Other languages
English (en)
French (fr)
Inventor
黄安麒
李深远
董治
Original Assignee
腾讯音乐娱乐科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯音乐娱乐科技(深圳)有限公司 filed Critical 腾讯音乐娱乐科技(深圳)有限公司
Publication of WO2020078120A1 publication Critical patent/WO2020078120A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/091Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for performance evaluation, i.e. judging, grading or scoring the musical qualities or faithfulness of a performance, e.g. with respect to pitch, tempo or other timings of a reference performance
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • G10L2025/906Pitch tracking

Definitions

  • the invention relates to the field of information technology, in particular to an audio recognition method, device and storage medium.
  • the terminal can also score the user's singing audio for the user's reference.
  • the start time and end time of the lyrics are generally regarded as the time when the person starts and ends singing. However, in the actual singing process, some people may sing earlier than the start time of the lyrics, and some people may sing later than the start time of the lyrics. Lower.
  • Embodiments of the present invention provide an audio recognition method, device, and storage medium, which can improve the accuracy of audio recognition.
  • An embodiment of the present invention provides an audio recognition method, which includes:
  • each word in the text information is a target word and obtain time information corresponding to the target word, the time information including the start time of the target word and the end time of the target word;
  • the audio file is identified according to multiple start adjustment times of the target word and multiple end adjustment times of the target word to obtain pitch information of the target word.
  • the multiple start adjustment times corresponding to the target word are determined according to the start time of the target word, and the multiple ends corresponding to the target word are determined according to the end time of the target word Adjust the time steps, including:
  • the step of recognizing the audio file according to a plurality of start adjustment times of the target word and a plurality of end adjustment times of the target word to obtain pitch information of the target word include:
  • the pitch information of the target word is generated according to the highest-score pitch probability set.
  • the step of scoring the plurality of pitch probability sets and selecting the highest pitch probability set includes:
  • the plurality of error reduction values are set as target error reduction values in turn, and a first probability and a second probability are obtained from the pitch probability set corresponding to the target error reduction values, wherein the first The probability is the maximum probability, and the second probability is the second largest probability;
  • the pitch probability set corresponding to the target error deduction value is scored.
  • the pitch probability set corresponding to each target adjustment time group is determined to obtain multiple pitch probability sets, and the pitch probability set includes pitch, probability, and the The association steps include:
  • the pitch, the probability, and the association relationship between the two are stored to generate a pitch probability set corresponding to the target adjustment time group.
  • the time information corresponding to the target word further includes the duration of the target word; the setting of each word in the text information as the target word in turn, and obtaining the correspondence of the target word Time information, the time information includes the start time of the target word and the end time of the target word, and further includes:
  • the split target word continues to be split until the duration of each word in the text information is not greater than the preset duration.
  • An embodiment of the present invention also provides an audio recognition device, including:
  • An acquisition module for acquiring an audio file and text information corresponding to the audio file, the text information including multiple words;
  • a setting module for sequentially setting each word in the text information as a target word, and acquiring time information corresponding to the target word, the time information including the start time of the target word and the target word End time
  • the first determining module is configured to determine a plurality of start adjustment times corresponding to the target word according to the start time of the target word, and determine a plurality of end adjustments corresponding to the target word according to the end time of the target word time;
  • the recognition module is configured to recognize the audio file according to a plurality of start adjustment times of the target word and a plurality of end adjustment times of the target word to obtain pitch information of the target word.
  • the first determining module includes:
  • Acquisition submodule used to acquire the preset time step and the preset maximum error value
  • the identification module includes:
  • a selection sub-module for selecting a target start adjustment time from a plurality of start adjustment times of the target word, and selecting a target end corresponding to the target start adjustment time from a plurality of end adjustment times of the target word Adjust the time to get multiple target adjustment time groups;
  • a scoring submodule for scoring the plurality of pitch probability sets, and selecting the highest pitch probability set
  • a generation submodule is used to generate pitch information of the target word according to the highest-score pitch probability set.
  • the scoring sub-module is specifically used to:
  • the pitch probability set corresponding to the target error deduction value is scored.
  • the obtaining submodule is specifically used for:
  • the pitch, the probability, and the association relationship between the two are stored to generate a pitch probability set corresponding to the target adjustment time group.
  • the audio recognition device further includes:
  • a second determination module configured to determine whether the duration of the target word is greater than a preset duration
  • a splitting module configured to split the target word when it is greater than a preset duration, and determine the duration of the split target word
  • a determining module used to re-determine whether the duration of the split target word is greater than a preset duration
  • a continuation splitting module which is used to continue splitting the split target word when it is greater than a preset duration, until the duration of each word in the text information is not greater than the preset duration .
  • an embodiment of the present invention also provides a storage medium in which processor-executable instructions are stored, and the processor provides any of the above audio recognition methods by executing the instructions.
  • the audio recognition method, device and storage medium of the embodiments of the present invention first determine a plurality of start adjustment times and a plurality of end adjustment times according to the start time and end time corresponding to the target word, and then determine At the end of adjusting the time, the audio file is recognized, which improves the accuracy of audio recognition.
  • FIG. 1 is a schematic diagram of a first scenario of an audio recognition method according to an embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of an audio recognition method provided by an embodiment of the present invention.
  • FIG. 3 is another schematic diagram of an audio recognition method provided by an embodiment of the present invention.
  • FIG. 4 is another schematic flowchart of an audio recognition method according to an embodiment of the present invention.
  • FIG. 5 is another schematic diagram of an audio recognition method according to an embodiment of the present invention.
  • FIG. 6 is another schematic diagram of an audio recognition method according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of an audio recognition device according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a first determining module provided by an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of an identification module provided by an embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
  • FIG. 1 is a schematic diagram of a scene of an audio recognition method provided by an embodiment of the present invention.
  • an audio recognition apparatus may be implemented as an entity, or may be implemented by being integrated in an electronic device such as a terminal or server.
  • this scenario may include a terminal a and a server b.
  • User A can record songs and generate audio files through the singing application H integrated in the terminal a.
  • the terminal a may obtain text information corresponding to the audio file from the server b, specifically including lyrics text information, and the text information includes multiple words.
  • each word in the text information has time information, specifically including the start time and end time of each word.
  • the beginning and end of a word correspond to the beginning and end of a person's high voice.
  • the terminal a sets each word in the text information as the target word, and further obtains time information corresponding to the target word from the server b.
  • the time information includes the start time of the target word and the end time of the target word. Because in the audio file recorded by the user, the start and end of the human voice is not necessarily completely synchronized with the start and end of the corresponding word. Therefore, multiple start adjustment times corresponding to the target word can be determined according to the start time of the target word, and multiple end adjustment times corresponding to the target word can be determined according to the end time of the target word. Finally, the terminal a then recognizes the audio file according to the multiple start adjustment times and the multiple end adjustment times to obtain pitch information of the target word.
  • Embodiments of the present invention provide an audio recognition method, device, and storage medium, which will be described in detail below.
  • An audio recognition method includes: acquiring an audio file and text information corresponding to the audio file, the text information includes a plurality of words; sequentially setting each word in the text information as a target word, and acquiring time information corresponding to the target word , The time information includes the start time of the target word and the end time of the target word; according to the start time of the target word, determine the multiple start adjustment time corresponding to the target word, and according to the end time of the target word, determine the multiple end corresponding to the target word Adjust the time; identify the audio file according to the multiple start adjustment times of the target word and the multiple end adjustment times of the target word to obtain the pitch information of the target word.
  • FIG. 2 is a flowchart of an audio recognition method according to an embodiment of the present invention.
  • the method may include:
  • Step S101 Acquire an audio file and text information corresponding to the audio file.
  • the text information includes multiple words.
  • accompaniment sounds When a user uses a singing application to record a song, accompaniment sounds, vocals, and other sounds together form an audio file. These sounds exist in the form of digital signals in audio files. To accurately identify the human voice from the audio file, you need to know the start time and end time of the human voice in the audio file.
  • the singing application when a user uses a singing application to record a song, the singing application will display lyrics text information to prompt the user to sing. It can be roughly considered that the time when the lyrics start is the time when the user starts singing, and the time when the lyrics end is the time when the user ends singing. Therefore, after the audio file is obtained, the text information corresponding to the audio file may be further obtained to assist in identifying the human voice in the audio file. Among them, the text information includes multiple words, which correspond to the human voice.
  • step S102 each word in the text information is set as the target word in turn, and time information corresponding to the target word is obtained.
  • the time information includes the start time of the target word and the end time of the target word.
  • the time when the user starts and ends the singing is not necessarily completely synchronized with the time corresponding to the text information provided by the singing application.
  • the start time of the word "dang” is 43000 milliseconds and the end time is 43300 milliseconds
  • the start time of the word "dang” sung by the user is 42000 milliseconds and the end time It is 42300 milliseconds.
  • the human voice is also detected according to the start time and end time corresponding to the lyrics "Dang" provided by the singing application, the accuracy of audio recognition will be reduced.
  • each word in the text information as the target word in turn, and obtain the time information corresponding to the target word, and adjust the time information to improve the accuracy of human voice recognition in the audio file.
  • the time information includes the start time of the target word and the end time of the target word.
  • step S103 multiple start adjustment times corresponding to the target word are determined according to the start time of the target word, and multiple end adjustment times corresponding to the target word are determined according to the end time of the target word.
  • multiple time points can be selected as the start adjustment time within a period before and after the start time of the target word.
  • multiple time points can be selected as the end adjustment time within a period before and after the end time of the target word.
  • the start time of the target word is 10000 milliseconds and the end time is 10500 milliseconds
  • the 10050 milliseconds and 10100 milliseconds are used as the start adjustment time.
  • the 10400 ms, 10450 ms, 10500 ms, 10550 ms, and 10600 ms before and after the 10500 ms are used as the end adjustment time.
  • Step S104 Identify the audio file according to the multiple start adjustment times of the target word and the multiple end adjustment times of the target word to obtain pitch information of the target word.
  • the target start adjustment time and the target end adjustment time satisfying the preset conditions may be selected from the multiple start adjustment times of the target word and the multiple end adjustment times of the target word to form multiple target adjustment time groups.
  • the time group adjusts the time group according to each target to perform human voice high recognition on the audio file, and score the recognized human voice height. If the quality of the human voice high recognition in the target adjustment time group is higher, the score is more high. That is, the time group can be adjusted according to the target to obtain pitch information of the target word.
  • the human voice height refers to the height of the human voice.
  • the audio recognition method provided by the embodiment of the present invention first determines multiple start adjustment times and multiple end adjustment times according to the start time and end time corresponding to the target word, and then determines the multiple start adjustment times and multiple end adjustment times End the adjustment time to identify the audio file and improve the accuracy of audio recognition.
  • FIG. 4 is another flowchart of an audio recognition method according to an embodiment of the present invention.
  • the method may include:
  • Step S201 Acquire an audio file and text information corresponding to the audio file.
  • the text information includes multiple words.
  • accompaniment sounds When a user uses a singing application to record a song, accompaniment sounds, vocals, and other sounds together form an audio file. These sounds exist in the form of digital signals in audio files. To accurately identify the human voice from the audio file, you need to know the start time and end time of the human voice in the audio file.
  • the singing application when a user uses a singing application to record a song, the singing application will display lyrics text information to prompt the user to sing. Therefore, the time when the lyrics start is the time when the user starts singing, and the time when the lyrics end is the time when the user ends singing. Therefore, after the audio file is obtained, the text information corresponding to the audio file may be further obtained to assist in identifying the human voice in the audio file.
  • the text information includes multiple words, which correspond to the human voice.
  • step S202 each word in the text information is set as the target word in turn, and time information corresponding to the target word is obtained, and the time information includes the start time of the target word and the end time of the target word.
  • the time when the user starts and ends the singing is not necessarily completely synchronized with the time corresponding to the text information provided by the singing application.
  • the start time of the word "dang” is 43000 milliseconds and the end time is 43300 milliseconds
  • the start time of the word "dang” of the user is 42000 milliseconds and the end time It is 42300 milliseconds.
  • the human voice is also detected according to the start time and end time corresponding to the lyrics "Dang" provided by the singing application, the accuracy of audio recognition will be reduced.
  • the time information includes time information such as the start time, end time and duration of the target word.
  • the lyrics include 15 words, which can be set as the target word in sequence.
  • the word “dang” is first set as the target word, and the start time of the word “dang” can be obtained as 43000 milliseconds, the end time is 43300 milliseconds, and the duration is 300 milliseconds.
  • the duration corresponding to a word is about 100 milliseconds. If the duration corresponding to the target word is detected to be greater than 100 milliseconds, it can be considered that the target word has a polyphony, that is, a target A word may correspond to multiple pitches, where pitch refers to the height of the sound. For the above one-word polyphony situation, you can use the following steps to deal with:
  • the duration of the target word can be calculated according to the end time and start time of the target word. Specifically, assume that the start time of the target word is E and the end time is F. Then the duration of the target word is (F-V).
  • the duration corresponding to a single pitch can be obtained, so the preset duration can be set according to the duration corresponding to the single pitch, such as setting the preset duration to the duration corresponding to the single pitch, here
  • the value of the preset duration is not specifically limited.
  • the target word needs to be split until each word in the text information corresponds to only one pitch.
  • the target word can be split into a first target word and a second target word, the start time of the first target word is set to E, and the end time of the first target word is set to The start time of the second target word is set to The end time of the second target word is set to F.
  • the duration of the first target word is The duration of the second target word is In summary, the duration of the first target word It must be less than the preset duration V, so next only the duration of the second target word needs to be revisited Screen whether it is greater than the preset duration V.
  • the second target word If the duration of the second target word If it is not greater than the preset duration V, stop splitting the second target word; if the duration of the second target word If it is greater than the preset duration V, the second target word is split according to the above method of splitting the target word, which will not be repeated here. Until the duration of each word in the text information is not greater than the preset duration V.
  • Step S203 obtaining a preset time step and a preset maximum error value
  • the preset time step refers to the difference between two preset time points.
  • the preset maximum error value refers to the error value between two preset time points. The larger the value of the preset maximum error value, the more accurate the actual start time and actual end time of the target word can be determined, but it also brings the problem of excessive calculation, so the preset maximum error value can be adjusted according to the actual situation Set the error value.
  • Step S204 Determine a plurality of start adjustment times corresponding to the target word according to the target word's start time, preset time step size and preset maximum error value, and according to the target word's end time, preset time step size and preset maximum The error value determines the multiple end adjustment times corresponding to the target word.
  • the Multiple start adjustment times include 100ms, 200ms, 300ms, 400ms, 500ms, 600ms, and 700ms, and multiple end adjustment times include 500ms, 600ms, 700th Milliseconds, 800 milliseconds, 900 milliseconds, 1000 milliseconds, and 1100 milliseconds.
  • Step S205 Select the target start adjustment time from the multiple start adjustment times of the target word, and select the target end adjustment time corresponding to the target start adjustment time from the multiple end adjustment times of the target word to obtain multiple target adjustment times group.
  • a start adjustment time may be arbitrarily selected as the target start adjustment time from the multiple start adjustment times
  • an end adjustment time may be arbitrarily selected as the target end adjustment time from the multiple end adjustment times
  • the 800th millisecond is selected as the target end adjustment time
  • the target start adjustment time of 200 milliseconds and the target end adjustment time of 800 milliseconds can be used as a target adjustment time group.
  • the selected target start adjustment time is 700 milliseconds and the target end adjustment time is 500 milliseconds, there will be an unreasonable situation where the target start adjustment time of the target word is greater than the target end adjustment time.
  • the area of the multiple target start adjustment time and the multiple target end adjustment time Compare the value areas. If there is an overlap area between the two, you can divide the overlap area. As shown in Fig. 5, the overlap area is from the 500th millisecond to the 700th millisecond, the 600th millisecond of the middle value of the overlap area can be taken as the dividing line between the target start adjustment time and the target end adjustment time.
  • the start adjustment time of each target includes the 100th millisecond, 200th millisecond, 300th millisecond, 400th millisecond, 500th millisecond, and 600th millisecond. Milliseconds, 1000th milliseconds, and 1100th milliseconds.
  • the start adjustment time may be selected from the multiple start adjustment times of the target word as the target start adjustment time, and then all the end adjustments that are not less than the target start adjustment time may be selected from the multiple end adjustment times Time, as the target end adjustment time corresponding to the target start adjustment time.
  • the 500th, 600th, 700th, 800th, 900th, 1000th, and 1000th milliseconds can be selected from the end adjustment time. 1100 milliseconds is used as the target end adjustment time.
  • 600 milliseconds, 700 milliseconds, 800 milliseconds, 900 milliseconds, 1000 milliseconds, and 1100 milliseconds can be selected as the target end adjustment time from the end adjustment time. This can also effectively avoid the unreasonable situation that the target start adjustment time of the target word is greater than the target end adjustment time.
  • the target start adjustment time and target end adjustment time are regarded as the target adjustment time group.
  • Step S206 Determine a pitch probability set corresponding to each target adjustment time group to obtain multiple pitch probability sets.
  • the pitch probability set includes pitch, probability, and the association between the two.
  • the audio file may be identified according to the target start adjustment time and the target end adjustment time in the target adjustment time group to obtain a pitch probability set.
  • the steps of establishing the pitch probability set are as follows:
  • the 100th to 300th The audio file between milliseconds is divided into 4 sampling intervals, in which the pitch measured in the 100 ms-150 ms sampling interval is m2, and the pitch measured in the 150 mm-200 ms sampling interval is m4, 200 ms -The pitch measured in the 250 ms sampling interval is m3, and the pitch measured in the 250 ms-300 ms sampling interval is m1.
  • the measurement of the pitch in each sampling interval may use a neural network algorithm to process the audio file to obtain the pitch corresponding to the sampling interval.
  • the set of pitch probabilities corresponding to the target adjustment time group can be obtained as
  • the pitch probability set can also be stored in the form of Table 1 below.
  • the pitch probability set corresponding to each target time adjustment group can be obtained, that is, multiple pitch probability sets can be obtained, for example, as shown in Table 2 below:
  • step S207 a plurality of pitch probability sets are scored, and a pitch probability set with the highest score is selected.
  • U i represents the target start adjustment time in the i-th target adjustment time group
  • V i represents the target end adjustment time in the i-th target adjustment time group
  • i is a positive integer
  • Y represents the start time of the target word
  • Z represents the end time of the target word
  • Q represents the error deduction coefficient
  • T i represents the first probability corresponding to the i-th error gain value R i
  • O i represents the second probability corresponding to the i-th error gain value R i . It should be noted that, if the first probability far exceeds the second probability, it means that the group is adjusted according to the target time, and the greater the accuracy of high-recognition of human voice for audio, that is, the greater the score S i .
  • the target adjustment time group As shown in the corresponding relationship between the target adjustment time group and the pitch probability set shown in Table 2, suppose the error deduction coefficient Q is 0.0001, the end time Z of the target word is 300 milliseconds, and the start time Y of the target word is 100 milliseconds.
  • the target group of a corresponding adjustment time error gain value of R 1 is 0, the target adjustment time corresponding to the error gain set value 2 and R 2 is 0.01, the target group to adjust the time corresponding to the error gain values 3 and R 3 is 0.01.
  • a first error gain value as the target error Save R & lt gain value, from a set of pitch error probability R & lt gain value 1 corresponding to a probability of T 1 is the first
  • the second probability O 1 is Finally, according to a first probability T 1, O 1 and a second target error probability gain value Save, Save pitch error probability target gain value corresponding to a set of R & lt score 1, score was obtained
  • step S208 the pitch information of the target word is generated according to the pitch probability set with the highest score.
  • the pitch information of the target word is generated according to the pitch probability set 3.
  • the set of pitch probability 3 The pitch with the highest probability is selected as the pitch of the target word, that is, m2 is used as the pitch of the target word.
  • the audio recognition method provided by the embodiment of the present invention first determines multiple start adjustment times and multiple end adjustment times according to the start time and end time corresponding to the target word, and then determines the multiple start adjustment times and multiple end adjustment times End the adjustment time to identify the audio file and improve the accuracy of audio recognition.
  • FIG. 7 is a structural diagram of an audio recognition device according to an embodiment of the present invention.
  • the device 30 includes an acquisition module 301, a setting module 302, a first determination module 303, and an identification module 304.
  • the obtaining module 301 is used to obtain audio files and text information corresponding to the audio files.
  • the text information includes multiple words.
  • accompaniment sounds When a user uses a singing application to record a song, accompaniment sounds, vocals, and other sounds together form an audio file. These sounds exist in the form of digital signals in audio files. To accurately identify the human voice from the audio file, you need to know the start time and end time of the human voice in the audio file.
  • the singing application when a user uses a singing application to record a song, the singing application will display lyrics text information to prompt the user to sing. Therefore, the time when the lyrics start is the time when the user starts singing, and the time when the lyrics end is the time when the user ends singing. Therefore, after the obtaining module 301 obtains the audio file, the obtaining module 301 may further obtain the text information corresponding to the audio file to assist in identifying the human voice in the audio file.
  • the text information includes multiple words, which correspond to the human voice.
  • the setting module 302 sequentially sets each word in the text information as the target word, and obtains time information corresponding to the target word.
  • the time information includes the start time of the target word and the end time of the target word.
  • the time when the user starts and ends the singing is not necessarily completely synchronized with the time corresponding to the text information provided by the singing application.
  • the start time of the word "dang” is 43000 milliseconds and the end time is 43300 milliseconds
  • the start time of the word "dang” sung by the user is 42000 milliseconds and the end time It is 42300 milliseconds.
  • the human voice is also detected according to the start time and end time corresponding to the lyrics "Dang" provided by the singing application, the accuracy of audio recognition will be reduced.
  • the setting module 302 can use the setting module 302 to set each word in the text information as the target word in turn, and obtain the time information corresponding to the target word, and adjust the time information to improve the human voice recognition in the audio file. accuracy.
  • the time information includes time information such as the start time, end time and duration of the target word.
  • the lyrics include 15 words, and the setting module 302 may sequentially set these 15 words as the target word. Specifically, the setting module 302 first sets the word "dang" as the target word, and can obtain that the start time of the word "dang" is 43000 milliseconds, the end time is 43300 milliseconds and the duration is 300 milliseconds.
  • the duration corresponding to a word is about 100 milliseconds. If the setting module 302 detects that the duration corresponding to the target word is greater than 100 milliseconds, it may be considered that the target word has a polyphony, That is, a target word may correspond to multiple pitches, where pitch refers to the height of the pitch.
  • the audio recognition device 30 is further provided with a second determination module 305, a split module 306, a determination module 307, and a continue split module 308.
  • the second determination module 305 is used to determine whether the duration of the target word is greater than the preset duration; the splitting module 306 is used to split the target word when it is greater than the preset duration and determine the split target The duration of the word; the determination module 307 is used to re-determine whether the duration of the split target word is greater than the preset duration; the continue splitting module 308 is used to continue the split after the preset duration The target word of is split until the duration of each word in the text information is not greater than the preset duration.
  • the duration of the target word can be calculated according to the end time and start time of the target word. Specifically, assume that the start time of the target word is E and the end time is F. Then the duration of the target word is (F-V).
  • the duration corresponding to a single pitch can be obtained, so the preset duration can be set according to the duration corresponding to the single pitch, such as setting the preset duration to the duration corresponding to the single pitch, here
  • the value of the preset duration is not specifically limited.
  • the second determination module 305 determines that the duration of the target word is greater than the preset duration, it indicates that the target word may have a situation where one word corresponds to multiple pitches. Therefore, the target word needs to be split until each word in the text information corresponds to only one pitch.
  • the splitting module 306 can split the target word into a first target word and a second target word, set the start time of the first target word to E, and set the end time of the first target word to The start time of the second target word is set to The end time of the second target word is set to F.
  • the duration of the first target word is The duration of the second target word is In summary, the duration of the first target word It must be less than the preset duration V, so next only need to pass the determination module 307 to re-confirm the duration of the second target Screen whether it is greater than the preset duration V.
  • the splitting module 308 continues to split the second target word according to the above method for splitting the target word, which will not be repeated here. Until the duration of each word in the text information is not greater than the preset duration V.
  • the first determining module 303 is configured to determine a plurality of start adjustment times corresponding to the target word according to the start time of the target word, and determine a plurality of end adjustment times corresponding to the target word according to the end time of the target word.
  • the first determination module 303 includes: an acquisition submodule 3031 and a determination submodule 3032.
  • the obtaining submodule 3031 is used to obtain a preset time step and a preset maximum error value.
  • the preset time step refers to the difference between two preset time points. The smaller the preset time step value is set, the more accurate the actual start time and actual end time of the target word can be determined, but it will also cause the problem of excessive calculation, so the preset time can be adjusted according to the actual situation Set the step value.
  • the preset maximum error value refers to the error value between two preset time points. The larger the value of the preset maximum error value, the more accurate the actual start time and actual end time of the target word can be determined, but it also brings the problem of excessive calculation, so the preset maximum error value can be adjusted according to the actual situation Set the error value.
  • the determination submodule 3032 is used to determine a plurality of start adjustment times corresponding to the target word according to the start time, preset time step and preset maximum error value of the target word, and according to the end time and preset time step of the target word With the preset maximum error value, determine the multiple end adjustment times corresponding to the target word.
  • the Multiple start adjustment times include 100ms, 200ms, 300ms, 400ms, 500ms, 600ms, and 700ms, and multiple end adjustment times include 500ms, 600ms, 700th Milliseconds, 800 milliseconds, 900 milliseconds, 1000 milliseconds, and 1100 milliseconds.
  • the recognition module 304 is used to recognize the audio file according to the multiple start adjustment times of the target word and the multiple end adjustment times of the target word to obtain pitch information of the target word.
  • the identification module 304 includes: a selection submodule 3041, a obtaining submodule 3042, a scoring submodule 3043, and a generation submodule 3044.
  • the selection submodule 3041 is used to select the target start adjustment time from the multiple start adjustment times of the target word, and select the target end adjustment time corresponding to the target start adjustment time from the multiple end adjustment times of the target word. Time group for each target.
  • the selection submodule 3041 may arbitrarily select a start adjustment time from the plurality of start adjustment times as the target start adjustment time, and select an end adjustment time from the plurality of end adjustment times as the target end adjustment time .
  • the selection submodule 3041 can select the 200 milliseconds as the target start adjustment time from the multiple start adjustment times such as the 100 milliseconds, 200 milliseconds, and 300 milliseconds, from the 700 milliseconds, the 800 milliseconds, and the first Among the multiple end adjustment times such as 900 milliseconds, the 800th millisecond is selected as the target end adjustment time, then the target start adjustment time of 200 milliseconds and the target end adjustment time of 800 milliseconds can be used as a target adjustment time group.
  • the target start adjustment time selected by the sub-module 3041 is 700 milliseconds and the target end adjustment time is 500 milliseconds, there will be an unreasonable situation where the target start adjustment time of the target word is greater than the target end adjustment time.
  • the sub-module 3041 can select the area and the plurality of start adjustment times for the plurality of targets Compare the value areas of the target end adjustment time. If there is an overlap area between the two, you can divide the overlap area. As shown in FIG.
  • the overlapping area is from 500 milliseconds to 700 milliseconds
  • the submodule 3041 can take the middle value of the overlapping area at 600 milliseconds as the boundary between the target start adjustment time and the target end adjustment time, that is, a compromise After division, multiple target start adjustment times include 100 ms, 200 ms, 300 ms, 400 ms, 500 ms, and 600 ms, and multiple end adjustment times include 600 ms, 700 ms, and 800 ms Milliseconds, 900 milliseconds, 1000 milliseconds, and 1100 milliseconds.
  • the selection submodule 3041 may also select the start adjustment time from the multiple start adjustment times of the target word as the target start adjustment time, and then select all the multiple end adjustment times that are not less than the target to start adjustment
  • the end adjustment time of time is regarded as the target end adjustment time corresponding to the target start adjustment time.
  • the selection submodule 3041 can select the 500th millisecond, 600th millisecond, 700th millisecond, 800th millisecond, 800th millisecond, 900th millisecond, and 900th millisecond from the end adjustment time 1000 milliseconds and 1100 milliseconds are the target end adjustment time.
  • the selection submodule 3041 can select 600 milliseconds, 700 milliseconds, 800 milliseconds, 900 milliseconds, 1000 milliseconds, and 1100 milliseconds as the target end adjustment from the end adjustment time time. This can also effectively avoid the unreasonable situation that the target start adjustment time of the target word is greater than the target end adjustment time.
  • the target start adjustment time and target end adjustment time are regarded as the target adjustment time group.
  • the obtaining submodule 3042 is used to determine the pitch probability set corresponding to each target adjustment time group, and obtain multiple pitch probability sets.
  • the pitch probability set includes pitch, probability, and the relationship between the two.
  • the obtaining submodule 3042 may identify the audio file according to the target start adjustment time and the target end adjustment time in the target adjustment time group to obtain a pitch probability set.
  • the steps of obtaining the submodule 3042 to establish the pitch probability set are as follows:
  • the pitch, probability and the relationship between the two are stored to generate a set of pitch probabilities corresponding to the target adjustment time group.
  • the submodule 3042 can obtain the 100th
  • the audio file between milliseconds-300 milliseconds is divided into 4 sampling intervals, of which the pitch measured in the 100 milliseconds-150 milliseconds sampling interval is m2, and the pitch measured in the 150 milliseconds-200th millisecond sampling interval is m4
  • the pitch measured in the 200 ms-250 ms sampling interval is m3, and the pitch measured in the 250 ms-300 ms sampling interval is m1.
  • the measurement of the pitch in each sampling interval can be processed by the neural network algorithm to obtain the pitch corresponding to the sampling interval.
  • the obtaining submodule 3042 can obtain the pitch probability set corresponding to the target adjustment time group as The pitch probability set can also be stored as shown in Table 1.
  • the obtaining submodule 3042 can obtain a pitch probability set corresponding to each target time adjustment group, that is, multiple pitch probability sets, as shown in Table 2.
  • the scoring sub-module 3043 is used to score multiple pitch probability sets, and select the highest pitch probability set.
  • the scoring submodule 3043 is specifically used to:
  • multiple error deduction values are set as target error deduction values, and the first probability and the second probability are obtained from the pitch probability set corresponding to the target error deduction values;
  • the pitch probability set corresponding to the target error deduction value is scored.
  • U i represents the target start adjustment time in the i-th target adjustment time group
  • V i represents the target end adjustment time in the i-th target adjustment time group
  • i is a positive integer
  • Y represents the start time of the target word
  • Z represents the end time of the target word
  • Q represents the error deduction coefficient
  • T i represents the first probability corresponding to the i-th error gain value R i
  • O i represents the second probability corresponding to the i-th error gain value R i . It should be noted that, if the first probability far exceeds the second probability, it means that the group is adjusted according to the target time, and the greater the accuracy of high-recognition of human voice for audio, that is, the greater the score S i .
  • the score sub-module 3043 may obtain the target adjustment time group 1 corresponding to error gain value of R 1 is 0, the target adjustment time group 2 corresponding to the error gain value of R 2 is 0.01, the target adjustment time group 3 corresponding to the error gain value of R 3 is 0.01.
  • the second probability O 2 is Finally, according to a first probability T 2, O 2 and the second target error probability Save gain value R 2, the target error probability of the pitch corresponding to 2 R gain value minus set 2 score, the score is obtained
  • the scoring sub-module 3043 can score the pitch probability set 3 corresponding to the target error deduction value R 3 , and the score obtained is The detailed calculation process is not repeated here.
  • the generating sub-module 3044 is used for generating pitch information of the target word according to the pitch probability set with the highest score.
  • generate submodule 3044 contrast score with It can be seen that the score highest. Therefore, the generation submodule 3044 generates pitch information of the target word according to the pitch probability set 3. Specifically, the generation sub-module 3044 may set 3 from the pitch probability: The pitch with the highest probability is selected as the pitch of the target word, that is, m2 is used as the pitch of the target word.
  • the audio recognition device of the embodiment of the present invention first determines multiple start adjustment times and multiple end adjustment times according to the start time and end time corresponding to the target word, and then determines the multiple start adjustment times and the multiple end adjustment times, Recognize audio files, improve the accuracy of audio recognition.
  • an embodiment of the present invention also provides an electronic device, as shown in FIG. 10, which shows a schematic structural diagram of the electronic device involved in the embodiment of the present invention, specifically speaking:
  • the electronic device may include a processor 401 with one or more processing cores, a memory 402 with one or more computer-readable storage media, a power supply 403, an input unit 404, and other components.
  • a processor 401 with one or more processing cores
  • a memory 402 with one or more computer-readable storage media
  • a power supply 403 with one or more computer-readable storage media
  • a power supply 403 with one or more computer-readable storage media
  • a power supply 403 with one or more computer-readable storage media
  • a power supply 403 with one or more computer-readable storage media
  • a power supply 403 with one or more computer-readable storage media
  • a power supply 403 with one or more computer-readable storage media
  • a power supply 403 with one or more computer-readable storage media
  • a power supply 403 with one or more computer-readable storage media
  • a power supply 403 with one or more computer-readable storage media
  • a power supply 403 with one or more computer-readable storage media
  • the processor 401 is the control center of the electronic device, and uses various interfaces and lines to connect various parts of the entire electronic device, by running or executing the software programs and / or modules stored in the memory 402, and calling the stored in the memory 402 Data, perform various functions of electronic devices and process data, so as to monitor electronic devices as a whole.
  • the processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, and application programs, etc.
  • the modem processor mainly handles wireless communication. It can be understood that, the foregoing modem processor may not be integrated into the processor 401.
  • the memory 402 may be used to store software programs and modules.
  • the processor 401 executes various functional applications and data processing by running the software programs and modules stored in the memory 402.
  • the memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required by at least one function (such as a sound playback function, an image playback function, etc.), etc .; the storage data area may store Data created by the use of electronic devices, etc.
  • the memory 402 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. Accordingly, the memory 402 may further include a memory controller to provide the processor 401 with access to the memory 402.
  • the electronic device further includes a power supply 403 that supplies power to various components.
  • the power supply 403 can be logically connected to the processor 401 through a power management system, so as to realize functions such as charging, discharging, and power management through the power management system.
  • the power supply 403 may also include any component such as one or more DC or AC power supplies, a recharging system, a power failure detection circuit, a power converter or inverter, and a power status indicator.
  • the electronic device may further include an input unit 404, which may be used to receive input digital or character information, and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.
  • an input unit 404 which may be used to receive input digital or character information, and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.
  • the electronic device may further include a display unit and the like, which will not be repeated here.
  • the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs and stores the The application program in the memory 402, thereby implementing various functions, as follows:
  • the text information includes multiple words
  • the time information includes the start time of the target word and the end time of the target word;
  • the target word determine multiple start adjustment times corresponding to the target word, and according to the end time of the target word, determine multiple end adjustment times corresponding to the target word;
  • the audio file is identified to obtain pitch information of the target word.
  • the electronic device can realize the effective effects that can be achieved by any audio recognition apparatus provided in the embodiments of the present invention. For details, see the foregoing embodiments, and details are not described herein again.
  • the electronic device first determines multiple start adjustment times and multiple end adjustment times according to the start time and end time corresponding to the target word, and then determines the multiple start adjustment times and the multiple end adjustment times according to the multiple start adjustment times and the multiple end adjustment times. Recognize audio files, improve the accuracy of audio recognition.
  • the one or more operations may constitute computer-readable instructions stored on one or more computer-readable media, which when executed by an electronic device will cause the computing device to perform the operations.
  • the order in which some or all operations are described should not be interpreted as implying that these operations must be sequentially related. Those skilled in the art will understand alternative rankings that have the benefits of this specification. Moreover, it should be understood that not all operations are necessarily present in every embodiment provided herein.
  • Each functional unit in the embodiment of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it may also be stored in a computer-readable storage medium.
  • the storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk.
  • the above devices or systems may execute the methods in the corresponding method embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Auxiliary Devices For Music (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

提供了一种音频识别方法、装置及存储介质,该方法包括获取音频文件,以及音频文件对应的文本信息,文本信息包括多个字(S101);依次将文本信息中的每个字设置为目标字,并获取目标字对应的时间信息,时间信息包括目标字的开始时间和结束时间(S102);根据目标字的开始时间确定对应的多个开始调整时间,并根据目标字的结束时间确定对应的多个结束调整时间(S103);根据目标字的多个开始调整时间和多个结束调整时间,对音频文件进行识别,得到目标字的音高信息(S104)。

Description

音频识别方法、装置及存储介质 技术领域
本发明涉及信息技术领域,尤其涉及一种音频识别方法、装置及存储介质。
背景技术
随着互联网技术的发展和终端的不断普及,越来越多的用户根据终端中唱歌应用播放的伴奏,演唱歌曲。同时,终端还可以对用户的演唱音频进行评分,供用户参考。
在一段演唱音频中,既包含人声,也包括乐器演奏的声音,甚至噪声。为了对演唱音频准确打分,需要从演唱音频中准确识别出人声音高。在现有的人声音高识别技术中,一般把歌词的开始时间和结束时间,作为人开始和结束演唱的时间。然而在实际演唱过程中,有些人可能早于歌词的开始时间演唱,有些人可能晚于歌词的开始时间演唱,因此直接通过歌词的开始时间和结束时间来确定人声的开始与结束,准确性较低。
故,有必要提供一种音频识别方法来提高人声音高识别的准确性。
技术问题
本发明实施例提供一种音频识别方法、装置及存储介质,可以提高音频识别的准确率。
技术解决方案
本发明实施例提供了一种音频识别方法,其包括:
获取音频文件,以及所述音频文件对应的文本信息,所述文本信息包括多个字;
依次将所述文本信息中的每个字,设置为目标字,并获取所述目标字对应的时间信息,所述时间信息包括所述目标字的开始时间和所述目标字的结束时间;
根据所述目标字的开始时间,确定所述目标字对应的多个开始调整时间,并根据所述目标字的结束时间,确定所述目标字对应的多个结束调整时间;
根据所述目标字的多个开始调整时间和所述目标字的多个结束调整时间,对所述音频文件进行识别,得到所述目标字的音高信息。
在一实施例中,所述根据所述目标字的开始时间,确定所述目标字对应的多个开始调整时间,并根据所述目标字的结束时间,确定所述目标字对应的多个结束调整时间步骤,包括:
获取预设时间步长和预设最大误差值;
根据所述目标字的开始时间、所述预设时间步长和所述预设最大误差值,确定所述目标 字对应的多个开始调整时间,并根据所述目标字的结束时间、所述预设时间步长和所述预设最大误差值,确定所述目标字对应的多个结束调整时间。
在一实施例中,所述根据所述目标字的多个开始调整时间和所述目标字的多个结束调整时间,对所述音频文件进行识别,得到所述目标字的音高信息步骤,包括:
从所述目标字的多个开始调整时间中,选取目标开始调整时间,并从所述目标字的多个结束调整时间中,选取所述目标开始调整时间对应的目标结束调整时间,得到多个目标调整时间组;
确定每一组所述目标调整时间组对应的音高概率集合,得到多个音高概率集合,所述音高概率集合包括音高、概率以及二者之间的关联关系;
对所述多个音高概率集合进行评分,并选取评分最高的音高概率集合;
根据所述评分最高的音高概率集合,生成所述目标字的音高信息。
在一实施例中,所述对所述多个音高概率集合进行评分,并选取评分最高的音高概率集合步骤包括:
根据所述目标字的开始时间、所述目标字的结束时间、所述目标字的多个目标调整时间组,得到多个误差减益值;
依次将所述多个误差减益值,设置为目标误差减益值,并从所述目标误差减益值对应的音高概率集合中,获取第一概率和第二概率,其中所述第一概率为最大概率,所述第二概率为第二大概率;
根据所述第一概率、所述第二概率以及所述目标误差减益值,对所述目标误差减益值对应的音高概率集合进行评分。
在一实施例中,所述确定每一组所述目标调整时间组对应的音高概率集合,得到多个音高概率集合,所述音高概率集合包括音高、概率以及二者之间的关联关系步骤,包括:
根据所述目标调整时间组,对所述音频文件划分多个采样区间;
获取每一个采样区间对应的音高,以及所述音高对应的概率;
将所述音高、所述概率以及二者之间的关联关系存储,生成所述目标调整时间组对应的音高概率集合。
在一实施例中,所述目标字对应的时间信息还包括所述目标字的持续时长;所述依次将所述文本信息中的每个字,设置为目标字,并获取所述目标字对应的时间信息,所述时间信息包括所述目标字的开始时间和所述目标字的结束时间步骤之后,还包括:
确定所述目标字的持续时长是否大于预设持续时长;
如果大于预设持续时长,则对所述目标字进行拆分,并确定拆分后的目标字的持续时长;
重新确定所述拆分后的目标字的持续时长是否大于预设持续时长;
如果大于预设持续时长,则继续对所述拆分后的目标字进行拆分,直至所述文本信息中的每个字的持续时长都不大于预设持续时长为止。
本发明实施例还提供了一种音频识别装置,其包括:
获取模块,用于获取音频文件,以及所述音频文件对应的文本信息,所述文本信息包括多个字;
设置模块,用于依次将所述文本信息中的每个字,设置为目标字,并获取所述目标字对应的时间信息,所述时间信息包括所述目标字的开始时间和所述目标字的结束时间;
第一确定模块,用于根据所述目标字的开始时间,确定所述目标字对应的多个开始调整时间,并根据所述目标字的结束时间,确定所述目标字对应的多个结束调整时间;
识别模块,用于根据所述目标字的多个开始调整时间和所述目标字的多个结束调整时间,对所述音频文件进行识别,得到所述目标字的音高信息。
在一实施例中,所述第一确定模块包括:
获取子模块,用于获取预设时间步长和预设最大误差值;
确定子模块,用于根据所述目标字的开始时间、所述预设时间步长和所述预设最大误差值,确定所述目标字对应的多个开始调整时间,并根据所述目标字的结束时间、所述预设时间步长和所述预设最大误差值,确定所述目标字对应的多个结束调整时间。
在一实施例中,所述识别模块包括:
选取子模块,用于从所述目标字的多个开始调整时间中,选取目标开始调整时间,并从所述目标字的多个结束调整时间中,选取所述目标开始调整时间对应的目标结束调整时间,得到多个目标调整时间组;
得到子模块,用于确定每一组所述目标调整时间组对应的音高概率集合,得到多个音高概率集合,所述音高概率集合包括音高、概率以及二者之间的关联关系;
评分子模块,用于对所述多个音高概率集合进行评分,并选取评分最高的音高概率集合;
生成子模块,用于根据所述评分最高的音高概率集合,生成所述目标字的音高信息。
在一实施例中,所述评分子模块具体用于:
根据所述目标字的开始时间、所述目标字的结束时间、所述目标字的多个目标调整时间 组,得到多个误差减益值;
依次将所述多个误差减益值,设置为目标误差减益值,并从所述目标误差减益值对应的音高概率集合中,获取第一概率和第二概率;
根据所述第一概率、所述第二概率以及所述目标误差减益值,对所述目标误差减益值对应的音高概率集合进行评分。
在一实施例中,所述得到子模块具体用于:
根据所述目标调整时间组,对所述音频文件划分多个采样区间;
获取每一个采样区间对应的音高,以及所述音高对应的概率;
将所述音高、所述概率以及二者之间的关联关系存储,生成所述目标调整时间组对应的音高概率集合。
在一实施例中,所述音频识别装置还包括:
第二确定模块,用于确定所述目标字的持续时长是否大于预设持续时长;
拆分模块,用于在大于预设持续时长时,对所述目标字进行拆分,并确定拆分后的目标字的持续时长;
确定模块,用于重新确定所述拆分后的目标字的持续时长是否大于预设持续时长;
继续拆分模块,用于在大于预设持续时长时,继续对所述拆分后的目标字进行拆分,直至所述文本信息中的每个字的持续时长都不大于预设持续时长为止。
相应的,本发明实施例还提供了一种存储介质,其内存储有处理器可执行指令,该处理器通过执行所述指令提供如上任一的音频识别方法。
有益效果
本发明实施例的音频识别方法、装置及存储介质,先根据目标字对应的开始时间和结束时间,确定多个开始调整时间和多个结束调整时间,再根据该多个开始调整时间和多个结束调整时间,对音频文件进行识别,提高了音频识别的准确性。
附图说明
下面结合附图,通过对本发明的具体实施方式详细描述,将使本发明的技术方案及其它有益效果显而易见。
图1为本发明实施例提供的音频识别方法的第一场景示意图。
图2为本发明实施例提供的音频识别方法的流程示意图。
图3为本发明实施例提供的音频识别方法的另一场景示意图。
图4为本发明实施例提供的音频识别方法的另一流程示意图。
图5为本发明实施例提供的音频识别方法的又一场景示意图。
图6为本发明实施例提供的音频识别方法的再一场景示意图。
图7为本发明实施例提供的音频识别装置的结构示意图。
图8为本发明实施例提供的第一确定模块的结构示意图。
图9为本发明实施例提供的识别模块的结构示意图。
图10为本发明实施例提供的电子设备的结构示意图。
本发明的实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
请参照图1,图1为本发明实施例提供的音频识别方法的场景示意图,该场景中,音频识别装置可以作为实体来实现,也可以集成在终端或服务器等电子设备来实现,该电子设备可以包括智能手机、平板电脑和个人计算机等。
如图1所示,该场景中可以包括终端a和服务器b。用户A可以通过集成在终端a中的歌唱应用H,录制歌曲,生成音频文件。终端a获取到该音频文件后,可以从服务器b中获取该音频文件对应的文本信息,具体包括歌词文本信息,该文本信息包括多个字。需要说明的是,文本信息中的每个字都具有时间信息,具体包括每个字的开始时间和结束时间。一般而言,一个字的开始与结束,对应着一个人声音高的开始与结束。接下来,终端a将文本信息中的每个字,设置为目标字,并进一步从服务器b中获取目标字对应的时间信息,该时间信息包括目标字的开始时间和目标字的结束时间。由于在用户录制的音频文件中,人声音高的开始与结束,并不一定与对应字的开始与结束完全同步。因此,可以根据目标字的开始时间,确定目标字对应的多个开始调整时间,并根据目标字的结束时间,确定目标字对应的多个结束调整时间。最后,终端a再根据该多个开始调整时间和多个结束调整时间,对音频文件进行识别,得到目标字的音高信息。
本发明实施例提供一种音频识别方法、装置及存储介质,以下将分别进行详细说明。
在本发明实施例中,将从音频识别装置的角度进行描述,该音频识别装置具体可以集 成在电子设备中。
一种音频识别方法,包括:获取音频文件,以及音频文件对应的文本信息,文本信息包括多个字;依次将文本信息中的每个字,设置为目标字,并获取目标字对应的时间信息,时间信息包括目标字的开始时间和目标字的结束时间;根据目标字的开始时间,确定目标字对应的多个开始调整时间,并根据目标字的结束时间,确定目标字对应的多个结束调整时间;根据目标字的多个开始调整时间和目标字的多个结束调整时间,对音频文件进行识别,得到目标字的音高信息。
请参照图2,图2为本发明实施例提供的音频识别方法的流程图,该方法可以包括:
步骤S101,获取音频文件,以及音频文件对应的文本信息,文本信息包括多个字。
用户使用唱歌应用录制歌曲时,伴奏声、人声等声音共同形成了音频文件。这些声音在音频文件中,都以数字信号形式存在。若要从音频文件中准确识别出人声,需要知道人声在音频文件中的开始时间和结束时间。
如图3所示,用户使用唱歌应用录制歌曲时,唱歌应用会显示歌词文本信息,提示用户演唱。可以大致认为歌词开始的时间即用户开始演唱的时间,歌词结束的时间即用户结束演唱的时间。因此,在获取到音频文件后,可以进一步获取该音频文件对应的文本信息,以辅助对音频文件中的人声进行识别。其中,文本信息中包括多个字,该字与人声对应。
步骤S102,依次将文本信息中的每个字,设置为目标字,并获取目标字对应的时间信息,时间信息包括目标字的开始时间和目标字的结束时间。
由于在实际演唱过程中,用户演唱开始与结束的时间,不一定与唱歌应用提供的文本信息对应的时间完全同步。如图3所示,假设唱歌应用提供的歌词中,“当”字的开始时间为第43000毫秒,结束时间为第43300毫秒,而用户演唱“当”字的开始时间为第42000毫秒,结束时间为第42300毫秒,此时如果还按照歌唱应用提供的歌词“当”对应的开始时间和结束时间,来检测人声,则会降低音频识别的准确性。
综上,可以依次将文本信息中的每个字,设置为目标字,并获取目标字对应的时间信息,通过对该时间信息进行调整,来提高音频文件中人声识别的准确性。其中,时间信息包括目标字的开始时间和目标字的结束时间。
步骤S103,根据目标字的开始时间,确定目标字对应的多个开始调整时间,并根据目标字的结束时间,确定目标字对应的多个结束调整时间。
具体的,可以在目标字的开始时间前后的一段时间内,选取多个时间点作为开始调整时间。类似的,也可以在目标字的结束时间前后的一段时间内,选取多个时间点作为结束调 整时间。假设,目标字的开始时间为第10000毫秒,结束时间为第10500毫秒,则可以在第10000毫秒前后的第9900毫秒-第10100毫秒之间,选取第9900毫秒、第9950毫秒、第10000毫秒、第10050毫秒以及第10100毫秒作为开始调整时间。在第10500毫秒前后的第10400毫秒、第10450毫秒、第10500毫秒、第10550毫秒以及第10600毫秒作为结束调整时间。
步骤S104,根据目标字的多个开始调整时间和目标字的多个结束调整时间,对音频文件进行识别,得到目标字的音高信息。
具体的,可以从目标字的多个开始调整时间和目标字的多个结束调整时间中挑选满足预设条件的目标开始调整时间和目标结束调整时间,组成多个目标调整时间组。
然后根据每个目标调整时间组对音频文件进行人声音高识别,并对识别到的人声音高进行打分,如果在该目标调整时间组中对人声音高识别的质量越高,其分值越高。即可以根据该目标调整时间组,得到目标字的音高信息。其中,人声音高是指人发出的声音的高度。
由上述可知,本发明实施例提供的音频识别方法,通过先根据目标字对应的开始时间和结束时间,确定多个开始调整时间和多个结束调整时间,再根据该多个开始调整时间和多个结束调整时间,对音频文件进行识别,提高了音频识别的准确性。
根据上述实施例描述的音频识别方法,以下将举例作进一步说明。在本发明实施例中,将从音频识别装置的角度进行描述,该音频识别装置具体可以集成在电子设备中。
请参照图4,图4为本发明实施例提供的音频识别方法的另一流程图,该方法可以包括:
步骤S201,获取音频文件,以及音频文件对应的文本信息,文本信息包括多个字。
用户使用唱歌应用录制歌曲时,伴奏声、人声等声音共同形成了音频文件。这些声音在音频文件中,都以数字信号形式存在。若要从音频文件中准确识别出人声,需要知道人声在音频文件中的开始时间和结束时间。
如图3所示,用户使用唱歌应用录制歌曲时,唱歌应用会显示歌词文本信息,提示用户演唱。因此可以大致认为歌词开始的时间即用户开始演唱的时间,歌词结束的时间即用户结束演唱的时间。因此,在获取到音频文件后,可以进一步获取该音频文件对应的文本信息,以辅助对音频文件中的人声进行识别。其中,文本信息中包括多个字,该字与人声对应。
步骤S202,依次将文本信息中的每个字,设置为目标字,并获取目标字对应的时间信息,时间信息包括目标字的开始时间和目标字的结束时间。
由于在实际演唱过程中,用户演唱开始与结束的时间,不一定与唱歌应用提供的文本 信息对应的时间完全同步。如图3所示,假设唱歌应用提供的歌词中,“当”字的开始时间为第43000毫秒,结束时间为第43300毫秒,而用户演唱“当”字的开始时间为第42000毫秒,结束时间为第42300毫秒,此时如果还按照歌唱应用提供的歌词“当”对应的开始时间和结束时间,来检测人声,则会降低音频识别的准确性。
综上,可以依次将文本信息中的每个字,设置为目标字,并获取目标字对应的时间信息,通过对该时间信息进行调整,来提高音频文件中人声识别的准确性。其中,时间信息包括目标字的开始时间、结束时间以及持续时长等时间信息。
如图3所示,歌词包括15个字,可以依次将这15个字设置为目标字。具体的,先将“当”字设置为目标字,可以获取到“当”字的开始时间为第43000毫秒,结束时间为第43300毫秒,持续时长为300毫秒。
在一些实施例中,假设,一个字对应的持续时长大概在100毫秒左右,如果检测到目标字对应的持续时长大于100毫秒,则可以认为该目标字存在一字多音的情况,即一个目标字可能对应多个音高,其中,音高是指音的高度。针对上述一字多音的情况,可以使用如下步骤进行处理:
1-1,确定目标字的持续时长是否大于预设持续时长。
1-2,如果大于预设持续时长,则对目标字进行拆分,并确定拆分后的目标字的持续时长。
1-3,重新确定拆分后的目标字的持续时长是否大于预设持续时长。
1-4,如果大于预设持续时长,则继续对拆分后的目标字进行拆分,直至文本信息中的每个字的持续时长都不大于预设持续时长为止。
其中,该目标字的持续时长可以根据目标字的结束时间和开始时间来计算。具体的,假设目标字的开始时间为E,结束时间为F。则目标字的持续时长为(F-V)。
通过大量数据统计分析,可以得到单个音高对应的时长,因此预设持续时长可以根据该单个音高对应的时长进行设置,比如将预设持续时间设置为该单个音高对应的时长,在此不对预设持续时长的值进行具体限定。
如果目标字的持续时长大于预设持续时长,说明该目标字可能存在一字对应多个音高的情况。因此需要将该目标字进行拆分,直至文本信息中的每个字只对应一个音高为止。
具体的,可以将目标字拆分成第一目标字和第二目标字,将第一目标字的开始时间设置为E,第一目标字的结束时间设置为
Figure PCTCN2019103883-appb-000001
第二目标字的开始时间设置为
Figure PCTCN2019103883-appb-000002
第二 目标字的结束时间设置为F。经上述这种拆分方法拆分后,第一目标字的持续时长为
Figure PCTCN2019103883-appb-000003
第二目标字的持续时长为
Figure PCTCN2019103883-appb-000004
综上,第一目标字的持续时长
Figure PCTCN2019103883-appb-000005
一定小于预设持续时长V,因此接下来只需要重新对第二目标字的持续时长
Figure PCTCN2019103883-appb-000006
是否大于预设持续时长V进行甄别。
如果第二目标字的持续时长
Figure PCTCN2019103883-appb-000007
不大于预设持续时长V,则停止拆分该第二目标字;如果第二目标字的持续时长
Figure PCTCN2019103883-appb-000008
大于预设持续时长V,则根据上述对目标字进行拆分的方法,对第二目标字进行拆分,在此不再赘述。直至文本信息中的每个字的持续时长都不大于预设持续时长V为止。
步骤S203,获取预设时间步长和预设最大误差值;
其中,预设时间步长是指预先设置的两个时间点之间的差值。预设时间步长的值设置的越小,越能对目标字的实际开始时间和实际结束时间进行准确确定,但是也会带来计算量过大的问题,因此可以根据实际情况对预设时间步长的值进行设置。
预设最大误差值是指预先设置的两个时间点之间的误差值。该预设最大误差值的取值越大,越能对目标字的实际开始时间和实际结束时间进行准确确定,但是也会带来计算量过大的问题,因此可以根据实际情况对预设最大误差值进行设置。
步骤S204,根据目标字的开始时间、预设时间步长和预设最大误差值,确定目标字对应的多个开始调整时间,并根据目标字的结束时间、预设时间步长和预设最大误差值,确定目标字对应的多个结束调整时间。
具体的,假设目标字的开始时间为E,结束时间为F,预设时间步长为I,预设最大误差值为J,则目标字的多个开始调整时间可以设置为:K1=E-J,K2=E-J+I,K3=E-J+2*I,……,Kn=E+J。目标字的多个结束调整时间可以设置为:L1=F-J,L2=F-J+I,L3=F-J+2*I,……,Ln=F+J。
如图5所示,假设目标字的开始时间E为第400毫秒,结束时间F为第800毫秒,预设时间步长I为100毫秒,预设最大误差值J为300毫秒,则目标字的多个开始调整时间包括第100毫秒、第200毫秒、第300毫秒、第400毫秒、第500毫秒、第600毫秒以及第700毫秒,多个结束调整时间包括第500毫秒、第600毫秒、第700毫秒、第800毫秒、第900毫秒、第1000毫秒以及第1100毫秒。
步骤S205,从目标字的多个开始调整时间中,选取目标开始调整时间,并从目标字的多个结束调整时间中,选取目标开始调整时间对应的目标结束调整时间,得到多个目标调整时间组。
在一些实施例中,可以从上述多个开始调整时间中任意选取一个开始调整时间作为目标开始调整时间,从上述多个结束调整时间中任意选取一个结束调整时间作为目标结束调整时间。
如图5所示,可以从第100毫秒、第200毫秒以及第300毫秒等多个开始调整时间中选取第200毫秒作为目标开始调整时间,从第700毫秒、第800毫秒以及第900毫秒等多个结束调整时间中选取第800毫秒作为目标结束调整时间,则目标开始调整时间第200毫秒和目标结束调整时间第800毫秒可以作为一个目标调整时间组。然而如果选取的目标开始调整时间为700毫秒,目标结束调整时间为500毫秒,则会出现目标字的目标开始调整时间大于目标结束调整时间的不合理情况。
为了避免上述不合理情况的出现,在步骤S204中确定了多个目标开始调整时间和多个目标结束调整时间后,可以对该多个目标开始调整时间的区域和该多个目标结束调整时间的取值区域进行对比,如果二者存在重叠区域,则可以对该重叠区域进行折中划分。如图5所示,重叠区域为第500毫秒-第700毫秒,则可以取重叠区域的中间值第600毫秒作为目标开始调整时间和目标结束调整时间的分界线,即进行折中划分后,多个目标开始调整时间包括第100毫秒、第200毫秒、第300毫秒、第400毫秒、第500毫秒以及第600毫秒,多个结束调整时间包括第600毫秒、第700毫秒、第800毫秒、第900毫秒、第1000毫秒以及第1100毫秒。
在一些实施例中,还可以先从目标字的多个开始调整时间中依次选取开始调整时间作为目标开始调整时间,然后从多个结束调整时间中选取所有不小于该目标开始调整时间的结束调整时间,作为该目标开始调整时间对应的目标结束调整时间。
如图5所示,当选取第100毫秒作为目标开始调整时间时,可以从结束调整时间中选取第500毫秒、第600毫秒、第700毫秒、第800毫秒、第900毫秒、第1000毫秒以及第1100毫秒作为目标结束调整时间。当选取第600毫秒作为目标开始调整时间时,可以从结束调整时间中选取第600毫秒、第700毫秒、第800毫秒、第900毫秒、第1000毫秒以及第1100毫秒作为目标结束调整时间。这样也可以有效避免出现目标字的目标开始调整时间大于目标结束调整时间的不合理情况。最后将该目标开始调整时间和目标结束调整时间作为目标调整时间组。
步骤S206,确定每一组目标调整时间组对应的音高概率集合,得到多个音高概率集合,音高概率集合包括音高、概率以及二者之间的关联关系。
承接步骤S205,可以根据目标调整时间组中的目标开始调整时间和目标结束调整时间, 对音频文件进行识别,得到音高概率集合。其中,建立音高概率集合的步骤具体如下:
2-1,根据目标调整时间组,对音频文件划分多个采样区间。
2-2,获取每一个采样区间对应的音高,以及音高对应的概率。
2-3,将音高、概率以及二者之间的关联关系存储,生成目标调整时间组对应的音高概率集合。
具体的,以目标开始调整时间第100毫秒,目标结束调整时间第300毫秒作为一个目标调整时间组,以每50毫秒为一个采样区间,则如图6所示,可以将第100毫秒-第300毫秒之间的音频文件划分成4个采样区间,其中第100毫秒-150毫秒采样区间测得的音高为m2,第150毫米-第200毫秒采样区间测得的音高为m4,第200毫秒-第250毫秒采样区间测得的音高为m3,第250毫秒-第300毫秒采样区间测得的音高为m1。其中,每个采样区间音高的测量,可以采用神经网络算法对音频文件进行处理,得到该采样区间对应的音高。
综上,可以得到该目标调整时间组对应的音高概率集合为
Figure PCTCN2019103883-appb-000009
Figure PCTCN2019103883-appb-000010
该音高概率集合也可以以如下表1的形式存储。
表1
Figure PCTCN2019103883-appb-000011
根据上述方法,可以得到每一组目标时间调整组对应的音高概率集合,即得到多个音高概率集合,例如,如下表2所示:
表2
Figure PCTCN2019103883-appb-000012
Figure PCTCN2019103883-appb-000013
步骤S207,对多个音高概率集合进行评分,并选取评分最高的音高概率集合。
下面详细介绍对多个音高概率集合进行评分的具体步骤:
3-1,根据目标字的开始时间、目标字的结束时间、目标字的多个目标调整时间组,得到多个误差减益值。
3-2,依次将多个误差减益值,设置为目标误差减益值,并从目标误差减益值对应的音高概率集合中,获取第一概率和第二概率,其中第一概率为最大概率,第二概率为第二大概率。
3-3,根据第一概率、第二概率以及目标误差减益值,对目标误差减益值对应的音高概率集合进行评分。
其中,误差增益值R i的计算公式如下:
R i=(abs(U i-Y)+abs(V i-Z))*Q
其中,U i表示第i个目标调整时间组中的目标开始调整时间,V i表示第i个目标调整时间组中的目标结束调整时间,i为正整数,Y表示目标字的开始时间,Z表示目标字的结束时间,Q表示误差减益系数。
对应的,对音高概率集合进行评分的公式如下:
S i=T i-O i-R i
其中,T i表示第i个误差增益值R i对应的第一概率,O i表示第i个误差增益值R i对应的第二概率。需要说明的是,如果第一概率远远超过第二概率,说明根据目标时间调整组,对音频进行人声音高识别的准确率越大,即评分S i越大。
如上表2所示的目标调整时间组和音高概率集合的对应关系,假设误差减益系数Q为0.0001,目标字的结束时间Z为第300毫秒,目标字的开始时间Y为第100毫秒。则目标调整时间组1对应的误差增益值R 1为0,目标调整时间组2对应的误差增益值R 2为0.01,目标调整时间组3对应的误差增益值R 3为0.01。
接下来,先将误差增益值R 1作为目标误差减益值,从误差增益值R 1对应的音高概率集合1中第一概率T 1
Figure PCTCN2019103883-appb-000014
第二概率O 1
Figure PCTCN2019103883-appb-000015
最后根据第一概率T 1、第二概率O 1以及目 标误差减益值,对目标误差减益值R 1对应的音高概率集合1进行评分,得到的分值为
Figure PCTCN2019103883-appb-000016
同理的,再将误差增益值R 2作为目标误差减益值,从误差增益值R 2对应的音高概率集合2中第一概率T 2
Figure PCTCN2019103883-appb-000017
第二概率O 2
Figure PCTCN2019103883-appb-000018
最后根据第一概率T 2、第二概率O 2以及目标误差减益值R 2,对目标误差减益值R 2对应的音高概率集合2进行评分,得到的分值为
Figure PCTCN2019103883-appb-000019
再根据类似方法,可以对目标误差减益值R 3对应的音高概率集合3进行评分,得到的分值为
Figure PCTCN2019103883-appb-000020
详细计算过程不再赘述。
步骤S208,根据评分最高的音高概率集合,生成目标字的音高信息。
最后,对比分值
Figure PCTCN2019103883-appb-000021
Figure PCTCN2019103883-appb-000022
可知,分值
Figure PCTCN2019103883-appb-000023
最高。因此根据音高概率集合3来生成目标字的音高信息。具体的,可以从音高概率集合3:
Figure PCTCN2019103883-appb-000024
Figure PCTCN2019103883-appb-000025
中选取概率最大音高作为目标字的音高,即将m2作为目标字的音高。
由上述可知,本发明实施例提供的音频识别方法,通过先根据目标字对应的开始时间和结束时间,确定多个开始调整时间和多个结束调整时间,再根据该多个开始调整时间和多个结束调整时间,对音频文件进行识别,提高了音频识别的准确性。
根据上述实施例所描述的方法,本实施例将从音频识别装置的角度进一步进行描述,该音频识别装置可以集成在电子设备中。
请参照图7,图7为本发明实施例提供的音频识别装置的结构图,该装置30包括获取模块301、设置模块302、第一确定模块303以及识别模块304。
(1)获取模块301
获取模块301用于获取音频文件,以及音频文件对应的文本信息,文本信息包括多个字。
用户使用唱歌应用录制歌曲时,伴奏声、人声等声音共同形成了音频文件。这些声音在音频文件中,都以数字信号形式存在。若要从音频文件中准确识别出人声,需要知道人声在音频文件中的开始时间和结束时间。
如图3所示,用户使用唱歌应用录制歌曲时,唱歌应用会显示歌词文本信息,提示用户演唱。因此可以大致认为歌词开始的时间即用户开始演唱的时间,歌词结束的时间即用户 结束演唱的时间。因此,在获取模块301获取到音频文件后,可以进一步通过获取模块301获取该音频文件对应的文本信息,以辅助对音频文件中的人声进行识别。其中,文本信息中包括多个字,该字与人声对应。
(2)设置模块302
设置模块302于依次将文本信息中的每个字,设置为目标字,并获取目标字对应的时间信息,时间信息包括目标字的开始时间和目标字的结束时间。
由于在实际演唱过程中,用户演唱开始与结束的时间,不一定与唱歌应用提供的文本信息对应的时间完全同步。如图3所示,假设唱歌应用提供的歌词中,“当”字的开始时间为第43000毫秒,结束时间为第43300毫秒,而用户演唱“当”字的开始时间为第42000毫秒,结束时间为第42300毫秒,此时如果还按照歌唱应用提供的歌词“当”对应的开始时间和结束时间,来检测人声,则会降低音频识别的准确性。
综上,可以通过设置模块302,依次将文本信息中的每个字,设置为目标字,并获取目标字对应的时间信息,通过对该时间信息进行调整,来提高音频文件中人声识别的准确性。其中,时间信息包括目标字的开始时间、结束时间以及持续时长等时间信息。
如图3所示,歌词包括15个字,设置模块302可以依次将这15个字设置为目标字。具体的,设置模块302先将“当”字设置为目标字,可以获取到“当”字的开始时间为第43000毫秒,结束时间为第43300毫秒,持续时长为300毫秒。
在一些实施例中,假设,一个字对应的持续时长大概在100毫秒左右,如果设置模块302检测到目标字对应的持续时长大于100毫秒,则可以认为该目标字存在一字多音的情况,即一个目标字可能对应多个音高,其中,音高是指音的高度。
为了解决上述一字多音的情况,音频识别装置30还设置了第二确定模块305、拆分模块306、确定模块307以及继续拆分模块308。
第二确定模块305,用于确定目标字的持续时长是否大于预设持续时长;拆分模块306,用于在大于预设持续时长时,对目标字进行拆分,并确定拆分后的目标字的持续时长;确定模块307,用于重新确定拆分后的目标字的持续时长是否大于预设持续时长;继续拆分模块308,用于在大于预设持续时长时,继续对拆分后的目标字进行拆分,直至文本信息中的每个字的持续时长都不大于预设持续时长为止。
其中,该目标字的持续时长可以根据目标字的结束时间和开始时间来计算。具体的,假设目标字的开始时间为E,结束时间为F。则目标字的持续时长为(F-V)。
通过大量数据统计分析,可以得到单个音高对应的时长,因此预设持续时长可以根据 该单个音高对应的时长进行设置,比如将预设持续时间设置为该单个音高对应的时长,在此不对预设持续时长的值进行具体限定。
如果第二确定模块305确定目标字的持续时长大于预设持续时长,说明该目标字可能存在一字对应多个音高的情况。因此需要将该目标字进行拆分,直至文本信息中的每个字只对应一个音高为止。
具体的,可以通过拆分模块306,将目标字拆分成第一目标字和第二目标字,将第一目标字的开始时间设置为E,第一目标字的结束时间设置为
Figure PCTCN2019103883-appb-000026
第二目标字的开始时间设置为
Figure PCTCN2019103883-appb-000027
第二目标字的结束时间设置为F。经上述这种拆分方法拆分后,第一目标字的持续时长为
Figure PCTCN2019103883-appb-000028
第二目标字的持续时长为
Figure PCTCN2019103883-appb-000029
综上,第一目标字的持续时长
Figure PCTCN2019103883-appb-000030
一定小于预设持续时长V,因此接下来只需要通过确定模块307,重新对第二目标字的持续时长
Figure PCTCN2019103883-appb-000031
是否大于预设持续时长V进行甄别。
如果第二目标字的持续时长
Figure PCTCN2019103883-appb-000032
不大于预设持续时长V,则停止拆分该第二目标字;如果第二目标字的持续时长
Figure PCTCN2019103883-appb-000033
大于预设持续时长V,则通过继续拆分模块308,根据上述对目标字进行拆分的方法,对第二目标字进行拆分,在此不再赘述。直至文本信息中的每个字的持续时长都不大于预设持续时长V为止。
(3)第一确定模块303
第一确定模块303用于根据目标字的开始时间,确定目标字对应的多个开始调整时间,并根据目标字的结束时间,确定目标字对应的多个结束调整时间。
在一些实施例中,如图8所示,第一确定模块303包括:获取子模块3031和确定子模块3032。
获取子模块3031,用于获取预设时间步长和预设最大误差值。其中,预设时间步长是指预先设置的两个时间点之间的差值。预设时间步长的值设置的越小,越能对目标字的实际开始时间和实际结束时间进行准确确定,但是也会带来计算量过大的问题,因此可以根据实际情况对预设时间步长的值进行设置。
预设最大误差值是指预先设置的两个时间点之间的误差值。该预设最大误差值的取值越大,越能对目标字的实际开始时间和实际结束时间进行准确确定,但是也会带来计算量过大的问题,因此可以根据实际情况对预设最大误差值进行设置。
确定子模块3032,用于根据目标字的开始时间、预设时间步长和预设最大误差值,确定目标字对应的多个开始调整时间,并根据目标字的结束时间、预设时间步长和预设最大误 差值,确定目标字对应的多个结束调整时间。
具体的,假设目标字的开始时间为E,结束时间为F,预设时间步长为I,预设最大误差值为J,则确定子模块3032可以将目标字的多个开始调整时间设置为:K1=E-J,K2=E-J+I,K3=E-J+2*I,……,Kn=E+J。同理的,可以将目标字的多个结束调整时间设置为:L1=F-J,L2=F-J+I,L3=F-J+2*I,……,Ln=F+J。
如图5所示,假设目标字的开始时间E为第400毫秒,结束时间F为第800毫秒,预设时间步长I为100毫秒,预设最大误差值J为300毫秒,则目标字的多个开始调整时间包括第100毫秒、第200毫秒、第300毫秒、第400毫秒、第500毫秒、第600毫秒以及第700毫秒,多个结束调整时间包括第500毫秒、第600毫秒、第700毫秒、第800毫秒、第900毫秒、第1000毫秒以及第1100毫秒。
(4)识别模块304
识别模块304用于根据目标字的多个开始调整时间和目标字的多个结束调整时间,对音频文件进行识别,得到目标字的音高信息。
在一些实施例中,如图9所示,识别模块304包括:选取子模块3041、得到子模块3042、评分子模块3043以及生成子模块3044。
选取子模块3041,用于从目标字的多个开始调整时间中,选取目标开始调整时间,并从目标字的多个结束调整时间中,选取目标开始调整时间对应的目标结束调整时间,得到多个目标调整时间组。
在一些实施例中,选取子模块3041可以从上述多个开始调整时间中任意选取一个开始调整时间作为目标开始调整时间,从上述多个结束调整时间中任意选取一个结束调整时间作为目标结束调整时间。
如图5所示,选取子模块3041可以从第100毫秒、第200毫秒以及第300毫秒等多个开始调整时间中选取第200毫秒作为目标开始调整时间,从第700毫秒、第800毫秒以及第900毫秒等多个结束调整时间中选取第800毫秒作为目标结束调整时间,则目标开始调整时间第200毫秒和目标结束调整时间第800毫秒可以作为一个目标调整时间组。然而如果选取子模块3041选取的目标开始调整时间为700毫秒,目标结束调整时间为500毫秒,则会出现目标字的目标开始调整时间大于目标结束调整时间的不合理情况。
为了避免上述不合理情况的出现,在确定子模块3032确定了多个目标开始调整时间和多个目标结束调整时间后,选取子模块3041可以对该多个目标开始调整时间的区域和该多个目标结束调整时间的取值区域进行对比,如果二者存在重叠区域,则可以对该重叠区域 进行折中划分。如图5所示,重叠区域为第500毫秒-第700毫秒,则选取子模块3041可以取重叠区域的中间值第600毫秒作为目标开始调整时间和目标结束调整时间的分界线,即进行折中划分后,多个目标开始调整时间包括第100毫秒、第200毫秒、第300毫秒、第400毫秒、第500毫秒以及第600毫秒,多个结束调整时间包括第600毫秒、第700毫秒、第800毫秒、第900毫秒、第1000毫秒以及第1100毫秒。
在一些实施例中,选取子模块3041还可以先从目标字的多个开始调整时间中依次选取开始调整时间作为目标开始调整时间,然后从多个结束调整时间中选取所有不小于该目标开始调整时间的结束调整时间,作为该目标开始调整时间对应的目标结束调整时间。
如图5所示,当选取第100毫秒作为目标开始调整时间时,选取子模块3041可以从结束调整时间中选取第500毫秒、第600毫秒、第700毫秒、第800毫秒、第900毫秒、第1000毫秒以及第1100毫秒作为目标结束调整时间。当选取第600毫秒作为目标开始调整时间时,选取子模块3041可以从结束调整时间中选取第600毫秒、第700毫秒、第800毫秒、第900毫秒、第1000毫秒以及第1100毫秒作为目标结束调整时间。这样也可以有效避免出现目标字的目标开始调整时间大于目标结束调整时间的不合理情况。最后将该目标开始调整时间和目标结束调整时间作为目标调整时间组。
得到子模块3042,用于确定每一组目标调整时间组对应的音高概率集合,得到多个音高概率集合,音高概率集合包括音高、概率以及二者之间的关联关系。
在一些实施例中,得到子模块3042可以根据目标调整时间组中的目标开始调整时间和目标结束调整时间,对音频文件进行识别,得到音高概率集合。其中,得到子模块3042建立音高概率集合的步骤具体如下:
根据目标调整时间组,对音频文件划分多个采样区间;
获取每一个采样区间对应的音高,以及音高对应的概率;
将音高、概率以及二者之间的关联关系存储,生成目标调整时间组对应的音高概率集合。
具体的,以目标开始调整时间第100毫秒,目标结束调整时间第300毫秒作为一个目标调整时间组,以每50毫秒为一个采样区间,则如图6所示,得到子模块3042可以将第100毫秒-第300毫秒之间的音频文件划分成4个采样区间,其中第100毫秒-150毫秒采样区间测得的音高为m2,第150毫米-第200毫秒采样区间测得的音高为m4,第200毫秒-第250毫秒采样区间测得的音高为m3,第250毫秒-第300毫秒采样区间测得的音高为m1。其中,每个采样区间音高的测量,可以采用神经网络算法对音频文件进行处理,得到该采样 区间对应的音高。
综上,得到子模块3042可以得到该目标调整时间组对应的音高概率集合为
Figure PCTCN2019103883-appb-000034
Figure PCTCN2019103883-appb-000035
该音高概率集合也可以以如表1的形式存储。
根据上述方法,得到子模块3042可以得到每一组目标时间调整组对应的音高概率集合,即得到多个音高概率集合,具体如表2所示。
评分子模块3043,用于对多个音高概率集合进行评分,并选取评分最高的音高概率集。
在一些实施例中,评分子模块3043具体用于:
根据目标字的开始时间、目标字的结束时间、目标字的多个目标调整时间组,得到多个误差减益值;
依次将多个误差减益值,设置为目标误差减益值,并从目标误差减益值对应的音高概率集合中,获取第一概率和第二概率;
根据第一概率、第二概率以及目标误差减益值,对目标误差减益值对应的音高概率集合进行评分。
其中,误差增益值R i的计算公式如下:
R i=(abs(U i-Y)+abs(V i-Z))*Q
其中,U i表示第i个目标调整时间组中的目标开始调整时间,V i表示第i个目标调整时间组中的目标结束调整时间,i为正整数,Y表示目标字的开始时间,Z表示目标字的结束时间,Q表示误差减益系数。
对应的,对音高概率集合进行评分的公式如下:
S i=T i-O i-R i
其中,T i表示第i个误差增益值R i对应的第一概率,O i表示第i个误差增益值R i对应的第二概率。需要说明的是,如果第一概率远远超过第二概率,说明根据目标时间调整组,对音频进行人声音高识别的准确率越大,即评分S i越大。
如上表2所示的目标调整时间组和音高概率集合的对应关系,假设误差减益系数Q为0.0001,目标字的结束时间Z为第300毫秒,目标字的开始时间Y为第100毫秒。则评分子模块3043可以得到目标调整时间组1对应的误差增益值R 1为0,目标调整时间组2对应的误差增益值R 2为0.01,目标调整时间组3对应的误差增益值R 3为0.01。
接下来,评分子模块3043先将误差增益值R 1作为目标误差减益值,从误差增益值R 1 对应的音高概率集合1中第一概率T 1
Figure PCTCN2019103883-appb-000036
第二概率O 1
Figure PCTCN2019103883-appb-000037
最后根据第一概率T 1、第二概率O 1以及目标误差减益值,对目标误差减益值R 1对应的音高概率集合1进行评分,得到的分值为
Figure PCTCN2019103883-appb-000038
同理的,评分子模块3043再将误差增益值R 2作为目标误差减益值,从误差增益值R 2对应的音高概率集合2中第一概率T 2
Figure PCTCN2019103883-appb-000039
第二概率O 2
Figure PCTCN2019103883-appb-000040
最后根据第一概率T 2、第二概率O 2以及目标误差减益值R 2,对目标误差减益值R 2对应的音高概率集合2进行评分,得到的分值为
Figure PCTCN2019103883-appb-000041
再根据类似方法,评分子模块3043可以对目标误差减益值R 3对应的音高概率集合3进行评分,得到的分值为
Figure PCTCN2019103883-appb-000042
详细计算过程不再赘述。
生成子模块3044,用于根据评分最高的音高概率集合,生成目标字的音高信息。
最后,生成子模块3044对比分值
Figure PCTCN2019103883-appb-000043
Figure PCTCN2019103883-appb-000044
可知,分值
Figure PCTCN2019103883-appb-000045
最高。因此生成子模块3044根据音高概率集合3来生成目标字的音高信息。具体的,生成子模块3044可以从音高概率集合3:
Figure PCTCN2019103883-appb-000046
中选取概率最大音高作为目标字的音高,即将m2作为目标字的音高。
本发明实施例的音频识别装置,通过先根据目标字对应的开始时间和结束时间,确定多个开始调整时间和多个结束调整时间,再根据该多个开始调整时间和多个结束调整时间,对音频文件进行识别,提高了音频识别的准确性。
相应的,本发明实施例还提供一种电子设备,如图10所示,其示出了本发明实施例所涉及的电子设备的结构示意图,具体来讲:
该电子设备可以包括一个或者一个以上处理核心的处理器401、一个或一个以上计算机可读存储介质的存储器402、电源403和输入单元404等部件。本领域技术人员可以理解,图10中示出的电子设备结构并不构成对电子设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
处理器401是该电子设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器402内的软件程序和/或模块,以及调用存储在存储器402内的数据,执行电子设备的各种功能和处理数据,从而对电子设备进行整体监控。可选的,处理器401可包括一个或多个处理核心;优选的,处理器401可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理 器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器401中。
存储器402可用于存储软件程序以及模块,处理器401通过运行存储在存储器402的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器402可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。此外,存储器402可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器402还可以包括存储器控制器,以提供处理器401对存储器402的访问。
电子设备还包括给各个部件供电的电源403,优选的,电源403可以通过电源管理系统与处理器401逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源403还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
该电子设备还可包括输入单元404,该输入单元404可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。
尽管未示出,电子设备还可以包括显示单元等,在此不再赘述。具体在本实施例中,电子设备中的处理器401会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器402中,并由处理器401来运行存储在存储器402中的应用程序,从而实现各种功能,如下:
获取音频文件,以及音频文件对应的文本信息,文本信息包括多个字;
依次将文本信息中的每个字,设置为目标字,并获取目标字对应的时间信息,时间信息包括目标字的开始时间和目标字的结束时间;
根据目标字的开始时间,确定目标字对应的多个开始调整时间,并根据目标字的结束时间,确定目标字对应的多个结束调整时间;
根据目标字的多个开始调整时间和目标字的多个结束调整时间,对音频文件进行识别,得到目标字的音高信息。
该电子设备可以实现本发明实施例所提供的任一种音频识别装置所能实现的有效效果,详见前面的实施例,在此不再赘述。
本发明实施例的电子设备,通过先根据目标字对应的开始时间和结束时间,确定多个开始调整时间和多个结束调整时间,再根据该多个开始调整时间和多个结束调整时间,对音 频文件进行识别,提高了音频识别的准确性。
本文提供了实施例的各种操作。在一个实施例中,所述的一个或多个操作可以构成一个或多个计算机可读介质上存储的计算机可读指令,其在被电子设备执行时将使得计算设备执行所述操作。描述一些或所有操作的顺序不应当被解释为暗示这些操作必需是顺序相关的。本领域技术人员将理解具有本说明书的益处的可替代的排序。而且,应当理解,不是所有操作必需在本文所提供的每个实施例中存在。
而且,尽管已经相对于一个或多个实现方式示出并描述了本公开,但是本领域技术人员基于对本说明书和附图的阅读和理解将会想到等价变型和修改。本公开包括所有这样的修改和变型,并且仅由所附权利要求的范围限制。特别地关于由上述组件(例如元件、资源等)执行的各种功能,用于描述这样的组件的术语旨在对应于执行所述组件的指定功能(例如其在功能上是等价的)的任意组件(除非另外指示),即使在结构上与执行本文所示的本公开的示范性实现方式中的功能的公开结构不等同。此外,尽管本公开的特定特征已经相对于若干实现方式中的仅一个被公开,但是这种特征可以与如可以对给定或特定应用而言是期望和有利的其他实现方式的一个或多个其他特征组合。而且,就术语“包括”、“具有”、“含有”或其变形被用在具体实施方式或权利要求中而言,这样的术语旨在以与术语“包含”相似的方式包括。
本发明实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。上述提到的存储介质可以是只读存储器,磁盘或光盘等。上述的各装置或系统,可以执行相应方法实施例中的方法。
综上所述,虽然本发明已以实施例揭露如上,实施例前的序号仅为描述方便而使用,对本发明各实施例的顺序不造成限制。并且,上述实施例并非用以限制本发明,本领域的普通技术人员,在不脱离本发明的精神和范围内,均可作各种更动与润饰,因此本发明的保护范围以权利要求界定的范围为准。

Claims (13)

  1. 一种音频识别方法,其中,包括:
    获取音频文件,以及所述音频文件对应的文本信息,所述文本信息包括多个字;
    依次将所述文本信息中的每个字,设置为目标字,并获取所述目标字对应的时间信息,所述时间信息包括所述目标字的开始时间和所述目标字的结束时间;
    根据所述目标字的开始时间,确定所述目标字对应的多个开始调整时间,并根据所述目标字的结束时间,确定所述目标字对应的多个结束调整时间;
    根据所述目标字的多个开始调整时间和所述目标字的多个结束调整时间,对所述音频文件进行识别,得到所述目标字的音高信息。
  2. 根据权利要求1所述的音频识别方法,其中,所述根据所述目标字的开始时间,确定所述目标字对应的多个开始调整时间,并根据所述目标字的结束时间,确定所述目标字对应的多个结束调整时间步骤,包括:
    获取预设时间步长和预设最大误差值;
    根据所述目标字的开始时间、所述预设时间步长和所述预设最大误差值,确定所述目标字对应的多个开始调整时间,并根据所述目标字的结束时间、所述预设时间步长和所述预设最大误差值,确定所述目标字对应的多个结束调整时间。
  3. 根据权利要求1所述的音频识别方法,其中,所述根据所述目标字的多个开始调整时间和所述目标字的多个结束调整时间,对所述音频文件进行识别,得到所述目标字的音高信息步骤,包括:
    从所述目标字的多个开始调整时间中,选取目标开始调整时间,并从所述目标字的多个结束调整时间中,选取所述目标开始调整时间对应的目标结束调整时间,得到多个目标调整时间组;
    确定每一组所述目标调整时间组对应的音高概率集合,得到多个音高概率集合,所述音高概率集合包括音高、概率以及二者之间的关联关系;
    对所述多个音高概率集合进行评分,并选取评分最高的音高概率集合;
    根据所述评分最高的音高概率集合,生成所述目标字的音高信息。
  4. 根据权利要求3所述的音频识别方法,其中,所述对所述多个音高概率集合进行评分,并选取评分最高的音高概率集合步骤包括:
    根据所述目标字的开始时间、所述目标字的结束时间、所述目标字的多个目标调整时间组,得到多个误差减益值;
    依次将所述多个误差减益值,设置为目标误差减益值,并从所述目标误差减益值对应的音高概率集合中,获取第一概率和第二概率,其中所述第一概率为最大概率,所述第二概率为第二大概率;
    根据所述第一概率、所述第二概率以及所述目标误差减益值,对所述目标误差减益值对应的音高概率集合进行评分。
  5. 根据权利要求3所述的音频识别方法,其中,所述确定每一组所述目标调整时间组对应的音高概率集合,得到多个音高概率集合,所述音高概率集合包括音高、概率以及二者之间的关联关系步骤,包括:
    根据所述目标调整时间组,对所述音频文件划分多个采样区间;
    获取每一个采样区间对应的音高,以及所述音高对应的概率;
    将所述音高、所述概率以及二者之间的关联关系存储,生成所述目标调整时间组对应的音高概率集合。
  6. 根据权利要求1所述的音频识别方法,其中,所述目标字对应的时间信息还包括所述目标字的持续时长;所述依次将所述文本信息中的每个字,设置为目标字,并获取所述目标字对应的时间信息,所述时间信息包括所述目标字的开始时间和所述目标字的结束时间步骤之后,还包括:
    确定所述目标字的持续时长是否大于预设持续时长;
    如果大于预设持续时长,则对所述目标字进行拆分,并确定拆分后的目标字的持续时长;
    重新确定所述拆分后的目标字的持续时长是否大于预设持续时长;
    如果大于预设持续时长,则继续对所述拆分后的目标字进行拆分,直至所述文本信息中的每个字的持续时长都不大于预设持续时长为止。
  7. 一种音频识别装置,其中,包括:
    获取模块,用于获取音频文件,以及所述音频文件对应的文本信息,所述文本信息包括多个字;
    设置模块,用于依次将所述文本信息中的每个字,设置为目标字,并获取所述目标字对应的时间信息,所述时间信息包括所述目标字的开始时间和所述目标字的结束时间;
    第一确定模块,用于根据所述目标字的开始时间,确定所述目标字对应的多个开始调整时间,并根据所述目标字的结束时间,确定所述目标字对应的多个结束调整时间;
    识别模块,用于根据所述目标字的多个开始调整时间和所述目标字的多个结束调整时间,对所述音频文件进行识别,得到所述目标字的音高信息。
  8. 根据权利要求7所述的音频识别装置,其中,所述第一确定模块包括:
    获取子模块,用于获取预设时间步长和预设最大误差值;
    确定子模块,用于根据所述目标字的开始时间、所述预设时间步长和所述预设最大误差值,确定所述目标字对应的多个开始调整时间,并根据所述目标字的结束时间、所述预设时间步长和所述预设最大误差值,确定所述目标字对应的多个结束调整时间。
  9. 根据权利要求7所述的音频识别装置,其中,所述识别模块包括:
    选取子模块,用于从所述目标字的多个开始调整时间中,选取目标开始调整时间,并从所述目标字的多个结束调整时间中,选取所述目标开始调整时间对应的目标结束调整时间,得到多个目标调整时间组;
    得到子模块,用于确定每一组所述目标调整时间组对应的音高概率集合,得到多个音高概率集合,所述音高概率集合包括音高、概率以及二者之间的关联关系;
    评分子模块,用于对所述多个音高概率集合进行评分,并选取评分最高的音高概率集合;
    生成子模块,用于根据所述评分最高的音高概率集合,生成所述目标字的音高信息。
  10. 根据权利要求9所述的音频识别装置,其中,所述评分子模块具体用于:
    根据所述目标字的开始时间、所述目标字的结束时间、所述目标字的多个目标调整时间组,得到多个误差减益值;
    依次将所述多个误差减益值,设置为目标误差减益值,并从所述目标误差减益值对应的音高概率集合中,获取第一概率和第二概率;
    根据所述第一概率、所述第二概率以及所述目标误差减益值,对所述目标误差减益值对应的音高概率集合进行评分。
  11. 根据权利要求9所述的音频识别装置,其中,所述得到子模块具体用于:
    根据所述目标调整时间组,对所述音频文件划分多个采样区间;
    获取每一个采样区间对应的音高,以及所述音高对应的概率;
    将所述音高、所述概率以及二者之间的关联关系存储,生成所述目标调整时间组对应 的音高概率集合。
  12. 根据权利要求7所述的音频识别装置,其中,所述音频识别装置还包括:
    第二确定模块,用于确定所述目标字的持续时长是否大于预设持续时长;
    拆分模块,用于在大于预设持续时长时,对所述目标字进行拆分,并确定拆分后的目标字的持续时长;
    确定模块,用于重新确定所述拆分后的目标字的持续时长是否大于预设持续时长;
    继续拆分模块,用于在大于预设持续时长时,继续对所述拆分后的目标字进行拆分,直至所述文本信息中的每个字的持续时长都不大于预设持续时长为止。
  13. 一种存储介质,其内存储有处理器可执行指令,该处理器通过执行所述指令提供如权利要求1-6中任一的音频识别方法。
PCT/CN2019/103883 2018-10-15 2019-08-30 音频识别方法、装置及存储介质 WO2020078120A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811198963.1A CN108962286B (zh) 2018-10-15 2018-10-15 音频识别方法、装置及存储介质
CN201811198963.1 2018-10-15

Publications (1)

Publication Number Publication Date
WO2020078120A1 true WO2020078120A1 (zh) 2020-04-23

Family

ID=64480972

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103883 WO2020078120A1 (zh) 2018-10-15 2019-08-30 音频识别方法、装置及存储介质

Country Status (2)

Country Link
CN (1) CN108962286B (zh)
WO (1) WO2020078120A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108962286B (zh) * 2018-10-15 2020-12-01 腾讯音乐娱乐科技(深圳)有限公司 音频识别方法、装置及存储介质
CN110335629B (zh) * 2019-06-28 2021-08-03 腾讯音乐娱乐科技(深圳)有限公司 音频文件的音高识别方法、装置以及存储介质
CN111063372B (zh) * 2019-12-30 2023-01-10 广州酷狗计算机科技有限公司 确定音高特征的方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105788589A (zh) * 2016-05-04 2016-07-20 腾讯科技(深圳)有限公司 一种音频数据的处理方法及装置
CN107507628A (zh) * 2017-08-31 2017-12-22 广州酷狗计算机科技有限公司 唱歌评分方法、装置及终端
EP3316257A1 (en) * 2016-10-28 2018-05-02 Fujitsu Limited Pitch extraction device and pitch extraction method
CN108008930A (zh) * 2017-11-30 2018-05-08 广州酷狗计算机科技有限公司 确定k歌分值的方法和装置
CN108206026A (zh) * 2017-12-05 2018-06-26 北京小唱科技有限公司 确定音频内容音高偏差的方法及装置
CN108962286A (zh) * 2018-10-15 2018-12-07 腾讯音乐娱乐科技(深圳)有限公司 音频识别方法、装置及存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101149957B (zh) * 2007-09-30 2010-06-23 炬力集成电路设计有限公司 一种音字同步播放的方法及播放器
CN102737685A (zh) * 2011-04-15 2012-10-17 盛乐信息技术(上海)有限公司 歌词滚动播放系统及其实现方法
US20120290285A1 (en) * 2011-05-09 2012-11-15 Gao-Peng Wang Language learning device for expanding vocaburary with lyrics
CN102982832B (zh) * 2012-11-24 2015-05-27 安徽科大讯飞信息科技股份有限公司 一种在线卡拉ok伴奏、人声与字幕的同步方法
CN104091595B (zh) * 2013-10-15 2017-02-15 广州酷狗计算机科技有限公司 一种音频处理方法及装置
US9064484B1 (en) * 2014-03-17 2015-06-23 Singon Oy Method of providing feedback on performance of karaoke song
CN105702240B (zh) * 2014-11-25 2019-09-03 广州酷狗计算机科技有限公司 智能终端调整歌曲伴奏音乐的方法和装置
CN104967900B (zh) * 2015-05-04 2018-08-07 腾讯科技(深圳)有限公司 一种生成视频的方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105788589A (zh) * 2016-05-04 2016-07-20 腾讯科技(深圳)有限公司 一种音频数据的处理方法及装置
EP3316257A1 (en) * 2016-10-28 2018-05-02 Fujitsu Limited Pitch extraction device and pitch extraction method
CN107507628A (zh) * 2017-08-31 2017-12-22 广州酷狗计算机科技有限公司 唱歌评分方法、装置及终端
CN108008930A (zh) * 2017-11-30 2018-05-08 广州酷狗计算机科技有限公司 确定k歌分值的方法和装置
CN108206026A (zh) * 2017-12-05 2018-06-26 北京小唱科技有限公司 确定音频内容音高偏差的方法及装置
CN108962286A (zh) * 2018-10-15 2018-12-07 腾讯音乐娱乐科技(深圳)有限公司 音频识别方法、装置及存储介质

Also Published As

Publication number Publication date
CN108962286B (zh) 2020-12-01
CN108962286A (zh) 2018-12-07

Similar Documents

Publication Publication Date Title
WO2020078120A1 (zh) 音频识别方法、装置及存储介质
US10261965B2 (en) Audio generation method, server, and storage medium
WO2020177190A1 (zh) 一种处理方法、装置及设备
US20220366880A1 (en) Method and electronic device for recognizing song, and storage medium
CN106782601B (zh) 一种多媒体数据处理方法及其装置
WO2017157319A1 (zh) 音频信息处理方法及装置
US8892565B2 (en) Method and apparatus for accessing an audio file from a collection of audio files using tonal matching
CN108766451B (zh) 一种音频文件处理方法、装置和存储介质
US11511200B2 (en) Game playing method and system based on a multimedia file
US20180210952A1 (en) Music track search method, music track search device, and computer readable recording medium
CN105825872B (zh) 歌曲的难度确定方法和装置
CN106887233B (zh) 音频数据处理方法及系统
WO2020199384A1 (zh) 音频识别方法、装置、设备及存储介质
US10964301B2 (en) Method and apparatus for correcting delay between accompaniment audio and unaccompanied audio, and storage medium
CN110010159B (zh) 声音相似度确定方法及装置
CN111785238A (zh) 音频校准方法、装置及存储介质
US9940326B2 (en) System and method for speech to speech translation using cores of a natural liquid architecture system
CN108170845B (zh) 多媒体数据处理方法、装置及存储介质
EP3979241B1 (en) Audio clip matching method and apparatus, computer-readable medium and electronic device
US20120053937A1 (en) Generalizing text content summary from speech content
CN108829370B (zh) 有声资源播放方法、装置、计算机设备及存储介质
CN110070891A (zh) 一种歌曲识别方法、装置以及存储介质
JPH0736478A (ja) 音符列間類似度計算装置
CN107025902B (zh) 数据处理方法及装置
CN113674725B (zh) 音频混音方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19873560

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 09.06.2021)

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 120A DATED 15.12.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19873560

Country of ref document: EP

Kind code of ref document: A1