CN114360508A - Marking method, device, equipment and storage medium - Google Patents

Marking method, device, equipment and storage medium Download PDF

Info

Publication number
CN114360508A
CN114360508A CN202111605160.5A CN202111605160A CN114360508A CN 114360508 A CN114360508 A CN 114360508A CN 202111605160 A CN202111605160 A CN 202111605160A CN 114360508 A CN114360508 A CN 114360508A
Authority
CN
China
Prior art keywords
audio
target
marked
phoneme
marking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111605160.5A
Other languages
Chinese (zh)
Inventor
黄丽莉
李良斌
陈孝良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202111605160.5A priority Critical patent/CN114360508A/en
Publication of CN114360508A publication Critical patent/CN114360508A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a marking method, a marking device, marking equipment and a storage medium, wherein the method comprises the following steps: acquiring target phonemes in the audio to be marked and the awakening words; determining similarity between a first phoneme of a target audio in the audio to be marked and the target phoneme, and when the similarity meets a preset condition, marking the target audio by using a feature tag, wherein the target audio is a head-end audio of the audio to be marked, and the feature tag represents that the target audio is a residual audio of the awakening word. The method provided by the embodiment of the application can improve the efficiency of marking the residual audio of the awakening words in the audio.

Description

Marking method, device, equipment and storage medium
Technical Field
The present application relates to the field of speech recognition, and in particular, to a labeling method, apparatus, device, and storage medium.
Background
The application of voice interaction in various industries is more and more extensive, and a voice awakening technology is a portal for voice interaction with a terminal. At present, a terminal can be switched from a standby state to a working state sometimes when a complete awakening word is not collected, and the residual awakening word collected after the terminal is switched to the working state can have adverse effects on subsequent interaction between a user and the terminal.
When the technician reduces such adverse effects by optimizing the speech recognition method, the residual audio of the awakening word in the audio needs to be manually marked.
At present, it is inefficient to manually mark the residual audio of the wake-up word in the audio.
Disclosure of Invention
The embodiment of the application provides a marking method, a marking device and a storage medium, which can improve the efficiency of marking the residual audio of the awakening words in the audio.
In a first aspect, an embodiment of the present application provides a marking method, where the method includes:
acquiring target phonemes in the audio to be marked and the awakening words;
determining the similarity between a first phoneme of a target audio in the audio to be marked and the target phoneme, and when the similarity meets a preset condition, marking the target audio by using a feature tag, wherein the target audio is a head-end audio of the audio to be marked, and the feature tag represents that the target audio is a residual audio of a wakeup word.
In one possible implementation manner, the audio to be marked includes audio of instruction information, and the method further includes:
receiving marking operation of a user on the audio frequency of the instruction information and an instruction text corresponding to the instruction information input by the user;
in response to the marking operation, marking the audio frequency of the instruction information in the audio frequency to be marked by adopting the instruction label;
recording the audio to be marked of the characteristic label and the instruction label as a marked audio, training an instruction identification model according to the marked audio and the instruction text of the marked audio to obtain a target instruction identification model, wherein the target instruction identification model is used for eliminating residual audio of awakening words in the marked audio, identifying the audio of the instruction information and obtaining the instruction text corresponding to the instruction information.
In one possible implementation manner, before determining the similarity between the first phoneme of the target audio in the audio to be labeled and the target phoneme, the method further includes:
and determining the target audio according to the amplitude change rule of the head end audio of the audio to be marked.
In one possible implementation manner, determining the target audio according to the amplitude variation rule of the head end audio of the audio to be marked includes:
and determining that the head end amplitude change rule of the audio to be marked is the audio which is reduced from high to zero, and the audio is the target audio.
In one possible implementation, the target phoneme includes a phoneme of a preset length of the wake-up end of word.
In one possible implementation manner, determining similarity between a first phoneme of a target audio in audio to be labeled and a target phoneme, and labeling the target audio with a feature tag when the similarity satisfies a preset condition includes:
and determining the similarity between a first phoneme of the target audio in the audio to be marked and the target phoneme by adopting a marking model, and marking the target audio by adopting the feature tag when the similarity meets a preset condition.
In a possible implementation manner, a tagging model is used to determine a similarity between a first phoneme of a target audio in an audio to be tagged and a target phoneme, and when the similarity satisfies a preset condition, before tagging the target audio with a feature tag, the method further includes:
acquiring a positive sample and a negative sample, wherein the head end of the positive sample comprises the residual audio of the awakening word, and the head end of the negative sample does not comprise the residual audio of the awakening word;
and training the model to be trained according to the positive sample and the negative sample to obtain a labeled model.
In a second aspect, an embodiment of the present application provides a marking device, including:
the acquisition module is used for acquiring target phonemes in the audio to be marked and the awakening words;
the marking module is used for determining the similarity between a first phoneme of a target audio frequency in the audio frequency to be marked and the target phoneme, and when the similarity meets a preset condition, a feature tag is adopted to mark the target audio frequency, wherein the target audio frequency is a head-end audio frequency of the audio frequency to be marked, and the feature tag represents that the target audio frequency is a residual audio frequency of a wakeup word.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, performs the method as in the first aspect or any possible implementation of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, implement the method as in the first aspect or any possible implementation manner of the first aspect.
In a fifth aspect, the present application provides a computer program product, and instructions in the computer program product, when executed by a processor of an electronic device, cause the electronic device to perform the method as in the first aspect or any possible implementation manner of the first aspect.
According to the marking method, the device, the equipment and the storage medium provided by the embodiment of the application, the target phoneme in the audio frequency to be marked and the awakening word is firstly obtained; and then, determining the similarity between a first phoneme of a target audio in the audio to be marked and the target phoneme, and when the similarity meets a preset condition, marking the target audio by adopting a feature tag, wherein the target audio is a head-end audio of the audio to be marked, and the feature tag represents that the target audio is a residual audio of the awakening word. In the audio of the head end of the audio to be marked, the characteristic label is automatically adopted to mark the audio with the similarity degree with the residual phoneme of the awakening word meeting the preset condition, the audio is the residual audio of the awakening word, and the efficiency of marking the residual audio of the awakening word in the audio is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a marking method provided in an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a marking device according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
Features and exemplary embodiments of various aspects of the present application will be described in detail below, and in order to make objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. It will be apparent to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present application by illustrating examples thereof.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The application of voice interaction in various industries is more and more extensive, and a voice awakening technology is a portal for voice interaction with a terminal. At present, a terminal can be switched from a standby state to a working state sometimes when a complete awakening word is not acquired, continues to acquire the residues of the awakening word after being switched to the working state, acquires a voice instruction of a user, inputs the residues of the awakening word and the voice instruction into an instruction recognition model together, and recognizes the instruction of the user, but the instruction recognition model is difficult to accurately recognize the instruction of the user due to the existence of the residues of the awakening word. Technical personnel face the problems that the instruction identification model is difficult to accurately identify the instruction of the user due to the existence of the residues of the awakening words, and the instruction identification model needs to be optimized to improve the accuracy rate of the identification instruction. Firstly, a large amount of audio including the residual audio of the wake-up word needs to be manually screened out, and the residual audio of the wake-up word is marked to obtain a sample for training an instruction recognition model, but the efficiency of manually screening and marking the residual audio of the wake-up word is low.
In order to solve the above problems, embodiments of the present application provide a labeling method, apparatus, device, and storage medium, which first obtain target phonemes in an audio to be labeled and a wakeup word; and then, determining the similarity between a first phoneme of a target audio in the audio to be marked and the target phoneme, and when the similarity meets a preset condition, marking the target audio by adopting a feature tag, wherein the target audio is a head-end audio of the audio to be marked, and the feature tag represents that the target audio is a residual audio of the awakening word. In the audio of the head end of the audio to be marked, the characteristic label is automatically adopted to mark the audio with the similarity degree with the residual phoneme of the awakening word meeting the preset condition, the audio is the residual audio of the awakening word, and the efficiency of marking the residual audio of the awakening word in the audio is improved.
The method provided by the embodiment of the application is executed by a terminal with audio processing capability, such as a computer.
A marking method provided in an embodiment of the present application will be described in detail below with reference to fig. 1.
As shown in fig. 1, the method may include the steps of:
s110, acquiring target phonemes in the audio to be marked and the awakening words.
The terminal obtains the audio to be marked and the target phoneme in the awakening word.
And the audio to be marked is the audio received after the terminal is switched to the working state. When the user wakes up the terminal by voice, if the terminal is awakened when the terminal does not receive the complete awakening word, the terminal continues to receive the residual voice of the awakening word after being switched to the working state and receives the voice instruction of the user, so that the audio to be marked can also comprise the residual audio of the awakening word while the audio comprises the instruction information. The instruction information represents the instruction of the user to the terminal, and the terminal makes corresponding response according to the instruction information. The target phonemes include the remaining phonemes of the wake-up word.
In one example, the wake-up word is "baby", and if the terminal is woken up when receiving "baby", the audio received after the terminal is switched to the working state includes a residue of the wake-up word: "shellfish". The target phoneme in the wake-up word includes the remaining phoneme of the wake-up word, i.e., the phoneme of "bei".
And S120, determining the similarity between the first phoneme of the target audio in the audio to be marked and the target phoneme, and marking the target audio by adopting the feature tag when the similarity meets a preset condition.
The target audio is the head end audio of the audio to be marked, and the characteristic label represents that the target audio is the residual audio of the awakening word.
The method comprises the steps that the residual audio of the awakening word is the audio which is received after the terminal is switched to a working state, the terminal firstly identifies the first phoneme of the head end target audio of the audio to be marked, the similarity between the first phoneme and the target phoneme is determined, when the similarity meets a preset condition, the target audio is represented to be the residual audio of the awakening word, and the target audio is marked by adopting a feature tag.
In one example, the awakening word is "baby", the target phoneme is the phoneme "bei" of "baby", and the preset condition is that the similarity is greater than 60%. The terminal identifies that the first phoneme of the target audio is 'bei', so that the similarity between the first phoneme and the target phoneme is 100%, and the preset condition is met, so that the terminal determines that the target audio is the residual audio of the awakening word, and marks the target audio by using the feature tag.
In one example, the awakening word is "baby", the target phoneme is the phoneme "bei" of "baby", and the preset condition is that the similarity is greater than 60%. The terminal identifies that the first phoneme of the target audio is 'ei' or 'pei', so that the similarity between the first phoneme and the target phoneme is 66.7%, and a preset condition is met, so that the terminal determines that the target audio is a residual audio of the awakening word, and marks the target audio by using the feature tag.
In one example, the awakening word is "baby", the target phoneme is the phoneme "bei" of "baby", and the preset condition is that the similarity is greater than 60%. The terminal identifies that the first phoneme of the target audio is 'i', so that the similarity between the first phoneme and the target phoneme is 33.3%, and the preset condition is not met, so that the terminal determines that the target audio is not the residual audio of the awakening word, and does not adopt the feature tag to mark the target audio.
The method provided by the embodiment of the application comprises the steps of firstly obtaining target phonemes in audio to be marked and awakening words; and then, determining the similarity between a first phoneme of a target audio in the audio to be marked and the target phoneme, and when the similarity meets a preset condition, marking the target audio by adopting a feature tag, wherein the target audio is a head-end audio of the audio to be marked, and the feature tag represents that the target audio is a residual audio of the awakening word. In the audio of the head end of the audio to be marked, the characteristic label is automatically adopted to mark the audio with the similarity degree with the residual phoneme of the awakening word meeting the preset condition, the audio is the residual audio of the awakening word, and the efficiency of marking the residual audio of the awakening word in the audio is improved.
In some embodiments, the audio to be marked includes audio of instruction information, and the method may further include the steps of:
firstly, receiving the marking operation of the audio frequency of the instruction information by the user and the instruction text corresponding to the instruction information input by the user.
The instruction information represents the instruction of the user to the terminal, and the terminal makes corresponding response according to the instruction information.
In order to train the instruction recognition model and enable the instruction recognition model to more accurately recognize the instruction information in the audio, a technician marks the audio of the instruction information in the audio to be marked and obtains an instruction text corresponding to the instruction information.
The instruction text represents a text of the instruction information. For example, the instruction text of a certain piece of instruction information is "light on".
A user can mark the audio of the instruction information in the audio to be marked through a man-machine interaction interface by using the operation terminal, and input an instruction text corresponding to the instruction information, and the terminal receives the marking operation of the audio of the instruction information by the user and the instruction text corresponding to the instruction information input by the user.
And secondly, in response to the marking operation, marking the audio frequency of the instruction information in the audio frequency to be marked by adopting the instruction label.
And the terminal responds to the marking operation of the user and adopts the instruction label to mark the audio frequency of the instruction information in the audio frequency to be marked.
The instruction label is a preset label and represents that the marked audio is the audio of the instruction information.
And the terminal responds to the marking operation, and obtains the audio frequency of the manually marked instruction information after marking the audio frequency of the instruction information in the audio frequency to be marked by adopting the instruction label.
And thirdly, recording the audio to be marked of the feature label and the instruction label as a marked audio, training the instruction recognition model according to the marked audio and the instruction text of the marked audio, and obtaining the target instruction recognition model.
The terminal records the audio to be marked of the characteristic label and the instruction label as a marked audio, the marked audio is input into the instruction identification model to obtain an identification result, the identification result is compared with the instruction text to obtain the identification accuracy, and if the accuracy meets the preset accuracy, the target instruction identification model is obtained; if the accuracy rate does not meet the preset accuracy rate, the terminal automatically adjusts or technicians adjust the instruction recognition model, the marked audio is input into the adjusted instruction recognition model, and if the accuracy rate of the adjusted instruction recognition model meets the preset accuracy rate, the target instruction recognition model is obtained.
The target instruction identification model is used for eliminating residual audio of the awakening words in the marked audio, identifying the audio of the instruction information and obtaining an instruction text corresponding to the instruction information.
In one embodiment, if the audio to be marked does not include the residual audio of the wakeup word, the target instruction identification model is used for identifying the audio of the instruction information to obtain the instruction text corresponding to the instruction information.
The instruction recognition model of the terminal is updated to the target instruction recognition model, the terminal adopts the target instruction recognition model, and when the audio received after the terminal is switched to the working state comprises the residual audio of the awakening word, the audio of the instruction information can be accurately recognized.
According to the method provided by the embodiment of the application, a terminal receives marking operation of a user on the audio frequency of the instruction information and an instruction text corresponding to the instruction information input by the user; in response to the marking operation, marking the audio frequency of the instruction information in the audio frequency to be marked by adopting the instruction label; recording the audio to be marked of the feature label and the instruction label as a marked audio, training an instruction recognition model according to the marked audio and the instruction text of the marked audio, and obtaining a target instruction recognition model. The target instruction recognition model obtained through training can remove residual audio of the awakening words in the marked audio, recognize the audio of the instruction information and obtain instruction texts corresponding to the instruction information, and the terminal adopts the target instruction recognition model and can accurately recognize the audio of the instruction information when the audio received after the terminal is switched to a working state comprises the residual audio of the awakening words.
In some embodiments, before determining the similarity between the first phoneme of the target audio in the audio to be labeled and the target phoneme, i.e., before S120, the method may further include:
and determining the target audio according to the amplitude change rule of the head end audio of the audio to be marked.
And the terminal identifies the amplitude change rule of the head end audio of the audio to be marked and determines a target audio in the head end audio according to the amplitude change rule.
The method provided by the embodiment of the application determines the target audio according to the amplitude change rule of the head end audio of the audio to be marked, and obtains the residual audio possibly comprising the awakening words.
In some embodiments, determining the target position audio according to the amplitude variation law of the head end audio of the audio to be marked may include:
and determining that the head end amplitude change rule of the audio to be marked is the audio which is reduced from high to zero, and the audio is the target audio.
The speaking habit of the user is that after a complete awakening word is spoken, a voice instruction is spoken again through a tiny interval, so that the amplitude change rule of the received audio is high and is reduced to zero before the terminal starts to receive the residual audio of the awakening word and receives the audio of the instruction information, and the terminal determines that the amplitude change rule of the head end of the audio to be marked is the audio reduced from high to zero as the target audio.
In the method provided by the embodiment of the application, the terminal determines that the audio frequency of the audio frequency to be marked, the amplitude change rule of which is reduced from high to zero, is the target audio frequency, and more accurate residual audio frequency which possibly comprises the awakening word is obtained.
In some embodiments, determining the target audio according to the amplitude variation law of the head end audio of the audio to be marked may include:
and determining the target audio by adopting a marking model according to the amplitude change rule of the head end audio of the audio to be marked.
And identifying the amplitude change rule of the head end audio of the audio to be marked by adopting the marking model, and determining the target audio according to the amplitude change rule.
According to the method, the marking model is provided, the marking model identifies the amplitude change rule of the head end audio of the audio to be marked, the target audio is determined according to the amplitude change rule, and the efficiency of identifying the target audio is improved.
In some embodiments, determining that the leading end amplitude variation rule of the audio to be marked is audio that decreases from high to zero as the target audio may include:
and determining that the amplitude change rule of the head end of the audio to be marked is the audio which is reduced from high to zero by adopting a marking model and is the target audio.
And identifying the amplitude change rule of the head end audio of the audio to be marked by adopting the marking model, and determining the audio with the amplitude change rule of reducing from high to zero as the target audio by adopting the marking model.
The terminal receives the residual audio of the awakening word and before the terminal receives the audio of the instruction information, the amplitude change rule of the received audio is high and is reduced to zero, the marking model determines the audio with the amplitude change rule being reduced from high to zero as the target audio, namely the marking model determines the residual audio possibly comprising the awakening word as the target audio.
According to the method, the marking model is provided, the amplitude change rule of the head end audio of the audio to be marked is identified by the marking model, the audio with the amplitude change rule reduced from high to zero is determined as the target audio according to the amplitude change rule, the efficiency of identifying the target audio is improved, and meanwhile the residual audio possibly comprising the awakening words is determined as the target audio.
In some embodiments, before determining that the leading-end amplitude variation rule of the audio to be marked is audio with a high-to-zero amplitude variation rule and is target audio by using the marking model, the method may further include:
a positive sample and a negative sample are taken first.
And the terminal acquires a large number of positive samples and negative samples for training the model to be trained.
The positive sample and the negative sample are audio to be marked, the head end of the positive sample comprises audio with the amplitude change rule of reducing from high to zero, and the head end of the negative sample does not comprise audio with the amplitude change rule of reducing from high to zero.
And then, training the model to be trained according to the positive sample and the negative sample to obtain a labeled model.
The terminal inputs the positive sample and the negative sample into a model to be trained, the model to be trained identifies the amplitude change rule of the head end audios of the positive sample and the negative sample, the sample with the change rule being high and reduced to zero is determined as a target audio, the accuracy is counted, the model to be trained is adjusted according to the accuracy, and the training is stopped until the accuracy meets the preset accuracy, so that the labeled model is obtained.
According to the method provided by the embodiment of the application, the marking model is obtained through training, the amplitude rule of the head-end audio in the audio to be marked can be automatically identified, the sample with the amplitude change rule being high and reduced to zero is automatically determined, and the target audio is determined, so that the efficiency of determining the target audio in the audio to be marked is improved.
In some embodiments, the target phonemes comprise phonemes of a preset length that wake the end of speech.
The remaining of the wake-up word is a word at the tail end of the wake-up word, and the target phoneme in the wake-up word includes a phoneme of a preset length at the tail end of the wake-up word, that is, the target phoneme in the wake-up word includes the remaining phoneme of the wake-up word.
The preset length is set according to the number of phonemes of the wakeup word.
In one example, the wake word is "baby", the remainder of the wake word is "baby", the phoneme in the wake word includes "baobei", the preset length is equal to the number 3 of the remaining phonemes of the wake word, and the terminal acquires the target phoneme "bei" in the wake word.
In one example, the wake word is "baby", the remainder of the wake word is "baby", the phoneme in the wake word includes "baobei", the preset length is equal to the number 2 of the remaining phonemes of the wake word, and the terminal acquires the target phoneme "ei" in the wake word.
According to the method provided by the embodiment of the application, the phoneme with the preset length at the tail end of the awakening word is used as the target phoneme, so that the residual audio of the awakening word in the audio to be marked can be conveniently identified.
In some embodiments, determining a similarity between a first phoneme of a target audio in the audio to be labeled and the target phoneme, and labeling the target audio with a feature tag when the similarity satisfies a preset condition, that is, S120 may include:
and determining the similarity between a first phoneme of the target audio in the audio to be marked and the target phoneme by adopting a marking model, and marking the target audio by adopting the feature tag when the similarity meets a preset condition.
The method comprises the steps of identifying a first phoneme of a target audio in the audio to be marked by adopting a marking model, calculating the similarity between the first phoneme of the target audio in the audio to be marked and the target phoneme, and marking the target audio by adopting a feature tag when the similarity meets a preset condition.
According to the method provided by the embodiment of the application, a marking model is provided, the similarity between a first phoneme of a target audio frequency in the audio frequency to be marked and the target phoneme is determined by adopting the marking model, and when the similarity meets a preset condition, the target audio frequency is marked by adopting a feature tag, so that the automatic marking of the residues of the awakening words is realized.
In some embodiments, before determining, by using a tagging model, a similarity between a first phoneme of a target audio in the audio to be tagged and the target phoneme, and tagging the target audio by using a feature tag when the similarity satisfies a preset condition, the method may further include the following steps:
a positive sample and a negative sample are taken first.
And the terminal acquires a large number of positive samples and negative samples for training the model to be trained.
The positive sample and the negative sample are both audio frequencies, the head end of the positive sample comprises the residual audio frequency of the awakening word, the residual audio frequency of the awakening word is marked by the feature tag, and the head end of the negative sample does not comprise the residual audio frequency of the awakening word.
And then, training the model to be trained according to the positive sample and the negative sample to obtain a labeled model.
And the terminal inputs the positive sample and the negative sample into the model to be trained, counts the accuracy of the marking, adjusts the model to be trained according to the accuracy, and stops training until the accuracy meets the preset accuracy to obtain the marked model.
According to the method provided by the embodiment of the application, the mark model is obtained through training, the residual audio frequency of the awakening word in the audio frequency to be marked can be automatically identified and marked, and therefore the efficiency of marking the residual audio frequency of the awakening word in the audio frequency to be marked is improved.
The embodiment of the present application also provides a marking apparatus, as shown in fig. 2, the apparatus 200 may include an obtaining module 210 and a marking module 220.
The obtaining module 210 is configured to obtain target phonemes in the audio to be marked and the wakeup word.
The labeling module 220 is configured to determine a similarity between a first phoneme of a target audio in the audio to be labeled and the target phoneme, and label the target audio with a feature tag when the similarity satisfies a preset condition.
The target audio is the head end audio of the audio to be marked, and the characteristic label represents that the target audio is the residual audio of the awakening word.
The device provided by the embodiment of the application firstly obtains target phonemes in the audio to be marked and the awakening words; and then, determining the similarity between a first phoneme of a target audio in the audio to be marked and the target phoneme, and when the similarity meets a preset condition, marking the target audio by adopting a feature tag, wherein the target audio is a head-end audio of the audio to be marked, and the feature tag represents that the target audio is a residual audio of the awakening word. In the audio of the head end of the audio to be marked, the characteristic label is automatically adopted to mark the audio with the similarity degree with the residual phoneme of the awakening word meeting the preset condition, the audio is the residual audio of the awakening word, and the efficiency of marking the residual audio of the awakening word in the audio is improved.
In some embodiments, the audio to be tagged includes the audio of the instructional information, and the apparatus 200 may further include a receiving module 230, a response module 240, and a training module 250.
The receiving module 230 is configured to receive a marking operation of the audio of the instruction information by the user and an instruction text corresponding to the instruction information input by the user.
And the response module 240 is configured to respond to the marking operation, and mark the audio frequency of the instruction information in the audio frequency to be marked with the instruction tag.
And the training module 250 records the audio to be marked of the feature label and the instruction label as a marked audio, trains the instruction recognition model according to the marked audio and the instruction text of the marked audio, and obtains the target instruction recognition model.
The target instruction identification model is used for eliminating residual audio of the awakening words in the marked audio, identifying the audio of the instruction information and obtaining an instruction text corresponding to the instruction information.
In one embodiment, if the audio to be marked does not include the residual audio of the wakeup word, the target instruction recognition model recognizes the audio of the instruction information to obtain an instruction text corresponding to the instruction information.
The device provided by the embodiment of the application receives marking operation of a user on the audio frequency of the instruction information and an instruction text corresponding to the instruction information input by the user; in response to the marking operation, marking the audio frequency of the instruction information in the audio frequency to be marked by adopting the instruction label; recording the audio to be marked of the feature label and the instruction label as a marked audio, training an instruction recognition model according to the marked audio and the instruction text of the marked audio, and obtaining a target instruction recognition model. The target instruction recognition model obtained through training can remove residual audio of the awakening words in the marked audio, recognize the audio of the instruction information and obtain instruction texts corresponding to the instruction information, and the terminal adopts the target instruction recognition model and can accurately recognize the audio of the instruction information when the audio received after the terminal is switched to a working state comprises the residual audio of the awakening words.
In some embodiments, the apparatus 200 may also include a determination module 260.
And the determining module 260 is used for determining the target audio according to the amplitude change rule of the head end audio of the audio to be marked.
The device provided by the embodiment of the application determines the target audio according to the amplitude change rule of the head end audio of the audio to be marked, and obtains the residual audio possibly comprising the awakening words.
In some embodiments, the determining module 260 may be specifically configured to:
and determining that the head end amplitude change rule of the audio to be marked is the audio which is reduced from high to zero, and the audio is the target audio.
The device provided by the embodiment of the application determines that the audio frequency of the audio frequency to be marked, the amplitude change rule of which is reduced from high to zero, is the target audio frequency, and obtains more accurate residual audio frequency which possibly comprises the awakening word.
In some embodiments, the target phonemes comprise phonemes of a preset length that wake the end of speech.
According to the device provided by the embodiment of the application, the phoneme with the preset length at the tail end of the awakening word is used as the target phoneme, so that the residual audio of the awakening word in the audio to be marked can be conveniently identified.
In some embodiments, the marking module 120 may be specifically configured to:
and determining the similarity between a first phoneme of the target audio in the audio to be marked and the target phoneme by adopting a marking model, and marking the target audio by adopting the feature tag when the similarity meets a preset condition.
The device provided by the embodiment of the application provides a marking model, the marking model is adopted to determine the similarity between the first phoneme of the target audio frequency in the audio frequency to be marked and the target phoneme, and when the similarity meets the preset condition, the feature tag is adopted to mark the target audio frequency, so that the automatic marking of the residues of the awakening words is realized.
In some embodiments, the obtaining module 210 may be further configured to obtain positive and negative samples.
The positive sample and the negative sample are audio to be marked, the head end of the positive sample comprises residual audio of the awakening word, and the head end of the negative sample does not comprise the residual audio of the awakening word.
The training module 270 may further be configured to train the model to be trained according to the positive sample and the negative sample, so as to obtain the labeled model.
The device provided by the embodiment of the application obtains the mark model through training, and can automatically identify and mark the residual audio frequency of the awakening word in the audio frequency to be marked, so that the efficiency of marking the residual audio frequency of the awakening word in the audio frequency to be marked is improved.
The marking device provided in the embodiment of the present application performs each step in the method shown in fig. 1, and can achieve the technical effect of improving the efficiency of marking the residual audio of the wakeup word in the audio, which is not described in detail herein for brevity.
Fig. 3 shows a hardware structure diagram of an electronic device according to an embodiment of the present application.
The electronic device may comprise a processor 301 and a memory 302 in which computer program instructions are stored.
Specifically, the processor 301 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
Memory 302 may include mass storage for data or instructions. By way of example, and not limitation, memory 302 may include a Hard Disk Drive (HDD), floppy Disk Drive, flash memory, optical Disk, magneto-optical Disk, tape, or Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 302 may include removable or non-removable (or fixed) media, where appropriate. The memory 302 may be internal or external to the integrated gateway disaster recovery device, where appropriate. In a particular embodiment, the memory 302 is a non-volatile solid-state memory. In a particular embodiment, the memory 302 includes Read Only Memory (ROM). Where appropriate, the ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory or a combination of two or more of these.
The processor 301 implements any of the tagging methods in the embodiment shown in fig. 1 by reading and executing computer program instructions stored in the memory 302.
In one example, the electronic device may also include a communication interface 303 and a bus 310. As shown in fig. 3, the processor 301, the memory 302, and the communication interface 303 are connected via a bus 310 to complete communication therebetween.
The communication interface 303 is mainly used for implementing communication between modules, apparatuses, units and/or devices in the embodiment of the present application.
Bus 310 includes hardware, software, or both to couple the components of the electronic device to each other. By way of example, and not limitation, a bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hypertransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus or a combination of two or more of these. Bus 310 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.
The electronic device may execute the marking method in the embodiment of the present application, thereby implementing the marking method described in conjunction with fig. 1.
In addition, in combination with the marking method in the foregoing embodiments, the embodiments of the present application may be implemented by providing a computer-readable storage medium. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the tagging methods in the embodiments described above.
In combination with the marking methods in the foregoing embodiments, the embodiments of the present application may provide a computer program product to implement. The instructions in the computer program product, when executed by a processor of an electronic device, cause the electronic device to perform any one of the marking methods as in the above embodiments.
It is to be understood that the present application is not limited to the particular arrangements and instrumentality described above and shown in the attached drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present application are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications, and additions or change the order between the steps after comprehending the spirit of the present application.
The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the present application are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
It should also be noted that the exemplary embodiments mentioned in this application describe some methods or systems based on a series of steps or devices. However, the present application is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.
As described above, only the specific embodiments of the present application are provided, and it can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered within the scope of the present application.

Claims (11)

1. A method of marking, the method comprising:
acquiring target phonemes in the audio to be marked and the awakening words;
determining similarity between a first phoneme of a target audio in the audio to be marked and the target phoneme, and when the similarity meets a preset condition, marking the target audio by using a feature tag, wherein the target audio is a head-end audio of the audio to be marked, and the feature tag represents that the target audio is a residual audio of the awakening word.
2. The method according to claim 1, wherein the audio to be marked comprises audio of instruction information, the method further comprising:
receiving marking operation of the audio frequency of the instruction information by a user and instruction text corresponding to the instruction information input by the user;
in response to the marking operation, marking the audio frequency of the instruction information in the audio frequency to be marked by adopting an instruction label;
recording the audio marked by the feature tag and the audio marked by the instruction tag to be marked as a marked audio, training an instruction recognition model according to the marked audio and the instruction text corresponding to the marked audio to obtain a target instruction recognition model, wherein the target instruction recognition model is used for eliminating residual audio of the awakening words in the marked audio, recognizing the audio of the instruction information and obtaining the instruction text corresponding to the instruction information.
3. The method of claim 1, wherein before the determining the similarity between the first phoneme of the target audio in the audio to be labeled and the target phoneme, the method further comprises:
and determining the target audio according to the amplitude change rule of the head end audio of the audio to be marked.
4. The method as claimed in claim 3, wherein the determining the target audio according to the amplitude variation law of the audio head end audio to be marked comprises:
and determining that the head end amplitude change rule of the audio to be marked is the audio which is reduced from high to zero as the target audio.
5. The method of claim 1, wherein the target phone comprises a phone of a preset length at the end of the wake-up word.
6. The method of claim 1, wherein the determining a similarity between a first phoneme of a target audio in the audio to be labeled and the target phoneme, and when the similarity satisfies a preset condition, labeling the target audio with a feature tag comprises:
and determining the similarity between a first phoneme of a target audio in the audio to be marked and the target phoneme by adopting a marking model, and marking the target audio by adopting a feature tag when the similarity meets a preset condition.
7. The method according to claim 6, wherein before said determining, by using a labeling model, a similarity between a first phoneme of a target audio in the audio to be labeled and the target phoneme, and when the similarity satisfies a preset condition, labeling the target audio by using a feature label, the method further comprises:
acquiring a positive sample and a negative sample, wherein the head end of the positive sample comprises the residual audio frequency of the awakening word, and the head end of the negative sample does not comprise the residual audio frequency of the awakening word;
and training a model to be trained according to the positive sample and the negative sample to obtain the labeled model.
8. A marking device, the device comprising:
the acquisition module is used for acquiring target phonemes in the audio to be marked and the awakening words;
the marking module is used for determining the similarity between a first phoneme of a target audio frequency in the audio frequency to be marked and the target phoneme, and when the similarity meets a preset condition, marking the target audio frequency by adopting a feature tag, wherein the target audio frequency is a head end audio frequency of the audio frequency to be marked, the feature tag represents that the target audio frequency is a residual audio frequency of the awakening word, the target audio frequency is a head end audio frequency of the audio frequency to be marked, and the feature tag represents that the target audio frequency is a residual audio frequency of the awakening word.
9. An electronic device, characterized in that the device comprises: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, implements the tagging method of any one of claims 1-7.
10. A computer-readable storage medium, having computer program instructions stored thereon, which, when executed by a processor, implement the marking method of any one of claims 1-7.
11. A computer program product, wherein instructions in the computer program product, when executed by a processor of an electronic device, cause the electronic device to perform the tagging method of any one of claims 1-7.
CN202111605160.5A 2021-12-24 2021-12-24 Marking method, device, equipment and storage medium Pending CN114360508A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111605160.5A CN114360508A (en) 2021-12-24 2021-12-24 Marking method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111605160.5A CN114360508A (en) 2021-12-24 2021-12-24 Marking method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114360508A true CN114360508A (en) 2022-04-15

Family

ID=81100519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111605160.5A Pending CN114360508A (en) 2021-12-24 2021-12-24 Marking method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114360508A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100179811A1 (en) * 2009-01-13 2010-07-15 Crim Identifying keyword occurrences in audio data
CN107871506A (en) * 2017-11-15 2018-04-03 北京云知声信息技术有限公司 The awakening method and device of speech identifying function
CN109994106A (en) * 2017-12-29 2019-07-09 阿里巴巴集团控股有限公司 A kind of method of speech processing and equipment
US10649727B1 (en) * 2018-05-14 2020-05-12 Amazon Technologies, Inc. Wake word detection configuration
CN111833874A (en) * 2020-07-10 2020-10-27 上海茂声智能科技有限公司 Man-machine interaction method, system, equipment and storage medium based on identifier
CN111933112A (en) * 2020-09-21 2020-11-13 北京声智科技有限公司 Awakening voice determination method, device, equipment and medium
US20200380978A1 (en) * 2017-12-08 2020-12-03 Samsung Electronics Co., Ltd. Electronic device for executing application by using phoneme information included in audio data and operation method therefor
US20210210073A1 (en) * 2019-05-09 2021-07-08 Lg Electronics Inc. Artificial intelligence device for providing speech recognition function and method of operating artificial intelligence device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100179811A1 (en) * 2009-01-13 2010-07-15 Crim Identifying keyword occurrences in audio data
CN107871506A (en) * 2017-11-15 2018-04-03 北京云知声信息技术有限公司 The awakening method and device of speech identifying function
US20200380978A1 (en) * 2017-12-08 2020-12-03 Samsung Electronics Co., Ltd. Electronic device for executing application by using phoneme information included in audio data and operation method therefor
CN109994106A (en) * 2017-12-29 2019-07-09 阿里巴巴集团控股有限公司 A kind of method of speech processing and equipment
US10649727B1 (en) * 2018-05-14 2020-05-12 Amazon Technologies, Inc. Wake word detection configuration
US20210210073A1 (en) * 2019-05-09 2021-07-08 Lg Electronics Inc. Artificial intelligence device for providing speech recognition function and method of operating artificial intelligence device
CN111833874A (en) * 2020-07-10 2020-10-27 上海茂声智能科技有限公司 Man-machine interaction method, system, equipment and storage medium based on identifier
CN111933112A (en) * 2020-09-21 2020-11-13 北京声智科技有限公司 Awakening voice determination method, device, equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KIM BYEONGGEUN ET, AL.: "《Broadcasted Residual Learning for Efficient Keyword Spotting》", 《INTERSPEECH CONFERENCE》, 3 September 2021 (2021-09-03), pages 4538 - 4542 *

Similar Documents

Publication Publication Date Title
CN103165129B (en) Method and system for optimizing voice recognition acoustic model
CN104078044B (en) The method and apparatus of mobile terminal and recording search thereof
CN111081279A (en) Voice emotion fluctuation analysis method and device
CN112200273B (en) Data annotation method, device, equipment and computer storage medium
US20080294433A1 (en) Automatic Text-Speech Mapping Tool
CN109326305B (en) Method and system for batch testing of speech recognition and text synthesis
CN109801628B (en) Corpus collection method, apparatus and system
CN111028842B (en) Method and equipment for triggering voice interaction response
CN113903363B (en) Violation behavior detection method, device, equipment and medium based on artificial intelligence
CN109670148A (en) Collection householder method, device, equipment and storage medium based on speech recognition
CN108111538A (en) Smart projector speech control system and its method based on sound groove recognition technology in e
CN113053390B (en) Text processing method and device based on voice recognition, electronic equipment and medium
CN114023315A (en) Voice recognition method and device, readable medium and electronic equipment
CN112417850A (en) Error detection method and device for audio annotation
CN113380238A (en) Method for processing audio signal, model training method, apparatus, device and medium
CN112818680A (en) Corpus processing method and device, electronic equipment and computer-readable storage medium
CN112185425B (en) Audio signal processing method, device, equipment and storage medium
CN111538823A (en) Information processing method, model training method, device, equipment and medium
CN111160026B (en) Model training method and device, and text processing method and device
CN114360508A (en) Marking method, device, equipment and storage medium
CN114267342A (en) Recognition model training method, recognition method, electronic device and storage medium
CN112669850A (en) Voice quality detection method and device, computer equipment and storage medium
CN111048068B (en) Voice wake-up method, device and system and electronic equipment
CN115295020A (en) Voice evaluation method and device, electronic equipment and storage medium
CN111883109B (en) Voice information processing and verification model training method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination