CN112116908B - Wake-up audio determining method, device, equipment and storage medium - Google Patents

Wake-up audio determining method, device, equipment and storage medium Download PDF

Info

Publication number
CN112116908B
CN112116908B CN202011293307.7A CN202011293307A CN112116908B CN 112116908 B CN112116908 B CN 112116908B CN 202011293307 A CN202011293307 A CN 202011293307A CN 112116908 B CN112116908 B CN 112116908B
Authority
CN
China
Prior art keywords
audio
matching degree
wake
awakening
statement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011293307.7A
Other languages
Chinese (zh)
Other versions
CN112116908A (en
Inventor
陈孝良
冯大航
陈天峰
常乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202011293307.7A priority Critical patent/CN112116908B/en
Publication of CN112116908A publication Critical patent/CN112116908A/en
Application granted granted Critical
Publication of CN112116908B publication Critical patent/CN112116908B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method, a device, equipment and a storage medium for determining awakening audio frequency, and belongs to the technical field of voice. The embodiment of the application models the awakening audio and the non-awakening audio respectively, and the awakening audio and the non-awakening audio correspond to a plurality of sentence states respectively to form a sentence state sequence, so that when the audio characteristics of the audio are classified, whether the audio is more similar to the awakening audio or the non-awakening audio can be determined respectively. In the process, the awakening audio and the non-awakening audio are directly modeled and are independent from each other, modeling is not performed on each phoneme, a model obtained by training labeled data of each frame level is not needed, a corresponding recognition result does not need to be determined for each phoneme in the recognition process, the calculation amount can be greatly reduced, and the recognition efficiency is improved.

Description

Wake-up audio determining method, device, equipment and storage medium
Technical Field
The present application relates to the field of voice technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining a wake-up audio.
Background
In recent years, with the continuous development of audio processing technology, intelligent voice interaction systems such as intelligent sound boxes and vehicle-mounted voice interaction systems are continuously popularized, and in order to reduce user operation, a voice awakening function is provided, and whether the collected voice is awakened voice is determined by identifying the collected voice, so that voice awakening of equipment can be realized.
In the related art, the wake-up voice determination method is generally: and (4) extracting the characteristics of the voice to be processed to obtain the voice characteristics with fixed length, and inputting the voice characteristics into the awakening acoustic model for classification. The sample data required for training the awakening acoustic model needs to have frame-level labeling data, and the frame-level labeling data is usually obtained by aligning voice data by a pre-trained acoustic model.
In the above manner of obtaining the labeled data through the trained acoustic model, the alignment result may greatly affect the performance of the subsequent model. For example, if the alignment model has poor performance and the alignment result has low accuracy, the trained model will have poor performance based on the alignment result with low accuracy as the annotation data. If the labeled data with high accuracy is required to be obtained, the acoustic model needs to be retrained by using large-scale sample data, so that the cost is high and the efficiency is low.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for determining awakening audio, which can reduce the calculation amount and improve the identification efficiency. The technical solution of the present application is described below.
In one aspect, a wake audio determination method is provided, and the method includes:
carrying out feature extraction on the audio to obtain audio features of the audio;
classifying the audio features of the audio to obtain the matching degree of the audio and a plurality of statement state sequences, wherein the statement state sequences at least comprise a plurality of statement states which are included by awakening audio and non-awakening audio respectively;
and determining whether the audio is a wake-up audio according to the matching degree of the audio and the state sequences of the various sentences.
In some embodiments, the extracting the feature of the audio to obtain the audio feature of the audio includes:
extracting the characteristics of each audio frame in the audio to obtain the audio characteristics of each audio frame;
the classifying the audio features of the audio to obtain the matching degree of the audio and the multiple sentence state sequences includes:
classifying the audio features of each audio frame to obtain the matching degree of each audio frame and various statement states;
and acquiring the matching degree of the audio and the multiple sentence state sequences according to the matching degree of each audio frame and the multiple sentence states.
In some embodiments, the classifying the audio features of each audio frame to obtain a matching degree of each audio frame with a plurality of sentence states includes:
classifying the audio features of each audio frame to obtain the probability distribution of each audio frame corresponding to the multiple statement states;
the obtaining the matching degree of the audio and the multiple sentence state sequences according to the matching degree of each audio frame and the multiple sentence states comprises:
and acquiring the matching degree of the audio and the paths corresponding to the multiple sentence state sequences according to the probability distribution of each audio frame corresponding to the multiple sentence states and the word graph comprising the multiple sentence states.
In some embodiments, the determining whether the audio is a wake-up audio according to a degree of matching of the audio with a plurality of sentence state sequences includes:
acquiring a difference value between a first matching degree and a second matching degree of the statement state sequence of the audio and the awakening audio;
in response to the difference being greater than a target threshold, determining the audio to be a wake-up audio;
in response to the difference being less than the target threshold, determining that the audio is a non-wake-up audio.
In some embodiments, the non-wake audio includes non-wake speech and non-speech; the statement state sequences comprise a plurality of statement states including wakeup audio, non-wakeup voice and non-voice.
In some embodiments, the step of classifying the audio features is performed based on an audio processing model;
the audio processing model is obtained by training based on the following steps:
obtaining a plurality of sample audios, wherein each sample audio corresponds to a target classification result, and the target classification result is used for indicating a target statement state sequence corresponding to the sample audio;
performing feature extraction on the plurality of sample audios to obtain audio features of the plurality of sample audios;
inputting the audio features of the plurality of sample audios into an initial audio processing model, and classifying the audio features of each sample audio by the initial audio processing model to obtain a classification result of each sample audio;
obtaining mutual information corresponding to each sample audio according to the classification result of each sample audio and the target classification result;
and adjusting the model parameters of the initial audio processing model according to the mutual information until the model parameters meet the target conditions, so as to obtain the audio processing model.
In some embodiments, each sentence state in the sequence of target sentence states corresponds to a plurality of consecutive audio frames.
In some embodiments, the target condition is that the mutual information reaches a maximum value or the number of iterations reaches a target number.
In one aspect, a wake audio determining apparatus is provided, where the apparatus includes a plurality of functional modules, and the functional modules are configured to execute various optional implementations of the wake audio determining method. In some embodiments, the plurality of functional modules may include an extraction module, a classification module, and a determination module.
In one aspect, an apparatus for wake audio determination is provided, the apparatus comprising:
the extraction module is used for extracting the characteristics of the audio to obtain the audio characteristics of the audio;
the classification module is used for classifying the audio features of the audio to obtain the matching degree of the audio and a plurality of statement state sequences, wherein the statement state sequences at least comprise a plurality of statement states which are included by the awakening audio and the non-awakening audio respectively;
and the determining module is used for determining whether the audio is the awakening audio according to the matching degree of the audio and the state sequences of the sentences.
In some embodiments, the extraction module is configured to perform feature extraction on each audio frame in an audio to obtain an audio feature of each audio frame;
the classification module comprises a classification unit and an acquisition unit;
the classification unit is used for classifying the audio features of each audio frame to obtain the matching degree of each audio frame and a plurality of sentence states;
the obtaining unit is used for obtaining the matching degree of the audio and the multiple sentence state sequences according to the matching degree of each audio frame and the multiple sentence states.
In some embodiments, the classification unit is configured to classify the audio features of each audio frame to obtain a probability distribution of each audio frame corresponding to the plurality of sentence states;
the obtaining unit is configured to obtain, according to the probability distribution of each audio frame corresponding to the multiple sentence states and the word graph including the multiple sentence states, a matching degree between the audio and the corresponding path of the multiple sentence state sequence.
In some embodiments, the determination module is to:
acquiring a difference value between a first matching degree and a second matching degree of the statement state sequence of the audio and the awakening audio;
in response to the difference being greater than a target threshold, determining the audio to be a wake-up audio;
in response to the difference being less than the target threshold, determining that the audio is a non-wake-up audio.
In some embodiments, the non-wake audio includes non-wake speech and non-speech; the statement state sequences comprise a plurality of statement states including wakeup audio, non-wakeup voice and non-voice.
In some embodiments, the step of classifying the audio features is performed based on an audio processing model;
the audio processing model is obtained by training based on the following steps:
obtaining a plurality of sample audios, wherein each sample audio corresponds to a target classification result, and the target classification result is used for indicating a target statement state sequence corresponding to the sample audio;
performing feature extraction on the plurality of sample audios to obtain audio features of the plurality of sample audios;
inputting the audio features of the plurality of sample audios into an initial audio processing model, and classifying the audio features of each sample audio by the initial audio processing model to obtain a classification result of each sample audio;
obtaining mutual information corresponding to each sample audio according to the classification result of each sample audio and the target classification result;
and adjusting the model parameters of the initial audio processing model according to the mutual information until the model parameters meet the target conditions, so as to obtain the audio processing model.
In some embodiments, each sentence state in the sequence of target sentence states corresponds to a plurality of consecutive audio frames.
In some embodiments, the target condition is that the mutual information reaches a maximum value or the number of iterations reaches a target number.
In one aspect, an electronic device is provided that includes one or more processors and one or more memories having at least one computer program stored therein, the at least one computer program being loaded and executed by the one or more processors to implement various alternative implementations of the wake audio determination method described above.
In one aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, which is loaded and executed by a processor to implement various optional implementations of the wake audio determination method described above.
In one aspect, a computer program product or computer program is provided that includes one or more program codes stored in a computer-readable storage medium. The one or more program codes can be read from a computer-readable storage medium by one or more processors of the electronic device, and the one or more processors execute the one or more program codes, so that the electronic device can execute the wake audio determination method of any one of the above possible embodiments.
The embodiment of the application models the awakening audio and the non-awakening audio respectively, and the awakening audio and the non-awakening audio correspond to a plurality of sentence states respectively to form a sentence state sequence, so that when the audio characteristics of the audio are classified, whether the audio is more similar to the awakening audio or the non-awakening audio can be determined respectively. In the process, the awakening audio and the non-awakening audio are directly modeled and are independent from each other, modeling is not performed on each phoneme, a model obtained by training labeled data of each frame level is not needed, a corresponding recognition result does not need to be determined for each phoneme in the recognition process, the calculation amount can be greatly reduced, and the recognition efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to be able to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of an implementation environment of a wake audio determination method according to an embodiment of the present application;
fig. 2 is a flowchart of a wake audio determining method according to an embodiment of the present application;
fig. 3 is a flowchart of a wake audio determining method according to an embodiment of the present application;
fig. 4 is a flowchart of a wake audio determining method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of an FST decoding graph provided by an embodiment of the present application;
fig. 6 is a schematic structural diagram of a wake audio determining apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 8 is a block diagram of a terminal according to an embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, the first image can be referred to as a second image, and similarly, the second image can be referred to as a first image without departing from the scope of the various examples. The first image and the second image can both be images, and in some cases, can be separate and distinct images.
The term "at least one" is used herein to mean one or more, and the term "plurality" is used herein to mean two or more, e.g., a plurality of packets means two or more packets.
It is to be understood that the terminology used in the description of the various examples herein is for the purpose of describing particular examples only and is not intended to be limiting. As used in the description of the various examples and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. The term "and/or" is an associative relationship that describes an associated object, meaning that three relationships can exist, e.g., a and/or B, can mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present application generally indicates that the former and latter related objects are in an "or" relationship.
It should also be understood that, in the embodiments of the present application, the size of the serial number of each process does not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
It should also be understood that determining B from a does not mean determining B from a alone, but can also determine B from a and/or other information.
It will be further understood that the terms "comprises," "comprising," "includes," and/or "including," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also understood that the term "if" may be interpreted to mean "when" ("where" or "upon") or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined." or "if [ a stated condition or event ] is detected" may be interpreted to mean "upon determining.. or" in response to determining. "or" upon detecting [ a stated condition or event ] or "in response to detecting [ a stated condition or event ]" depending on the context.
The following describes an embodiment of the present application.
Fig. 1 is a schematic diagram of an implementation environment of a wake audio determination method according to an embodiment of the present application. The implementation environment comprises the terminal 101 or the implementation environment comprises the terminal 101 and the wake up audio determination platform 102. The terminal 101 is connected to the wake-up audio determination platform 102 through a wireless network or a wired network.
The terminal 101 can be at least one of a smart phone, a game console, a desktop computer, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3) player or an MP4 (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4) player, a laptop computer, an intelligent robot, and a self-service payment device. The terminal 101 is installed and running with an application supporting wake-up audio determination, which can be, for example, a system application, an instant messaging application, a news push application, a shopping application, an online video application, a social application.
Illustratively, the terminal 101 is capable of doing this independently and also of providing data services to it by waking up the audio determination platform 102. The embodiments of the present application do not limit this.
The wake audio determination platform 102 includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The wake audio determination platform 102 is used to provide background services for applications that wake audio determinations. Optionally, the audio determination platform 102 is awakened to undertake primary processing, and the terminal 101 undertakes secondary processing; or, the audio determination platform 102 is awakened to undertake the secondary processing work, and the terminal 101 undertakes the primary processing work; alternatively, the wake-up audio determination platform 102 or the terminal 101, respectively, can undertake the processing separately. Alternatively, the wake audio determination platform 102 and the terminal 101 perform cooperative computing by using a distributed computing architecture.
Optionally, the wake audio determination platform 102 includes at least one server 1021 and a database 1022, where the database 1022 is used to store data, and in this embodiment, the database 1022 can store sample audio or audio processing models to provide data services for the at least one server 1021.
The server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and artificial intelligence platform. The terminal can be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like.
Those skilled in the art will appreciate that the number of the terminals 101 and the servers 1021 can be greater or smaller. For example, the number of the terminals 101 and the servers 1021 may be only one, or the number of the terminals 101 and the servers 1021 may be several tens or several hundreds, or more, and the number of the terminals or the servers and the device types are not limited in the embodiment of the present application.
Fig. 2 is a flowchart of a wake audio determining method provided in an embodiment of the present application, where the method is applied to an electronic device, where the electronic device is a terminal or a server, and referring to fig. 2, the method includes the following steps.
201. And the electronic equipment performs feature extraction on the audio to obtain the audio features of the audio.
202. The electronic equipment classifies the audio features of the audio to obtain the matching degree of the audio and a plurality of statement state sequences, and the statement state sequences at least comprise a plurality of statement states which are respectively included by the awakening audio and the non-awakening audio.
203. And the electronic equipment determines whether the audio is the awakening audio according to the matching degree of the audio and the state sequences of the various sentences.
The embodiment of the application models the awakening audio and the non-awakening audio respectively, and the awakening audio and the non-awakening audio correspond to a plurality of sentence states respectively to form a sentence state sequence, so that when the audio characteristics of the audio are classified, whether the audio is more similar to the awakening audio or the non-awakening audio can be determined respectively. In the process, the awakening audio and the non-awakening audio are directly modeled and are independent from each other, modeling is not performed on each phoneme, a model obtained by training labeled data of each frame level is not needed, a corresponding recognition result does not need to be determined for each phoneme in the recognition process, the calculation amount can be greatly reduced, and the recognition efficiency is improved.
In some embodiments, the extracting the feature of the audio to obtain the audio feature of the audio includes:
extracting the characteristics of each audio frame in the audio to obtain the audio characteristics of each audio frame;
should classify the audio frequency characteristic of this audio frequency, obtain the matching degree of this audio frequency and multiple sentence state sequence, include:
classifying the audio features of each audio frame to obtain the matching degree of each audio frame and various sentence states;
and acquiring the matching degree of the audio and the multiple sentence state sequences according to the matching degree of each audio frame and the multiple sentence states.
In some embodiments, the classifying the audio features of each audio frame to obtain the matching degree of each audio frame with the plurality of sentence states includes:
classifying the audio features of each audio frame to obtain the probability distribution of each audio frame corresponding to the multiple sentence states;
the obtaining the matching degree of the audio and the multiple sentence state sequences according to the matching degree of each audio frame and the multiple sentence states includes:
and acquiring the matching degree of the audio and the paths corresponding to the multiple sentence state sequences according to the probability distribution of each audio frame corresponding to the multiple sentence states and the word graph comprising the multiple sentence states.
In some embodiments, the determining whether the audio is a wake-up audio according to the degree of matching of the audio with the plurality of sentence state sequences includes:
acquiring a difference value between a first matching degree and a second matching degree of the statement state sequence of the audio and the awakening audio;
in response to the difference being greater than a target threshold, determining the audio to be a wake-up audio;
in response to the difference being less than the target threshold, determining the audio to be non-wake-up audio.
In some embodiments, the non-wake audio includes non-wake speech and non-speech; the statement states include a wakeup tone, a non-wakeup tone and a plurality of statement states including a non-tone.
In some embodiments, the step of classifying the audio features is performed based on an audio processing model;
the audio processing model is obtained by training based on the following steps:
acquiring a plurality of sample audios, wherein each sample audio corresponds to a target classification result which is used for indicating a target statement state sequence corresponding to the sample audio;
extracting the characteristics of the sample audios to obtain the audio characteristics of the sample audios;
inputting the audio features of the sample audios into an initial audio processing model, and classifying the audio features of each sample audio by the initial audio processing model to obtain a classification result of each sample audio;
obtaining mutual information corresponding to each sample audio according to the classification result of each sample audio and the target classification result;
and adjusting the model parameters of the initial audio processing model according to the mutual information until the model parameters meet the target conditions, and obtaining the audio processing model.
In some embodiments, each sentence state in the sequence of target sentence states corresponds to a plurality of consecutive audio frames.
In some embodiments, the target condition is that the mutual information reaches a maximum value or that the number of iterations reaches a target number.
Fig. 3 is a flowchart of an audio processing model training method provided by an embodiment of the present application, and referring to fig. 3, the method includes the following steps.
301. The electronic equipment acquires a plurality of sample audios, wherein each sample audio corresponds to a target classification result, and the target classification result is used for indicating a target statement state sequence corresponding to the sample audio.
Electronic equipment can improve the accuracy that initial audio frequency processing model handled the audio frequency based on a plurality of sample audio frequency training initial audio frequency processing model for the audio frequency processing model that the training was obtained can accurately handle the audio frequency, determines the matching degree of this audio frequency and a plurality of sentence state sequence, and then can determine whether the audio frequency is awakening the audio frequency.
The statement state sequences at least comprise a plurality of statement states which are contained in the awakening audio and the non-awakening audio respectively. Here, sentence states are respectively constructed for the awakening audio and the non-awakening audio, the two are independent, unified modeling is performed on all the awakening audio, unified modeling is performed on all the non-awakening audio instead of each phoneme, a model obtained by training labeled data of each frame level is not needed, a corresponding recognition result does not need to be determined for each phoneme in the recognition process, the calculation amount can be greatly reduced, and the recognition efficiency is improved.
In some embodiments, the non-wake audio may include non-wake speech and non-speech, and the plurality of sentence state sequences include a plurality of sentence states including wake audio, non-wake speech and non-speech. That is, modeling is performed for three audio frequencies, namely, the awakening audio frequency, the non-awakening voice frequency and the non-voice frequency, and a plurality of sentence states are constructed for each audio frequency, so that the sentence-level states are constructed, rather than the states for each phoneme, frame-level labeled data is not required naturally, and the problem that the labeled data in the related art is difficult to obtain or is not accurate enough can be solved.
In some embodiments, the sentence state may be in the form of an HMM (Hidden Markov Model), and the modeling process is a process of building HMMs for three audios respectively. The HMMs of the three audios are connected to construct FSTs (Finite State Transducers).
For example, three states are built for the wake audio: 0,1,2. Two states are constructed for non-awake speech: 3,4. Two states are constructed for non-speech: 5,6. Thus, an FST including 7 states can be obtained, and the FST can decode the audio classification result to specify a corresponding path and a matching degree on the path.
In some embodiments, each sentence state in the sequence of target sentence states corresponds to a plurality of consecutive audio frames. That is, each sentence state in the target sentence state sequence lasts for multiple audio frames. Thus, a multi-frame audio frame corresponds to a sentence state, which is sentence-level, not frame-level phoneme state. For example, each statement state corresponds to three consecutive audio frames, each statement state lasting three audio frames. For a wake-up audio, its corresponding target state sequence may be 000111222. As another example, each sentence state in the sequence of target sentence states lasts at least three audio frames.
The number of the sentence states included in each sentence state sequence and the number of the audio frames corresponding to each sentence state can be set by related technicians according to requirements, and the embodiment of the application does not limit the number.
302. The electronic equipment performs feature extraction on the sample audios to obtain audio features of the sample audios.
After the electronic equipment acquires the sample audio, the electronic equipment can perform feature extraction on the sample audio, and the audio features are used as data bases for subsequent processing and analysis. The audio features can highlight the characteristics of the audio, and further more accurate processing results can be obtained.
Specifically, in consideration of the short-time stationarity of the audio, the computer device may perform framing processing on the speech to be processed, and perform feature extraction on the framed audio segment to obtain the audio features.
In a possible implementation mode, when voice is processed, the voice is converted into a frequency domain for calculation, so that the calculation difficulty and the calculation speed can be effectively reduced, the audio features can be more effectively represented, and the accuracy of awakening identification is improved. Therefore, the electronic device can perform fourier transform on the framed audio segment to obtain the frequency spectrum of the audio, and perform feature extraction on the frequency spectrum to obtain the audio features. The audio characteristic may be an FBank (filter bank) characteristic or an MFCC (Mel Frequency Cepstral coefficient), which is not limited in this embodiment.
In some embodiments, the electronic device may analyze each audio frame by extracting features from each frame to obtain audio features for each audio frame in the sample audio. Specifically, in step 302, the electronic device may perform feature extraction on each audio frame in the sample audio to obtain an audio feature of each audio frame.
303. The electronic equipment inputs the audio features of the sample audios into an initial audio processing model, and the initial audio processing model classifies the audio features of each sample audio to obtain a classification result of each sample audio.
After the electronic equipment obtains the audio features, whether the audio is the wake-up audio can be analyzed according to the audio features. Wherein, for each sample audio, the process of determining whether the audio is a wake audio may be understood as a classification process for determining whether the audio is classified as a wake audio or as a non-wake audio.
In some embodiments, the electronic device may frame the audio, analyzing for each audio frame. After the audio features of each audio frame are extracted in step 302, in step 303, the electronic device may classify the audio features of each audio frame to obtain a matching degree between each audio frame and a plurality of sentence states, and then the electronic device may obtain a matching degree between the audio and the plurality of sentence state sequences according to the matching degree between each audio frame and the plurality of sentence states.
By analyzing the matching degree of the audio and the sentence state sequence, whether the audio is closer to the awakening audio or the non-awakening audio can be measured. If the audio is closer to the wake up audio, the audio matches the sequence of sentence states of the wake up audio to a higher degree. If the audio is closer to the non-wake audio, the audio matches the sequence of sentence states of the non-wake audio to a greater degree.
In some embodiments, during the classification, the electronic device can classify to obtain a probability distribution corresponding to each audio frame, and solve the word graph based on the probability distribution to obtain a matching degree between the audio and the sentence state sequence. Specifically, the electronic device may classify the audio features of each audio frame to obtain a probability distribution of each audio frame corresponding to the plurality of sentence states. Then, the electronic device may obtain the matching degree between the audio and the path corresponding to the multiple sentence state sequences according to the probability distribution of each audio frame corresponding to the multiple sentence states and the word graph including the multiple sentence states.
In a possible implementation manner, when the electronic device obtains the matching degree of the audio and the paths corresponding to the multiple sentence state sequences according to the probability distribution and the word graph, the electronic device may perform decoding. Specifically, the electronic device may output the probability distribution to a decoder, and the decoder decodes the probability distribution and the word graph to obtain the matching degree between the audio and the path corresponding to the multiple sentence state sequences.
For decoding, the decoding process may be a viterbi decoding process, and of course, other decoding methods may also be adopted, which is not limited in this embodiment of the present application.
For the degree of matching, the degree of matching may be a probability, which may be a numerical value between 0 and 1, for example. The matching degree may also be a score, for example, the score may be a numerical value between 0 and 100, the form of the matching degree may be set by a related technician according to requirements, and the specific form of the matching degree is not limited in the embodiment of the present application.
Taking the matching degree as the score as an example, when decoding, the electronic device can decode according to the word graph and the probability distribution to obtain the optimal path of the audio corresponding to any statement state sequence, and obtain the score of the optimal path. For example, the electronic device decodes the word graph and the probability distribution to obtain the optimal path of the wake-up audio sequence. For example, taking three states of 0, 1, and 2 corresponding to the wake-up audio as an example, the electronic device obtains the optimal path of the wake-up audio corresponding to the audio by decoding according to the word graph and the probability distribution as follows: 00000111222, and obtains its corresponding score of 0.8. Similarly, the electronic device may also decode to obtain the optimal path and score corresponding to the non-wake-up audio, which is not described herein.
In some embodiments, in the classification, the audio can be classified according to the observation sequence of the audio and the audio features of the audio, the probabilities that the audio is various recognition results on the premise that the observation sequence is known are determined, and the multiple probabilities are called probability distributions, so that the similarity between the recognition results and the observation sequence can be analyzed.
304. And the electronic equipment acquires the mutual information corresponding to each sample audio according to the classification result of each sample audio and the target classification result.
After the classification result of each sample audio is determined, the electronic device is corresponding to a target classification result which is a real and correct result, so that the electronic device can acquire mutual information corresponding to each sample audio and take the mutual information as a target of model optimization, and thus, an identification error does not need to be acquired through a loss function, and the identification result does not need to be determined to identify the error according to labeled data at a frame level.
Mutual Information (Mutual Information) is a useful Information measure in Information theory, and can be regarded as the amount of Information contained in a random variable about another random variable, or the negative of a random variable being reduced by the knowledge of another random variable. On the premise that the observation sequence is given, after the probability that the audio is the awakening audio and the probability that the audio is the non-awakening audio are determined, the more irrelevant the two is, and the more accurate the classification result is.
For example, the Mutual Information may be determined by the following formula one, and the Mutual Information may be Maximum Mutual Information (LF-MMI).
Figure 733441DEST_PATH_IMAGE001
Equation 1
Wherein,
Figure 968025DEST_PATH_IMAGE002
is an observation sequence, namely a target classification result.
Figure 75658DEST_PATH_IMAGE003
The sample audio is matched with the wake-up audio.
Figure 283917DEST_PATH_IMAGE004
Is the sum of the matching degree of the sample audio and the awakening audio and the matching degree of the sample audio and the non-awakening audio. And N is the total number of audio frames and is a positive integer.
Figure 363868DEST_PATH_IMAGE005
Is the maximum mutual information.
305. And the electronic equipment adjusts the model parameters of the initial audio processing model according to the mutual information until the model parameters meet the target conditions, and the audio processing model is obtained.
After the electronic equipment obtains the mutual information, the model parameters can be adjusted according to the mutual information, so that the mutual information is larger, the audio classification result is more accurate, and the performance of the obtained audio processing model is better.
For the audio processing model, a model structure of the audio processing model may be set by a relevant technician as required, for example, the model structure may be a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a Time-Delay Neural Network (TDNN), and the like, which is not limited in this embodiment of the present application.
In some embodiments, the target condition is that the mutual information reaches a maximum value or that the number of iterations reaches a target number. The embodiment of the present application does not specifically limit the termination condition of the training process.
The above description is made on the model training process, and the following description is made on the model using process.
Fig. 4 is a flowchart of a wake audio determination method provided in an embodiment of the present application, and referring to fig. 4, the method includes the following steps.
401. The electronic device obtains audio.
In the embodiment of the application, the electronic equipment has an audio processing function. The electronic device may obtain audio, process the audio, and determine whether the audio is a wake-up audio.
The electronic device may be a terminal or a server. The embodiments of the present application do not limit this.
In some embodiments, the electronic device has an audio acquisition function and a voice wake-up function, and by acquiring the audio acquisition, whether the audio is a wake-up voice is determined, and whether the device is woken up is further determined. Wherein, the wake-up audio refers to audio for waking up the device. For example, the wake-up audio may contain a wake-up word that is used to wake up the device.
In other embodiments, the electronic device may have an audio capture function, and may determine whether the audio is a wake-up audio by itself after the audio is captured. The electronic equipment can also acquire the audio and send the audio to other electronic equipment, and the other electronic equipment determines whether the audio is the awakening audio and feeds back the determination result to the electronic equipment.
In other embodiments, the electronic device may not have a voice capture function. The electronic device can receive or download the audio collected by other electronic devices, determine whether the audio is a wake-up audio, further perform analysis based on the determination result, or feed the determination result back to the other electronic devices.
Correspondingly, different functions executed by the electronic device in the wake-up audio determining method may include the following several ways, which are not specifically limited in this embodiment of the present application.
The electronic device may acquire the audio in various ways, and the acquisition process may include any one of the following ways one to three.
Firstly, the electronic equipment collects audio.
The electronic device may have an audio collection function, and the electronic device may directly collect sound to obtain the audio.
And in the second mode, the electronic equipment receives the audio collected by the audio collecting equipment.
The electronic equipment can be connected with the audio acquisition equipment through a network or a data line, acquire the audio acquired by the audio acquisition equipment and provide background service for the audio acquisition equipment. The audio acquisition device can be any kind of device with an audio acquisition function, such as a smart sound box, a smart phone and the like, and the embodiment of the application does not limit the device.
In a third mode, the electronic device may extract the audio from the database.
In the third mode, the audio may be stored in a database, and when the electronic device needs to process the audio, the audio is extracted from the database.
After the audio is obtained, the electronic device can perform a feature extraction and classification step on the audio to determine whether the audio is a wake-up audio, which is specifically referred to in the following steps 402 to 404.
402. And the electronic equipment performs feature extraction on the audio to obtain the audio features of the audio.
Step 402 is similar to step 302, and will not be described herein.
Similarly, in some embodiments, the electronic device may analyze each audio frame by extracting features from each frame to obtain audio features of each audio frame in the audio. Specifically, in this step 402, the electronic device may perform feature extraction on each audio frame in the audio to obtain an audio feature of each audio frame.
403. The electronic equipment inputs the audio features of the audio into an audio processing model, and the audio processing model classifies the audio features of the audio to obtain the matching degree of the audio and the multiple sentence state sequences.
The statement state sequences at least comprise a plurality of statement states which are respectively included in the awakening audio and the non-awakening audio.
Step 403 is similar to step 303, and will not be described herein.
Similarly, the electronic device may frame the audio, analyzing for each audio frame. After the audio features of each audio frame are extracted in step 402, in step 403, the electronic device may classify the audio features of each audio frame to obtain a matching degree between each audio frame and a plurality of sentence states, and then the electronic device may obtain a matching degree between the audio and the plurality of sentence state sequences according to the matching degree between each audio frame and the plurality of sentence states.
Similarly, in some embodiments, during the classification, the electronic device can classify to obtain a probability distribution corresponding to each audio frame, and solve the word graph based on the probability distribution to obtain a matching degree between the audio and the sentence state sequence. Specifically, the electronic device may classify the audio features of each audio frame to obtain a probability distribution of each audio frame corresponding to the plurality of sentence states. Then, the electronic device may obtain the matching degree between the audio and the path corresponding to the multiple sentence state sequences according to the probability distribution of each audio frame corresponding to the multiple sentence states and the word graph including the multiple sentence states.
The step 403 is a process of classifying the audio features of the audio to obtain a matching degree between the audio and the multiple sentence state sequences, which has been described above only by taking an example that the classification step is executed based on an audio processing model, and in some embodiments, the electronic device may also directly process the audio features without outputting the audio features to the audio processing model, and the specific manner adopted in the embodiment of the present application is not limited.
404. And the electronic equipment determines whether the audio is the awakening audio according to the matching degree of the audio and the state sequences of the various sentences.
The electronic device determines the matching degree of the audio and the state sequences of the sentences, and the matching degree shows that the audio is more similar to a wake-up audio or a non-wake-up audio. It will be appreciated that the higher the degree of matching of the audio to the sentence state sequence of the wake-up audio, the higher the likelihood that the audio is a wake-up audio. Otherwise, the same principle is applied.
In some embodiments, the electronic device may determine whether the audio is a wake-up audio by comprehensively considering the matching degree of the audio and the sentence state sequences and using the difference value between the audio and the sentence state sequences as a measure. Specifically, the electronic device may obtain a difference between a first matching degree and a second matching degree of the sentence state sequence of the audio and the wake-up audio, and determine whether the audio is the wake-up audio according to a magnitude relationship between the difference and a target threshold.
In some embodiments, the electronic device can determine that the audio is a wake audio in response to the difference being greater than a target threshold. In other embodiments, the electronic device can determine that the audio is a non-wake audio in response to the difference being less than the target threshold. The target threshold may be set by a person skilled in the art according to requirements, and is not limited in this embodiment of the application.
The above matching degree is understood as the score of the audio frequency being the awakening audio frequency and the score of the non-awakening audio frequency, respectively, and the electronic device subtracts the score of the audio frequency being the awakening audio frequency from the score of the audio frequency being the non-awakening audio frequency, so as to measure whether the audio content of the audio frequency is closer to the awakening word. It will be appreciated that the greater the difference, the greater the likelihood that the audio is a wake-up audio. The difference value is taken as a measuring standard, whether the audio is the awakening audio or not and whether the audio is the non-awakening audio or not are comprehensively considered, and therefore the obtained result is more accurate.
In the above manner of solving the word graph based on the probability distribution, in step 404, the matching degree of the audio and the sentence state sequence can be understood as the scores of the wakening audio path and the non-wakening audio path, and whether the audio is the wakening audio is determined according to the magnitude relationship between the difference D of the two scores and the target threshold H.
For example, for the above classification result, decoding can be performed in the decoding graph of FST as shown in fig. 5 to determine the awake audio path and the non-awake audio path of the audio, and the corresponding scores can also be decoded from the decoding graph with the above probability distribution. The decoding diagram represents a word for awakening, SIL for non-awakening, and freetext for non-speech.
In other embodiments, the electronic device can determine that the audio is the wake audio in response to the audio matching the sequence of sentence states of the wake audio to a degree of match threshold. Alternatively, the electronic device may determine the sentence state sequence corresponding to the largest matching degree of the multiple matching degrees as the type of the audio, for example, if the matching degree corresponding to the non-awakening audio is the largest, the audio is determined to be the non-awakening audio. Therefore, the electronic equipment does not need to calculate again, the calculation amount can be reduced, and the determination efficiency is improved.
The above provides several optional ways to determine whether the audio is the wake-up audio according to the matching degree, and the embodiment of the present application does not limit what is specifically adopted.
In some embodiments, after the electronic device determines that the audio is a wake-up audio, the wake-up audio is used to wake up the target device, and thus, the electronic device may also wake up the target device. The target device may or may not be the electronic device. In one possible implementation, the target device is the electronic device, and the electronic device may capture audio, process the audio, determine that the audio is a wake-up audio, and wake up itself, for example, the electronic device may be on a bright screen. In another possible implementation manner, the target device is not the electronic device, and after the target device acquires audio, the target device can analyze whether the audio is a wake-up audio or not by the electronic device. If so, the electronic device may wake up the target device, e.g., instruct the target device to light up, etc.
The embodiment of the application models the awakening audio and the non-awakening audio respectively, and the awakening audio and the non-awakening audio correspond to a plurality of sentence states respectively to form a sentence state sequence, so that when the audio characteristics of the audio are classified, whether the audio is more similar to the awakening audio or the non-awakening audio can be determined respectively. In the process, the awakening audio and the non-awakening audio are directly modeled and are independent from each other, modeling is not performed on each phoneme, a model obtained by training labeled data of each frame level is not needed, a corresponding recognition result does not need to be determined for each phoneme in the recognition process, the calculation amount can be greatly reduced, and the recognition efficiency is improved.
All the above optional technical solutions can be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.
Fig. 6 is a schematic structural diagram of a wake-up audio determining apparatus according to an embodiment of the present application, and referring to fig. 6, the apparatus includes:
the extraction module 601 is configured to perform feature extraction on the audio to obtain an audio feature of the audio;
a classification module 602, configured to classify audio features of the audio to obtain a matching degree between the audio and multiple sentence state sequences, where the multiple sentence state sequences at least include multiple sentence states included in an awakening audio and a non-awakening audio, respectively;
the determining module 603 is configured to determine whether the audio is a wake-up audio according to a matching degree between the audio and the statement state sequences.
In some embodiments, the extracting module 601 is configured to perform feature extraction on each audio frame in the audio to obtain an audio feature of each audio frame;
the classification module 602 includes a classification unit and an acquisition unit;
the classification unit is used for classifying the audio features of each audio frame to obtain the matching degree of each audio frame and various sentence states;
the obtaining unit is used for obtaining the matching degree of the audio and the multiple sentence state sequences according to the matching degree of each audio frame and the multiple sentence states.
In some embodiments, the classifying unit is configured to classify the audio features of each audio frame to obtain a probability distribution of each audio frame corresponding to the plurality of sentence states;
the obtaining unit is used for obtaining the matching degree of the audio and the corresponding paths of the multiple sentence state sequences according to the probability distribution of each audio frame corresponding to the multiple sentence states and the word graph comprising the multiple sentence states.
In some embodiments, the determining module 603 is configured to:
acquiring a difference value between a first matching degree and a second matching degree of the statement state sequence of the audio and the awakening audio;
in response to the difference being greater than a target threshold, determining the audio to be a wake-up audio;
in response to the difference being less than the target threshold, determining the audio to be non-wake-up audio.
In some embodiments, the non-wake audio includes non-wake speech and non-speech; the statement states include a wakeup tone, a non-wakeup tone and a plurality of statement states including a non-tone.
In some embodiments, the step of classifying the audio features is performed based on an audio processing model;
the audio processing model is obtained by training based on the following steps:
acquiring a plurality of sample audios, wherein each sample audio corresponds to a target classification result which is used for indicating a target statement state sequence corresponding to the sample audio;
extracting the characteristics of the sample audios to obtain the audio characteristics of the sample audios;
inputting the audio features of the sample audios into an initial audio processing model, and classifying the audio features of each sample audio by the initial audio processing model to obtain a classification result of each sample audio;
obtaining mutual information corresponding to each sample audio according to the classification result of each sample audio and the target classification result;
and adjusting the model parameters of the initial audio processing model according to the mutual information until the model parameters meet the target conditions, and obtaining the audio processing model.
In some embodiments, each sentence state in the sequence of target sentence states corresponds to a plurality of consecutive audio frames.
In some embodiments, the target condition is that the mutual information reaches a maximum value or that the number of iterations reaches a target number.
The device that this application embodiment provided, respectively will awaken up audio frequency and non-awaken up audio frequency and model a model, it has a plurality of sentence states to correspond respectively, forms sentence state sequence, when classifying the audio frequency characteristic of audio frequency like this, then can confirm respectively that this audio frequency is more like awakening up the audio frequency or more like non-awakening up the audio frequency. In the process, the awakening audio and the non-awakening audio are directly modeled and are independent from each other, modeling is not performed on each phoneme, a model obtained by training labeled data of each frame level is not needed, a corresponding recognition result does not need to be determined for each phoneme in the recognition process, the calculation amount can be greatly reduced, and the recognition efficiency is improved.
It should be noted that: in the wake-up audio determining apparatus provided in the foregoing embodiment, when determining whether the audio is a wake-up audio, only the division of the functional modules is used for illustration, and in practical applications, the function distribution can be completed by different functional modules according to needs, that is, the internal structure of the wake-up audio determining apparatus is divided into different functional modules to complete all or part of the functions described above. In addition, the wake-up audio determining apparatus and the wake-up audio determining method provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.
Fig. 7 is a schematic structural diagram of an electronic device 700 according to an embodiment of the present application, where the electronic device 700 may generate a relatively large difference due to different configurations or performances, and can include one or more processors (CPUs) 701 and one or more memories 702, where the memory 702 stores at least one computer program, and the at least one computer program is loaded and executed by the processors 701 to implement the wake-up audio determining method provided by the above method embodiments. The electronic device can also include other components for implementing device functions, for example, the electronic device can also have components such as a wired or wireless network interface and an input/output interface for input/output. The embodiments of the present application are not described herein in detail.
The electronic device in the above method embodiment can be implemented as a terminal. For example, fig. 8 is a block diagram of a terminal according to an embodiment of the present disclosure. The terminal 800 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3) player, an MP4 (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4) player, a notebook computer or a desktop computer. The terminal 800 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.
In general, the terminal 800 includes: a processor 801 and a memory 802.
The processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 801 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 801 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 801 may be integrated with a GPU (Graphics Processing Unit) which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 801 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.
Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 802 is used to store at least one instruction for execution by processor 801 to implement the wake audio determination methods provided by method embodiments herein.
In some embodiments, the terminal 800 may further include: a peripheral interface 803 and at least one peripheral. The processor 801, memory 802 and peripheral interface 803 may be connected by bus or signal lines. Various peripheral devices may be connected to peripheral interface 803 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 804, a display screen 805, a camera assembly 806, an audio circuit 807, a positioning assembly 808, and a power supply 809.
The peripheral interface 803 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 801 and the memory 802. In some embodiments, the processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.
The Radio Frequency circuit 804 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 804 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 804 converts an electrical signal into an electromagnetic signal to be transmitted, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 804 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 804 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display screen 805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to capture touch signals on or above the surface of the display 805. The touch signal may be input to the processor 801 as a control signal for processing. At this point, the display 805 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 805 may be one, disposed on a front panel of the terminal 800; in other embodiments, the display 805 may be at least two, respectively disposed on different surfaces of the terminal 800 or in a folded design; in other embodiments, the display 805 may be a flexible display disposed on a curved surface or a folded surface of the terminal 800. Even further, the display 805 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 805 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.
The camera assembly 806 is used to capture images or video. Optionally, camera assembly 806 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 806 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
The audio circuit 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 801 for processing or inputting the electric signals to the radio frequency circuit 804 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 800. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 807 may also include a headphone jack.
The positioning component 808 is used to locate the current geographic position of the terminal 800 for navigation or LBS (Location Based Service). The Positioning component 808 may be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.
Power supply 809 is used to provide power to various components in terminal 800. The power supply 809 can be ac, dc, disposable or rechargeable. When the power supply 809 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, terminal 800 also includes one or more sensors 810. The one or more sensors 810 include, but are not limited to: acceleration sensor 811, gyro sensor 812, pressure sensor 813, fingerprint sensor 814, optical sensor 815 and proximity sensor 816.
The acceleration sensor 811 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 800. For example, the acceleration sensor 811 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 801 may control the display 805 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 811. The acceleration sensor 811 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 812 may detect a body direction and a rotation angle of the terminal 800, and the gyro sensor 812 may cooperate with the acceleration sensor 811 to acquire a 3D motion of the user with respect to the terminal 800. From the data collected by the gyro sensor 812, the processor 801 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
Pressure sensors 813 may be disposed on the side frames of terminal 800 and/or underneath display 805. When the pressure sensor 813 is disposed on the side frame of the terminal 800, the holding signal of the user to the terminal 800 can be detected, and the processor 801 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 813. When the pressure sensor 813 is disposed at a lower layer of the display screen 805, the processor 801 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 805. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The fingerprint sensor 814 is used for collecting a fingerprint of the user, and the processor 801 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 814, or the fingerprint sensor 814 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 801 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 814 may be disposed on the front, back, or side of terminal 800. When a physical button or a vendor Logo is provided on the terminal 800, the fingerprint sensor 814 may be integrated with the physical button or the vendor Logo.
The optical sensor 815 is used to collect the ambient light intensity. In one embodiment, processor 801 may control the display brightness of display 805 based on the ambient light intensity collected by optical sensor 815. Specifically, when the ambient light intensity is high, the display brightness of the display screen 805 is increased; when the ambient light intensity is low, the display brightness of the display 805 is reduced. In another embodiment, the processor 801 may also dynamically adjust the shooting parameters of the camera assembly 806 based on the ambient light intensity collected by the optical sensor 815.
A proximity sensor 816, also known as a distance sensor, is typically provided on the front panel of the terminal 800. The proximity sensor 816 is used to collect the distance between the user and the front surface of the terminal 800. In one embodiment, when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal 800 gradually decreases, the processor 801 controls the display 805 to switch from the bright screen state to the dark screen state; when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal 800 becomes gradually larger, the display 805 is controlled by the processor 801 to switch from the breath-screen state to the bright-screen state.
Those skilled in the art will appreciate that the configuration shown in fig. 8 is not intended to be limiting of terminal 800 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.
The electronic device in the above method embodiment can be implemented as a server. For example, fig. 9 is a schematic structural diagram of a server provided in this embodiment of the present application, where the server 900 may generate relatively large differences due to different configurations or performances, and can include one or more processors (CPUs) 901 and one or more memories 902, where the memory 902 stores at least one computer program, and the at least one computer program is loaded and executed by the processors 901 to implement the wake audio determination method provided in the foregoing method embodiments. Certainly, the server can also have components such as a wired or wireless network interface and an input/output interface to facilitate input and output, and the server can also include other components for implementing the functions of the device, which is not described herein again.
In an exemplary embodiment, a computer readable storage medium, such as a memory including at least one computer program, executable by a processor to perform the wake audio determination method of the above embodiments, is also provided. For example, the computer-readable storage medium can be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product or a computer program is also provided, which comprises one or more program codes, which are stored in a computer-readable storage medium. The one or more program codes can be read from a computer-readable storage medium by one or more processors of the electronic device, and the one or more processors execute the one or more program codes to enable the electronic device to perform the wake audio determination method described above.
It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
It should be understood that determining B from a does not mean determining B from a alone, but can also determine B from a and/or other information.
Those skilled in the art will appreciate that all or part of the steps for implementing the above embodiments can be implemented by hardware, or can be implemented by a program for instructing relevant hardware, and the program can be stored in a computer readable storage medium, and the above mentioned storage medium can be read only memory, magnetic or optical disk, etc.
The above description is intended only to be an alternative embodiment of the present application, and not to limit the present application, and any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (11)

1. A wake audio determination method, the method comprising:
carrying out feature extraction on the audio to obtain audio features of the audio;
classifying the audio features of the audio to obtain the matching degree of the audio and a plurality of statement state sequences, wherein the plurality of statement state sequences at least comprise a plurality of statement states included in an awakening audio and a non-awakening audio respectively, each statement state sequence comprises a plurality of statement states, each statement state is a state in a finite state switcher, and each statement state sequence is used for indicating one classification of the audio;
acquiring a difference value between a first matching degree and a second matching degree, wherein the first matching degree is the matching degree of the audio and the sentence state sequence of the awakening audio, and the second matching degree is the matching degree of the audio and the sentence state sequence of the non-awakening audio;
in response to the difference being greater than a target threshold, determining the audio to be a wake-up audio.
2. The method of claim 1, wherein the extracting the features of the audio to obtain the audio features of the audio comprises:
extracting the characteristics of each audio frame in the audio to obtain the audio characteristics of each audio frame;
the classifying the audio features of the audio to obtain the matching degree of the audio and the multiple sentence state sequences includes:
classifying the audio features of each audio frame to obtain the matching degree of each audio frame and various statement states;
and acquiring the matching degree of the audio and the multiple sentence state sequences according to the matching degree of each audio frame and the multiple sentence states.
3. The method of claim 2, wherein the classifying the audio features of each audio frame to obtain the degree of matching between each audio frame and a plurality of sentence states comprises:
classifying the audio features of each audio frame to obtain the probability distribution of each audio frame corresponding to the multiple statement states;
the obtaining the matching degree of the audio and the multiple sentence state sequences according to the matching degree of each audio frame and the multiple sentence states comprises:
and acquiring the matching degree of the audio and the paths corresponding to the multiple sentence state sequences according to the probability distribution of each audio frame corresponding to the multiple sentence states and the word graph comprising the multiple sentence states.
4. The method of claim 1, further comprising:
in response to the difference being less than the target threshold, determining that the audio is a non-wake-up audio.
5. The method of claim 1, wherein the non-wake audio comprises non-wake speech and non-speech; the statement state sequences comprise a plurality of statement states including wakeup audio, non-wakeup voice and non-voice.
6. The method of claim 1, wherein the step of classifying the audio features is performed based on an audio processing model;
the audio processing model is obtained by training based on the following steps:
obtaining a plurality of sample audios, wherein each sample audio corresponds to a target classification result, and the target classification result is used for indicating a target statement state sequence corresponding to the sample audio;
performing feature extraction on the plurality of sample audios to obtain audio features of the plurality of sample audios;
inputting the audio features of the plurality of sample audios into an initial audio processing model, and classifying the audio features of each sample audio by the initial audio processing model to obtain a classification result of each sample audio;
obtaining mutual information corresponding to each sample audio according to the classification result of each sample audio and the target classification result;
and adjusting the model parameters of the initial audio processing model according to the mutual information until the model parameters meet the target conditions, so as to obtain the audio processing model.
7. The method of claim 6, wherein each sentence state in the sequence of target sentence states corresponds to a plurality of consecutive audio frames.
8. The method of claim 6, wherein the target condition is that the mutual information reaches a maximum value or that the number of iterations reaches a target number.
9. A wake audio determination apparatus, the apparatus comprising:
the extraction module is used for extracting the characteristics of the audio to obtain the audio characteristics of the audio;
the classification module is used for classifying the audio features of the audio to obtain the matching degree of the audio and a plurality of statement state sequences, the plurality of statement state sequences at least comprise a plurality of statement states included by awakening audio and non-awakening audio respectively, each statement state sequence comprises a plurality of statement states, each statement state refers to a state in the finite state switcher, and each statement state sequence is used for indicating one classification of the audio;
the determining module is used for obtaining a difference value between a first matching degree and a second matching degree, wherein the first matching degree is the matching degree of the audio and the sentence state sequence of the awakening audio, and the second matching degree is the matching degree of the audio and the sentence state sequence of the non-awakening audio; in response to the difference being greater than a target threshold, determining the audio to be a wake-up audio.
10. An electronic device, comprising one or more processors and one or more memories having at least one computer program stored therein, the at least one computer program being loaded and executed by the one or more processors to implement the wake up audio determination method as claimed in any one of claims 1 to 8.
11. A computer-readable storage medium, in which at least one computer program is stored, which is loaded and executed by a processor to implement a wake up audio determination method as claimed in any one of claims 1 to 8.
CN202011293307.7A 2020-11-18 2020-11-18 Wake-up audio determining method, device, equipment and storage medium Active CN112116908B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011293307.7A CN112116908B (en) 2020-11-18 2020-11-18 Wake-up audio determining method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011293307.7A CN112116908B (en) 2020-11-18 2020-11-18 Wake-up audio determining method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112116908A CN112116908A (en) 2020-12-22
CN112116908B true CN112116908B (en) 2021-02-23

Family

ID=73795014

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011293307.7A Active CN112116908B (en) 2020-11-18 2020-11-18 Wake-up audio determining method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112116908B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113160802B (en) * 2021-03-12 2023-09-26 北京声智科技有限公司 Voice processing method, device, equipment and storage medium
CN113889084A (en) * 2021-11-19 2022-01-04 北京声智科技有限公司 Audio recognition method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766139A (en) * 2018-12-13 2019-05-17 平安普惠企业管理有限公司 The configuration method and device of configuration file
WO2020001458A1 (en) * 2018-06-26 2020-01-02 华为技术有限公司 Speech recognition method, device, and system
US10553205B2 (en) * 2017-03-09 2020-02-04 Kabushiki Kaisha Toshiba Speech recognition device, speech recognition method, and computer program product
CN110808050A (en) * 2018-08-03 2020-02-18 蔚来汽车有限公司 Voice recognition method and intelligent equipment
CN111968648A (en) * 2020-08-27 2020-11-20 北京字节跳动网络技术有限公司 Voice recognition method and device, readable medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10553205B2 (en) * 2017-03-09 2020-02-04 Kabushiki Kaisha Toshiba Speech recognition device, speech recognition method, and computer program product
WO2020001458A1 (en) * 2018-06-26 2020-01-02 华为技术有限公司 Speech recognition method, device, and system
CN110808050A (en) * 2018-08-03 2020-02-18 蔚来汽车有限公司 Voice recognition method and intelligent equipment
CN109766139A (en) * 2018-12-13 2019-05-17 平安普惠企业管理有限公司 The configuration method and device of configuration file
CN111968648A (en) * 2020-08-27 2020-11-20 北京字节跳动网络技术有限公司 Voice recognition method and device, readable medium and electronic equipment

Also Published As

Publication number Publication date
CN112116908A (en) 2020-12-22

Similar Documents

Publication Publication Date Title
CN111933112B (en) Awakening voice determination method, device, equipment and medium
CN111063342B (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN110322760B (en) Voice data generation method, device, terminal and storage medium
CN110047468B (en) Speech recognition method, apparatus and storage medium
WO2021052306A1 (en) Voiceprint feature registration
CN111105788B (en) Sensitive word score detection method and device, electronic equipment and storage medium
CN111696532A (en) Speech recognition method, speech recognition device, electronic device and storage medium
CN112735429B (en) Method for determining lyric timestamp information and training method of acoustic model
CN112116904B (en) Voice conversion method, device, equipment and storage medium
CN113220590A (en) Automatic testing method, device, equipment and medium for voice interaction application
CN111370025A (en) Audio recognition method and device and computer storage medium
CN112116908B (en) Wake-up audio determining method, device, equipment and storage medium
CN112233689B (en) Audio noise reduction method, device, equipment and medium
CN111341307A (en) Voice recognition method and device, electronic equipment and storage medium
CN113362836B (en) Vocoder training method, terminal and storage medium
CN114299935A (en) Awakening word recognition method, awakening word recognition device, terminal and storage medium
CN113744736A (en) Command word recognition method and device, electronic equipment and storage medium
CN110337030B (en) Video playing method, device, terminal and computer readable storage medium
CN112614507A (en) Method and apparatus for detecting noise
CN109829067B (en) Audio data processing method and device, electronic equipment and storage medium
CN113343709B (en) Method for training intention recognition model, method, device and equipment for intention recognition
CN114360494A (en) Rhythm labeling method and device, computer equipment and storage medium
CN112786025B (en) Method for determining lyric timestamp information and training method of acoustic model
CN111028846B (en) Method and device for registration of wake-up-free words
CN114333821A (en) Elevator control method, device, electronic equipment, storage medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant