CN111724781A

CN111724781A - Audio data storage method and device, terminal and storage medium

Info

Publication number: CN111724781A
Application number: CN202010537664.7A
Authority: CN
Inventors: 陈喆
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2020-09-29
Anticipated expiration: 2040-06-12
Also published as: CN111724781B

Abstract

The embodiment of the application discloses a storage method, a storage device, a terminal and a storage medium of audio data, and belongs to the technical field of terminals. The method comprises the following steps: acquiring audio data collected by a microphone; the audio data are identified through k-level voice awakening identification models to obtain identification results corresponding to all levels of voice awakening identification models, wherein the voice awakening identification models of different levels correspond to different identification dimensions, the identification results are used for representing the identification passing condition of the audio data on the corresponding identification dimensions, and k is an integer greater than or equal to 2; and storing the audio data into a first storage area according to the recognition result, wherein the first storage area is used for storing the audio data passing through the at least one stage of voice awakening recognition model. The reason of the awakening failure can be accurately positioned, namely, the reason of the awakening failure is identified at which level, so that the reason of the awakening failure can be accurately analyzed and optimized, and the awakening rate under the voice awakening scene is improved.

Description

Audio data storage method and device, terminal and storage medium

Technical Field

The embodiment of the application relates to the technical field of terminals, in particular to a method and a device for storing audio data, a terminal and a storage medium.

Background

Along with the more and more extensive applications of smart devices, for example, smart phones, smart speakers, smart televisions, etc., in order to facilitate users to use the smart devices, a voice wake-up technology is usually introduced into the smart devices.

In the related art, before the smart device leaves the factory, a voice wake-up function test is generally performed to ensure the wake-up rate of the voice wake-up function, but the voice wake-up function is affected by the use environment and the user difference of the smart device, so that the situation of voice wake-up failure generally occurs when a user uses the smart device with the voice wake-up function, and in the related art, a smart device manufacturer cannot accurately locate the reason of the voice wake-up failure, thereby reducing the wake-up rate of the voice wake-up function.

Disclosure of Invention

The embodiment of the application provides a storage method, a storage device, a terminal and a storage medium of audio data. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides a method for storing audio data, where the method includes:

acquiring audio data collected by a microphone;

identifying the audio data through k-level voice awakening identification models to obtain identification results corresponding to all levels of voice awakening identification models, wherein the voice awakening identification models of different levels correspond to different identification dimensions, the identification results are used for representing the identification passing condition of the audio data on the corresponding identification dimensions, and k is an integer greater than or equal to 2;

and storing the audio data into a first storage area according to the recognition result, wherein the first storage area is used for storing the audio data passing through at least one stage of voice awakening recognition model.

In another aspect, an embodiment of the present application provides an apparatus for storing audio data, where the apparatus includes:

the acquisition module is used for acquiring audio data acquired by a microphone;

the recognition module is used for recognizing the audio data through k levels of voice awakening recognition models to obtain recognition results corresponding to the voice awakening recognition models at all levels, wherein the voice awakening recognition models at different levels correspond to different recognition dimensions, the recognition results are used for representing the recognition passing condition of the audio data on the corresponding recognition dimensions, and k is an integer greater than or equal to 2;

and the first storage module is used for storing the audio data into a first storage area according to the recognition result, and the first storage area is used for storing the audio data passing through at least one stage of voice awakening recognition model.

In another aspect, an embodiment of the present application provides a terminal, which includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the method for storing audio data according to the above aspect.

In another aspect, embodiments of the present application provide a computer-readable storage medium, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by a processor to implement the storage method of audio data according to the above aspect.

In another aspect, an embodiment of the present application further provides a computer program product, where the computer program product includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the storage method of the audio data according to the above aspect.

The technical scheme provided by the embodiment of the application at least comprises the following beneficial effects:

in a voice awakening scene, a k-level voice awakening recognition model is arranged in the terminal, after the terminal receives audio data collected by a microphone, the audio data can be input into the k-level voice awakening recognition model, the audio data are recognized in different recognition dimensions, recognition passing conditions of the audio data in different recognition dimensions are obtained, and the audio data are stored in a first storage area according to recognition results. Through the voice awakening identification module with different identification dimensions, the identification results of the audio data on the different identification dimensions can be obtained, the audio data are stored according to the identification results, the awakening state under the voice awakening scene can be accurately obtained, the reason of awakening failure can be accurately positioned for the condition of awakening failure, namely, the reason of awakening failure is accurately analyzed and optimized in the identification dimension, and then the awakening rate under the voice awakening scene is improved.

Drawings

FIG. 1 illustrates an architecture diagram of a voice wake-up service system according to an exemplary embodiment of the present application;

fig. 2 illustrates a flowchart of a method of storing audio data according to an exemplary embodiment of the present application;

fig. 3 illustrates a flowchart of a storage method of audio data according to another exemplary embodiment of the present application;

fig. 4 illustrates a flowchart of a storage method of audio data according to another exemplary embodiment of the present application;

fig. 5 illustrates a flowchart of a storage method of audio data according to another exemplary embodiment of the present application;

FIG. 6 is a diagram illustrating a voice wake recognition and storage process according to an exemplary embodiment of the present application;

fig. 7 illustrates a flowchart of a storage method of audio data according to another exemplary embodiment of the present application;

fig. 8 illustrates a flowchart of a storage method of audio data according to another exemplary embodiment of the present application;

fig. 9 illustrates a flowchart of a storage method of audio data according to another exemplary embodiment of the present application;

FIG. 10 is a diagram illustrating a process for voice wake-up training in accordance with an exemplary embodiment of the present application;

FIG. 11 illustrates a schematic diagram of a process for storing training audio, shown in an exemplary embodiment of the present application;

FIG. 12 is a diagram illustrating audio data storage modes corresponding to different scenes in two modes according to an exemplary embodiment of the present application;

fig. 13 is a block diagram illustrating a structure of an audio data storage apparatus according to an exemplary embodiment of the present application;

fig. 14 is a block diagram of a terminal according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Referring to fig. 1, an architecture diagram of a voice wakeup service system according to an exemplary embodiment of the present application is shown, where the voice wakeup service system includes a terminal 101 and a server 102.

The terminal 101 is a device having a voice wake-up function, and may be a smart phone, a smart speaker, a tablet, a smart television, and the like, which is not limited in this embodiment of the present application. In the embodiment of the present application, the terminal 101 has an application mode (or user mode) and a debugging mode (test mode), and audio data (including a wake-up audio and a test audio) in the application mode are stored in the data directory, and audio data (including a wake-up audio and a test audio) favorite in the debugging mode are stored in the sdcard directory. Optionally, a k-level voice wake-up recognition model is set in the terminal 101, and is used to provide a voice recognition function for a user in a wake-up scene. Optionally, the terminal 101 may periodically send the stored audio data to the server 102, and the server 102 analyzes the audio data, optimizes the k-level voice wakeup recognition model, and returns an optimized structure to the terminal 101.

The server 102 and the terminal 101 are directly or indirectly connected by wired or wireless communication.

The server 102 is a background server or a service server corresponding to the terminal voice wake-up function. The server can be a server, a server cluster formed by a plurality of servers or a cloud server. In this embodiment, the server 102 analyzes and processes the audio data by receiving the audio data reported by the terminal 101, so as to optimize the k-level voice awakening recognition model, and push the optimized voice awakening recognition model to the terminal 101.

It should be noted that the server 102 may perform data interaction with a large number of terminals, that is, may receive audio data reported by a large number of terminals, perform data analysis on the large number of audio data, so as to optimize the voice wakeup recognition model, and push the optimized voice wakeup recognition model to the terminals, so as to improve the wakeup rate of the terminal voice wakeup function.

Referring to fig. 2, a flowchart of a method for storing audio data according to an exemplary embodiment of the present application is shown, where the present application is illustrated by applying the method to the terminal shown in fig. 1, and the method includes:

step 201, audio data collected by a microphone is acquired.

In a possible implementation manner, when a user needs to use a voice wake-up function or a voice assistant in a terminal, a corresponding wake-up word needs to be selected in a voice assistant interface in advance, and after training is successful, a voiceprint model of the user is saved so as to be used for voice recognition in a subsequent voice wake-up scene.

In a voice awakening scene, after a user triggers a voice awakening function, a microphone arranged on the terminal is in a state of continuously acquiring a voice signal, the acquired voice signal is converted into an electric signal, and correspondingly, the terminal acquires audio data acquired by the microphone.

The user may be a user who actually uses the terminal, or a tester who tests the voice wake-up function of the terminal, which is not limited in this embodiment.

Step 202, identifying the audio data through k levels of voice awakening identification models to obtain identification results corresponding to the voice awakening identification models at all levels, wherein the voice awakening identification models at different levels correspond to different identification dimensions, the identification results are used for representing the passing condition of the identification of the audio data on the corresponding identification dimensions, and k is an integer greater than or equal to 2.

In a possible implementation manner, a trained voice awakening recognition model is preset in the terminal and is used for recognizing the received audio data in a voice awakening scene to obtain a recognition result corresponding to the audio data, if the recognition result represents that the voice awakening recognition model passes through each level, the voice awakening is successful, otherwise, the voice awakening is failed.

In an exemplary example, when a user needs to turn on a screen of the terminal through a voice wake-up function, a microphone continuously collects sound signals during a process of speaking a wake-up word by the user, the sound signals are converted into audio data, the audio data are identified through a k-level voice wake-up identification model, if an identification result represents that the identification is passed, the terminal turns on the screen, otherwise, the terminal is still in a screen-off state.

Because the voice awakening result is affected by many factors, for example, the audio data contains more environmental sounds (noise or noise other than the user's sound), the user's own factors (for example, the sound is small, the word is not clear, the tone is special and cannot be recognized), the voice awakening model recognition range is not wide enough, and so on, in order to awaken the accuracy of the audio recognition result, in one possible implementation manner, a plurality of voice awakening recognition models are trained through different recognition dimensions, for example, whether the audio data contains awakening words or partial awakening words is determined through a keyword manner, whether the user corresponding to the terminal is awakening is determined through voiceprint recognition, so that the audio data can be recognized through various levels of voice awakening recognition models in the model application stage, and thus when the recognition result representation is not awakened successfully, the method can definitely position the identification failure in which level of identification dimension, thereby facilitating the subsequent analysis and optimization of the audio data.

And 203, storing the audio data into a first storage area according to the recognition result, wherein the first storage area is used for storing the audio data which pass through the at least one stage of voice awakening recognition model.

The setting position of the first storage area may be a data folder in the terminal or an sdcard folder in the terminal, which is not limited in this embodiment of the application.

For the way of storing the audio data according to the recognition result, in a possible implementation manner, the audio data may be named as the audio data of the second level to be stored according to the voice awakening recognition model of the second level through which the audio data finally passes; or naming the audio file according to the time of the audio data passing through the voice wakeup recognition model and storing the audio file in the first storage area, or storing the audio data passing through the same level of voice wakeup recognition model in the same folder, which is not limited in the embodiment of the present application.

In the embodiment of the application, under a voice awakening scene, a k-level voice awakening recognition model is arranged in the terminal, after the terminal receives audio data collected by a microphone, the audio data can be input into the k-level voice awakening recognition model, the audio data are recognized in different recognition dimensions, recognition passing conditions of the audio data in different recognition dimensions are obtained, and the audio data are stored in the first storage area according to recognition results. Through the voice awakening recognition module with different recognition dimensions, recognition results of the audio data on the different recognition dimensions can be obtained, the audio data are stored according to the recognition results, the awakening state under the voice awakening scene can be accurately obtained, the reason of awakening failure can be accurately positioned for the condition of awakening failure, namely, the reason of awakening failure can be accurately analyzed and optimized on the identification dimension of which level, and then the awakening rate under the voice awakening scene is improved.

Since there may be situations where a user changes a wake-up word and changes the user during a terminal using process (or a process of testing a terminal voice wake-up function), and there is a difference in corresponding audio data, in order to better distinguish audio data under different situations, in one possible implementation, the audio data may be named according to the wake-up word corresponding to the audio data when the audio data is stored.

In an exemplary example, as shown in fig. 3, which shows a flowchart of a method for storing audio data according to another exemplary embodiment of the present application, the present embodiment is illustrated by applying the method to the terminal shown in fig. 1, and the method includes:

step 301, audio data collected by a microphone is acquired.

Step 201 may be referred to in the implementation manner of this step, and this embodiment is not described herein again.

And 302, performing nth-level recognition on the audio data through an nth-level voice awakening recognition model to obtain an nth-level recognition result, wherein n is a positive integer smaller than k.

Because each level of voice awakening recognition model corresponds to different recognition dimensions, a certain recognition sequence exists, for example, the first level voice awakening recognition model is used for recognizing whether the acquired audio data contains partial keywords, and the acquired invalid audio can be removed after the first level voice awakening recognition model is screened; the second-level voice awakening recognition model is used for recognizing whether the audio data contain a complete awakening word or not, and removing the audio data containing other keywords close to the awakening word, and the like.

In a possible implementation manner, standard recognition results corresponding to various levels of voice wakeup recognition models are preset in the terminal, for example, standard audio feature vectors are obtained, when the terminal inputs audio data into an nth level voice wakeup recognition model, feature extraction is performed on the audio data, an audio feature vector output by the nth level voice wakeup recognition model is obtained, and is compared with the preset standard feature vectors, so that an nth recognition result of the nth level voice wakeup recognition model on the audio data is obtained, that is, whether the audio data passes through the nth level voice wakeup recognition model or not is obtained, if the recognition result represents that the audio data passes through the nth level voice wakeup recognition model, the audio data is input into an n +1 (next) level voice wakeup recognition model, and otherwise, a recognition process of a subsequent voice wakeup recognition model is not performed.

And 303, responding to the nth recognition result representation audio data, performing nth-level recognition, and performing (n + 1) th-level recognition on the audio data through (n + 1) th-level voice awakening recognition model to obtain an (n + 1) th recognition result.

And the standard audio characteristic vector corresponding to the (n + 1) th-level voice awakening recognition model is preset in the terminal.

In a possible implementation manner, when the nth recognition result represents that the audio data passes the nth recognition, the next-stage voice awakening recognition model needs to be entered, the terminal inputs the audio data into the (n + 1) th-stage voice awakening recognition model, feature extraction is performed on the audio data by the (n + 1) th-stage voice awakening recognition model to obtain an (n + 1) th audio feature vector, the n +1 th audio feature vector is compared with a standard audio feature vector to obtain an n +1 th recognition result, namely whether the audio data passes the recognition of the (n + 1) th voice awakening recognition model or not is determined, if the audio data passes the recognition, the audio data continues to be input into the next-stage voice awakening recognition model for recognition, and if the audio data does not pass the recognition, the audio data is stopped to be.

In a possible implementation mode, the audio data is sequentially input into an nth voice awakening recognition model for nth-level recognition, after an nth recognition result of the nth voice awakening recognition model represents the audio data and the nth-level recognition is carried out, the audio data is input into an n +1 th voice awakening recognition model for n + 1-level recognition, an n + 1-level recognition result is obtained, if the n + 1-level recognition result represents the audio data and the n + 1-level recognition is carried out, the audio data is input into an n +2 th voice awakening recognition model for n + 2-level recognition, an n + 2-level recognition result is obtained, and the rest is done until a kth recognition result is obtained, and the audio data is represented by the kth recognition result and the kth voice awakening recognition model, and the audio data is stored.

In other possible implementation manners, if the audio data is input into the nth voice awakening recognition model for nth-level recognition, and the obtained nth recognition result represents that the audio data does not pass the nth-level recognition, the audio data is stopped from being continuously input into the (n + 1) th-level recognition result and is stored in the audio data.

And step 304, naming the audio data according to the recognition result and the awakening word.

The awakening words are determined in the process of training the audio frequency before the user awakens, and can be selected by the user, namely the user selects the corresponding awakening words in the process of training the audio frequency and trains the corresponding awakening words, and after the training is successful, the voiceprint model of the user corresponding to the awakening words is stored and is used as one of the bases for whether the audio data can be awakened successfully or not in the subsequent awakening process.

In a possible embodiment, the recognition result and the wake-up word are combined to name the audio data, so that the difference between different audio data can be represented from the name, for example, two audio data use different wake-up words, or two audio data correspond to voice wake-up recognition models of different levels.

In another possible embodiment, an audio capture time may also be added to the naming, and different audio data may be further distinguished in different voice wake-up time dimensions.

The way how the corresponding naming of the audio data is determined from the recognition result is, in one possible embodiment, if the nth recognition result represents that the audio data passes through the nth voice awakening recognition model and the (n + 1) th recognition result represents that the audio data does not pass through the (n + 1) th voice awakening recognition model, stopping inputting the audio data into the (n + 2) th voice awakening recognition model, it can be seen that the audio data only passes through the nth voice wakeup recognition model, the nth level audio can be added to the audio data name, namely the audio data is named using the model identification of the voice wake recognition model through which the audio data passes, so that the audio data can be obtained through the first-level voice awakening recognition model and not through the second-level voice awakening recognition model according to the name during the subsequent analysis, therefore, the reason of the audio data which is not successfully awakened is analyzed according to the identification dimension corresponding to the voice awakening identification model.

In an illustrative example, based on FIG. 3, as shown in FIG. 4, step 304 may include step 304A and step 304B.

And 304A, in response to the nth recognition result representing that the audio data passes the nth-level recognition and the (n + 1) th recognition result representing that the audio data does not pass the (n + 1) th-level recognition, naming the audio data according to the model identification and the awakening word of the nth-level voice awakening recognition model.

For example, if the user passes the first-level recognition and does not pass the second-level recognition, that is, the subsequent recognition process needs to be stopped, and the current audio data is stored, it is seen that the audio data only passes the first-level voice wakeup recognition model in the k-level voice wakeup recognition model, the corresponding audio data can be named as a first-level audio, and it is characterized that the audio data corresponding to the wakeup only passes the first-level voice wakeup recognition model, so that the reason of wakeup failure in the voice wakeup process is easily located. Therefore, in a possible implementation manner, when the nth recognition result indicates that the audio data passes the nth stage recognition, and the (n + 1) th recognition result indicates that the audio data does not pass the (n + 1) th stage recognition, the audio data may be named according to the model identifier of the nth stage voice wakeup recognition model and the wakeup word.

In an exemplary example, taking n to 2 as an example, if the audio data passes through the level 2 recognition but does not pass through the level 3 recognition, the audio data is named according to the model identification of the level 2 voice wakeup recognition model and the wakeup word, such as: the audio data is named "wake up word + second.

In other illustrative examples, a time may be added to the naming, and the time may be a time when the audio data is collected (i.e., a time when the voice wakeup starts), or a time when the audio data passes through the nth stage voice wakeup recognition model, or a time when the audio data does not pass through the (n + 1) th stage voice wakeup recognition model (i.e., a time when the audio data is stored), which is not limited in the embodiment.

In an exemplary example, if the audio data is named using time, recognition result, and wakeup word, the corresponding audio data is named "time + wakeup word + second.

And step 304B, in response to the k-th recognition result representing that the audio data passes through the k-th level recognition, naming the audio data according to the model identification of the k-th level voice awakening recognition model and the awakening word.

Since n in the above embodiment is an integer smaller than k, if the audio data passes through the kth-level voice wakeup recognition model, the wakeup is represented successfully, and the audio data corresponding to the wakeup successfully is stored correspondingly, at this time, the model identifier and the wakeup word of the kth-level voice wakeup recognition model may also be used to name the audio data.

In an exemplary example, after the audio data passes through the k-level voice wake recognition model, if k takes 3, the corresponding audio data may be named "wake word + third.

In other exemplary examples, corresponding to the above embodiments, a time may also be added to the nomenclature, and the time may be a time when the audio data is collected (i.e., a time when the voice wakeup starts), or a time when the audio data passes through the kth level voice wakeup recognition model (i.e., a time when the audio data is stored), which is not limited by the present embodiment.

In an exemplary example, if the audio data is named using time, recognition result, and wakeup word, the corresponding audio data is named "time + wakeup word + third.

Step 305, storing the named audio data in a first storage area.

In a possible implementation manner, in a voice wake-up scene, the number of times of waking up by a user is not limited, the corresponding data volume of the stored wake-up audio is large, in order to better store the wake-up audio, a folder for storing the wake-up audio is established in a first storage area, and under the main folder, a subfolder is established for storing an audio file corresponding to each wake-up; or under the main folder, a plurality of subfolders are divided according to time, for example, audio files corresponding to awakening every day are stored in the same subfolder; or in the main folder, dividing a plurality of subfolders according to the awakening words, namely storing the audio files corresponding to the same awakening words in the same subfolder; or in the main folder, the plurality of subfolders are divided according to the recognition result, that is, the audio data of the voice awakening recognition model of the same level are stored in the same subfolder.

In an illustrative example, a folder named as "wakeup audio" is established in the first storage area, and audio data named as "time + wakeup word + first.

In this embodiment, the audio data is named according to the recognition result and the wakeup word, and is stored, so that the recognition result corresponding to the wakeup can be determined by the naming of the audio file, for example, if the audio data is named as "wakeup word + first-level wakeup audio", the representation audio data passes through the first-level voice wakeup recognition model, thereby accurately positioning the reason of the voice wakeup failure, facilitating subsequent targeted analysis of the audio data, and facilitating the subsequent optimization of the corresponding voice wakeup recognition.

In a possible application scenario, a developer sets a three-level voice awakening recognition model for a voice awakening scenario, that is, the k-level voice awakening recognition model includes a first-level voice awakening recognition model, a second-level voice awakening recognition model, and a third-level voice awakening recognition model, and is used for recognizing audio data acquired in the voice awakening scenario, so as to determine whether to execute a corresponding voice awakening operation.

In an exemplary example, please refer to fig. 5, which shows a flowchart of a method for storing audio data according to another exemplary embodiment of the present application, where the present embodiment is illustrated by applying the method to the terminal shown in fig. 1, and the method includes:

step 501, audio data collected by a microphone is acquired.

The implementation manner of this step may refer to the above embodiments, which are not described herein.

Step 502, performing first-level recognition on the audio data through a first-level voice awakening recognition model to obtain a first recognition result, wherein the first recognition result is used for representing whether the audio data contains a keyword, and the keyword is a part of an awakening word.

Because the microphone inputs continuously acquired audio data into the first-stage voice awakening recognition model when acquiring the audio data, that is, the terminal needs to acquire the audio data while performing the recognition of the audio data, and the voice awakening recognition model may adopt a specific sampling frequency when performing the recognition, for example, recognition is performed every 1s, which may cause that the audio data input in a single recognition process cannot contain a complete awakening word, for example, the awakening word is 'little cloth', and the audio data input during a single recognition may only contain 'little cloth', therefore, in order to avoid the awakening failure caused by the inability to acquire the audio data completely containing the awakening word in the continuous recognition process of the audio data, the terminal is provided with the first-stage voice awakening recognition model which only needs to recognize a part of the audio data containing the awakening word, if the voice data is identified to contain no part of awakening words through the first-level voice awakening identification model, the voice data containing the awakening words is not received, the microphone continuously collects sound signals, the voice data is obtained, and the voice data is input into the first-level voice awakening identification model for continuous identification.

The setting of the keyword may be a single child in the awakening word, or two adjacent characters, or three adjacent characters, for example, the awakening word is "small cloth", and the corresponding keyword may be "small cloth", and the like, which is not limited in this embodiment.

In a possible implementation, the keyword needs to be stored in the terminal in advance, and different keywords are stored for different wake-up words.

Aiming at the recognition mode of the first-stage voice awakening recognition model to the audio data, in a possible implementation manner, the first-stage voice awakening recognition model adopts a Convolutional Neural Network (CNN) for extracting audio feature vectors corresponding to the audio data, correspondingly, standard audio feature vectors corresponding to keywords are stored in a terminal in advance, the audio feature vectors corresponding to the audio data are compared with the standard audio feature vectors corresponding to the keywords, if the similarity between the audio feature vectors corresponding to the audio data and the standard audio feature vectors corresponding to any keywords is higher than a preset similarity threshold, the audio data are characterized to contain the keywords, and a corresponding first recognition result is that the audio data pass through the first-stage voice awakening recognition model.

If the standard audio feature vector corresponding to the keyword is stored in advance, the standard audio feature vector is an audio feature vector extracted after being tested by a plurality of testers in advance.

In an exemplary example, if the preset similarity threshold is 95%, and if the similarity between the audio feature vector corresponding to the audio data and the standard audio feature vector corresponding to any keyword is 98%, which is higher than the similarity threshold, the audio data is characterized to include the keyword.

In another possible implementation, the first few words of the wakeup word are usually set as keywords, and when the microphone continuously acquires audio, if the first-level recognition result of the first-level voice wakeup recognition model indicates that the audio data corresponding to a certain acquisition time contains the keywords, the audio data in a predetermined time period before and after the acquisition time is input into the second-level voice wakeup recognition model to be used as the basis for subsequent wakeup word recognition. For example, audio data within 1s before and after the acquisition time is input into the second-stage voice wakeup recognition model.

In the process that no keyword is included in the audio data, the first-stage voice awakening recognition model needs to continuously operate to recognize, and obviously has a large influence on the power consumption of the terminal, so that in a possible implementation manner, the first-stage voice awakening recognition model operates in a Digital Signal Processor (DSP), the power consumption of the first-stage voice awakening recognition model can be reduced, and the power consumption of the whole voice awakening recognition process is reduced.

Correspondingly, in a possible implementation manner, after the microphone collects the audio data, the audio data can be continuously sent to the DSP, the DSP inputs the received audio data into the first-stage voice awakening recognition model, and when the first recognition result represents that the audio data has the keyword, the DSP sends the corresponding whole section of audio data to the second-stage voice awakening recognition model.

And 503, performing second-level recognition on the audio data through the second-level voice awakening recognition model to obtain a second recognition result, wherein the second recognition result is used for representing whether the audio data contains awakening words or not.

Because the first-level speech recognition model can only determine that the audio data contains the keywords, namely can only determine that the audio data contains partial awakening words, it may happen that the audio data contains a part of difference between the wake-up word and the target wake-up word (i.e. the wake-up word selected by the user during the test), for example, the wake-up word is "small cloth", the wake-up word contained in the audio data is "Hi, small cloth", but since the first-stage voice awakening recognition model recognizes that the audio data contains the keyword 'little cloth', then with the first stage recognition, it is clear that only the first stage voice wakeup recognition model is used to determine whether wakeup was successful, and therefore, in a possible implementation mode, a second-stage voice awakening recognition model is also arranged in the terminal, the method is used for identifying the whole piece of audio data so as to determine whether the audio data contains the complete awakening word.

Aiming at the recognition mode of the second-level voice awakening recognition model for the audio data, in a possible implementation mode, the second-level voice awakening recognition model also adopts CNN (voice CNN) which is used for extracting the audio feature vector corresponding to the audio data, correspondingly, the standard audio feature vector corresponding to the awakening word is stored in the terminal in advance, the audio feature vector corresponding to the audio data is compared with the standard audio feature vector corresponding to the awakening word, if the similarity between the audio feature vector corresponding to the audio data and the standard audio feature vector corresponding to the awakening word is higher than a preset similarity threshold value, the audio data is represented to contain keywords, and the corresponding second recognition result is that the audio data passes through the second-level voice awakening recognition model.

If the standard audio feature vector corresponding to the awakening word is prestored, the standard audio feature vector is the audio feature vector corresponding to the keyword extracted after being tested by a plurality of testers in advance, and different awakening words correspond to different standard audio feature vectors.

In an exemplary example, if the preset similarity threshold is 96%, and if the similarity between the audio feature vector corresponding to the audio data and the standard audio feature vector corresponding to the wake-up word is 98%, which is higher than the similarity threshold, the audio data is characterized to include the wake-up word.

Although the first-stage voice awakening recognition model and the second-stage voice awakening recognition model both adopt CNN, the sizes of the audio data input by the first-stage voice awakening recognition model and the second-stage voice awakening recognition model are different, namely, the audio data input into the first-stage voice awakening recognition model each time is only one part of the second-stage voice awakening recognition model, the sizes and the recognition modes of the two corresponding voice awakening models are also different, namely, the first-stage voice awakening recognition model is smaller than the second-stage voice awakening recognition model.

In an illustrative example, the first level wake speech recognition model may be 200KB in size and the second level wake speech recognition model may be 20MB in size.

In the above embodiment, in order to reduce power consumption of the terminal for voice wakeup, the first-stage voice wakeup recognition model is set in the DSP for operation, and in order to avoid increase of power consumption of the terminal due to operation of the second-stage voice wakeup recognition model while the first-stage voice wakeup recognition model is operated, the second-stage voice wakeup recognition model is set in a Central Processing Unit (CPU) for operation, so that after the first-stage voice wakeup recognition model determines that the audio data has a keyword, the DSP sends the whole section of audio data to the CPU, the CPU inputs the second-stage voice wakeup recognition model, and controls the second voice wakeup recognition model to operate and recognize.

And step 504, performing third-level recognition on the audio data through a third-level voice awakening recognition model to obtain a third recognition result, wherein the third recognition result is used for representing whether the voiceprint features of the audio data are matched with the target voiceprint features.

Since the first-level voice awakening recognition model and the second-level voice awakening recognition model both only recognize whether the audio data contains keywords or awakening words and the condition that other users send the audio data exists, at this time, if the voiceprint features of the audio data (different users correspond to different voiceprint features) are not concerned, whether to execute awakening operation is determined only according to the awakening words, which obviously threatens the safety of terminal data or carries out misoperation.

The target voiceprint characteristics indicate that before voice awakening, a user selects an awakening word and then carries out an audio test process in a test scene, and the voiceprint characteristics are stored in the terminal after the test is successful.

In a possible implementation manner, after the audio data passes through the second-level recognition, the audio data is input into a third-level voice wakeup recognition model, the third-level voice wakeup recognition model performs feature extraction on the audio data to obtain a voiceprint feature vector corresponding to the audio data, the voiceprint feature vector is compared with a pre-stored target voiceprint feature vector, and when the similarity between the voiceprint feature vector and the target voiceprint feature vector is higher than a preset similarity threshold, it is determined that the audio data passes through the third-level recognition, the voice wakeup is successful, and the terminal responds to the voice wakeup of the user to execute corresponding voice wakeup operations, such as screen lighting, application program opening and the like.

In an exemplary example, if the preset similarity threshold is 96%, and the degree of similarity between the voiceprint feature vector and the target voiceprint feature vector is 98%, which is higher than the preset similarity threshold, it is determined that the acquired audio data belongs to the target user (i.e., the user corresponding to the target voiceprint feature).

Because the third-level voice awakening recognition Model extracts and recognizes the voiceprint features corresponding to the audio data, and the first-level voice awakening recognition Model and the second-level voice awakening recognition Model both recognize the keywords or awakening words in the audio data, and the recognition dimensions corresponding to the keywords or awakening words are different, the network models adopted by the first-level voice awakening recognition Model and the second-level voice awakening recognition Model are different, and in a possible implementation manner, the third-level voice awakening recognition Model adopts a Gaussian Mixture Model (GMM).

It should be noted that the preset similarity thresholds corresponding to the voice wakeup recognition models of each stage may be the same or different, and this embodiment does not limit this.

And 505, storing the audio data into a first storage area according to the recognition result, wherein the first storage area is used for storing the audio data which passes through the at least one stage of voice awakening recognition model.

In a possible implementation manner, if the k-level voice wakeup recognition model respectively includes a first-level voice wakeup recognition model, a second-level voice wakeup recognition model, and a third-level voice wakeup recognition model, the corresponding names of the audio data according to the recognition result and the wakeup word are also required when the wakeup audio is stored.

In an illustrative example, as shown in fig. 6, a schematic diagram of a voice wake up recognition and storage process shown in an illustrative embodiment of the present application is shown. The microphone 601 continuously collects audio data, sends the audio data to the DSP602, and performs recognition through a first-stage voice wake-up recognition model operated in the DSP602 to obtain a first recognition result, if the first recognition result represents that the audio data does not pass the first-stage recognition (namely does not contain keywords), the audio data is invalid audio data, and is not stored, and the microphone 601 continues to collect the audio data; if the first recognition result represents that the audio data passes through the first-level recognition (namely, comprises a keyword), the microphone 601 stops collecting the audio data, and sends the audio data to the CPU603, and the audio data is recognized through a second-level voice awakening recognition model running on the CPU603 to obtain a second recognition result, if the second recognition result represents that the audio data does not pass through the second-level recognition (namely, does not comprise an awakening word), the audio data is stored, and the name of the corresponding audio data can be 'time + awakening word + first.pcm', and the audio data is represented to be only subjected to the first-level voice awakening recognition; if the second recognition result represents that the audio data passes through the second-level recognition (namely, the audio data comprises a wake-up word), inputting the audio data into a third voice wake-up recognition model for third-level recognition to obtain a third recognition result, and if the third recognition result represents that the audio data does not pass through the third-level recognition (namely, the voiceprint feature does not conform to the target voiceprint feature), storing the audio data, wherein the name corresponding to the audio data is time + wake-up word + second-level pcm; and if the third recognition result representation passes through the third-level recognition (namely the voiceprint feature is consistent with the target voiceprint feature), storing the audio data, wherein the name corresponding to the audio data is time + awakening words + vprint.

In this embodiment, the voice wakeup recognition model is trained in three recognition dimensions of the keyword (part of the wakeup word), the wakeup word and the voiceprint feature, so that a first-level voice wakeup recognition model for recognizing whether audio data contain the keyword, a second-level voice wakeup recognition model for recognizing whether audio data contain the wakeup word, and a third-level voice wakeup recognition model for recognizing whether the voiceprint feature is consistent with the target voiceprint feature are obtained, so that in a model application stage, the audio data can be recognized in different recognition dimensions, and the accuracy of voice wakeup recognition is improved. In addition, for the characteristics or the operation opportunity of each voice awakening recognition model, for example, the first-stage voice awakening recognition model needs to continuously operate in the process of acquiring the audio data, and the second-stage voice awakening recognition model only needs to operate when the first recognition result representation audio data corresponding to the first-stage voice awakening recognition model contains the keyword, so that the first-stage voice awakening recognition model and the second-stage voice awakening recognition model are separately arranged to realize separate operation, and the power consumption of the whole voice awakening process of the terminal is reduced.

Because the voice awakening recognition model set in the terminal is generally pre-trained, and the audio data required by the model training is performed in a specific scene, for example, a testing laboratory, or the source of the audio data collected during the testing is a tester or a recording, and when the user uses the terminal, the environment where the terminal is located and the user's own voice characteristics are different from those during the model training, therefore, in the actual use process of the voice awakening, the voice awakening function may fail or the failure rate is high due to the difference between the awakening and the user, so in order to further optimize the voice awakening recognition model, in a possible implementation manner, by storing the awakening audio (including the first-level awakening audio, the second-level awakening audio and the voiceprint audio) corresponding to the awakening scene during the user awakening process, and uploading the awakening audio data to the server, the server further optimizes the voice awakening recognition model according to the audio data and feeds the optimized voice awakening recognition model back to the user, so that the voice awakening recognition model is more consistent with the environment where the user is located and the characteristics of the user, and the awakening rate of the voice awakening recognition model is improved.

Referring to fig. 7, a flowchart of a method for storing audio data according to another exemplary embodiment of the present application is shown, where the present embodiment is illustrated by applying the method to the terminal shown in fig. 1, and the method includes:

step 701, acquiring audio data acquired by a microphone.

And 702, identifying the audio data through k levels of voice awakening identification models to obtain identification results corresponding to the levels of voice awakening identification models, wherein the voice awakening identification models of different levels correspond to different identification dimensions, the identification results are used for representing the identification passing condition of the audio data on the corresponding identification dimensions, and k is an integer greater than or equal to 2.

And 703, storing the audio data into a first storage area according to the recognition result, wherein the first storage area is used for storing the audio data which pass through the at least one stage of voice awakening recognition model.

The embodiments of step 701 and step 703 may refer to the above embodiments, which are not described herein.

Step 704, uploading the audio data of the content of the first storage area to a server, where the server is configured to determine the recognition quality of the voice wakeup recognition model according to the audio data and the name of the audio data.

The first storage area is used for storing awakening audio data in a voice awakening scene, and because the awakening failure condition exists in the awakening scene, namely the condition that the audio data does not pass through the k-level voice awakening recognition model exists, the awakening audio data stored in the first storage area comprises the audio data corresponding to the awakening failure and the audio data corresponding to the awakening success.

In an exemplary example, taking the k-level voice wakeup recognition model as including the first level voice wakeup recognition model, the second level voice wakeup recognition model and the third level voice wakeup recognition model as an example, the corresponding audio data category may include: the voice recognition module comprises a first-stage awakening audio (which refers to audio data only awakening a recognition model through a first-stage voice), a second-stage awakening audio (which refers to audio data awakening the recognition model through the first-stage voice) and voiceprint audio (which refers to audio data awakening the recognition model through the first-stage voice, the second-stage awakening recognition model and a third-stage awakening recognition model through the first-stage voice, namely awakening successfully corresponding audio data).

In a possible implementation manner, the terminal periodically uploads the stored audio data to the server, and the server optimizes the voice recognition model or optimizes the target voiceprint characteristics stored by the user, wherein the optimization of the voice recognition model is performed by the various audio data, and the optimization of the target voiceprint characteristics is performed by the audio data corresponding to the successful voice wake-up (i.e. the voiceprint audio data).

The terminal uploads the audio data at a predetermined time interval, for example, every 15 days, or when the number of audio data stored in the terminal exceeds a predetermined number, the terminal uploads the audio data and deletes the audio data in the terminal, for example, the number of audio data exceeds 50 times.

The audio data are named according to the recognition result and the awakening word when being stored, so that the server can accurately position whether the audio data are the audio data corresponding to the awakening failure according to the name of the audio data, if the audio data are the audio data corresponding to the awakening failure, the specific voice awakening recognition model of which level does not pass, the audio data can be input into the voice awakening recognition models of all levels again, whether the operation result is the same as the recognition result indicated by the naming of the audio data or not is judged, if the operation result is the same as the recognition result indicated by the naming of the audio data, the voice recognition model operates normally, and the voice awakening failure is caused by which reason needs to be further; if the two models are different, the operation quality of the representation voice recognition model at the terminal is poor, recognition errors exist, and optimization is needed.

If the running results of the server and the terminal for the same audio data are the same, the corresponding reason for not realizing voice awakening may be the reason of the audio data, for example, the audio data does not include an awakening word; or the acquisition source of the audio data does not belong to the target user, and the reasons are irrelevant to the voice awakening recognition model; there are of course other factors: for example, the audio data contains wake-up words and also belongs to target users, but the noise in the audio data is more, which affects the recognition of the voice wake-up recognition model, or there are some users with special pronunciations and cannot be recognized by the voice wake-up recognition model, for these two cases, the voice wake-up recognition model needs to be optimally trained for different wake-up environments, for example, before inputting each stage of voice wake-up recognition models, noise reduction processing on the audio data is added, so as to avoid the influence of environmental voice (noise) on the recognition accuracy of the voice wake-up recognition model; or, aiming at a special user group, a voice awakening recognition model which accords with the occurrence characteristics of the user is specially trained, so that the voice awakening recognition model can realize personalized recognition, the recognition accuracy of the voice awakening recognition model is improved, and the awakening rate is further improved.

In another possible embodiment, for the audio data corresponding to successful voice wake-up, since the last stage voice wake-up recognition model (i.e. the third stage voice wake-up recognition model) recognizes the voiceprint features of the user, and compared with the voiceprint features of the user trained before the user wake-up, the voiceprint features of the user may change, for example, the voiceprint features of the user in a cold state are different from the voiceprint features of the user in a normal state, or the voiceprint features of the user are changed during a sound-changing period, so as to avoid a voice wake-up failure caused by the change of the voiceprint features of the user, in one possible embodiment, the audio data corresponding to successful wake-up is uploaded to the server, the server integrates the audio data of a predetermined time period, and trains the target voiceprint features again for the user, namely, the real-time update of the target voiceprint characteristics is realized, and the target user cannot be perceived, so that the awakening rate is improved.

Step 705, receiving feedback information of the server.

The feedback information may be at least one of the optimized speech recognition models at each level and the target voiceprint characteristics corresponding to the user.

In a possible implementation manner, the server acquires audio data stored in a voice wake-up scene stored in the terminal once every predetermined time interval, and performs multi-angle analysis for different wake-up states (wake-up failure or wake-up success), so as to continuously optimize voice wake-up recognition, or provide a special voice wake-up recognition model for a special user group, and feed back the optimized voice wake-up recognition model to a corresponding terminal.

In another possible implementation manner, the server updates the target voiceprint feature of the target user according to the acquired audio data at each preset time interval, and feeds the updated target voiceprint feature back to each terminal, so that the target voiceprint feature better conforms to the voiceprint feature of the user in a recent period of time, and the awakening rate of the user in a voice awakening scene is improved.

In this embodiment, the audio data in the voice wake-up scene is uploaded to the server through the terminal, so that the server can monitor the actual operation condition of the voice wake-up recognition model in the terminal according to the naming of the audio data and the audio data, thereby optimizing the voice wake-up recognition model in time and feeding back the voice wake-up recognition model to the terminal, so as to improve the recognition accuracy of the voice wake-up recognition model and further improve the wake-up rate of the user in the voice wake-up scene. In addition, the server can further optimize the target voiceprint characteristics of the user according to the audio data in the voice awakening scene, and the voice awakening failure caused by the change of the voiceprint characteristics of the user is avoided, so that the awakening rate in the voice awakening scene is improved.

In another possible application scenario, the optimization process of the voice awakening recognition model can also be performed in the terminal, the server classifies the received audio data and divides a positive sample and a negative sample, and if the recognition results of the same audio data at the terminal and the server are the same and the voice awakening recognition model is represented to operate normally, the voice awakening recognition model is used as the positive sample of the previous-stage voice awakening recognition model or the audio data is used as the negative sample of the next-stage voice awakening recognition model.

In one illustrative example, the process of determining positive and negative samples may include the steps of:

and firstly, in response to the feedback information indicating that the recognition quality of the voice awakening recognition model meets the quality index, determining the audio data as the positive sample training data of the nth-stage voice awakening recognition model.

Since the naming of the audio data refers to which level of voice wakeup recognition model the audio data passes, for example, the naming of the audio data is "time + wakeup word + nth level wakeup audio", it is characterized that the audio data passes only the nth level voice wakeup recognition model and does not pass the (n + 1) th level voice wakeup recognition model, therefore, in a possible implementation manner, if the operation conditions of the audio data at the terminal and the server both indicate that the audio data passes the nth level voice wakeup recognition model and does not pass the (n + 1) th level voice wakeup recognition model, the audio data is positive sample training data for the nth level voice wakeup recognition model, and is used for training the nth level voice wakeup recognition model.

And secondly, determining the audio data as negative sample training data of the (n + 1) th-level voice awakening recognition model.

In another possible and implemented manner, if the operation conditions of the audio data at the terminal and the server both indicate that the audio data passes through the nth-level voice wakeup recognition model and does not pass through the (n + 1) th-level voice wakeup recognition model, the audio data is negative sample training data for the (n + 1) th-level voice wakeup recognition model, and is used for training the (n + 1) th-level voice wakeup recognition model.

In this embodiment, the terminal determines, according to the recognition quality of the voice wakeup recognition model fed back by the server, that is, according to the condition that the recognition results of the same audio data at the terminal and the server are consistent, the model identifier of the voice wakeup recognition model indicated by the recognition result in the naming of the audio data, which audio data is used as the positive sample training data or the negative sample training data of the voice wakeup recognition model of which level to train the voice wakeup recognition model, and further optimizes the voice wakeup recognition model, so that the voice wakeup recognition model better conforms to the characteristics of the user, thereby improving the accuracy of the voice wakeup recognition model, and further improving the wakeup rate of the user in the voice wakeup scene.

In a possible application scenario, due to differences in usage scenarios after the terminal leaves the factory, for example, the terminal is used by a user after leaving the factory, or the terminal is used for testing terminal performance after leaving the factory, that is, used by a tester, under a usage situation of the user, security of stored audio data needs to be guaranteed, and for use by the tester, the audio data needs to be acquired at any time for analysis.

In an illustrative example, step 203 may be replaced with

steps

801 and 802 as shown in fig. 8 on the basis of fig. 2.

Step 801, in response to that the current mode is the application mode, storing the audio data in a first storage area according to the identification result, wherein the first storage area cannot be accessed when the root authority is not provided.

Because the data directory in the terminal cannot be accessed under a non-root condition, the security is high, and the privacy information of the user can be better protected, in a possible implementation manner, the data directory can be determined as a first storage area, and when the user selects an application mode (user mode), the voice wake-up service is that the user establishes a folder under the data directory for storing the audio data of the user.

In other possible embodiments, an encryption storage partition may be disposed in the terminal, and is used to store audio data of the user, so as to prevent the audio data from being accessed by a non-system application, and may protect the security of the audio data.

When the user uses the voice wake-up function in the application mode, voice wake-up training needs to be performed first, that is, a target voiceprint feature of the user needs to be trained and used as a basis for judging whether the user successfully wakes up in a subsequent voice wake-up scene, and in the voice wake-up training process, audio data of the user can be collected and stored.

In an illustrative example, after the user sets the application mode, the terminal receives the application mode setting operation, that is, generates a data/kws/folder for storing the audio data of the user, and further, in order to distinguish the voice wake-up test scenario from the voice wake-up scenario, creates a data/kws/trainaudio subfolder and a data/kws/wakeupaudio subfolder respectively under the data/kws/folder directory, wherein the data/kws/wakeupaudio directory is used for storing the audio data (i.e., the wake-up audio data) of the user under the voice wake-up scenario, and the data/kws/trainaudio directory is used for storing the audio data of the user under the voice wake-up test scenario.

And 802, responding to the fact that the current mode is a debugging mode, storing the audio data into a sub-storage space corresponding to the current debugging object in a second storage area according to the identification result, wherein the second storage is allowed to be accessed when the second storage does not have root authority.

For the debugging mode, a corresponding mode is performed on the terminal voice awakening performance by a tester in a test scene, and in the test scene, privacy information of a user does not need to be concerned, so that audio data in the debugging mode does not need to be stored in a first storage area with privacy protection, and subsequent analysis is facilitated.

In a possible implementation, first two sub-storage spaces are divided for the second storage space according to the training and wake-up scenes, and then sub-storage spaces corresponding to different users are divided for the sub-storage spaces corresponding to different scenes, wherein each sub-storage space is used for storing audio data of the same user.

In an illustrative example, when the user selects the debug mode, the voice wake-up service creates sdcard/kws/folder in sdcard, when testing the wake-up rate of User1, the voice wake-up service creates subdirectories and names the folder with training time 1 when User1 trains, saves User1's training audio under sdcard/kws/Time 1/Traciaudio directory, saves User1's wake-up audio under sdcard/kws/Time 1/Wakeupaudio directory, and so on until the testing of multiple users is completed.

In the embodiment, different modes, namely an application mode and a debugging mode, are respectively and correspondingly set for different application scenes of the terminal, and different storage areas are allocated for the different modes, for example, in the application mode, audio data is stored in a storage area which cannot be accessed without root authority, so that privacy information of a user can be protected; for the debugging mode, storing the audio data in an area which is allowed to be accessed when the root authority is not provided, and facilitating the calling in the follow-up analysis; in addition, in the debugging mode, a plurality of users are required to test the wake-up rate of the terminal, so that the audio data corresponding to the target user can be accurately positioned during subsequent analysis while different users are distinguished, and when the audio data are stored, the storage area is divided according to a debugging object (namely, the user) and the test time.

In a possible implementation manner, before the user needs to perform voice wakeup, voice wakeup training is also performed, that is, a target voiceprint feature corresponding to the user is obtained through training and is used for comparing the voiceprint features during subsequent voice wakeup.

In an exemplary example, as shown in fig. 9, which shows a flowchart of a method for storing audio data according to another exemplary embodiment of the present application, the present embodiment is illustrated by applying the method to the terminal shown in fig. 1, and the method includes:

step 901, in response to receiving a wakeup word selection operation in the wakeup voice setting interface, determining a wakeup word indicated by the wakeup word selection operation.

In a possible implementation manner, when a user needs to train a target voiceprint feature before performing voice wakeup, the voice wakeup service provides a wakeup voice setting interface, the user can select a wakeup word in the interface, and accordingly, the terminal determines the wakeup word used in subsequent training and wakeup.

In an exemplary example, as shown in fig. 10, which shows a schematic diagram of a voice wake-up training process shown in an exemplary embodiment of the present application, when a user turns on a voice wake-up switch in a wake-up voice setting interface 1001, a wake-up word option is displayed in the wake-up voice setting interface 1001, the user may click on a pull-down control 1002, a plurality of wake-up word options are displayed on an upper layer of the wake-up voice setting interface 1001, the user may select any wake-up word to train, for example, the user clicks on the wake-up word "xiao yi", the terminal receives a selection operation on the wake-up word "xiao yi", and determines that the wake-up word corresponding to the voice wake-up training is "xiao yi".

Step 902, in response to the acquired training audio data including the wake word indicated by the wake word selecting operation, storing the training audio data in a third storage area.

In a possible implementation manner, after the user selects the wake-up word, the voice wake-up training is performed according to the prompt in the voice wake-up design interface, and the audio data collected by the microphone is stored in the third storage area.

In an exemplary example, as shown in fig. 10, when the terminal determines that the wake-up word selected by the user is "art, the user is prompted in the voice set interface 1001 after the terminal determines that the exercise is successful, the user is included in the sound.

In another possible implementation manner, if the training fails, the user is prompted to fail in the awakening voice setting interface correspondingly, the training is repeated, and the audio data corresponding to the training failure is stored in the corresponding position, so that the reason of the training failure is analyzed subsequently.

Aiming at the voice awakening training mode, in a possible implementation mode, a voiceprint recognition model is preset in a terminal and used for carrying out feature extraction on audio data collected by a microphone to obtain an audio feature vector, the audio feature vector is compared with an audio feature vector corresponding to an awakening word stored in advance, when the audio feature vector and the audio feature vector meet a preset similarity threshold value, the successful training is represented, and after repeated training is successful, a target voiceprint feature which is trained is stored and used for comparing the voiceprint feature in a subsequent voice awakening scene.

Because two kinds of audio data can be generated under the voice awakening training scene, namely the audio data corresponding to the successful training and the audio data corresponding to the failed training, correspondingly, when the audio data is stored in the third storage area, the two kinds of audio data also need to be distinguished, the audio data can be named by adopting the training states of time and the audio data, wherein the training states comprise two states of the successful training and the failed training.

In an exemplary example, as shown in fig. 11, which shows a schematic diagram of a storage process of training audio according to an exemplary embodiment of the present application, when training is started, a terminal creates a new folder, a rainaudio, in a third storage area, and creates a new subfolder according to a wake-up word selected by a user, for example: the method comprises the following steps of (1) obtaining audio data, namely, the trailing audio/xiaobuxiaobu, wherein the xiaobuxiaobu is a wakeup word, and if training is successful, the audio data is named as' time + + success. And if the training fails, naming the audio data as time + fail.

It should be noted that the storage region of the training audio also needs to consider what mode the terminal is in, if the terminal is in the application mode, the training audio needs to be stored in a storage region that cannot be accessed without root authority for protecting the privacy information of the user, and if the terminal is in the debugging mode, the training audio is stored in a storage region that is allowed to be accessed without root authority for facilitating subsequent acquisition and analysis.

Optionally, in the debugging mode, the audio data is also named according to different debugging objects and testing time, so that the target testing user can be conveniently and accurately located.

Step 903, acquiring audio data collected by a microphone.

And 904, identifying the audio data through k levels of voice awakening identification models to obtain identification results corresponding to the voice awakening identification models at all levels, wherein the voice awakening identification models at different levels correspond to different identification dimensions, the identification results are used for representing the passing condition of the identification of the audio data on the corresponding identification dimensions, and k is an integer greater than or equal to 2.

Step 905, storing the audio data into a first storage area according to the recognition result, wherein the first storage area is used for storing the audio data of the at least one stage of voice awakening recognition model.

The embodiments of step 903 and step 905 may refer to the above embodiments, which are not described herein.

In this embodiment, in a voice wake-up test scenario, when audio data is stored, the audio data needs to be stored according to a wake-up word and a training result, the audio data that fails to be trained can be used to analyze the reason of the training failure, and the audio data that succeeds in being trained can be used to analyze the voiceprint characteristics of the user.

In the above embodiment, only the training scenario and the wake-up scenario are taken as examples, and audio data storage manners in various scenarios are respectively introduced, please refer to fig. 12, which shows schematic diagrams of audio data storage manners corresponding to different scenarios in two modes shown in an exemplary embodiment of the present application.

As shown in fig. 12, the terminal is provided with an attribute setting function, that is, the user can select a mode, and when the user selects a user mode (application mode), the audio data in the user wake-up scene is stored in the "data/kws/wakeup audio" directory, and the audio data in the user test scene is stored in the "data/kws/trainaudio" directory; if the user selects the test mode (debug mode), the audio data corresponding to the user training scene is stored according to the debug object (i.e. the user 1 corresponds to the training 1), for example: storing training audio data of user 1 under the directory of "sdcard/kws/time 1/trainaudio", and storing training audio data of user 2 under the directory of sdcard/kws/time 2/trainaudio; similarly, the audio data corresponding to the user wake-up scene is stored according to the debugging object (i.e., wake-up 1 corresponding to user 1), for example, the wake-up audio data corresponding to user 1 is stored in the "sdcard/kws/time 1/wakeup audio" directory, and the wake-up audio data corresponding to user 2 is stored in the "sdcard/kws/time 2/wakeup audio" directory.

Referring to fig. 13, a block diagram of an audio data storage device according to an exemplary embodiment of the present application is shown. The apparatus may be implemented as all or a portion of the terminal in software, hardware, or a combination of both. The device includes:

an obtaining module 1301, configured to obtain audio data collected by a microphone;

the recognition module 1302 is configured to recognize the audio data through k-level voice awakening recognition models to obtain recognition results corresponding to the voice awakening recognition models at each level, where the voice awakening recognition models at different levels correspond to different recognition dimensions, the recognition results are used to represent recognition passing conditions of the audio data in the corresponding recognition dimensions, and k is an integer greater than or equal to 2;

and the first storage module 1303 is configured to store the audio data into a first storage area according to the recognition result, where the first storage area is used to store the audio data passing through the at least one stage of voice wakeup recognition model.

Optionally, the first storage module 1303 includes:

the naming unit is used for naming the audio data according to the identification result and the awakening word;

the first storage unit is used for storing the named audio data into the first storage area.

Optionally, the identifying module 1302 includes:

the first identification unit is used for carrying out nth-level identification on the audio data through an nth-level voice awakening identification model to obtain an nth identification result, wherein n is a positive integer smaller than k;

the second identification unit is used for responding to the nth identification result to represent that the audio data passes through nth-level identification, and performing nth + 1-level identification on the audio data through an n + 1-level voice awakening identification model to obtain an n + 1-level identification result;

optionally, the naming unit is further configured to:

in response to the nth recognition result representing that the audio data passes the nth level recognition and the (n + 1) th recognition result representing that the audio data does not pass the (n + 1) th level recognition, naming the audio data according to the model identification of the nth level voice awakening recognition model and the awakening word;

or,

and in response to the fact that the kth recognition result represents that the audio data passes through kth-level recognition, naming the audio data according to the model identification of the kth-level voice awakening recognition model and the awakening word.

Optionally, the k-level voice awakening recognition model includes a first-level voice awakening recognition model, a second-level voice awakening recognition model, and a third-level voice awakening recognition model;

the first identification unit is further configured to:

performing first-stage recognition on the audio data through the first-stage voice awakening recognition model to obtain a first recognition result, wherein the first recognition result is used for representing whether the audio data contains keywords which are part of the awakening words;

or,

performing second-level recognition on the audio data through the second-level voice awakening recognition model to obtain a second recognition result, wherein the second recognition result is used for representing whether the audio data contains the awakening words or not;

optionally, the second identifying unit is further configured to:

and performing third-level recognition on the audio data through the third-level voice awakening recognition model to obtain a third recognition result, wherein the third recognition result is used for representing whether the voiceprint features of the audio data are matched with the target voiceprint features.

Optionally, the first-stage voice awakening recognition model runs on the DSP, and the second-stage voice awakening recognition model and the third-stage voice awakening recognition model run on the CPU;

the first-stage voice awakening recognition model and the second-stage voice awakening recognition model are based on CNN, and the third-stage voice awakening recognition model is based on GMM.

Optionally, the apparatus further comprises:

the uploading module is used for uploading the audio data of the content of the first storage area to a server, and the server is used for determining the recognition quality of the voice awakening recognition model according to the audio data and the name of the audio data;

and the receiving module is used for receiving the feedback information of the server.

Optionally, the apparatus further comprises:

a first determining module, configured to determine the audio data as positive sample training data of the nth level voice wakeup recognition model and/or determine the audio data as negative sample training data of the (n + 1) th level voice wakeup recognition model in response to the feedback information indicating that the recognition quality of the voice wakeup recognition model meets a quality index.

Optionally, the first storage module 1303 further includes:

the second storage unit is used for responding to the situation that the current mode is an application mode, storing the audio data into the first storage area according to the identification result, wherein the first storage area cannot be accessed when the root authority is not provided;

the device further comprises:

and the second storage module is used for responding to the condition that the current mode is a debugging mode, storing the audio data into a sub-storage space corresponding to the current debugging object in a second storage area according to the identification result, and allowing access when the second storage does not have root authority.

Optionally, the apparatus further comprises:

the second determination module is used for responding to the received awakening word selection operation in the awakening voice setting interface and determining the awakening word indicated by the awakening word selection operation;

and the third storage module is used for responding to the fact that the acquired training audio data contain the awakening words indicated by the awakening word selection operation, and storing the training audio data in a third storage area.

In the embodiment of the application, under a voice awakening scene, a k-level voice awakening recognition model is arranged in the terminal, after the terminal receives audio data collected by a microphone, the audio data can be input into the k-level voice awakening recognition model, the audio data are recognized in different recognition dimensions, recognition passing conditions of the audio data in different recognition dimensions are obtained, and the audio data are stored in the first storage area according to recognition results. Through the voice awakening recognition module with different recognition dimensions, recognition results of the audio data on the different recognition dimensions can be obtained, the audio data are stored according to the recognition results, the awakening state under the voice awakening scene can be accurately obtained, the reason of awakening failure can be accurately positioned under the condition of awakening failure, namely, the reason of awakening failure is failed in the recognition dimension of the corresponding level, so that the reason of awakening failure can be accurately analyzed and optimized, and the awakening rate under the voice awakening scene is improved.

Referring to fig. 14, a block diagram of a terminal 1400 according to an exemplary embodiment of the present application is shown. Terminal 1400 in embodiments of the present application may include one or more of the following: a processor 1410, a memory 1420, and a screen 1430.

Processor 1410 may include one or more processing cores. The processor 1410 connects various parts throughout the terminal 1400 using various interfaces and lines, and performs various functions of the terminal 1400 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1420, and calling data stored in the memory 1420. Alternatively, the processor 1410 may be implemented in at least one hardware form of a DSP, a Field-Programmable Gate Array (FPGA), and a Programmable Logic Array (PLA). The processor 1410 may integrate one or more of a CPU, a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is responsible for rendering and drawing the content that the screen 1430 needs to display; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 810, but may be implemented by a communication chip.

The Memory 1420 may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). Optionally, the memory 1420 includes a non-transitory computer-readable storage medium. The memory 1420 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 1420 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the above-described method embodiments, and the like, and the operating system may be an Android (Android) system (including an Android system depth development-based system), an IOS system developed by apple inc (including an IOS system depth development-based system), or other systems. The stored data area may also store data created by terminal 1400 in use (e.g., phone book, audio-video data, chat log data), and the like.

The screen 1430 may be a capacitive touch display screen for receiving touch operations by a user on or near the screen using a finger, stylus, or any other suitable object, as well as displaying user interfaces for various applications. The touch display screen is generally provided at a front panel of the terminal 1400. The touch display screen may be designed as a full-face screen, a curved screen, or a profiled screen. The touch display screen can also be designed to be a combination of a full-face screen and a curved-face screen, and a combination of a special-shaped screen and a curved-face screen, which is not limited in the embodiment of the present application.

In this embodiment, the terminal 1400 further includes a microphone, which is an energy conversion device that converts a sound signal into an electrical signal, and is configured to collect the sound signal, convert the sound signal into audio data, and send the audio data to each level of voice wakeup recognition models for recognition.

In addition, those skilled in the art will appreciate that the configuration of terminal 1400 shown in fig. 14 is not intended to be limiting of terminal 1400 and that terminals may include more or less components than shown, or some components may be combined, or a different arrangement of components. For example, the terminal 1400 further includes a radio frequency circuit, a shooting component, a sensor, an audio circuit, a Wireless Fidelity (WiFi) component, a power supply, a bluetooth component, and other components, which are not described herein again.

The embodiment of the present application further provides a computer-readable medium, which stores at least one instruction, where the at least one instruction is loaded and executed by the processor to implement the storage method of audio data according to the above embodiments.

Embodiments of the present application also provide a computer program product including computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the storage method of audio data provided in the various alternative implementations of the above aspect.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of storing audio data, the method comprising:

acquiring audio data collected by a microphone;

2. The method of claim 1, wherein storing the audio data in a first storage area according to the recognition result comprises:

naming the audio data according to the identification result and the awakening word;

and storing the named audio data into the first storage area.

3. The method according to claim 2, wherein the recognizing the audio data by the k-level voice wakeup recognition model to obtain the recognition result corresponding to each level of the voice wakeup recognition model comprises:

performing nth-level recognition on the audio data through an nth-level voice awakening recognition model to obtain an nth recognition result, wherein n is a positive integer smaller than k;

responding to the nth recognition result representing that the audio data passes through nth-level recognition, and performing nth + 1-level recognition on the audio data through an nth + 1-level voice awakening recognition model to obtain an nth +1 recognition result;

naming the audio data according to the recognition result and the awakening word, including:

or,

4. The method of claim 3, wherein the k-level voice wake recognition models comprise a first level voice wake recognition model, a second level voice wake recognition model, and a third level voice wake recognition model;

the nth stage recognition is carried out on the audio data through the nth stage voice awakening recognition model to obtain an nth recognition result, and the nth recognition result comprises the following steps:

or,

the n + 1-level recognition of the audio data through the n + 1-level voice awakening recognition model comprises:

5. The method of claim 4,

the first-stage voice awakening recognition model runs on a Digital Signal Processor (DSP), and the second-stage voice awakening recognition model and the third-stage voice awakening recognition model run on a Central Processing Unit (CPU);

the first-stage voice awakening recognition model and the second-stage voice awakening recognition model are based on a Convolutional Neural Network (CNN), and the third-stage voice awakening recognition model is based on a Gaussian Mixture Model (GMM).

6. The method according to any one of claims 2 to 5, wherein after storing the audio data in the first storage area according to the identification result, the method further comprises:

uploading the audio data of the content of the first storage area to a server, wherein the server is used for determining the recognition quality of the voice awakening recognition model according to the audio data and the name of the audio data;

and receiving feedback information of the server.

7. The method of claim 6, further comprising:

in response to the feedback information indicating that the recognition quality of the voice wakeup recognition model meets a quality index, determining the audio data as positive sample training data of the nth-level voice wakeup recognition model, and/or determining the audio data as negative sample training data of the (n + 1) th-level voice wakeup recognition model.

8. The method according to any one of claims 1 to 5, wherein the storing the audio data in the first storage area according to the recognition result further comprises:

responding to that the current mode is an application mode, storing the audio data into the first storage area according to the identification result, wherein the first storage area cannot be accessed when the root authority is not available;

the method further comprises the following steps:

and responding to the fact that the current mode is a debugging mode, storing the audio data into a sub-storage space corresponding to the current debugging object in a second storage area according to the identification result, wherein the second storage allows access when the second storage does not have root authority.

9. The method of any of claims 1 to 5, wherein prior to the acquiring the audio data collected by the microphone, the method further comprises:

responding to the received awakening word selection operation in the awakening voice setting interface, and determining awakening words indicated by the awakening word selection operation;

and in response to the acquired training audio data containing the awakening words indicated by the awakening word selection operation, storing the training audio data in a third storage area.

10. An apparatus for storing audio data, the apparatus comprising:

11. A terminal, characterized in that it comprises a processor and a memory in which at least one instruction, at least one program, set of codes or set of instructions is stored, which is loaded and executed by the processor to implement a method of storing audio data according to any one of claims 1 to 9.

12. A computer-readable storage medium, having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method of storing audio data according to any one of claims 1 to 9.