CN116312619A

CN116312619A - Voice activity detection model generation method and device, medium and electronic equipment

Info

Publication number: CN116312619A
Application number: CN202310087996.3A
Authority: CN
Inventors: 文仕学; 马泽君
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2023-01-29
Filing date: 2023-01-29
Publication date: 2023-06-23

Abstract

The disclosure relates to a voice activity detection model generation method, a device, a medium and an electronic device, wherein the method comprises the following steps: acquiring a voice sample data set, wherein the voice sample data set comprises a voice sample frame and a sample label corresponding to the voice sample frame; inputting the voice frame sample into a voice activity detection model to obtain a voice probability result which is output by the voice activity detection model and corresponds to the voice frame sample; determining a loss value according to a preset loss function, a corresponding voice probability result of a voice frame sample in a voice sample data set and a sample label, wherein the loss function comprises a first loss function, and the first loss function is the smooth approximate inverse number of the F1 fraction; and according to the loss value, adjusting the model parameters of the voice activity detection model until the training parameters of the estimated voice activity detection model meet the preset conditions, so as to obtain the voice activity detection model, and improving the effect of the voice activity detection model in practical application.

Description

Voice activity detection model generation method and device, medium and electronic equipment

Technical Field

The disclosure relates to the technical field of neural networks, and in particular relates to a voice activity detection model generation method, a device, a medium and electronic equipment.

Background

The voice activity detection model (voice activity detection, VAD) can detect speech in a piece of audio.

In the related art, the voice activity detection model (voice activity detection, VAD) is usually optimized by using a cross entropy loss function as an objective function, however, in the practical application of the voice activity detection model, the objective of the voice activity detection model is not strictly consistent with the cross entropy loss function used in the optimization, so that the effect of the voice activity detection model in the practical application is not ideal.

Disclosure of Invention

This section is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This section is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method for generating a voice activity detection model, including: acquiring a voice sample data set, wherein the voice sample data set comprises a voice sample frame and a sample label corresponding to the voice sample frame;

Inputting the voice frame sample into a voice activity detection model to obtain a voice probability result which is output by the voice activity detection model and corresponds to the voice frame sample;

determining a loss value according to a preset loss function, a corresponding voice probability result of a voice frame sample in the voice sample data set and a sample label, wherein the loss function comprises a first loss function which is the smooth approximate inverse number of an F1 score;

and adjusting model parameters of the voice activity detection model according to the loss value until the training parameters of the voice activity detection model are evaluated to meet the preset conditions so as to obtain the voice activity detection model.

In a second aspect, the present disclosure provides a voice activity detection model generating apparatus, including:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a voice sample data set, wherein the voice sample data set comprises a voice sample frame and a sample label corresponding to the voice sample frame;

the input module is used for inputting the voice frame sample into a voice activity detection model to obtain a voice probability result which is output by the voice activity detection model and corresponds to the voice frame sample;

The determining module is used for determining a loss value according to a preset loss function, a corresponding voice probability result of the voice frame samples in the voice sample data set and a sample label, wherein the loss function comprises a first loss function which is the smooth approximate inverse number of the F1 fraction;

and the adjusting module is used for adjusting the model parameters of the voice activity detection model according to the loss value until the training parameters for evaluating the voice activity detection model meet the preset conditions so as to obtain the voice activity detection model.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device performs the steps of the method described in the first aspect.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having at least one computer program stored thereon;

at least one processing means for executing said at least one computer program in said storage means to carry out the steps of the method described in the first aspect.

According to the technical scheme, the situation that the F1 score is taken as an index in the actual application of the evaluation model, the F1 score cannot be continuously led and the model performance is better when the loss value for model training is generally lower is considered, the constructed first loss function is the reverse number of the smooth approximation of the F1 score, the smooth approximation of the F1 score can meet the continuous conduction, the reverse number of the smooth approximation of the F1 score can meet the judgment condition that the model performance is better when the loss value is lower in model training, therefore, the reverse number of the smooth approximation of the F1 score is used as the first loss function for training the voice activity detection model, the loss function used in the generation process of the voice activity detection model is consistent with the target in the actual application of the model, and the effect of the trained voice activity detection model in the actual application is improved.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flowchart illustrating a method of generating a voice activity detection model, according to an example embodiment.

FIG. 2 is a block diagram illustrating a voice activity detection model generation apparatus according to an example embodiment.

Fig. 3 is a schematic diagram of an electronic device according to an exemplary embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

All actions in this disclosure to obtain signals, information or data are performed in compliance with the corresponding data protection legislation policies of the country of location and to obtain authorization granted by the owner of the corresponding device.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

Meanwhile, it can be understood that the data (including but not limited to the data itself, the acquisition or the use of the data) related to the technical scheme should conform to the requirements of the corresponding laws and regulations and related regulations.

As mentioned in the background, the voice activity detection model (voice activity detection, VAD) is typically optimized with a cross entropy loss function as an objective function, which may also be referred to as a loss function. The optimization objective of the cross entropy loss function is that the result recognized by the model is consistent with the result marked by the sample as much as possible, however, the objective of the voice activity detection model in practical application can be quantized into the precision and recall of the voice, the precision is used for representing the total frame number of the voice which is recognized by the model as the result and is actually the voice, the recall is used for representing the ratio of the frame number of the voice which is recognized by the model to the total frame number of the voice which is actually the voice, the higher the precision is, the lower the recall is, and the inverse relation is between the two. Thus, in order to balance accuracy and recall, the geometric mean of the extraction accuracy and recall, i.e., F1-score is also referred to as F1 score, it is generally desirable that the accuracy and recall be as high as possible, or that the F1-score be as high as possible, in the practical use of the voice activity detection model. However, the target and the optimization target of the cross entropy used in the model generation process are not strictly consistent, so that the effect of the voice activity detection model in practical application is not ideal.

In view of this, the embodiments of the present disclosure provide a method, an apparatus, a medium, and an electronic device for generating a voice activity detection model, which improve the effect of the voice activity detection model in practical application.

The present disclosure is explained below with reference to the drawings.

FIG. 1 is a flowchart illustrating a method of generating a voice activity detection model, according to an example embodiment. The voice activity detection model generation method can be applied to an electronic device, such as a mobile terminal, such as a mobile phone, a tablet, and a fixed terminal, such as a server, a desktop computer, and the like. Referring to fig. 1, the method comprises the steps of:

step S101, a voice sample data set is obtained, where the voice sample data set includes a voice sample frame and a sample tag corresponding to the voice sample frame.

The sample label corresponding to the voice sample frame is used for representing whether the voice sample frame is voice or not. For example, tag 0 may be characterized as non-speech and tag 1 may be characterized as speech.

The voice sample data set may be sampled from a pre-collected sample, and the pre-collection method may refer to related technology, which is not described herein.

Step S102, inputting the voice frame sample into the voice activity detection model to obtain a voice probability result corresponding to the voice frame sample output by the voice activity detection model.

Wherein the speech probability result corresponding to a speech frame sample may be used to characterize the probability that the speech frame sample is speech.

The model structure of the model related to the voice activity detection model may refer to the related art, and this embodiment is not described herein.

Step S103, determining a loss value according to a preset loss function, a corresponding voice probability result of a voice frame sample in the voice sample data set and a sample label, wherein the loss function comprises a first loss function, and the first loss function is a smooth approximate opposite number of the F1 fraction.

Wherein a smooth approximation of the F1 fraction can be characterized by the following formula (1):

wherein F1 ₁ For smooth approximation of F1 score, torch.sum () is the sum function, y _i Sample tag for ith speech frame sample in speech sample dataset, p _i E is a natural constant, which is the probability of being predicted as speech for the ith speech frame sample.

Wherein the smooth approximation of the F1 fraction has a relative number of-F1 ₁ 。

Step S104, according to the loss value, the model parameters of the voice activity detection model are adjusted until the training parameters of the estimated voice activity detection model meet the preset conditions, so as to obtain the voice activity detection model.

It should be noted that, the process of adjusting the model parameters of the voice activity detection model is an iterative update process, when the training parameters of the estimated voice activity detection model do not meet the preset conditions, the voice frame samples are input into the voice activity detection model after the model parameters are adjusted again, the loss value is determined based on the voice probability result and the sample label of the voice activity detection model after the model parameters are adjusted, and then the model parameters of the voice activity detection model are adjusted based on the loss value.

The parameter for evaluating the training of the voice activity detection model may be the adjustment number of the model parameter for adjusting the voice activity detection model, in this case, the parameter for evaluating the training of the voice activity detection model may be determined to satisfy the preset condition when the adjustment number of the model parameter for adjusting the voice activity detection model reaches the preset number, and the voice activity detection model after the current adjustment of the model parameter is the trained voice activity detection model.

The parameter for evaluating the training of the voice activity detection model may be that the difference between the loss value and the last determined loss value is smaller than a first preset threshold, in this case, it may be determined that the parameter for evaluating the training of the voice activity detection model satisfies a preset condition when the difference between the loss value and the last determined loss value is smaller than the first preset threshold, and the model corresponding to the model parameter of the voice activity detection model adjusted based on the last loss value is the trained voice activity detection model.

The trained voice activity detection model can be used for voice detection of different scenes.

According to the scheme, considering that the F1 score is an index in the actual application of the evaluation model, the F1 score cannot be continuously led, and the model performance is better when the loss value for model training is generally lower, the constructed first loss function is the opposite number of the smooth approximation of the F1 score, the smooth approximation of the F1 score can meet the continuous conduction, and the opposite number of the smooth approximation of the F1 score can meet the judgment condition that the model performance is better when the loss value is lower during model training, so that the opposite number of the smooth approximation of the F1 score is used as the first loss function for training the voice activity detection model, the loss function used during the generation process of the voice activity detection model is consistent with the target in the actual application of the model, and the effect of the trained voice activity detection model in the actual application is improved.

In some embodiments, the loss function further comprises a second loss function, which is a cross entropy loss function, in which case step S103 described above may be implemented by: determining a first loss value according to the first loss function, a corresponding voice probability result of a voice frame sample in the voice sample data set and a sample label; determining a second loss value according to the second loss function, a corresponding voice probability result of the voice frame samples in the voice sample data set and the sample label; and weighting the first loss value and the second loss value according to a weight relation to obtain the loss value, wherein the weight relation is used for representing that the sum of the first weight corresponding to the first loss value and the second weight corresponding to the second loss value is a preset value, the first weight is in direct proportion to the adjusted times of the model parameters of the voice activity detection model, and the second weight is in inverse proportion to the adjusted times of the model parameters of the voice activity detection model.

Illustratively, weighting the first loss value and the second loss value may be characterized by the following equation (2):

loss＝W ₁ *loss ₁ +W ₂ *loss ₂ (2)；

wherein loss is the weighted result of the first loss value and the second loss value, namely loss value, W ₁ For the first weight, loss ₁ For the first loss value, W ₂ For the second weight, loss ₂ Is the second loss value. It is noted that the loss here ₁ The first loss function, i.e., loss, may be calculated by constructing the inverse of the smooth approximation of the F1 score of the above example ₁ Characterization can be performed by the following formula (3):

wherein, by the above formula (3), loss can be determined ₁ . In the above formula (3), the meaning of each physical parameter can be referred to the explanation of the above formula (1), and the description of this embodiment is omitted here.

The cross entropy loss function can be characterized by the following equation (4):

wherein loss is ₂ For the second loss value, N is the number of speech frame samples in the speech sample dataset, y _i Sample tag for ith speech frame sample in speech sample dataset, p _i For the probability of being predicted as speech for the ith speech frame sample, a second loss value may be determined by equation (4) above.

Wherein the preset value may be 1, and the relationship between the first weight and the second weight may be represented by the following formula (5):

1＝W ₁ +W ₂ (5)；

Wherein the method comprises the steps of，W ₁ Is of a first weight, W ₂ Is a second weight. The first weight increases as the number of times the model parameters of the voice activity detection model have been adjusted increases, i.e., the first weight is proportional to the number of times the model parameters of the voice activity detection model have been adjusted, and for example, the first weight may increase to 1 as the number of times the model parameters of the voice activity detection model have been adjusted increases; the second weight decreases as the number of times the model parameters of the voice activity detection model have been adjusted increases, i.e. the second weight is inversely proportional to the number of times the model parameters of the voice activity detection model have been adjusted, and for example, the second weight may decrease to 0 as the number of times the model parameters of the voice activity detection model have been adjusted increases.

Further, the second weight may be characterized by the following formula (6):

W ₂ ＝1*e ^-k/K (6)；

wherein W is ₂ For the second weight, K is the number of times of currently adjusting the model parameters of the voice activity detection model, K is the total number of times of adjusting the model parameters of the voice activity detection model, i.e. the preset number of times, and e is a natural constant.

The second weight can be determined by the above equation (6), the first weight can be determined based on the weight relation, and the first loss value and the second loss value are weighted based on the first weight and the second weight to obtain the loss value.

It should be noted that, when a speech sample data set for training a speech activity detection model is constructed by using a large number of samples for data sampling, the speech sample data set obtained by each sampling is not uniformly sampled, and the summation of the samples in the speech sample data set is involved in the F1 score, so that the F1 score is biased to be estimated, and thus the inverse number based on smooth approximation of the F1 score is used as an objective function of the model for training, so that the problem of training divergence is caused. Therefore, in order to solve the problem of training divergence, the cross entropy loss function is combined with the first loss function constructed based on the inverse of the smooth approximation of the F1 score to cooperatively train the voice activity detection model, the first weight is set to be in direct proportion to the adjusted times of the model parameters of the voice activity detection model, the second weight is in inverse proportion to the adjusted times of the model parameters of the voice activity detection model, the model is trained to be converged by utilizing the cross entropy loss function, and then the voice activity detection model is finely adjusted by utilizing the first loss function, so that the trained voice activity detection model is obtained; and setting a second weight which is automatically changed based on the preset total times of adjusting the model parameters of the voice activity detection model and the adjusted times of the model parameters of the voice activity detection model, so that the model automatically changes the first weight and the second weight in the process of determining the loss value, the change of the loss value corresponding to the first loss function and the second loss function in different training stages is realized, and manual intervention is not needed, namely the loss function of the model is not needed to be replaced manually in different training stages.

In the case of adjusting the model parameters of the voice activity detection model using the loss values obtained by weighting the first loss value and the second loss value according to the weight relationship, the preset condition includes that the adjusted number of times of the model parameters of the voice activity detection model reaches the total number of times.

In some embodiments, the step S102 may be implemented by: and inputting the voice frame sample into a candidate voice activity detection model to obtain a voice probability result which is output by the candidate voice activity detection model and corresponds to the voice frame sample, wherein the candidate voice activity detection model is obtained by training with a cross entropy loss function as an objective function.

It should be noted that in the foregoing embodiment, a scheme is provided for cooperatively generating the voice activity detection model by combining the cross entropy loss function and the inverse number of the smooth approximation of the F1 score as the loss function of the model, and it is understood that different loss functions may be used to adjust the model parameters of the voice activity detection model as the loss functions of the voice activity detection model in different stages, so as to obtain a trained voice activity detection model.

For example, training is performed by using the cross entropy loss function as an objective function of the voice activity detection model, so as to obtain a candidate voice activity detection model, which is the first stage of obtaining a trained voice activity detection model; and training by using the smooth approximate opposite number of the F1 score as an objective function of the candidate voice activity detection model, thereby obtaining a trained voice activity detection model, which is the second stage for obtaining the trained voice activity detection model.

Wherein the candidate voice activity detection model may be obtained by: inputting the voice frame sample into a voice activity detection model, and outputting a voice probability result corresponding to the voice frame sample by the voice activity detection model; determining a loss value by using a corresponding voice probability result and a sample label of a voice frame sample in the cross entropy loss function voice sample data set, and adjusting model parameters of the voice activity detection model based on the loss value until the training parameters of the estimated voice activity detection model meet a second preset condition to obtain a trained candidate voice activity detection model, wherein the second preset condition may include that a difference between the loss value and a last determined loss value is smaller than a second preset threshold value, and it is worth noting that the difference between the loss value and the last determined loss value is smaller than the second preset threshold value may indicate that the model has converged.

By the method, the cross entropy loss function is used as an objective function of the voice activity detection model for training, so that a candidate voice activity detection model is obtained; and training by using the smooth approximate opposite number of the F1 fraction as an objective function of the candidate voice activity detection model, so as to obtain a trained voice activity detection model, and solve the problem of divergence of training.

FIG. 2 is a block diagram illustrating a voice activity detection model generation apparatus according to an example embodiment. Referring to fig. 2, the voice activity detection model generating apparatus may include:

an obtaining module 201, configured to obtain a voice sample data set, where the voice sample data set includes a voice sample frame and a sample tag corresponding to the voice sample frame;

the input module 202 is configured to input the voice frame sample to a voice activity detection model, and obtain a voice probability result corresponding to the voice frame sample output by the voice activity detection model;

a determining module 203, configured to determine a loss value according to a preset loss function, a corresponding speech probability result of a speech frame sample in the speech sample dataset, and a sample label, where the loss function includes a first loss function, and the first loss function is a smooth approximate inverse number of an F1 score;

And the adjusting module 204 is configured to adjust model parameters of the voice activity detection model according to the loss value until the training parameters of the voice activity detection model are evaluated to meet a preset condition, so as to obtain the voice activity detection model.

In some embodiments, the loss function further comprises a second loss function, the second loss function being a cross entropy loss function, the determining module 203 comprising:

a first determining submodule, configured to determine a first loss value according to the first loss function, a corresponding speech probability result of a speech frame sample in the speech sample dataset, and a sample label;

a second determining submodule, configured to determine a second loss value according to the second loss function, a corresponding speech probability result of the speech frame samples in the speech sample dataset, and a sample label;

the weighting sub-module is used for weighting the first loss value and the second loss value according to a weight relation to obtain a loss value, wherein the weight relation is used for representing that the sum of the first weight corresponding to the first loss value and the second weight corresponding to the second loss value is a preset value, the first weight is in direct proportion to the adjusted times of the model parameters of the voice activity detection model, and the second weight is in inverse proportion to the adjusted times of the model parameters of the voice activity detection model.

In some embodiments, the second weight is characterized by the following formula:

W ₂ ＝1*e ^-k/K ；

wherein W is ₂ For the second weight, k is the current toneThe number of times of the model parameters of the whole voice activity detection model is K is the preset total number of times of adjusting the model parameters of the voice activity detection model, and e is a natural constant.

In some embodiments, the preset condition includes the adjusted number of times of model parameters of the voice activity detection model reaching the total number of times.

In some embodiments, the input module 202 is specifically configured to:

and inputting the voice frame sample into a candidate voice activity detection model to obtain a voice probability result which is output by the candidate voice activity detection model and corresponds to the voice frame sample, wherein the candidate voice activity detection model is obtained by training with a cross entropy loss function as an objective function.

In some embodiments, the preset condition includes the loss value differing from a last determined loss value by less than a first preset threshold.

The embodiments of each module in the above apparatus may refer to the above related embodiments, which are not described herein.

The present disclosure also provides a computer readable medium having stored thereon a computer program which when executed by a processing device performs the steps of the above method.

The present disclosure also provides an electronic device, including:

a storage device having at least one computer program stored thereon;

at least one processing means for executing said at least one computer program in said storage means to carry out the steps of the above method.

Referring now to fig. 3, a schematic diagram of an electronic device 300 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 3 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various suitable actions and processes in accordance with a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

In general, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 308 including, for example, magnetic tape, hard disk, etc.; and communication means 309. The communication means 309 may allow the electronic device 300 to communicate with other devices wirelessly or by wire to exchange data. While fig. 3 shows an electronic device 300 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via a communication device 309, or installed from a storage device 308, or installed from a ROM 302. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing means 301.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the electronic device may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a voice sample data set, wherein the voice sample data set comprises a voice sample frame and a sample label corresponding to the voice sample frame; inputting the voice frame sample into a voice activity detection model to obtain a voice probability result which is output by the voice activity detection model and corresponds to the voice frame sample; determining a loss value according to a preset loss function, a corresponding voice probability result of a voice frame sample in the voice sample data set and a sample label, wherein the loss function comprises a first loss function which is the smooth approximate inverse number of an F1 score; and adjusting model parameters of the voice activity detection model according to the loss value until the training parameters of the voice activity detection model are evaluated to meet the preset conditions so as to obtain the voice activity detection model.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of a module does not in some cases define the module itself.

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In accordance with one or more embodiments of the present disclosure, example 1 provides a voice activity detection model generation method, comprising:

acquiring a voice sample data set, wherein the voice sample data set comprises a voice sample frame and a sample label corresponding to the voice sample frame;

According to one or more embodiments of the present disclosure, example 2 provides the method of example 1, the penalty function further comprising a second penalty function, the second penalty function being a cross entropy penalty function, the determining the penalty value comprising:

Determining a first loss value according to the first loss function, a corresponding voice probability result of the voice frame samples in the voice sample dataset and a sample label;

determining a second loss value according to the second loss function, a corresponding voice probability result of the voice frame samples in the voice sample dataset and a sample label;

and weighting the first loss value and the second loss value according to a weight relation to obtain a loss value, wherein the weight relation is used for representing that the sum of the first weight corresponding to the first loss value and the second weight corresponding to the second loss value is a preset value, the first weight is in direct proportion to the adjusted times of the model parameters of the voice activity detection model, and the second weight is in inverse proportion to the adjusted times of the model parameters of the voice activity detection model.

In accordance with one or more embodiments of the present disclosure, example 3 provides the method of example 2, the second weight is characterized by:

W ₂ ＝1*e ^-k/K ；

wherein W is ₂ For the second weight, K is the number of times of currently adjusting the model parameters of the voice activity detection model, K is the preset total number of times of adjusting the model parameters of the voice activity detection model, and e is a natural constant.

In accordance with one or more embodiments of the present disclosure, example 4 provides the method of example 3, the preset condition includes an adjusted number of times of model parameters of the voice activity detection model reaching the total number of times.

According to one or more embodiments of the present disclosure, example 5 provides the method of example 1, the inputting the speech frame samples into a speech activity detection model, obtaining speech probability results corresponding to the speech frame samples output by the speech activity detection model, including:

According to one or more embodiments of the present disclosure, example 6 provides the method of example 5, the preset condition comprising the loss value differing from a last determined loss value by less than a first preset threshold.

According to one or more embodiments of the present disclosure, example 7 provides a voice activity detection model generating apparatus, comprising:

According to one or more embodiments of the present disclosure, example 8 provides the apparatus of example 7, the loss function further comprising a second loss function, the second loss function being a cross entropy loss function, the determining module comprising:

In accordance with one or more embodiments of the present disclosure, example 9 provides the apparatus of example 8, the second weight is characterized by:

W ₂ ＝1*e ^-k/K ；

In accordance with one or more embodiments of the present disclosure, example 10 provides the apparatus of example 9, the preset condition includes an adjusted number of times of model parameters of the voice activity detection model reaching the total number of times.

In accordance with one or more embodiments of the present disclosure, example 11 provides the apparatus of example 7, the inputting the speech frame samples into a speech activity detection model to obtain speech probability results corresponding to the speech frame samples output by the speech activity detection model, including:

According to one or more embodiments of the present disclosure, example 12 provides the apparatus of example 11, the preset condition includes that the loss value differs from a last determined loss value by less than a first preset threshold.

According to one or more embodiments of the present disclosure, example 13 provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the method of any of examples 1-6.

Example 14 provides an electronic device according to one or more embodiments of the present disclosure, comprising:

A storage device having at least one computer program stored thereon;

at least one processing means for executing the at least one computer program in the storage means to implement the steps of the method of any one of examples 1-6.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. A method for generating a speech activity detection model, comprising:

2. The method of claim 1, wherein the loss function further comprises a second loss function, the second loss function being a cross entropy loss function, the determining a loss value based on a predetermined loss function, a corresponding speech probability result for a speech frame sample in the speech sample dataset, and a sample label comprising:

3. The method of claim 2, wherein the second weight is characterized by the formula:

W ₂ ＝1*e ^-k/K ；

4. A method according to claim 3, wherein the preset condition comprises the number of times the model parameters of the speech activity detection model have been adjusted to reach the total number of times.

5. The method of claim 1, wherein inputting the speech frame samples into a speech activity detection model to obtain speech probability results corresponding to the speech frame samples output by the speech activity detection model comprises:

6. The method of claim 5, wherein the predetermined condition comprises the loss value differing from a last determined loss value by less than a first predetermined threshold.

7. A voice activity detection model generation apparatus, comprising:

8. The apparatus of claim 7, wherein the loss function further comprises a second loss function, the second loss function being a cross entropy loss function, the determining module comprising:

9. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-6.

10. An electronic device, comprising:

A storage device having at least one computer program stored thereon;

at least one processing means for executing said at least one computer program in said storage means to carry out the steps of the method according to any one of claims 1-6.