CN115376563A

CN115376563A - Voice endpoint detection method and device, computer equipment and storage medium

Info

Publication number: CN115376563A
Application number: CN202210992114.3A
Authority: CN
Inventors: 谭风云; 魏韬; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2022-11-22

Abstract

The embodiment of the application belongs to the technical field of audio processing in artificial intelligence, and relates to a method and a device for detecting a voice endpoint aiming at vibration audio, computer equipment and a storage medium. In addition, the present application also relates to a blockchain technique, and a target voice endpoint of a user can be stored in the blockchain. This application utilizes the actual scene effective data of business to do data enhancement, alleviate the problem that non-vocal noise data such as cell-phone vibrations lack, thereby solve in the model training because the vibration data is sparse leads to the not good pain point problem of model learning effect, and simultaneously, the adjustment data proportion skill of this application, can make the model under the condition of not influencing other business identification rate, solve simultaneously quiet and noisy have vibrations mistake recognition problem under the background vocal environment, eliminate non-vocal influence from the root, make VAD detect more accurate, improve the speech recognition rate of accuracy, reduce bandwidth resource consumption.

Description

Voice endpoint detection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of audio processing technology in artificial intelligence, and in particular, to a method and an apparatus for detecting a speech endpoint for vibration audio, a computer device, and a storage medium.

Background

VAD (voice endpoint detection for vibro-audio) aims to identify and eliminate long periods of silence from the voice signal stream to save speech channel resources without degrading the quality of service. However, the non-human voice noise is erroneously detected and recognized as human voice, which increases communication bandwidth resources, not only causes resource waste, but also affects the recognition accuracy of the downstream voice recognition system.

The existing voice endpoint detection method for vibration audio starts from an acoustic model, optimizes an acoustic model learning method, inputs different types of training data to enable the training data to continuously learn the characteristics of non-human voice noise voice, and can have better non-human voice noise learning capability through multiple rounds of iterative reasoning, so that the problem of reduction of model recognition rate is further solved.

However, the applicant finds that the conventional method for detecting a voice endpoint for a vibration audio does not solve the influence caused by the non-human noise from the source, because the VAD model still detects the non-human segment by mistake, which results in a waste of bandwidth resources, and thus it can be seen that the conventional method for detecting a voice endpoint for a vibration audio cannot identify the non-human noise, thereby reducing the accuracy of detecting a voice endpoint for a vibration audio.

Disclosure of Invention

An embodiment of the present application provides a method and an apparatus for detecting a voice endpoint for a vibration audio, a computer device, and a storage medium, so as to solve the problem that the accuracy of detecting a voice endpoint for a vibration audio is reduced because non-human noise cannot be identified in the conventional method for detecting a voice endpoint for a vibration audio.

In order to solve the above technical problem, an embodiment of the present application provides a method for detecting a voice endpoint for a vibration audio, which adopts the following technical solutions:

acquiring vibration audio data corresponding to a target scene;

performing data enhancement operation on the vibration audio data to obtain audio enhancement data;

performing feature extraction operation on the audio enhancement data according to an open source voice recognition tool to obtain audio feature data;

performing feature labeling operation on the audio feature data according to a preset standard audio feature and a feature tag corresponding to the standard audio feature to obtain audio labeling data;

obtaining a mute audio characteristic, a human voice audio characteristic and a vibration audio characteristic in a preset proportion from the audio labeling data to obtain model training data;

performing model training operation on the initial VAD model according to the model training data to obtain a target VAD model;

acquiring audio to be identified;

and inputting the audio to be identified into the target VAD model to perform voice endpoint detection operation aiming at the vibration audio to obtain a target voice endpoint.

In order to solve the above technical problem, an embodiment of the present application further provides a voice endpoint detection apparatus for vibration audio, which adopts the following technical solutions:

the vibration audio module is used for acquiring vibration audio data corresponding to a target scene;

the data enhancement module is used for carrying out data enhancement operation on the vibration audio data to obtain audio enhancement data;

the feature extraction module is used for carrying out feature extraction operation on the audio enhancement data according to an open source voice recognition tool to obtain audio feature data;

the characteristic marking module is used for carrying out characteristic marking operation on the audio characteristic data according to preset standard audio characteristics and characteristic labels corresponding to the standard audio characteristics to obtain audio marking data;

the training data acquisition module is used for acquiring a mute audio characteristic, a human voice audio characteristic and a vibration audio characteristic in a preset proportion from the audio labeling data to obtain model training data;

the model training module is used for carrying out model training operation on the initial VAD model according to the model training data to obtain a target VAD model;

the audio acquisition module to be identified is used for acquiring the audio to be identified;

and the to-be-identified audio detection module is used for inputting the to-be-identified audio to the target VAD model to perform voice endpoint detection operation aiming at the vibration audio so as to obtain a target voice endpoint.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of the method for speech endpoint detection for rumble audio as described above.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

the computer readable storage medium has stored thereon computer readable instructions which, when executed by a processor, implement the steps of the voice endpoint detection method for rumble audio as described above.

The application provides a voice endpoint detection method aiming at vibration audio, which comprises the following steps: acquiring vibration audio data corresponding to a target scene; performing data enhancement operation on the vibration audio data to obtain audio enhancement data; performing feature extraction operation on the audio enhancement data according to an open source voice recognition tool to obtain audio feature data; performing feature labeling operation on the audio feature data according to a preset standard audio feature and a feature tag corresponding to the standard audio feature to obtain audio labeling data; obtaining a mute audio characteristic, a human voice audio characteristic and a vibration audio characteristic in a preset proportion from the audio labeling data to obtain model training data; performing model training operation on the initial VAD model according to the model training data to obtain a target VAD model; acquiring audio to be identified; and inputting the audio to be identified into the target VAD model to perform voice endpoint detection operation aiming at the vibration audio to obtain a target voice endpoint. Compared with the prior art, this application utilizes the actual scene effective data of business to do the data enhancement, alleviate the problem that non-vocal noise data such as cell-phone vibrations lack, thereby solve in the model training because vibrations data are sparse and lead to the not good painful point problem of model learning effect, and simultaneously, the adjustment data proportion skill of this application, can make the model under the condition that does not influence other business recognition rate, solve simultaneously quiet and noisy have vibrations mistake recognition problem under the background vocal environment, eliminate non-vocal influence from the root, make VAD detect more accurate, improve the speech recognition rate of accuracy, reduce bandwidth resource consumption.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flowchart of an implementation of a method for detecting a voice endpoint for vibration audio according to an embodiment of the present application;

FIG. 3 is a flowchart of one embodiment of step S201 in FIG. 2;

fig. 4 is a flowchart of a specific implementation of a method for acquiring a human voice audio feature according to an embodiment of the present application;

fig. 5 is a flowchart of a specific implementation of a mel filter bank obtaining method according to an embodiment of the present application;

FIG. 6 is a flowchart of one embodiment of step S503 of FIG. 5;

fig. 7 is a schematic structural diagram of a speech endpoint detection apparatus for vibration audio according to a second embodiment of the present application;

FIG. 8 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the voice endpoint detection method for vibration audio provided in the embodiments of the present application is generally executed by a server/terminal device, and accordingly, the voice endpoint detection apparatus for vibration audio is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Example one

With continuing reference to fig. 2, a flowchart of an implementation of a method for detecting a voice endpoint for vibration audio according to an embodiment of the present application is shown, and for convenience of explanation, only the relevant portions of the present application are shown.

The voice endpoint detection method for the vibration audio comprises the following steps:

step S201: and acquiring vibration audio data corresponding to the target scene.

In the embodiment of the present application, the vibration audio data refers to non-human sound noise in the form of vibration, for example, noise generated by vibration of a mobile phone when the mobile phone is in an incoming call.

Step S202: and carrying out data enhancement operation on the vibration audio data to obtain audio enhancement data.

In the embodiment of the application, because the mobile phone has insufficient vibration audio frequency fragments, the audio frequency is firstly copied by 10 times, and in addition, data enhancement methods such as sound speed disturbance, volume disturbance and noise addition are adopted to further expand data. The speech rate disturbance is completed by utilizing a sox command, mainly on the basis of original audio, the audio with different speech rates is generated by accelerating or slowing down the speech rate, and the training data of the application are provided with three speech rate coefficients which are 0.9,1.0,1.1 respectively. The operation method of volume disturbance, such as sound speed disturbance, also utilizes the sox command to increase or decrease the volume through a random volume coefficient on the basis of the original audio to generate new audio with different volumes. And finally, based on the superposition noise of the original audio, generating the mobile phone vibration audio containing the background noise.

Step S203: and performing feature extraction operation on the audio enhancement data according to the open source speech recognition tool to obtain audio feature data.

In the embodiment of the application, 13-dimensional MFCC features are extracted from newly generated mobile phone vibration data by using a script provided by kaldi, and then a features.

Step S204: and performing characteristic marking operation on the audio characteristic data according to a preset standard audio characteristic and a characteristic label corresponding to the standard audio characteristic to obtain audio marking data.

In this embodiment of the application, the standard audio features include a mute audio feature, a human voice audio feature, and a vibration audio feature, where a feature tag corresponding to the mute audio feature is 0, a feature tag corresponding to the human voice audio feature is 1, and a feature tag corresponding to the vibration audio feature is 2.

In the embodiment of the present application, the background of the present application is that a mobile phone vibration audio segment is determined as a human voice by a VAD model, so that it needs to be set to be silent when processing a sample tag corresponding to a vibration audio. The main idea of the algorithm is that the MFCC feature extraction takes a frame as a unit, the number of frames occupied by a mobile phone vibration segment in each channel audio is firstly calculated, a label corresponding to each frame feature of the mobile phone vibration segment is set to be 2 (0-mute, 1-voice and 2-noise), other mute frames are set to be 0, and a voice frame is set to be 1. The target file ali.

Step S205: and obtaining a mute audio characteristic, a human voice audio characteristic and a vibration audio characteristic in a preset proportion from the audio labeling data to obtain model training data.

In the embodiment of the application, after data processing is finished, the data proportion needs to be adjusted, and then the data is input into a model for learning as training data, wherein the training data is divided into three parts, namely mobile phone vibration data, business voice data and voice noise data.

Step S206: and carrying out model training operation on the initial VAD model according to the model training data to obtain a target VAD model.

Step S207: and acquiring the audio to be identified.

Step S208: and inputting the audio to be identified into a target VAD model to perform voice endpoint detection operation aiming at the vibration audio to obtain a target voice endpoint.

In an embodiment of the present application, a method for detecting a voice endpoint for vibration audio is provided, including: acquiring vibration audio data corresponding to a target scene; performing data enhancement operation on the vibration audio data to obtain audio enhancement data; performing feature extraction operation on the audio enhancement data according to the open source voice recognition tool to obtain audio feature data; performing feature labeling operation on the enhanced audio features according to preset standard audio features and feature labels corresponding to the standard audio features to obtain audio labeling data, wherein the standard audio features comprise silent audio features, human voice audio features and vibration audio features, the feature labels corresponding to the silent audio features are 0, the feature labels corresponding to the human voice audio features are 1, and the feature labels corresponding to the vibration audio features are 2; obtaining a mute audio characteristic, a human voice audio characteristic and a vibration audio characteristic in a preset proportion from the audio labeling data to obtain model training data; performing model training operation on the initial VAD model according to the model training data to obtain a target VAD model; acquiring audio to be identified; and inputting the audio to be identified into a target VAD model to perform voice endpoint detection operation aiming at the vibration audio to obtain a target voice endpoint. Compared with the prior art, this application utilizes the actual scene effective data of business to do the data enhancement, alleviate the problem that non-vocal noise data such as cell-phone vibrations lack, thereby solve in the model training because vibrations data are sparse and lead to the not good painful point problem of model learning effect, and simultaneously, the adjustment data proportion skill of this application, can make the model under the condition that does not influence other business recognition rate, solve simultaneously quiet and noisy have vibrations mistake recognition problem under the background vocal environment, eliminate non-vocal influence from the root, make VAD detect more accurate, improve the speech recognition rate of accuracy, reduce bandwidth resource consumption.

Continuing to refer to fig. 3, a flowchart of one embodiment of step S201 of fig. 2 is shown, and for convenience of illustration, only the relevant portions of the present application are shown.

In some optional implementation manners of this embodiment, step S201 specifically includes:

step S301: acquiring scene audio data corresponding to a target scene;

step S302: and carrying out cutting operation on the scene audio data according to the ffmpeg tool to obtain vibration audio data.

In the embodiment of the present application, ffmpeg is a set of open source computer programs that can be used to record, convert digital audio, video, and convert them into streams. LGPL or GPL licenses are used. It provides a complete solution for recording, converting and streaming audio and video.

In some optional implementations of this embodiment, the standard audio features include a vibration audio feature, a human voice audio feature, and a silence audio feature, and the preset ratio is:

P ₁ ∶P ₂ ∶P ₃ ＝1∶(30～50)∶4

wherein, P ₁ Representing a vibration audio feature; p ₂ Representing human voice audio features; p ₃ Representing a silent audio feature.

In the embodiment of the application, through multiple experiments, different data ratios are tried by adjusting parameters, and the optimal ratio of the effects of the mobile phone vibration, the service voice data and the voice noise data is 1: 33: 4. In the prior art, only mobile phone vibration audio is added to fine adjustment based on baseVAD, the effect is not obvious, vibration, noise and service voice are tried in a ratio of 1:14:2, the mobile phone vibration under the quiet condition can be well detected accurately, the mobile phone vibration in the noisy environment cannot be identified accurately, and the accuracy of a service test set is reduced by about 10 points due to insufficient ratio of service voice data during finetune training. Therefore, the ratio of noise data to service voice data is increased, and through VAD model training and learning, tests show that the mobile phone vibration can be accurately detected in a quiet environment, vibration sound can be accurately identified in a noisy environment, and the service ASR identification rate is improved by 1.2 points. Finally, regression testing 20 is carried out on the accuracy of a plurality of service test sets, the accuracy is compared with the baseVAD, and the overall voice recognition accuracy of the newly trained VAD floats up and down by 0.1 point. The new training VAD model can better solve the problem of false recognition of the vibration sound under the condition of not influencing the accuracy of other services, thereby improving the recognition rate of the services and reducing unnecessary bandwidth resource waste.

Continuing to refer to fig. 4, a flowchart of a specific implementation of a human voice audio feature obtaining method provided in an embodiment of the present application is shown, and for convenience of description, only the portions related to the present application are shown.

In some optional implementations of this embodiment, before step S204, the method further includes:

step S401: acquiring a conventional human voice audio corresponding to a target scene;

step S402: converting the time domain of the conventional human voice audio into a frequency domain according to fast Fourier transform;

step S403: and filtering the converted conventional voice audio according to a Mel filter bank to obtain voice audio characteristics.

In the embodiment of the application, after the conventional human voice audio is obtained, the audio information is preprocessed to achieve the purpose of enhancing the performance of the voice signal, then the audio information is subjected to fast Fourier transform to convert the audio information from a time domain to a frequency domain, and the audio information of the frequency domain is subjected to filtering processing of a Mel filter bank for setting the frequency based on the language information of the audio information, so that the audio characteristic vector of the audio information is obtained.

Continuing to refer to fig. 5, a flowchart of a specific implementation of a mel filter bank obtaining method provided in an embodiment of the present application is shown, and for convenience of illustration, only the relevant portions of the present application are shown.

In some optional implementations of this embodiment, before step S402, the method further includes:

step S501: and acquiring training language information corresponding to the training audio data.

Step S502: and calling a preset number of calling filters corresponding to the training language information to carry out sequential arrangement to obtain an initial Mel filter set.

Step S503: and determining the starting Mel frequency and the ending Mel frequency of each calling filter in the initial Mel filter bank to obtain the Mel filter bank.

In the embodiment of the application, after the audio information is obtained, the audio information is preprocessed, and meanwhile, the language of the audio information is analyzed, so that the language information corresponding to the audio information is determined, that is, through which language the current audio information is output, whether the current audio information is in english or chinese, or japanese, and the like.

In the embodiment of the present application, after the language information is acquired, because the pronunciation frequency response side of the audio is different for different language information, after the language information is acquired, the initial frequency and the termination frequency of each mel filter in the mel filter bank need to be set based on the characteristics of the language information, so that the language characteristics of the audio information can be highlighted after the current audio information passes through the mel filter bank which is frequency-set based on the characteristics of the language information, so as to achieve the purpose of accurately identifying the audio information by acquiring the audio feature vector of the audio information based on the characteristics of the language information.

Continuing to refer to fig. 6, a flowchart of one embodiment of step S503 of fig. 5 is shown, and for ease of illustration, only the portions relevant to the present application are shown.

In some optional implementation manners of this embodiment, step S503 specifically includes:

step S601: determining a first starting frequency algorithm and a second starting frequency algorithm according to the training language information, wherein the first starting frequency algorithm is represented as:

the second start frequency algorithm is represented as:

wherein F represents the maximum frequency after conversion into the mel-frequency spectrum; i =1,2, …, M, when determining the center frequency of a certain mel filter, fi in the formula represents the center frequency of the ith mel filter; when a specific initial mel frequency of a certain mel filter is determined, fi-1 in the formula represents the specific initial mel frequency of the ith mel filter; when the specific termination mel frequency of a certain mel filter is determined, the specific termination mel frequency of the i-th mel filter is represented by fi + 1.

In the embodiments of the present application, as examples, for example: i in the formula is equal to k when determining the center frequency of the kth mel filter, and i in the formula is equal to k-1 when determining the specific termination mel frequency of the kth mel filter.

Step S602: determining a specific start Mel frequency of a k-th Mel filter and a preceding Mel filter in the Mel filter bank according to a first start frequency algorithm, and determining a specific start Mel frequency of a k + 1-th Mel filter and a succeeding Mel filter in the Mel filter bank according to a second start frequency algorithm, wherein k is a positive integer less than half of the sum of the preset number and 1, and k +1 is a positive integer greater than or equal to half of the sum of the preset number and 1;

step S603: the specific termination mel frequency of the k-1 st mel filter and the preceding mel filters in the mel filter bank is determined according to a first start frequency algorithm, and the specific termination mel frequency of the k-th mel filter and the following mel filters in the mel filter bank is determined according to a second start frequency algorithm, wherein the specific termination mel frequency of each mel filter is the specific start mel frequency of the next mel filter of the mel filters.

It is emphasized that, to further ensure the privacy and security of the target voice endpoint, the target voice endpoint may also be stored in a node of a blockchain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, can include processes of the embodiments of the methods described above. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

Example two

With further reference to fig. 7, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a speech endpoint detection apparatus for vibration audio, where the apparatus embodiment corresponds to the method embodiment shown in fig. 2, and the apparatus may be applied to various electronic devices.

As shown in fig. 7, the voice endpoint detection apparatus 200 for vibration audio according to the present embodiment includes: the system comprises a vibration audio module 210, a data enhancement module 220, a feature extraction module 230, a feature labeling module 240, a training data acquisition module 250, a model training module 260, an audio to be recognized acquisition module 270 and an audio to be recognized detection module 280. Wherein:

a vibration audio module 210, configured to obtain vibration audio data corresponding to a target scene;

the data enhancement module 220 is configured to perform data enhancement operation on the vibration audio data to obtain audio enhancement data;

the feature extraction module 230 is configured to perform feature extraction operation on the audio enhancement data according to an open-source speech recognition tool to obtain audio feature data;

the feature labeling module 240 is configured to perform feature labeling operation on the audio feature data according to a preset standard audio feature and a feature tag corresponding to the standard audio feature, so as to obtain audio labeled data;

a training data obtaining module 250, configured to obtain a mute audio feature, a human voice audio feature, and a vibration audio feature in a preset ratio from the audio labeling data to obtain model training data;

the model training module 260 is configured to perform model training operation on the initial VAD model according to the model training data to obtain a target VAD model;

the audio to be identified acquiring module 270 is configured to acquire an audio to be identified;

and the audio to be recognized detection module 280 is configured to input the audio to be recognized to the target VAD model to perform a voice endpoint detection operation for the vibration audio, so as to obtain a target voice endpoint.

In the embodiment of the present application, the background of the present application is that a mobile phone vibration audio segment is determined as a human voice by a VAD model, so that it needs to be set to be silent when processing a sample tag corresponding to a vibration audio. The main idea of the algorithm is that MFCC feature extraction takes a frame as a unit, the number of frames occupied by a mobile phone vibration fragment in each channel of audio is firstly calculated, a label corresponding to each frame feature of the mobile phone vibration fragment is set to be 2 (0-mute, 1-voice and 2-noise), other mute frames are set to be 0, and a voice frame is set to be 1. The target file ali.

In an embodiment of the present application, there is provided a speech endpoint detection apparatus 200 for vibration audio, including: a vibration audio module 210, configured to obtain vibration audio data corresponding to a target scene; the data enhancement module 220 is configured to perform data enhancement operation on the vibration audio data to obtain audio enhancement data; the feature extraction module 230 is configured to perform feature extraction operation on the audio enhancement data according to an open-source speech recognition tool to obtain audio feature data; the feature labeling module 240 is configured to perform feature labeling operation on the enhanced audio features according to preset standard audio features and feature labels corresponding to the standard audio features to obtain audio labeling data, where the standard audio features include a silent audio feature, a human voice audio feature and a vibration audio feature, a feature label corresponding to the silent audio feature is 0, a feature label corresponding to the human voice audio feature is 1, and a feature label corresponding to the vibration audio feature is 2; a training data obtaining module 250, configured to obtain a mute audio feature, a human voice audio feature, and a vibration audio feature in a preset ratio from the audio labeling data to obtain model training data; the model training module 260 is configured to perform model training operation on the initial VAD model according to the model training data to obtain a target VAD model; the audio to be identified acquiring module 270 is configured to acquire an audio to be identified; and the audio to be recognized detection module 280 is configured to perform a voice endpoint detection operation for the vibration audio according to the audio to be recognized input to the target VAD model, so as to obtain a target voice endpoint. Compared with the prior art, this application utilizes the actual scene effective data of business to do the data enhancement, alleviate the problem that non-human noise data such as cell-phone vibrations lack, thereby solve in the model training because the vibration data is sparse leads to the not good pain point problem of model learning effect, and simultaneously, the adjustment data proportion skill of this application, can make the model under the condition that does not influence other business recognition rate, solve simultaneously quiet and noisy have vibrations mistake recognition problem under the background human sound environment, eliminate non-human influence from the root, make VAD detect more accurate, improve the speech recognition rate of accuracy, reduce bandwidth resource consumption.

In some optional implementations of the present embodiment, the above-mentioned voice endpoint detection apparatus 100 for vibration audio further includes: the system comprises a human voice audio acquisition module, a domain conversion module and a filtering processing module, wherein:

the voice audio acquisition module is used for acquiring conventional voice audio corresponding to the target scene;

the domain conversion module is used for converting the time domain of the conventional human voice audio into a frequency domain according to fast Fourier transform;

and the filtering processing module is used for filtering the converted conventional voice audio according to the Mel filter bank to obtain voice audio characteristics.

P ₁ ∶P ₂ ∶P ₃ ＝1∶(30～50)∶4

wherein, P ₁ Representing a vibration audio feature; p ₂ Representing a human voice audio feature; p ₃ Representing a silent audio feature.

In the embodiment of the application, through multiple experiments, different data ratios are tried by adjusting parameters, and the optimal ratio of the effects of the mobile phone vibration, the service voice data and the voice noise data is 1: 33: 4. In the prior art, only mobile phone vibration audio is added to be subjected to baseVAD fine adjustment, the effect is not obvious, and vibration, noise and service voice are attempted according to the proportion of 1:14:2, the mobile phone vibration under the quiet condition can be well detected accurately, the mobile phone vibration in the noisy environment cannot be identified accurately, and the accuracy of a service test set is reduced by about 1O point due to insufficient proportion of service voice data during finetune training. Therefore, the ratio of noise data to service voice data is increased, and through VAD model training and learning, tests show that the mobile phone vibration can be accurately detected in a quiet environment, vibration sound can be accurately identified in a noisy environment, and the service ASR identification rate is improved by 1.2 points. Finally, regression testing 20 is carried out on the accuracy of a plurality of service test sets, the accuracy is compared with the baseVAD, and the overall voice recognition accuracy of the newly trained VAD floats up and down by 0.1 point. Therefore, the new training VAD model can better solve the problem of false recognition of the vibration sound under the condition of not influencing the accuracy of other services, thereby improving the recognition rate of the service and reducing unnecessary bandwidth resource waste.

In some optional implementations of the present embodiment, the above-mentioned voice endpoint detection apparatus 100 for vibration audio further includes: language information acquisition submodule, sequence arrangement submodule and frequency determination submodule, wherein:

the language information acquisition submodule is used for acquiring training language information corresponding to the training audio data;

the sequential arrangement submodule is used for calling a preset number of calling filters corresponding to the training language information to carry out sequential arrangement to obtain an initial Mel filter bank;

and the frequency determining submodule is used for determining the starting Mel frequency and the ending Mel frequency of each calling filter in the initial Mel filter bank to obtain the Mel filter bank.

In the embodiment of the application, after the audio information is obtained, the audio information is preprocessed, and simultaneously, the language of the audio information is analyzed, so that the language information corresponding to the audio information is determined, that is, whether the current audio information is output in which language, whether the current audio information is in english or chinese, or japanese, and the like.

In some optional implementations of this embodiment, the frequency determining sub-module includes: a mode determination unit, a start frequency determination unit, and a stop frequency determination unit, wherein:

the mode determining unit is used for determining a first starting frequency algorithm and a second starting frequency algorithm according to the training language information;

a start frequency determining unit for determining a specific start mel frequency of a kth mel filter and a preceding mel filter in the mel filter bank according to a first start frequency algorithm, and determining a specific start mel frequency of a (k + 1) th mel filter and a following mel filter in the mel filter bank according to a second start frequency algorithm, wherein k is a positive integer less than a half of a sum of a preset number and 1, and k +1 is a positive integer greater than or equal to a half of a sum of the preset number and 1;

a termination frequency determining unit for determining a specific termination mel frequency of a (k-1) th mel filter and a preceding mel filter in the mel filter bank according to a first start frequency algorithm, and determining a specific termination mel frequency of a k-th mel filter and a succeeding mel filter in the mel filter bank according to a second start frequency algorithm, wherein the specific termination mel frequency of each mel filter is a specific start mel frequency of a next mel filter of the mel filters.

In the embodiment of the present application, the first start frequency algorithm is expressed as:

the second start frequency algorithm is represented as:

wherein F represents the maximum frequency after conversion into the mel-frequency spectrum; i =1,2, …, M, when determining the center frequency of a certain mel filter, fi in the formula represents the center frequency of the ith mel filter; when a specific initial mel frequency of a certain mel filter is determined, fi-1 in the formula represents the specific initial mel frequency of the ith mel filter; when determining the specific termination mel frequency of a certain mel filter, the specific termination mel frequency of the i-th mel filter is represented by fi +1, for example: i in the formula is equal to k when determining the center frequency of the kth mel filter, and i in the formula is equal to k-1 when determining the specific termination mel frequency of the kth mel filter.

In order to solve the technical problem, the embodiment of the application further provides computer equipment. Referring to fig. 8, fig. 8 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 300 includes a memory 310, a processor 320, and a network interface 330 communicatively coupled to each other via a system bus. It is noted that only computer device 300 having components 310-330 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user in a keyboard mode, a mouse mode, a remote controller mode, a touch panel mode or a voice control equipment mode.

The memory 310 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 310 may be an internal storage unit of the computer device 300, such as a hard disk or a memory of the computer device 300. In other embodiments, the memory 310 may also be an external storage device of the computer device 300, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 300. Of course, the memory 310 may also include both internal and external storage devices of the computer device 300. In this embodiment, the memory 310 is generally used for storing an operating system and various types of application software installed in the computer device 300, such as computer readable instructions of a voice endpoint detection method for vibration audio. In addition, the memory 310 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 320 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 320 is generally operative to control overall operation of the computer device 300. In this embodiment, the processor 320 is configured to execute computer readable instructions stored in the memory 310 or process data, such as executing computer readable instructions of the method for detecting a voice endpoint for vibration audio.

The network interface 330 may include a wireless network interface or a wired network interface, and the network interface 330 is generally used to establish a communication connection between the computer device 300 and other electronic devices.

The application provides a computer equipment, utilize the actual scene effective data of business to do the data enhancement, alleviate the problem that non-vocal noise data such as cell-phone vibrations lack, thereby solve in the model training because vibrations data are sparse and lead to the not good pain point problem of model learning effect, and simultaneously, the adjustment data proportion skill of this application, can make the model under the condition of not influencing other business recognition rate, solve simultaneously quiet and noisy have vibrations mistake recognition problem under the background vocal environment, eliminate non-vocal influence from the root, make VAD detect more accurate, improve the speech recognition rate of accuracy, reduce bandwidth resource consumption.

The present application provides yet another embodiment, which is to provide a computer-readable storage medium having stored thereon computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the method for speech endpoint detection for rumble audio as described above.

The computer readable storage medium provided by the application utilizes the effective data of the actual scene of the business to perform data enhancement, and alleviates the problem of non-human noise data lack such as mobile phone vibration, thereby solving the problem of pain point with poor model learning effect caused by sparse vibration data in model training, and simultaneously, the data proportion adjusting skill applied by the application can enable the model to solve the problem of vibration error recognition under quiet and noisy background human voice environments without influencing other business recognition rates, and eliminate non-human influence fundamentally, so that VAD detection is more accurate, the voice recognition accuracy is improved, and the bandwidth resource consumption is reduced.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It should be understood that the above-described embodiments are merely exemplary of some, and not all, embodiments of the present application, and that the drawings illustrate preferred embodiments of the present application without limiting the scope of the claims appended hereto. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A voice endpoint detection method for vibration audio is characterized by comprising the following steps:

acquiring vibration audio data corresponding to a target scene;

acquiring audio to be identified;

2. The method according to claim 1, wherein the step of obtaining the vibration audio data corresponding to the target scene specifically comprises the steps of:

acquiring scene audio data corresponding to a target scene;

and cutting the scene audio data according to the ffmpeg tool to obtain the vibration audio data.

3. The method of claim 1, wherein the standard audio features comprise a vibration audio feature, a human voice audio feature, and a silence audio feature at a predetermined ratio:

P ₁ ：P ₂ ：P ₃ ＝1：(30～50)：4

wherein, P ₁ Representing the seismic audio feature; p ₂ Representing the human voice audio features; p ₃ Representing the silent audio feature.

4. The method of claim 1, wherein before the step of performing a feature labeling operation on the enhanced audio feature according to a preset standard audio feature and a feature tag corresponding to the standard audio feature to obtain audio labeling data, the method further comprises the following steps:

acquiring a conventional human voice audio corresponding to the target scene;

converting the time domain of the conventional human voice audio into a frequency domain according to fast Fourier transform;

and filtering the converted conventional voice audio according to a Mel filter bank to obtain the voice audio characteristics.

5. The method according to claim 4, wherein before the step of filtering the converted normal human voice audio according to the Mel filter bank to obtain the normal human voice audio, the method further comprises the following steps:

acquiring training language information corresponding to the conventional human voice audio;

calling a preset number of calling filters corresponding to the training language information to carry out sequential arrangement to obtain an initial Mel filter set;

and determining the starting Mel frequency and the ending Mel frequency of each calling filter in the initial Mel filter bank to obtain the Mel filter bank.

6. The method of claim 5, wherein the step of determining a starting Mel frequency and a terminating Mel frequency of each of the called filters in the initial Mel filter set to obtain the Mel filter set comprises the steps of:

determining a first starting frequency algorithm and a second starting frequency algorithm according to the language information, wherein the first starting frequency algorithm is represented as:

the second start frequency algorithm is represented as:

wherein F represents the maximum frequency after conversion into mel-frequency spectrum; i =1,2, …, M, when determining the center frequency of a certain mel filter, fi in the formula represents the center frequency of the ith mel filter; when a specific initial mel frequency of a certain mel filter is determined, fi-1 in the formula represents the specific initial mel frequency of the ith mel filter; when a specific termination mel frequency of a certain mel filter is determined, the specific termination mel frequency of the i-th mel filter is represented by fi + 1;

determining a specific starting Mel frequency of a k-th Mel filter and a preceding Mel filter in the Mel filter bank according to the first starting frequency algorithm, and determining a specific starting Mel frequency of a k + 1-th Mel filter and a following Mel filter in the Mel filter bank according to the second starting frequency algorithm, wherein k is a positive integer less than half of the sum of a preset number and 1, and k +1 is a positive integer greater than or equal to half of the sum of the preset number and 1;

determining a specific termination Mel frequency of a k-1 th Mel filter and a preceding Mel filter in the Mel filter bank according to the first start frequency algorithm, and determining a specific termination Mel frequency of a k-th Mel filter and a succeeding Mel filter in the Mel filter bank according to the second start frequency algorithm, wherein the specific termination Mel frequency of each Mel filter is a specific start Mel frequency of a next Mel filter of the Mel filters.

7. The method according to claim 1, wherein after the step of performing a voice endpoint detection operation for the vibration audio according to the audio to be recognized input to the target VAD model to obtain a target voice endpoint, the method further comprises the following steps:

and storing the target voice endpoint into a block chain.

8. A speech endpoint detection apparatus for vibro-audio, comprising:

the characteristic labeling module is used for performing characteristic labeling operation on the audio characteristic data according to preset standard audio characteristics and characteristic labels corresponding to the standard audio characteristics to obtain audio labeling data;

9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed performs the steps of the method for speech endpoint detection for rumble audio of any of claims 1 to 7.

10. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of the method for speech endpoint detection for rumble audio of any of claims 1 to 7.