CN117457016A

CN117457016A - Method and system for filtering invalid voice recognition data

Info

Publication number: CN117457016A
Application number: CN202311449861.3A
Authority: CN
Inventors: 郑大川; 陈振标; 杜晓祥
Original assignee: Beijing Yunshang Technology Co ltd
Current assignee: Beijing Yunshang Technology Co ltd
Priority date: 2023-11-02
Filing date: 2023-11-02
Publication date: 2024-01-26

Abstract

The application discloses a method and a system for filtering invalid voice recognition data, wherein the method comprises the following steps: receiving audio data to be identified; judging whether the audio data is mute or noise, if so, not identifying the audio data, and outputting a judging result that the audio data is invalid voice; evaluating the voice quality in the audio data, judging whether the voice quality of the audio data is low, if so, not identifying the audio data, and outputting a judgment result that the audio data is invalid voice; judging whether the audio data is music or not, if so, not identifying the audio data, and outputting a judging result that the audio data is invalid voice; judging whether the audio data is in a target language, if not, not identifying the audio data, and outputting a judging result that the audio data is invalid voice; by the method, the filtering effect of the audio with poor quality can be greatly improved, and the problem of resource waste caused by error identification of the input audio is avoided.

Description

Method and system for filtering invalid voice recognition data

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a method and system for filtering invalid speech recognition data.

Background

In recent years, with the progress of data processing technology and the rapid popularization of mobile internet, computer technology has been widely used in various fields of society, and mass data generation has come with it. Among them, voice data is receiving increasing attention.

The specific function of speech recognition is to convert speech in a piece of audio into text. However, in practical applications, large amounts of audio data do not contain meaningful speech, or are very noisy and inaudible. If the voice recognition is directly performed without filtering, two effects are caused: firstly, wasting machine resources; secondly, the recognized text may contain a large number of errors, which may have uncertain consequences for subsequent processing, such as misjudgement of the system for a piece of normal music in the voice auditing system. Therefore, it is necessary to judge the input audio in the speech recognition.

However, in the prior art, only two means of energy detection and whether human voice detection are used for detecting whether audio data are mute, and voice recognition is performed on non-mute data; however, there are many factors causing the recognition error of the audio data, and only the mute data is filtered, so that the filtering effect of the audio data with poor quality is poor, the problem of the recognition error may occur during the recognition, and meanwhile, the resource waste is caused.

Disclosure of Invention

Based on the above, in order to solve the above technical problems, a method and a system for filtering invalid voice recognition data are provided, so as to improve the filtering effect on the audio with poor quality and avoid the problems of recognition errors and resource waste caused by the audio with poor quality.

In a first aspect, a method of filtering invalid speech recognition data, the method comprising:

receiving audio data to be identified;

judging whether the audio data is mute or short-time noise, if yes, not identifying the audio data, and outputting a judging result that the audio data is invalid voice;

evaluating the voice quality in the audio data, judging whether the voice quality of the audio data is low, if so, not identifying the audio data, and outputting a judging result that the audio data is invalid voice;

judging whether the audio data is music or not, if so, not identifying the audio data, and outputting a judging result that the audio data is invalid voice;

judging whether the audio data is in a target language or not, if not, not identifying the audio data, and outputting a judging result that the audio data is invalid voice;

and if the audio data are not mute, noise, music, voice quality is low and the voice quality is not the target language, the audio data are identified, and a character identification result of the audio data is returned.

In the above scheme, optionally, after receiving the audio data to be identified, framing the audio data.

In the above aspect, further optionally, the determining whether the audio data is mute or short-time noise includes: the audio energy per frame of audio data is calculated using the following formula:

wherein, N is the number of sampling points of single-frame audio, x (N) is the number of the sampling points, and the size range is (-1, 1);

and calculating the proportion of the number of frames with the audio energy larger than a first set value to the total number of frames in the audio data, marking the proportion as a first proportion, judging silence if the first proportion is 0, and judging noise if the first proportion is smaller than a second set value.

In the above solution, optionally, the first set value is 0.005, and the second set value is more than 0.1.

In the foregoing aspect, optionally, the evaluating the quality of the voice in the audio data, and determining whether the quality of the voice in the audio data is too low includes:

extracting mel log filterbank features of each frame of the audio data based on a mel scale;

extracting high-dimensional features of the audio data based on mel log filterbank features of each frame of the audio data using a first neural network;

scoring the voice quality of each time dimension of the high-dimensional feature by using a classifier;

the ratio of the number of high scores in the calculated scores to the number of all scores is recorded as a second ratio, and average scores of all scores are calculated, and if the second ratio is smaller than a third set value or the average score of all scores is smaller than a fourth set value, the voice quality of the audio data is judged to be too low; the high score exceeds a fifth set point.

In the above solution, further optionally, the third set value is 0.5, the fourth set value is 0.6 minutes, and the fifth set value is 0.7 minutes.

In the above aspect, optionally, the determining whether the audio data is music includes:

acquiring a score of the audio data as a music probability based on mel log filterbank features of each frame of the audio data by using a second neural network;

and if the score of the music probability of the audio data exceeds a sixth set value, judging that the data is music.

In the above aspect, optionally, the sixth set value is 0.5.

In the above solution, optionally, determining whether the audio data is in the target language includes:

judging the language corresponding to the audio data based on the high-dimensional characteristics of the audio data by using a second neural network;

judging whether the language is a target language.

In a second aspect, a system for filtering invalid speech recognition data, the system comprising:

the data receiving module is used for receiving the audio data to be identified;

the mute or noise judging module is used for judging whether the audio data is mute or noise, if so, returning a judging result that the audio data is invalid voice;

the voice quality evaluation module is used for evaluating the voice quality in the audio data, judging whether the voice quality of the audio data is too low, and if so, outputting a judgment result that the audio data is invalid voice;

the target language judging module is used for judging whether the audio data are target languages or not, and if not, whether the audio data are target languages or not;

an audio data identification module: the voice recognition module is used for recognizing the audio data and outputting a character recognition result of the audio data under the condition that the mute or noise judgment module, the voice quality evaluation module, the target language judgment module and the target language judgment module judge that the audio data is not mute, noise, music, voice quality is too low and non-target languages.

The application has at least the following beneficial effects:

according to the voice frequency recognition method and device, not only is the mute noise detection carried out on the input voice frequency, but also the music detection, the language detection and the voice quality detection are carried out, and when the input voice frequency is not mute, noise, music, voice quality is too low and the voice quality is not the target language, the voice frequency data are recognized, so that the filtering effect on the voice frequency with poor quality can be greatly improved, the problem that the input voice frequency is poor in quality and the input voice frequency is wrongly recognized to cause resource waste is solved.

Drawings

FIG. 1 is a flowchart of a method for filtering invalid speech recognition data according to one embodiment of the present application;

FIG. 2 is a schematic diagram showing a specific flow of step S103 in one embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a specific flow of step S104 in one embodiment of the present application;

FIG. 4 is a schematic flow chart of step S105 in one embodiment of the present application;

fig. 5 is a detailed flowchart of a method for filtering invalid speech recognition data according to the specific flow of step S105 in one embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In the description of the present application: unless otherwise indicated, the meaning of "a plurality" is two or more. The terms "first," "second," "third," and the like in this application are intended to distinguish between the referenced objects without a special meaning in terms of technical connotation (e.g., should not be construed as emphasis on degree or order of importance, etc.). The expressions "comprising", "including", "having", etc. also mean "not limited to" (certain units, components, materials, steps, etc.).

In one embodiment, as shown in FIG. 1, a method of filtering invalid speech recognition data is provided, comprising the steps of:

step S101, receiving audio data to be identified;

wherein, the received audio data are short-time audio. According to the method and the device, whether short-time voice is voice with good quality is judged, so that correct characters can be identified, voice with poor quality is filtered, and identification is not carried out.

Step S102, judging whether the audio data is mute or short-time noise, if yes, not identifying the audio data, and outputting a judging result that the audio data is invalid voice;

in step S102, for an audio signal, the size of sound may be quantized with audio energy. If the audio clip energy is very low or zero, this clip must be silent and can be filtered out directly.

Because of the short-time stationarity of speech signals, a framing process is typically used in the process, typically 10-30ms per frame. For example, for audio with a sampling rate of 16khz (1 s contains 16000 sampling points), and a frame length of 10ms is selected, n=16000 (10 ms/1000 ms), i.e. 160.

The energy calculation is performed for each frame, and the calculation formula is as follows:

each frame is calculated to obtain the audio energy of each frame, and the condition for judging whether the audio is effective is as follows:

1): calculating the ratio of the frame number of the audio energy larger than a first set value to the total frame number in the audio data, and marking the ratio as a first ratio, if the first ratio is 0, indicating that the energy of each frame is smaller than a set threshold value, wherein the sound of the audio fragment is extremely small and can be filtered;

2): if the first ratio is smaller than the second set value, the ratio of the number of frames larger than the set threshold to the total number of frames is too low, and the sound is often noise (such as impact, channel interference, etc.), and can be filtered out.

Typically the first set point is 0.005; the second set point is 0.1.

Step S103, evaluating the voice quality in the audio data, judging whether the voice quality of the audio data is low, if so, not identifying the audio data, and outputting a judging result that the audio data is invalid voice;

specifically, the received audio signal is first subjected to framing processing, and the speech signal has short-time stationarity. The analysis and processing of any speech signal must be based on "short-time", i.e. a "short-time analysis", in which the speech signal is divided into segments each called a "frame", the frame length being typically taken to be 10-30ms, and the application taking 25ms, to analyze its characteristic parameters. Thus, for the entire speech signal, a time series of feature parameters consisting of the feature parameters of each frame is analyzed. The framing generally adopts an overlapping segmentation method, which is to make a smooth transition between frames and keep continuity. The frame shift of the present application is 10ms.

Extracting mel log filterbank features from each frame of audio; there are two representations of the signal, the time domain and the frequency domain, which we typically use in speech processing. The two representations may be transformed into each other by fourier transformation. Human perception of frequency is not linear and perception of low frequency signals is more sensitive than high frequency signals. The Mel Scale (Mel Scale) is a Scale that is relatively compatible with human ear sensitivity by amplifying the low frequencies and compressing the high frequency components. Thus, mel log filterbank is a frequency domain feature that is consistent with human auditory comparisons, often used in speech processing.

Then, extracting high-dimensional features of the audio data based on mel log filterbank features of each frame of the audio data using a first neural network; the application adopts a Transformer encoder network; transformer is a powerful deep neural network that has excellent effects on both machine translation and speech recognition. In speech recognition, a transformer encoder +decoder scheme is typically employed, wherein an encoder is used to extract important high-dimensional features in each frame of speech signal, and the decoder converts the high-dimensional features into text. Transformer encoder employed in the present application is the former part of the transducer speech recognition model, whose extracted high-dimensional features are used to evaluate audio quality.

And after the features are sent to the Transformer encoder network, the last layer of output is the high-dimensional features extracted from the input voice. The dimension of the high-dimensional feature is two dimensions (t×d), T represents the dimension of the time axis, the size of T is related to the length of the input speech (the size is calculated by dividing the speech duration by 60ms, i.e. for 6s of audio, the size of T is 100), and D represents that each time axis is represented by a one-dimensional vector, where D is 512.

Then, scoring the voice quality of each time dimension of the high-dimensional feature by using a classifier; for example, for an audio length of 6s, we will get a score of 100. The interval of the score is 0-1.

Finally, the proportion of the number of high scores in the calculated scores to the number of all scores is recorded as a second proportion, the average score of all scores is calculated, and if the second proportion is smaller than a third set value or the average score of all scores is smaller than a fourth set value, the voice quality of the audio data is judged to be too low;

for example, we will get several scores for a single speech. For these scores we can obtain the average score and the proportion of the high score, the threshold for the high score is set to 0.7 score. From these two conditions we determine whether to retain the fragment. For example, we set the average score to be larger than 0.6 and the high score to be larger than 0.5, and we consider that the speech quality satisfies the requirements.

Step S103 is to detect whether the audio clip contains voice and evaluate the quality of voice. For audio clips that do not contain human voice and have very low human voice quality, filtering can be done. The above steps for specifically detecting whether a voice is included or not and evaluating the quality of the voice are shown in fig. 2.

Step S104: judging whether the audio data is music or not, if so, not identifying the audio data, and outputting a judging result that the audio data is invalid voice;

specifically, first, mel log filterbank features of each frame of the audio data are extracted based on a mel scale; the specific method is the same as the method for extracting mel log filterbank features in step S103, and a detailed description is omitted here.

Then, a score of the audio data as music probability is acquired by using a second neural network based on mel log filterbank characteristics of each frame of the audio data; the second neural network used in this application is a resnet32 network.

Resnet is a powerful deep residual neural network structure, and has good performance for voice classification tasks. The resnet32 used herein is largely divided into:

(1) Inputting a convolution layer;

(2) The multi-layer residual depth convolution comprises 32 convolution layers, wherein each time 2 layers of convolution are subjected to residual connection, the problem of performance degradation with the increase of the model depth is solved.

(3) average working layer: the output of the multi-layer residual depth convolution is a two-dimensional result, and in the music classification we directly classify the whole piece, i.e. whether the piece of audio is music or non-music. We need to translate the frame level information into full-segment level information on the time axis. In this context, we directly average the multi-frame results, which can be achieved using an average mapping layer in the neural network.

(4) feature dense: and a single-layer full-connection layer, wherein the output result of the previous layer is changed into two numbers, and the two numbers correspond to the scores of music and non-music respectively.

(5) softmax: the two scores of the upper layer outputs are typically added together to be not 1, and a softmax normalization function is typically used to make the sum of the final results to be 1.

And outputting a score of music for the audio piece through a resnet32 network, and if the score exceeds a preset threshold (0.5 in the application), judging that the audio piece is music, and filtering.

Step S104 is used to detect whether the voice of the person in the audio clip is music. Since music, especially rap music, enters a speech recognition system, a large number of miscords are generated. These miswords can produce erroneous commands or offensive words that interfere with subsequent tasks of speech recognition, and thus require filtering. The specific steps for determining whether the audio data is music are shown in fig. 3.

Step S105: judging whether the audio data is in a target language or not, if not, not identifying the audio data, and outputting a judging result that the audio data is invalid voice;

specifically, first, mel log filterbank features of each frame of the audio data are extracted based on a mel scale; the specific method is the same as the method for extracting mel log filterbank features in step S103, and is not described here in detail.

Secondly, extracting high-dimensional features of the audio data based on mel log filterbank features of each frame of the audio data by using a first neural network; the specific method of using the Transformer encoder network is the same as the method of extracting the high-frequency characteristic in step S104, and is not described herein.

Then, judging the language corresponding to the audio data based on the high-dimensional characteristics of the audio data by using a second neural network; the second neural network selects a resnet32 network; the method specifically comprises the following steps:

1) An average mapping layer of the network of the res 32: in language classification, we directly classify the whole section, i.e. for the whole section of audio, output the score of each language, and the highest score is the language corresponding to the section of voice. Like music classification, we also need to translate frame level information into whole piece of level information on the time axis. In this context, we directly average the multi-frame results, which can be achieved using an average mapping layer in the neural network.

2) The feature dense layer of the network with resnet 32: and a single-layer full-connection layer, wherein the output of the previous layer is changed into a plurality of numbers, and each number corresponds to the score of different languages.

3) Softmax layer using the resnet32 network: the multiple scores of the previous layer outputs are typically added up to not 1, and the softmax normalization function is used so that the sum of the final results is 1.

The language corresponding to the audio clip is output through the resnet32 network, and the language with the highest score is the language corresponding to the network judgment.

And finally, judging whether the language is a target language, and filtering if the language is a non-target language.

Step S105 is used to detect whether the voice language in the audio clip is the target language. The entry of non-target language speech into speech recognition creates meaningless messy codes that interfere with subsequent tasks of speech recognition, thus necessitating filtering. The specific steps for determining whether the audio data is in the target language are shown in fig. 4.

Step S106: and if the audio data are not mute, noise, music, voice quality is low and the voice quality is not the target language, the audio data are identified, and a character identification result of the audio data is returned. An overall architecture diagram of a method of filtering invalid speech recognition data of the present application is shown in fig. 5.

According to the method for filtering the invalid voice recognition data, not only is the silence noise detection, but also the music detection, the language detection and the voice quality detection are performed on the input audio, and when the input audio is not silence, noise, music, voice quality is too low and is not in target languages, the voice data are recognized, so that the accuracy of voice recognition is improved, and meanwhile, the problem of resource waste caused by error recognition of the input audio is avoided.

It should be understood that, although the steps in the flowcharts of fig. 1 and 5 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps in fig. 1 and 5 may include a plurality of steps or stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a portion of the steps or stages in other steps or other steps.

In one embodiment, a system for filtering invalid speech recognition data is provided, comprising:

For a specific limitation on a system for filtering invalid speech recognition data, reference is made to the above limitation on a method for filtering invalid speech recognition data, and detailed description thereof will be omitted. The various modules in the above-described system for filtering invalid speech recognition data may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of filtering invalid speech recognition data, the method comprising:

receiving audio data to be identified;

2. The method of claim 1, wherein the audio data is framed after the audio data to be identified is received.

3. The method of claim 2, wherein the determining whether the audio data is silence or short-term noise comprises: the audio energy per frame of audio data is calculated using the following formula:

4. A method according to claim 3, wherein the first set point is 0.005 and the second set point is 0.1.

5. The method of claim 2, wherein evaluating the quality of the voice in the audio data, determining whether the quality of the voice in the audio data is too low comprises:

6. The method of claim 5, wherein the third set point is 0.5, the fourth set point is 0.6 minutes, and the fifth set point is 0.7 minutes.

7. The method of claim 2, wherein said determining whether the audio data is music comprises:

8. The method of claim 7, wherein the sixth set point is 0.5.

9. The method of claim 2, wherein determining whether the audio data is in a target language comprises:

judging whether the language is a target language.

10. A system for filtering invalid speech recognition data, the system comprising: