CN114495985A

CN114495985A - Audio quality detection method, intelligent terminal and storage medium

Info

Publication number: CN114495985A
Application number: CN202011254415.3A
Authority: CN
Inventors: 唐延欢
Original assignee: TCL Technology Group Co Ltd
Current assignee: TCL Technology Group Co Ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2022-05-13

Abstract

The invention provides an audio quality detection method, an intelligent terminal and a storage medium, wherein the method comprises the following steps: acquiring audio data to be detected; and calculating an internal similarity value of the audio data, and determining the quality of the audio data according to the internal similarity value. The invention effectively improves the accuracy of the audio data quality detection and ensures the effectiveness of the subsequent audio data for voiceprint recognition or voice recognition.

Description

Audio quality detection method, intelligent terminal and storage medium

Technical Field

The present invention relates to the field of signal processing technologies, and in particular, to an audio quality detection method, an intelligent terminal, and a storage medium.

Background

At present, identification and contactless operations based on human biometrics are becoming more and more popular. For example, identity authentication is performed through a face, a fingerprint, a voiceprint and the like, so that the information security of the user is improved, and the user can control the equipment to execute specific operation through voice recognition. On one hand, the voice can be used as a standard for identity recognition, and on the other hand, the voice can also be used as a means for a user to give an instruction, so that the audio has a very wide application space. Taking the voiceprint system as an example, the voiceprint recognition system recognizes the identity of the speaker by recording the voice frequency of the speaker in real time and comparing the voice frequency with the original voice record of the registrant. Compared with identity authentication systems such as face recognition and fingerprint recognition, the voiceprint recognition system does not need a camera, a fingerprint collection device and other more complex biological feature collection devices, can collect information of a person to be recognized only by simple recording equipment, and further achieves the purpose of speaker recognition. A mature voiceprint system consisting essentially of: an audio filtering mechanism, a core model of voiceprint recognition and a judgment mechanism of audio classification. The audio filtering mechanism is used for ensuring that the input audio meets requirements of a voiceprint model, such as the effective audio time cannot be too short, the background noise cannot be too large, and the like; the core model is used for extracting the voiceprint characteristics of the audio; and the audio classification judgment mechanism is used for setting a relevant judgment threshold according to the voiceprint characteristics and comparing and classifying the audio.

Qualified audio quality is a prerequisite for audio applications, for example in voiceprint systems, whether an audio filtering mechanism is effective is crucial for subsequent voiceprint recognition. However, due to the complexity of the actual application scenario of the voiceprint recognition system, such as excessive background noise (which makes the noise energy cover the effective audio energy), too short time for recording the audio (which cannot extract sufficiently effective and stable voiceprint features), doped multi-person voice of the recorded audio (which person voice is not determined as the target audio), too large difference between the recorded audio and the pre-stored user audio (the current speaker is not the target user corresponding to the currently logged account), when performing voice registration, it may happen that a does not make a sound when registering a second sentence of audio after registering a first sentence of registered audio, but another person B makes a sound at this time, so that the recorded audio has a large probability and does not meet the requirement of the voiceprint system. The audio filtering mechanism in the prior art is relatively extensive in audio filtering, for example, according to the length of audio data, and therefore, abnormality is often extracted in the subsequent extraction of voiceprint features.

Disclosure of Invention

The invention provides an audio quality detection method, an intelligent terminal and a storage medium, and aims to solve the problem that the screening efficiency of audio data for identification is low in the prior art.

In order to achieve the above object, the present invention provides an audio quality detection method, including the steps of:

acquiring audio data to be detected;

and calculating an internal similarity value of the audio data, and determining the quality of the audio data according to the internal similarity value.

Optionally, the method for detecting audio quality, before the calculating an internal similarity value of the audio data and determining the quality of the audio data according to the internal similarity value, further includes:

and calculating the tone quality parameters of the audio data, and judging whether the audio data accords with preset tone quality qualified conditions or not according to the tone quality parameters.

Optionally, the audio quality detection method, where the sound quality parameters include a signal-to-noise ratio and an effective audio length, the calculating the sound quality parameters of the audio data, and determining whether the audio data meets a preset sound quality qualified condition according to the sound quality parameters specifically includes:

respectively extracting noise frames and effective frames in the audio data according to a preset list generation rule, and generating a noise frame list and an effective frame list corresponding to the audio data;

respectively calculating the effective audio length and the signal-to-noise ratio of the audio data according to the effective frame list and the noise frame list;

and judging whether the audio data meets the preset sound quality qualified condition or not according to the effective audio length, a preset effective audio length threshold value, the signal-to-noise ratio and a preset signal-to-noise ratio threshold value.

Optionally, the audio quality detection method, wherein the extracting, according to a preset list generation rule, a noise frame and an effective frame in the audio data, and generating a noise frame list and an effective frame list corresponding to the audio data specifically include:

carrying out noise reduction processing on the audio data to generate noise reduction audio data;

calculating the energy difference of the same audio frame in the audio data and the noise reduction audio data;

determining a noise frame and an effective frame in the audio data according to the energy difference;

respectively writing the audio parameters of the noise frame and the audio parameters of the effective frame into a preset blank list to generate a noise frame list and an effective frame list

Optionally, the audio quality detection method, wherein the determining, according to the energy difference, a noise frame and a valid frame in the audio data specifically includes:

judging whether the energy difference is larger than an energy difference threshold value or not;

if so, taking the audio frame corresponding to the energy difference larger than the energy difference threshold value as a noise frame;

and if not, taking the audio frame corresponding to the energy difference smaller than or equal to the energy difference threshold value as an effective frame.

Optionally, the method for detecting audio quality, wherein after determining a noise frame and a valid frame in the audio data according to the energy difference, the method further includes:

and sequentially writing the zone bits corresponding to the noise frames and the effective frames into a preset blank list according to the sampling sequence of the audio data to generate an audio zone bit list.

Optionally, the audio quality detection method, wherein the calculating an internal similarity value of the audio data and determining the quality of the audio data according to the internal similarity value specifically includes:

calculating an internal similarity value of the audio data;

judging whether the audio data meet preset homologous qualification conditions or not according to the internal similarity value;

and if the audio data meets the homologous qualified conditions, determining that the audio data is qualified audio.

Optionally, the audio quality detection method, wherein the calculating an internal similarity value of the audio data specifically includes:

determining an audio splitting position of the audio data according to a preset splitting rule;

splitting the audio data according to the audio splitting position to generate a plurality of audio clips;

an internal similarity value between voiceprint features of the audio segment is calculated.

Optionally, the audio quality detection method, wherein the determining an audio splitting position of the audio data according to a preset splitting rule specifically includes:

determining split frames in the effective frame list according to a preset split number and the effective frame list;

and determining a corresponding audio splitting position in the audio data according to the splitting frame and the audio zone bit list.

Optionally, the audio quality detection method, wherein the calculating an internal similarity value between the voiceprint features of the audio segment specifically includes:

extracting the segment preliminary features of the audio segments according to a preset preliminary feature extraction rule;

controlling a preset voiceprint model to extract the voiceprint characteristics of the initial voiceprint characteristics of the segments to generate voiceprint characteristics of the segments;

and calculating the similarity value between the voiceprint features of the segments as the internal similarity value between the audio segments.

Optionally, the audio quality detection method, wherein the determining, according to the internal similarity value, whether the audio data meets a preset homologous qualification condition specifically includes:

and judging whether the audio data meet a preset homologous qualification condition or not according to the internal similarity value and a preset internal similarity threshold value.

Optionally, the audio quality detection method, wherein, after determining that the audio data is a quality-qualified audio if the audio data meets the homology-qualified condition, the method further includes:

judging whether the audio with qualified quality is from a target user corresponding to the current account or not according to the target voiceprint characteristics corresponding to the current account;

and if the qualified audio comes from the target user corresponding to the current account, determining that the qualified audio is the sound source qualified audio.

Optionally, the audio quality detection method, wherein the determining, according to a target voiceprint feature corresponding to a current account, whether the quality-qualified audio is from a target user corresponding to the current account specifically includes:

extracting audio preliminary features of the audio with qualified quality according to a preset audio feature extraction rule;

controlling a preset voiceprint model to extract the voiceprint characteristics of the audio preliminary characteristics to generate audio voiceprint characteristics;

judging whether a target voiceprint feature exists in an audio feature group corresponding to the current account;

if the target voiceprint feature exists in the audio feature group corresponding to the current account, judging whether the audio with qualified quality is from a target user corresponding to the current account according to the target voiceprint feature and the audio voiceprint feature;

and if the target voiceprint characteristics do not exist in the audio characteristic group corresponding to the current account, storing the audio voiceprint characteristics into a preset blank array, and generating the audio characteristic group corresponding to the current account.

Optionally, the audio quality detection method, wherein, if the quality-qualified audio is from the target user corresponding to the current account, after determining that the quality-qualified audio is a sound source-qualified audio, the method further includes:

and calculating the average value of the audio voiceprint characteristics and the target voiceprint characteristics, and storing the average value as the updated target voiceprint characteristics.

In addition, to achieve the above object, the present invention further provides an intelligent terminal, wherein the intelligent terminal includes: a memory, a processor and an audio quality detection program stored on the memory and executable on the processor, the audio quality detection program when executed by the processor implementing the steps of the audio quality detection method as described above.

In addition, to achieve the above object, the present invention further provides a storage medium, wherein the storage medium stores an audio quality detection program, and the audio quality detection program implements the steps of the audio quality detection method as described above when executed by a processor.

The invention calculates the internal similarity value of the tone quality data, judges whether the tone quality data is from the same speaker according to the internal similarity value, and determines that the tone quality of the audio data is qualified if the tone quality data is from the same speaker. Therefore, the invention can detect the quality of the audio data according to the source of the audio data, thereby effectively screening out unqualified audio data and being beneficial to the subsequent voiceprint recognition and voice recognition work.

In addition, before the internal similarity value is judged, whether the tone quality is qualified or not is judged according to the audio parameters of the audio data, such as the signal-to-noise ratio. The invention also provides a calculation mode for quickly and effectively calculating the acoustic quality parameters, the signal-to-noise ratio and the effective audio length. When the internal similarity value is calculated, the internal similarity value is divided according to the number of effective frames in the audio data, so that the similarity value is calculated based on effective voiceprint features when the internal similarity value is calculated in a splitting mode, and the accuracy of identifying the homology of the audio data is improved. After the audio quality detection of the audio data is qualified, the audio quality detection is applied to the field of voiceprint recognition, and the voiceprint recognition is compared with the target voiceprint characteristics corresponding to the current account, so that whether the audio data is from the target user corresponding to the current account is judged, and the safety of user information is guaranteed.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of the audio quality detection method of the present invention;

FIG. 2 is a schematic diagram illustrating the flow of the overall method according to the preferred embodiment of the audio quality detection method of the present invention;

fig. 3 is a schematic operating environment diagram of an intelligent terminal according to a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1 and 2, the audio quality detection method according to the preferred embodiment of the present invention includes the following steps:

and step S100, acquiring audio data to be detected.

Specifically, the execution subject of the embodiment is an audio quality detection program installed in a voiceprint recognition system on the intelligent terminal.

The user logs in an account on the intelligent terminal, and the account is named as a user A. The intelligent terminal can display an interface to remind a user of whether the user is the user A through voiceprint verification, then a microphone of the intelligent terminal is started, the voice of the user is collected, and the audio data are obtained. And the microphone sends the audio data to the audio quality detection program, and the audio quality detection program receives the audio data so as to finish the audio data to be detected.

Step S200, calculating an internal similarity value of the audio data, and determining the quality of the audio data according to the internal similarity value.

Specifically, in the process of acquiring audio data, there may be sounds of other people, for example, when the microphone acquires audio data, another person around the user suddenly inserts a conversation, so that the acquired audio signals originate from two different persons, and if voiceprint recognition or voice recognition is directly performed, there may be a phenomenon of recognition error. Therefore, it is necessary to determine whether or not two or more persons' voices exist in the audio data. In this embodiment, the internal similarity value of the audio data is calculated, the audio data is divided into multiple segments, and then the similarity value between the voiceprint features of each segment is calculated. After the internal similarity value of the audio data is obtained through calculation, the quality of the audio data can be determined according to the size of the internal similarity value. If the internal similarity value of the audio data is high, it indicates that the audio data is more likely to originate from the same person, and thus the quality of the audio data is better.

Further, in the process of determining the quality of the audio data, a homologous qualification condition may be preset in this embodiment, which is used to determine the quality of the audio data according to the internal similarity value of the audio data, and the specific process is as follows:

step A1, according to the internal similarity value, judging whether the audio data meets the preset homologous qualification condition.

Specifically, a homologous qualification condition is preset for judging whether the audio data meets the condition according to the internal similarity value. The homology qualified condition may be a threshold value of a similarity value in the audio data, and if the similarity value is greater than the threshold value, the audio data is qualified. In addition, if the audio data is divided into multiple sections to obtain multiple internal similarity values, the judgment can be performed according to the number of the internal similarity values larger than the threshold value.

Further, in order to ensure the quality of the audio, besides ensuring that the audio data originates from the same person, the audio quality of the audio data should be ensured to be qualified, so before the step a1, the method further includes:

and B100, calculating the tone quality parameters of the audio data, and judging whether the audio data meets the preset tone quality qualified conditions or not according to the tone quality parameters.

Specifically, the sound quality parameter is a parameter indicating that the sound quality of the audio data is good or bad, such as a signal-to-noise ratio in the audio data. The signal-to-noise ratio is a ratio of an effective signal to a noise signal in a signal, and the smaller the signal-to-noise ratio is, the more noise signals in the audio data are, and the worse the quality of the audio data is. A signal-to-noise ratio threshold may be preset, and whether the signal-to-noise ratio of the audio data is greater than the signal-to-noise ratio threshold may be used as a sound quality qualification condition. And if the signal-to-noise ratio is larger than the signal-to-noise ratio threshold value ratio, the audio data accords with the qualified condition of the sound quality.

Further, the sound quality parameters include a signal-to-noise ratio and an effective audio length, and step a100 includes:

and step B110, respectively extracting noise frames and effective frames in the audio data according to a preset list generation rule, and generating a noise frame list and an effective frame list corresponding to the audio data.

Specifically, the sound quality parameters in this embodiment include a signal-to-noise ratio and an effective audio length. Because some noise frames exist in the audio data, the noise frames can seriously interfere with the subsequent voiceprint feature extraction, and the effective audio length refers to the number of effective frames in the audio data. Presetting a list generation rule, wherein the list generation rule is a rule for distinguishing a noise frame from an effective frame in the audio data, extracting the noise frame and the effective frame respectively, and writing the noise frame and the effective frame into a blank list in sequence in the extraction process to generate a noise frame list and an effective frame list.

Further, step B110 includes:

and step B111, performing noise reduction processing on the audio data to generate noise reduction audio data.

Specifically, the signal form in the collected audio file is a time domain signal, which is represented by a coordinate diagram, that is, an abscissa is time, and an ordinate is amplitude, and the signal processing is mainly a frequency domain signal, which is represented by a coordinate diagram, that is, an abscissa is frequency, and an ordinate is amplitude of the frequency signal, and the frequency domain signal can describe a frequency domain structure of the signal and a relationship between the frequency and the frequency domain signal amplitude, so that the signal processing is more convenient. The audio data is first converted from a time domain signal form to a frequency domain signal form by a fourier transform. And then, carrying out noise reduction processing on the audio data to generate noise reduction audio data, wherein the noise reduction processing method can adopt wiener filtering noise reduction, spectral subtraction, adaptive filters and the like.

And step B112, calculating the energy difference of the same audio frame in the audio data and the noise reduction audio data.

Specifically, after the noise reduction processing, the amplitude of each sampled signal point of the audio data may be affected to some extent, so that the energy corresponding to each signal point may be damaged. Because the audio data comprises a plurality of sampling points, in order to improve the calculation efficiency, the audio signal is framed according to the preset frame length and frame shift to obtain a plurality of audio frames, the energy difference of the same audio frame in the audio data and the noise reduction audio data is calculated, and the strength of the influence of the audio frame on the noise reduction treatment can be judged according to the strength of the change of the energy difference. In this embodiment, the following formula is used to calculate the energy difference between the audio data and the noise reduction audio data of the same audio frame:

wherein dt (i) is the energy difference, and i is the frame number of the current audio frame; l₁Extracting the next frame for frame shift, namely how many data points are each; l₂Is the frame length, i.e. the number of signal points contained in an audio frame; st (k) is a function of a time domain signal of the audio data; the y (k) is a time domain function of the noise reduced audio data. The above formula adopts the ratio of the sum of squares of the amplitudes of all the sampling points in the noise reduction audio data of the same audio frame to the maximum value between the sum of squares of the amplitudes of all the sampling points in the audio data and a constant 0.00001 as the energy difference, where the sampling rate of the audio data is 16000, the frame length is 1280, and the frame shift is 200 in this embodiment.

Further, the above formula is not the only formula for calculating the energy difference, and the energy difference may be calculated in such a manner that the difference between the audio data and the noise reduction audio data of the same audio frame is taken as the energy difference.

And step B113, determining a noise frame and an effective frame in the audio data according to the energy difference.

Since the noise reduction process is mainly to attenuate a noise signal in the audio signal, it is possible to determine whether a certain audio frame is a noise frame or a valid frame, based on the energy difference.

Further, step B113 includes:

Specifically, an energy difference threshold is preset, and if the energy difference is greater than the energy difference threshold, it is indicated that the audio frame is greatly affected by the noise reduction processing, so that the audio frame is taken as a noise frame; if the energy difference is less than or equal to the energy difference threshold, it is determined that the audio frame is less affected by the noise reduction processing, and therefore the audio frame is taken as an effective frame.

Further, after step B113, the method further includes: and sequentially writing the flag bits corresponding to the noise frames and the effective frames into a preset blank list according to the sampling sequence of the audio data to generate an audio flag bit list.

Specifically, the audio frame is divided into a noise frame list and an effective frame list, and an audio mark list is generated at the same time. And traversing audio frames in the audio data according to the sampling sequence of the audio data, writing a corresponding zone bit 0 in the audio sign list if the audio frames are noise frames, and writing a corresponding zone bit 1 in the audio sign list if the audio frames are valid frames. The values of the flag bits corresponding to the noise frame and the valid frame are set for facilitating the subsequent calculation of the position of the split frame, and other values such as 1 and 2 do not influence the implementation of the scheme.

Step B114, writing the audio parameters of the noise frame and the audio parameters of the valid frame into a preset blank list, respectively, to generate a noise frame list and a valid frame list.

Specifically, a blank list is preset, and the blank list can exist in the form of an array or a file. And when a certain audio frame is determined to be a noise frame or an effective frame, writing the audio parameters of the audio frame into the blank list until all the audio frames of the audio data are traversed, and respectively generating a noise frame list containing the noise frame of the audio data and an effective frame list containing the effective frame. The noise frame list and the valid frame list are mainly used for subsequently calculating the signal-to-noise ratio of the audio data, and when the audio parameters are written into the blank list, the audio parameters favorable for the calculation of the subsequent signal-to-noise ratio, such as amplitude and energy, can be selected. In this embodiment, the snr is calculated based on the amplitude of each audio frame, and therefore the formula for calculating the audio parameters absorbed into the white list is:

where V (i) is the amplitude average of the audio frame.

And step B120, respectively calculating the effective audio length and the signal-to-noise ratio of the audio data according to the effective frame list and the noise frame list.

Specifically, the length of the valid frame list and the length of the noise frame list represent the number of valid frames and noise frames in the audio data, and further, the number of valid sampling points in the audio data, that is, the valid audio length, can be determined. The calculation formula of the effective length of the audio frequency adopted by the embodiment is as follows:

wherein, L is_VIs the audio effective length; the above-mentioned_vIs the length of the audio data; the sr is the sampling rate of the audio data; the above-mentioned₃Is the length of the active frame list; the above-mentioned₄Is the length of the noise frame list. The formula is to calculate the audio effective length by using the ratio of the length of the effective frame list to the length of all the frame lists.

In this embodiment, the SNR calculation formula is

Wherein the SNR is the signal-to-noise ratio; the V is₁(i) The average value of the amplitude in the effective frame list is taken; the V is₂(i) The average value of the amplitude in the noise frame list is taken as the average value; what is needed isN is₁The number of the audio parameters in the effective frame list is; n is₂Is the number of audio parameters in the noise frame list. In addition, if different audio parameters are used to calculate the signal-to-noise ratio, for example, the energy and power of an audio frame, the calculation formulas of the audio parameters and the signal-to-noise ratio can be adaptively adjusted.

And B130, judging whether the audio data meets the preset sound quality qualified condition or not according to the effective audio length, a preset effective audio length threshold value, the signal-to-noise ratio and a preset signal-to-noise ratio threshold value.

Specifically, an audio length threshold and a signal-to-noise ratio threshold are preset, and if the signal-to-noise ratio is greater than the signal-to-noise ratio threshold and the effective audio length is greater than the effective audio length threshold, it is determined that the audio data meets the sound quality qualification condition. If the signal-to-noise ratio is smaller than or equal to the signal-to-noise ratio threshold, the number of noise frames in the audio data is excessive, and therefore the sound quality of the audio data is unqualified; if the effective audio length is smaller than or equal to the effective audio length threshold, it indicates that the number of effective frames in the audio data is too small, and thus the sound quality of the audio data is not qualified. And after the audio data are judged to be unqualified, the intelligent terminal can be controlled to display a prompt box with unqualified audio, remind the user of reproducing the audio and acquire the audio data again.

Further, step a1 includes:

step A110, determining an audio splitting position of the audio data according to a preset splitting rule.

Specifically, if the audio data meets the sound quality qualification, the audio data needs to be split, so that the audio splitting position of the audio data is determined according to a preset splitting rule. The audio splitting position may be determined in many ways, for example, the number of a split segment is set in a splitting rule, and then the audio splitting position of the audio data is randomly determined according to the number. For example, the split audio length after splitting is set in the split rule, and then the audio splitting positions are sequentially determined from the starting point of the audio data according to the split audio length.

Further, step a110 includes:

step A111, determining split frames in the effective frame list according to a preset split number and the effective frame list.

Further, a splitting number is preset in the splitting rule, for example, the splitting number is 2. And if the audio data meets the sound quality qualified condition, determining the frame in the middle as a split frame according to the number of the audio parameters in the effective frame list. If the splitting number is 3, determining the audio frames at the start positions 1/3 and 2/3 in the valid frame list as split frames.

Step a112, determining a corresponding audio splitting position in the audio data according to the splitting frame and the audio flag bit list.

Further, after the split frame is determined, sequentially accumulating the flag bits of the audio flag list according to the position of the split frame in the effective frame list, for example, if the split frame is located at the 3 rd bit, until the accumulation result is 3, and the flag at this time is the corresponding position, for example, the 5 th bit, determining that the user split frame is located at the position of the 5 th audio frame in the audio data, so that the audio split position is the 5 th audio frame in the audio data.

And A120, splitting the audio data according to the audio splitting position to generate a plurality of audio clips.

Specifically, taking the audio splitting position as the 5 th audio frame in the audio data as an example, taking the 5 th audio frame as a watershed, taking the 1 st to 4 th audio frames in the audio data as one audio clip, and taking the 5 th audio frame and the subsequent audio frame as another audio clip. And splitting the audio data into a plurality of audio fragments according to the difference of the splitting number.

Because the audio data has a length, the splitting number can be adjusted to a certain extent according to the length of the audio data, for example, when the time corresponding to the audio data is 3 seconds, the corresponding splitting number is 2, so as to prevent that the length of the split audio clip is too long, the result of the subsequently extracted voiceprint features is unstable, and the voiceprint features are not easy to distinguish. When the length of the audio data is too long, the splitting number needs to be increased so as to avoid that the same audio segment contains voiceprint features of multiple persons, and the result of the subsequently extracted voiceprint features is unavailable.

After the audio data is split, certain modification can be carried out on the audio data according to a preset standard length of an audio clip. In this embodiment, the time corresponding to the standard length of the audio clip is 3 seconds. If the time corresponding to the audio clip is 2 seconds, filling the front end and the rear end of the audio clip with a signal value of 0 until the time corresponding to the audio clip is 3 seconds; and if the time corresponding to the audio clip is 2 seconds, cutting the audio clip until the time corresponding to the audio clip is 3 seconds. Through the modification, the uniformity of the subsequent voiceprint feature extraction of the audio clip can be facilitated, and the extraction efficiency is improved.

Step A130, calculating an internal similarity value between the voiceprint features of the audio segments.

Specifically, after the audio segment is generated, an internal similarity value between voiceprint features of the audio segment is calculated. The inner similarity value may be a similarity value between voiceprint features of the audio segment.

Further, step a130 includes:

step A131, extracting the segment preliminary features of the audio segments according to a preset preliminary feature extraction rule.

Specifically, the segment preliminary features are preliminary voiceprint features of the audio segment, such as conventional voiceprint feature Linear Prediction Cepstral Coefficients (LPCCs), Mel Frequency Cepstral Coefficients (MFCC). And selecting different types of voiceprint features corresponding to different preliminary feature extraction rules. In brief description of the implementation by taking the MFCC as an example, the MFCC features are feature parameters extracted based on human auditory characteristics, and are characteristic features of human auditory sense. Therefore, the MFCC features are usually used when performing feature extraction on the audio signal, and the preliminary feature extraction rule in this embodiment is an MFCC feature extraction rule. Extracting preliminary characteristics of the audio segments by pre-emphasis, framing, windowing, Fast Fourier Transform (FFT), Mel filterbank, and discrete cosine transform of the audio segments.

And A132, controlling a preset voiceprint model to extract the voiceprint characteristics of the preliminary features of the segments to generate the voiceprint characteristics of the segments.

Specifically, a voiceprint model is preset, the voiceprint model adopted in the embodiment is a voiceprint model created based on deep learning, and the voiceprint model is obtained in advance through training of a large number of preliminary features, LPCCs, MFCC and the like. And then inputting the preliminary characteristics of the segments into the voiceprint model, and controlling the voiceprint model to further extract the preliminary characteristics of the segments to generate the voiceprint characteristics of the segments.

Step a133, calculating a similarity value between the voiceprint features of the segments as an internal similarity value between the audio segments.

Specifically, a similarity calculation method is adopted to calculate the similarity value between the voiceprint features of the segments and use the similarity value as the internal similarity value between the audio segments. The similarity calculation method includes an euclidean distance algorithm, a cosine distance algorithm, a jaccard similarity calculation method, and the like, and in this embodiment, a cosine distance algorithm is used for performing description, a coordinate system is created, the segmented voiceprint features are converted into a vector form of the coordinate system, then cosine values between the vectors are calculated, since the cosine values are located between values-1 and 1, the closer to 1, the two vectors are represented, that is, the more similar, and the obtained cosine values between the vectors are used as internal similarity values between the segmented voiceprint features.

Step A140, determining whether the audio data meets a preset homologous qualification condition according to the internal similarity value and a preset internal similarity threshold.

Specifically, an internal similarity threshold, such as 0.7, is preset. If the internal similarity value between two vectors is greater than 0.7, it indicates that the two vectors are very close and their corresponding segment voiceprint features are close, i.e. the audio segments originate from the same speaker. And comparing the internal similarity values among all the fragment voiceprint features with the internal similarity threshold, and if all the internal similarity values are greater than a preset internal similarity threshold, determining that the audio data meet a preset homologous qualification condition.

Step A2, if the audio data meets the homologous qualification condition, determining the audio data to be a qualified audio.

Specifically, if the audio data meets the homologous qualification condition, it is indicated that no other person speaks in the audio data, and if the sound quality qualification is further determined before the homologous qualification is determined, the audio data also has the characteristics of less noise, proper audio length and the like. The audio data is thus judged to be of a qualified quality, the audio data is determined to be of a qualified audio and determined. After the qualified audio is determined, the qualified audio can be extracted as an object, and voiceprint feature extraction is carried out, so that the biological features of the user are obtained, and the safety of user information is ensured. The quality-qualified audio may also be input into the speech recognition system for speech recognition by the speech recognition system of the quality-qualified audio to determine the instruction that the user wants to convey. Sometimes, a plurality of users are registered on one terminal or one software, and the preferences, habits, collections and the like corresponding to different users are different, in order to ensure the safety of user information, after the qualified audio is determined, the method further comprises the following steps:

step S310, judging whether the qualified audio comes from the target user corresponding to the current account according to the target voiceprint characteristics corresponding to the current account.

Specifically, the target user corresponding to the currently logged-in account is user a. When the user A registers an account, audio acquisition is carried out on the user A, and voiceprint features in the user A are extracted and stored to be used as target voiceprint features. And after the audio with qualified quality is determined, extracting the voiceprint characteristics of the audio with qualified quality to generate audio voiceprint characteristics. The audio voiceprint feature is then compared to the target voiceprint feature to determine whether the audio data originated from user a.

Further, step S310 includes:

and step S311, extracting the audio preliminary characteristics of the audio with qualified quality according to a preset audio characteristic extraction rule.

And S312, controlling a preset voiceprint model to perform voiceprint feature extraction on the audio preliminary features to generate audio voiceprint features.

Specifically, similar to the above-described process of extracting the voiceprint features of the audio segment, the preliminary voiceprint features of the audio with qualified quality are extracted by using a preset audio feature extraction rule, and then the voiceprint features of the preliminary voiceprint features are extracted by using a preset voiceprint model to generate the voiceprint features of the audio. It should be noted that the audio preliminary features include many, and thus the segment preliminary features and the audio preliminary features may be different preliminary features, for example, the former is MFCC, and the latter is LPCCs, and the voiceprint model may also be inconsistent, for example, the order of magnitude of the training samples is different, the parameters in the voiceprint model are different, and the like, which are not described herein in detail.

Step 313, determining whether a target voiceprint feature exists in the audio feature group corresponding to the current account.

Specifically, when each user performs account registration, an audio feature group is created to store a target voiceprint feature of the user. The audio feature sets may exist in the form of files, arrays, and the like. Calculating the number of target voiceprint features in the audio feature group, wherein if the number is zero, the target voiceprint features do not exist in the audio feature group, and when the user is in a registration state, the number is zero; and if the number is not zero, the target voiceprint characteristics exist in the audio characteristic group.

Step S314, if a target voiceprint feature exists in the audio feature group corresponding to the current account, determining whether the audio data is from a target user corresponding to the current account according to the target voiceprint feature and the audio voiceprint feature.

Specifically, in this embodiment, the target voiceprint feature is a voiceprint feature extracted by previously acquiring the audio data of the user a. And if the target voiceprint feature exists in the audio feature group corresponding to the current account, calculating the similarity value between the target voiceprint feature and the audio voiceprint feature. If the inter-similarity value is larger than a preset inter-similarity threshold value, determining that the audio data is from a target user corresponding to the current account; and if the inter-similarity value is smaller than or equal to the inter-similarity threshold, determining that the audio data does not originate from the target user, namely that the user currently using the equipment is not the user A.

It is worth mentioning that, in this embodiment, only a single target voiceprint feature is used for description, but in other embodiments, a plurality of target voiceprint features may be used to calculate a similarity value between the audio voiceprint feature and each target voiceprint feature, and then determine whether a number of the similarity values that is greater than the similarity threshold exceeds a preset qualified number, if so, determine that the audio data is from the target user.

Step S315, if the target voiceprint feature does not exist in the audio feature group corresponding to the current account, storing the audio voiceprint feature in a preset blank array, and generating the audio feature group corresponding to the current account.

Specifically, in this embodiment, if the target voiceprint feature does not exist in the audio feature group corresponding to the current account, it is indicated that the user is performing account registration or has not previously performed target voiceprint feature acquisition, so that the audio voiceprint feature is written into preset blank data as the target voiceprint feature of the target user, and the audio feature group corresponding to the current account is generated.

Step S320, if the quality-qualified audio is from the target user corresponding to the current account, determining that the quality-qualified audio is a sound source-qualified audio.

Specifically, if the quality-qualified audio is from the target user corresponding to the current account, that is, the user currently using the device is user a, it is determined that the quality-qualified audio is the audio emitted by the target user, that is, the quality-qualified audio is the sound source-qualified audio. And subsequently, performing voice recognition on the sound source qualified audio, converting the audio signal into an indication signal, and executing corresponding operation according to the content in the sound source qualified audio, such as 'power off'. And if the user is carrying out login verification operation at the moment, determining that the user is the user A, and displaying an initial interface for the user to use the equipment. And if the audio data are not from the target user corresponding to the current account, indicating that the user using the equipment is not the user A. If the user is carrying out login operation, prompting that the login is failed; if the user is inputting the instruction through voice, the information such as the user authentication error is prompted.

Further, after step S320, the method further includes: and calculating the average value of the audio voiceprint characteristics and the target voiceprint characteristics, and storing the average value as the updated target voiceprint characteristics.

Specifically, the human voice may change somewhat over time, for example, the older the person, the lower the voice. The target voiceprint feature stored at the time of user registration may not be applicable over time. In this embodiment, after the audio source qualified audio is determined, the audio voiceprint feature of the audio source qualified audio is stored as a target voiceprint feature in the audio feature group corresponding to the target user, so that the target voiceprint feature in the audio feature group is continuously updated.

However, since there is a deviation in the audio voiceprint feature of a sound source qualified audio as the target voiceprint feature, in this embodiment, the target voiceprint feature is an average value of the voiceprint features of the audio data corresponding to a plurality of target users. Therefore, the target voiceprint features are updated, and the mode of calculating the average value of the audio voiceprint features and the target voiceprint features as the updated target voiceprint features is adopted, so that the condition that the qualified audio sound source is judged inaccurately due to the deviation of certain audio data is avoided.

Further, as shown in fig. 3, based on the above audio quality detection method, the present invention also provides an intelligent terminal, which includes a processor 10, a memory 20, and a display 30. Fig. 3 shows only some of the components of the smart terminal, but it should be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

The memory 20 may be an internal storage unit of the intelligent terminal in some embodiments, such as a hard disk or a memory of the intelligent terminal. The memory 20 may also be an external storage device of the Smart terminal in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the Smart terminal. Further, the memory 20 may also include both an internal storage unit and an external storage device of the smart terminal. The memory 20 is used for storing application software installed in the intelligent terminal and various data, such as program codes of the installed intelligent terminal. The memory 20 may also be used to temporarily store data that has been output or is to be output. In one embodiment, the memory 20 stores an audio quality detection program 40, and the audio quality detection program 40 can be executed by the processor 10 to implement the audio quality detection method of the present application.

The processor 10 may be, in some embodiments, a Central Processing Unit (CPU), a microprocessor or other data Processing chip, and is used for executing the program codes stored in the memory 20 or Processing data, such as executing the audio quality detection method.

The display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch panel, or the like in some embodiments. The display 30 is used for displaying information at the intelligent terminal and for displaying a visual user interface. The components 10-30 of the intelligent terminal communicate with each other via a system bus.

In one embodiment, when the processor 10 executes the audio quality detection program 40 in the memory 20, the following steps are implemented:

acquiring audio data to be detected;

Before calculating the internal similarity value of the audio data and determining the quality of the audio data according to the internal similarity value, the method further comprises:

Wherein, the tone quality parameter includes SNR and effective audio length, calculate the tone quality parameter of audio data, and according to the tone quality parameter judges whether audio data accords with predetermined tone quality qualification, specifically includes:

The method includes the steps of respectively extracting noise frames and effective frames in the audio data according to a preset list generation rule, and generating a noise frame list and an effective frame list corresponding to the audio data, and specifically includes:

Wherein, the determining a noise frame and an effective frame in the audio data according to the energy difference specifically includes:

Wherein, after determining the noise frame and the valid frame in the audio data according to the energy difference, the method further comprises:

and sequentially writing the flag bits corresponding to the noise frames and the effective frames into a preset blank list according to the sampling sequence of the audio data to generate an audio flag bit list.

Wherein the calculating an internal similarity value of the audio data and determining the quality of the audio data according to the internal similarity value specifically includes:

calculating an internal similarity value of the audio data;

and if the audio data accords with the homologous qualified conditions, determining that the audio data is qualified audio.

Wherein the calculating the internal similarity value of the audio data specifically includes:

Wherein, according to a preset splitting rule, determining an audio splitting position of the audio data specifically includes:

Wherein the calculating the internal similarity value between the voiceprint features of the audio clip specifically comprises:

controlling a preset voiceprint model to extract the voiceprint characteristics of the preliminary features of the segments to generate the voiceprint characteristics of the segments;

Wherein, according to the internal similarity value, judging whether the audio data meets a preset homologous qualification condition, specifically comprising:

Wherein, if the audio data meets the homologous qualification, after determining that the audio data is qualified audio, the method further comprises:

The determining, according to the target voiceprint feature corresponding to the current account, whether the qualified-quality audio is from a target user corresponding to the current account specifically includes:

Wherein, if the quality-qualified audio is from the target user corresponding to the current account, after determining that the quality-qualified audio is a sound source-qualified audio, the method further includes:

The present invention also provides a storage medium, wherein the storage medium stores an audio quality detection program, and the audio quality detection program realizes the steps of the audio quality detection method as described above when executed by a processor.

In summary, the present invention provides an audio quality detection method, an intelligent terminal and a storage medium, where the method includes: acquiring audio data to be detected; and calculating an internal similarity value of the audio data, and determining the quality of the audio data according to the internal similarity value. The invention effectively improves the accuracy of the audio data quality detection and ensures the effectiveness of the subsequent audio data for voiceprint recognition or voice recognition.

Of course, it will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program instructing relevant hardware (such as a processor, a controller, etc.), and the program may be stored in a computer readable storage medium, and when executed, the program may include the processes of the above method embodiments. The storage medium may be a memory, a magnetic disk, an optical disk, etc.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. An audio quality detection method, characterized in that the audio quality detection method comprises:

acquiring audio data to be detected;

2. The audio quality detection method according to claim 1, wherein before the calculating an internal similarity value of the audio data and determining the quality of the audio data based on the internal similarity value, further comprising:

3. The audio quality detection method according to claim 2, wherein the sound quality parameters include a signal-to-noise ratio and an effective audio length, the calculating the sound quality parameters of the audio data, and determining whether the audio data meets a preset sound quality qualification condition according to the sound quality parameters specifically includes:

4. The audio quality detection method according to claim 3, wherein the extracting noise frames and valid frames from the audio data according to a preset list generation rule, and generating a noise frame list and a valid frame list corresponding to the audio data specifically comprises:

and respectively writing the audio parameters of the noise frame and the audio parameters of the effective frame into a preset blank list to generate a noise frame list and an effective frame list.

5. The method according to claim 4, wherein the determining the noise frame and the valid frame in the audio data according to the energy difference specifically comprises:

if not, the audio frame corresponding to the energy difference smaller than or equal to the energy difference threshold is taken as an effective frame.

6. The audio quality detection method according to claim 4, wherein, after determining the noise frame and the valid frame in the audio data according to the energy difference, further comprising:

7. The audio quality detection method according to claim 6, wherein the calculating an inner similarity value of the audio data and determining the quality of the audio data according to the inner similarity value specifically comprises:

calculating an internal similarity value of the audio data;

8. The audio quality detection method according to claim 7, wherein the calculating the internal similarity value of the audio data specifically comprises:

9. The audio quality detection method according to claim 8, wherein the determining the audio splitting position of the audio data according to a preset splitting rule specifically includes:

10. The method according to claim 8, wherein the calculating the internal similarity value between the voiceprint features of the audio segment specifically comprises:

11. The audio quality detection method according to claim 7, wherein the determining whether the audio data meets a predetermined homologous qualification condition according to the internal similarity value specifically includes:

12. The method of claim 11, wherein after determining that the audio data is a quality-qualified audio if the audio data meets the homology-qualified condition, the method further comprises:

13. The audio quality detection method according to claim 12, wherein the determining, according to the target voiceprint feature corresponding to the current account, whether the quality-qualified audio is from a target user corresponding to the current account specifically includes:

14. The method of claim 12, wherein after determining that the quality-qualified audio is a sound-source-qualified audio if the quality-qualified audio is from the target user corresponding to the current account, the method further comprises:

15. An intelligent terminal, characterized in that, intelligent terminal includes: memory, a processor and an audio quality detection program stored on the memory and executable on the processor, the audio quality detection program when executed by the processor implementing the steps of the audio quality detection method according to any of claims 1-14.

16. A storage medium, characterized in that the storage medium stores an audio quality detection program, which when executed by a processor implements the steps of the audio quality detection method according to any one of claims 1 to 14.