CN117877510A

CN117877510A - Voice automatic test method, device, electronic equipment and storage medium

Info

Publication number: CN117877510A
Application number: CN202311714221.0A
Authority: CN
Inventors: 赵峻毅; 李井峰; 张品品; 周杨
Original assignee: Best Tone Information Service Corp Ltd
Current assignee: Best Tone Information Service Corp Ltd
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-04-12

Abstract

The invention relates to a method and a device for automatic voice test, electronic equipment and a storage medium. The voice automatic test method comprises the following steps: s1, voice signal acquisition and preprocessing are carried out, user call records during user call are obtained, noise reduction processing is carried out on the call records, and noise-reduced voice signals are obtained; s2, tone color recognition, namely generating a tone color model through voice data preprocessing, MFCC feature extraction and tone color model training in sequence, so as to realize tone color recognition; s3, scene recognition, namely classifying the scenes, and generating a scene model through voice recognition, semantic recognition and scene model training in sequence, so as to realize scene recognition; s4, voice automation test scheduling is carried out, tone color recognition is carried out, scene recognition is utilized to carry out scene classification, input voice signals are classified, corresponding voice automation test cases and voice automation test scripts are triggered according to classification results, and test results are generated after the voice automation test is completed.

Description

Voice automatic test method, device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of communication wireless communication, in particular to a voice automatic test method, device, electronic equipment and storage medium based on tone recognition and scene recognition.

Background

With the diversification of interaction modes, voice interaction has been widely applied to various scenes.

However, most of the existing voice automatic test technologies only test text or specific voice, and cannot perform intelligent test scheduling according to different scenes and different colors.

Therefore, development of a voice automatic test scheduling system based on tone color and scene recognition is imperative.

Disclosure of Invention

The invention aims to solve the technical problems that the existing voice automatic test technology cannot conduct intelligent test scheduling according to different scenes and different colors, and the accuracy and the efficiency of the voice automatic test are low.

To solve the above technical problem, according to an aspect of the present invention, there is provided a method for voice automated testing, comprising the steps of: s1, voice signal acquisition and preprocessing are carried out, user call records during user call are obtained, noise reduction processing is carried out on the call records, and noise-reduced voice signals are obtained; s2, voice data preprocessing and model training are carried out on the voice signals to realize voice recognition, and the voice data preprocessing and model training comprise the following steps: effective data acquisition, data labeling, data expansion, feature extraction and model training are carried out, and tone color recognition sequentially passes through voice data preprocessing, MFCC feature extraction and tone color model training to generate a tone color model, so that tone color recognition is realized; s3, scene recognition, namely classifying the scenes, and generating a scene model through voice recognition, semantic recognition and scene model training in sequence, so as to realize scene recognition; s4, voice automation test scheduling is carried out, tone color recognition is carried out, scene recognition is utilized to carry out scene classification, input voice signals are classified, corresponding voice automation test cases and voice automation test scripts are triggered according to classification results, and test results are generated after the voice automation test is completed.

According to an embodiment of the present invention, the step S1 may include the following steps: s11, acquiring a user call record when the acquired user calls, and generating a wav format record file; s12, carrying out wavelet transformation on the record file, decomposing the record data to different wavelet levels, and carrying out threshold processing on coefficients on each wavelet level; and performing inverse wavelet transformation on the wavelet coefficient subjected to the threshold processing to obtain a noise-reduced voice signal.

According to an embodiment of the present invention, the step S2 may include the following steps: s21, after the noise reduction is finished, performing voice segmentation (VAD, voiceActivity Detection, namely voice endpoint Detection), firstly dividing a voice signal into small time windows, processing to generate voice fragments, and storing the voice fragments as an audio file in wav format; after generating the voice fragment, matching the voice fragment with the ID of the user to generate the annotation data of the training data set; s22, expanding the generated data set, changing the speed, tone and volume of the existing voice fragment, marking the newly generated voice fragment by using marking data of the source audio, and adding the new data into the data set; s23, extracting MFCC features, extracting voiceprint features of voice fragments by using an MFCC algorithm, dividing the voice fragments in the data set into a section of short-time stable frames by using a Hamming window, performing fast Fourier transform (FFT, fastFourierTransform) on signals of each frame, and squaring a frequency spectrum by taking a mode to obtain a power spectrum; applying a group of Mel filters on the power spectrum, taking the logarithm of the output of each filter, performing discrete cosine transform (DCT, discrete Cosine Transform) on the output of the Mel filter group to obtain a cepstrum, and extracting the MFCC characteristics of the voice fragment; s24, training a tone model, namely, transmitting the MFCC characteristics of the voice data and the labels into the model for iterative training until the model converges to generate the tone model; s25, voice recognition, namely respectively sending the MFCC characteristics of the voice fragments into a model for recognition, and respectively obtaining the speaker of each voice fragment.

In step S23, according to an embodiment of the present invention, a formula of a Fast Fourier Transform (FFT) may be:

where X (k) is the kth discrete frequency in the frequency domain, X (N) represents the nth sample point in the time domain signal, and N is the total number of samples of the signal.

According to an embodiment of the present invention, step S3 may include the steps of: s31, generating a pre-training model by adopting a roberta model and an electric model based on a transducer structure so as to train the model for different voice automation test scenes, and selecting a corresponding voice automation test model according to the voice recognition and scene recognition results; s32, acquiring a voice recognition result of the voice to be processed by adopting a CTC algorithm (Connectionist temporal classification) based on an improved RNN (Recurrent Neural Network) model, converting the voice recognition result into text for subsequent processing, and simultaneously extracting useful features from audio data, wherein the features comprise Mel Frequency Cepstrum Coefficient (MFCC) voice features, and the features are used for assisting in analyzing the current test scene; s33, performing semantic analysis on the text by using a natural language processing technology based on a long short time memory model (LSTM, long-short term memory) according to text content generated by the voice of the client, identifying the intention of the client, matching corresponding service labels, and inducing the service labels into different voice automation test scenes to perform scene identification and scene classification, so that different voice automation test models are selected according to the scene identification.

According to a second aspect of the present invention, there is provided an apparatus for voice automated testing, comprising:

the voice signal acquisition and preprocessing module is used for acquiring a user call record when a user calls, and carrying out noise reduction processing on the call record to obtain a noise-reduced voice signal; acquiring a user call record when the acquired user calls, and generating a wav format record file; performing wavelet transformation on the record file, decomposing the record data to different wavelet levels, and performing threshold processing on coefficients on each wavelet level; performing inverse wavelet transformation on the wavelet coefficient subjected to the threshold processing to obtain a noise-reduced voice signal; the voice data preprocessing and model training are carried out on the voice signals to realize voice recognition, and the voice data preprocessing and model training comprise the following steps: effective data acquisition, data labeling, data expansion, feature extraction and model training are carried out; the tone color recognition module is used for generating a tone color model through voice data preprocessing, MFCC feature extraction and tone color model training in sequence, so that tone color recognition is realized; the scene recognition module is used for classifying scenes, and generating a scene model through voice recognition, semantic recognition and scene model training in sequence, so that scene recognition is realized; the voice automatic test scheduling module is used for carrying out tone color recognition, carrying out scene classification by utilizing scene recognition, classifying input voice signals, triggering corresponding voice automatic test cases and voice automatic test scripts according to classification results, and generating test results after the voice automatic test is completed.

According to an embodiment of the present invention, the tone color recognition module may include: the voice data processing unit is used for performing voice segmentation (VAD, voiceActivity Detection, namely voice endpoint Detection) after the noise reduction is finished, firstly dividing a voice signal into small time windows, processing and generating voice fragments, and storing the voice fragments as an audio file in wav format; after generating the voice fragment, matching the voice fragment with the ID of the user to generate the annotation data of the training data set; expanding the generated data set, changing the speed, tone and volume of the existing voice fragments, marking the newly generated voice fragments by using marking data of source audio, and adding the new data into the data set; the MFCC feature extraction unit is used for extracting voiceprint features of the voice fragments through an MFCC algorithm, dividing the voice fragments in the data set into a section of short-time stable frames by using a Hamming window, performing fast Fourier Transform (FFT, fastFourier Transform) on signals of each frame, and taking a module square of a frequency spectrum to obtain a power spectrum; applying a group of Mel filters on the power spectrum, taking the logarithm of the output of each filter, performing discrete cosine transform (DCT, discrete Cosine Transform) on the output of the Mel filter group to obtain a cepstrum, and extracting the MFCC characteristics of the voice fragment; wherein, the formula of the Fast Fourier Transform (FFT) is:

wherein X (k) is the kth discrete frequency in the frequency domain, X (N) represents the nth sampling point in the time domain signal, and N is the total sampling number of the signal;

the tone color model training unit is used for carrying out iterative training on the MFCC characteristics of the voice data and the labels into the model until the model converges to generate a tone color model; and respectively sending the MFCC characteristics of the plurality of voice fragments into the model for recognition to respectively obtain the speaker of each voice fragment.

According to the embodiment of the invention, the scene recognition module can generate a pre-training model by adopting a roberta model and an electric model based on a transducer structure so as to train the model for different voice automation test scenes, and then select a corresponding voice automation test model according to the voice recognition and the scene recognition result; obtaining a voice recognition result of the voice to be processed by adopting a CTC algorithm based on an improved RNN model, converting the voice recognition result into a text for subsequent processing, and extracting useful features from audio data, wherein the features comprise Mel Frequency Cepstrum Coefficient (MFCC) voice features, and the features are used for assisting in analyzing the current test scene; and carrying out semantic analysis on the text by using a natural language processing technology based on long-short-term memory models according to text content generated by the voice of the client, identifying the intention of the client, matching corresponding service labels, and inducing the service labels into different voice automation test scenes to carry out scene identification and scene classification, so that different voice automation test models are selected according to the scene identification.

According to a third aspect of the present invention, there is provided an electronic device comprising: the method comprises the steps of a memory, a processor, and a voice automation test program stored on the memory and capable of running on the processor, wherein the voice automation test program is executed by the processor to realize the voice automation test method.

According to a fourth aspect of the present invention, there is provided a computer storage medium, wherein a voice automation test program is stored on the computer storage medium, and the voice automation test program when executed by a processor implements the steps of the voice automation test method described above.

Compared with the prior art, the technical scheme provided by the embodiment of the invention at least has the following beneficial effects:

the invention carries out voice automatic test through tone color recognition and scene recognition, wherein the tone color recognition carries out pretreatment on the collected voice signals and comprises the steps of tone color recognition, framing, windowing, feature extraction and the like; the voice signal is processed by adopting a deep learning model to obtain a feature vector reflecting the tone; the method comprises the steps of scene recognition framing, windowing and feature extraction, wherein a short-time Fourier transform or wavelet transform and other methods are adopted for processing voice signals so as to obtain feature vectors reflecting scene information; based on the feature vector reflecting the scene information, a scene classifier is trained by using a machine learning algorithm and is used for classifying and predicting the scene information reflected by the voice signal.

According to the voice automatic test method and the voice automatic test device, the voice automatic test based on tone recognition and scene recognition can be used for more accurately recognizing the tone and scene information of the called party, so that the corresponding test script is selected and executed, and the accuracy of the test is improved.

According to the invention, through the voice automatic test based on tone recognition and scene recognition, intelligent test scheduling of called people in different scenes and tone can be realized, and further the test efficiency is improved. And the efficiency of the voice automatic test is improved.

According to the voice automatic test method based on tone recognition and scene recognition, the human participation and the test error rate can be reduced, and further the human cost is reduced.

According to the voice automatic test method and the voice automatic test device, intelligent test scheduling of called parties in different scenes and tone colors can be achieved through voice automatic test based on tone color recognition and scene recognition, and therefore the expandability and flexibility of voice automatic test are improved.

According to the voice automatic test method and the voice automatic test device, through voice automatic test based on tone recognition and scene recognition, reliability and stability of the voice automatic test can be enhanced, and quality and user experience of software products are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following brief description of the drawings of the embodiments will make it apparent that the drawings in the following description relate only to some embodiments of the present invention and are not limiting of the present invention.

Fig. 1 is a flowchart illustrating a voice automated test method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. The terms "first," "second," and the like in the description and in the claims, are not used for any order, quantity, or importance, but are used for distinguishing between different elements. Likewise, the terms "a" or "an" and the like do not denote a limitation of quantity, but rather denote the presence of at least one.

As shown in fig. 1, a method for voice automated testing includes the steps of:

s1, voice signal acquisition and preprocessing are carried out, user call records during user call are obtained, noise reduction processing is carried out on the call records, and noise-reduced voice signals are obtained.

S2, voice data preprocessing and model training are carried out on the voice signals to realize voice recognition, and the voice data preprocessing and model training comprise the following steps: effective data acquisition, data labeling, data expansion, feature extraction and model training are carried out, and tone color recognition sequentially passes through voice data preprocessing, MFCC feature extraction and tone color model training to generate a tone color model, so that tone color recognition is realized.

S3, scene recognition, namely classifying the scenes, and generating a scene model through voice recognition, semantic recognition and scene model training in sequence, so that the scene recognition is realized.

S4, voice automation test scheduling is carried out, tone color recognition is carried out, scene recognition is utilized to carry out scene classification, input voice signals are classified, corresponding voice automation test cases and voice automation test scripts are triggered according to classification results, and test results are generated after the voice automation test is completed.

According to the voice automatic test method and the voice automatic test device, through voice automatic test based on tone recognition and scene recognition, tone and scene information of a called party can be recognized more accurately, so that corresponding test scripts can be selected and executed, and the accuracy of the test is improved; the intelligent test scheduling of the called party in different scenes and tone colors can be realized, and further the efficiency of the automatic voice test is improved.

According to one or some embodiments of the invention, the step S1 includes the steps of:

s11, acquiring a user call record when the acquired user calls, and generating a wav format record file.

S12, carrying out wavelet transformation on the record file, decomposing the record data to different wavelet levels, and carrying out threshold processing on coefficients on each wavelet level; and performing inverse wavelet transformation on the wavelet coefficient subjected to the threshold processing to obtain a noise-reduced voice signal.

According to one or some embodiments of the invention, step S2 includes the steps of:

s21, after the noise reduction is finished, performing voice segmentation (VAD, voiceActivity Detection, namely voice endpoint Detection), firstly dividing a voice signal into small time windows, processing to generate voice fragments, and storing the voice fragments as an audio file in wav format; after the speech segments are generated, the speech segments are matched with the user's ID, generating annotation data for the training dataset.

S22, expanding the generated data set, changing the speed, tone and volume of the existing voice fragments, marking the newly generated voice fragments by using marking data of the source audio, and adding new data into the data set.

S23, extracting MFCC features, extracting voiceprint features of voice fragments by using an MFCC algorithm, dividing the voice fragments in the data set into a section of short-time stable frames by using a Hamming window, performing fast Fourier transform (FFT, fast FourierTransform) on signals of each frame, and squaring a frequency spectrum by taking a mode to obtain a power spectrum; a set of Mel filters are applied to the power spectrum, the output of each filter is logarithmized, and the output of the Mel filter set is subjected to discrete cosine transform (DCT, discrete Cosine Transform) to obtain a cepstrum, and the MFCC characteristics of the speech segments are extracted.

S24, training the tone color model, namely, transmitting the MFCC characteristics of the voice data and the labels into the model for iterative training until the model converges, and generating the tone color model.

S25, voice recognition, namely respectively sending the MFCC characteristics of the voice fragments into a model for recognition, and respectively obtaining the speaker of each voice fragment.

In step S23, according to one or some embodiments of the present invention, a Fast Fourier Transform (FFT) is formulated as:

According to one or some embodiments of the invention, step S3 comprises the steps of:

s31, generating a pre-training model by adopting a roberta model and an electric model based on a transducer structure so as to train the model for different voice automation test scenes, and selecting a corresponding voice automation test model according to the voice recognition and scene recognition results.

S32, a CTC algorithm (Connectionist temporal classification) based on an improved RNN model is adopted to acquire a voice recognition result of the voice to be processed, the voice recognition result is converted into text for subsequent processing, meanwhile, useful features are extracted from audio data, the features comprise Mel Frequency Cepstrum Coefficient (MFCC) voice features, and the features are used for assisting in analyzing a current test scene.

S33, performing semantic analysis on the text by using a natural language processing technology based on a long short time memory model (LSTM, long-short term memory) according to text content generated by the voice of the client, identifying the intention of the client, matching corresponding service labels, and inducing the service labels into different voice automation test scenes to perform scene identification and scene classification, so that different voice automation test models are selected according to the scene identification.

According to a second aspect of the present invention, there is provided an apparatus for voice automated testing comprising: the system comprises a voice signal acquisition and preprocessing module, a tone color recognition module, a scene recognition module and a voice automation test scheduling module.

The voice signal acquisition and preprocessing module is used for acquiring a user call record when a user calls, and carrying out noise reduction processing on the call record to obtain a noise-reduced voice signal; acquiring a user call record when the acquired user calls, and generating a wav format record file; performing wavelet transformation on the record file, decomposing the record data to different wavelet levels, and performing threshold processing on coefficients on each wavelet level; and performing inverse wavelet transformation on the wavelet coefficient subjected to the threshold processing to obtain a noise-reduced voice signal.

Performing voice data preprocessing and model training on the voice signal to realize tone recognition, wherein the voice data preprocessing and model training comprises the following steps: effective data acquisition, data labeling, data expansion, feature extraction and model training are carried out; the tone color recognition module is used for generating a tone color model through voice data preprocessing, MFCC feature extraction and tone color model training in sequence, so that tone color recognition is realized.

The scene recognition module is used for classifying scenes, and generating scene models through voice recognition, semantic recognition and scene model training in sequence, so that scene recognition is realized.

Performing tone color recognition, performing scene classification by using scene recognition, classifying input voice signals, and triggering a corresponding voice automation test case and a corresponding voice automation test script by a voice automation test scheduling module according to the classification result to generate a test result after the voice automation test is completed.

According to one or more embodiments of the present invention, the timbre recognition module includes: a voice data processing unit, a MFCC feature extraction unit and a voice model training unit.

The voice data processing unit is used for performing voice segmentation (VAD, voice Activity Detection, i.e. voice endpoint detection) after the noise reduction is finished, firstly dividing a voice signal into small time windows, processing and generating voice fragments, and storing the voice fragments as an audio file in wav format; after generating the voice fragment, matching the voice fragment with the ID of the user to generate the annotation data of the training data set; and expanding the generated data set, changing the speed, tone and volume of the existing voice fragments, marking the newly generated voice fragments by using marking data of the source audio, and adding the new data into the data set.

The MFCC feature extraction unit is used for extracting voiceprint features of voice fragments through an MFCC algorithm, dividing the voice fragments in the data set into a section of short-time stable frames by using a Hamming window, performing fast Fourier transform (FFT, fast Fourier Transform) on signals of each frame, and taking a module square of a frequency spectrum to obtain a power spectrum; a set of Mel filters are applied to the power spectrum, the output of each filter is logarithmized, and the output of the Mel filter set is subjected to discrete cosine transform (DCT, discrete Cosine Transform) to obtain a cepstrum, and the MFCC characteristics of the speech segments are extracted. Wherein, the formula of the Fast Fourier Transform (FFT) is:

where X (k) is the kth discrete frequency in the frequency domain, X (N) represents the nth sample point in the time domain signal, and N is the total sample of the signal.

The tone color model training unit is used for carrying out iterative training on the MFCC characteristics of the voice data and the label input model until the model converges to generate a tone color model; and respectively sending the MFCC characteristics of the plurality of voice fragments into the model for recognition to respectively obtain the speaker of each voice fragment.

According to one or some embodiments of the present invention, the scene recognition module generates a pre-training model by using a roberta model and an electric model based on a transform structure, so as to train the model for different voice automation test scenes, so as to select a corresponding voice automation test model according to the voice recognition and the result of the scene recognition; obtaining a voice recognition result of the voice to be processed by adopting a CTC algorithm based on an improved RNN model, converting the voice recognition result into a text for subsequent processing, and extracting useful features from audio data, wherein the features comprise Mel Frequency Cepstrum Coefficient (MFCC) voice features, and the features are used for assisting in analyzing the current test scene; and carrying out semantic analysis on the text by using a natural language processing technology based on long-short-term memory models according to text content generated by the voice of the client, identifying the intention of the client, matching corresponding service labels, and inducing the service labels into different voice automation test scenes to carry out scene identification and scene classification, so that different voice automation test models are selected according to the scene identification.

According to one or some embodiments of the invention, the present solution is employed for automated testing of a space wing communication assistant.

Firstly, collecting voice signals, and collecting call record audios with different tones in different scenes, which are truly generated by a space-fin communication assistant.

Then, preprocessing the voice signal, and preprocessing the collected call record audio; the tone color recognition is performed by taking parameters such as Mel Frequency Cepstrum Coefficient (MFCC), linear Predictive Coding (LPC) and the like as characteristic inputs, and training by adopting a deep learning model Convolutional Neural Network (CNN), a cyclic neural network (RNN) and the like so as to obtain a tone color recognition model; framing, windowing and feature extraction the speech signal is processed by short-time fourier transform (STFT), wavelet transform (Wavelet) or the like to obtain feature vectors reflecting scene information.

Then, scene recognition is performed: and performing scene recognition on the preprocessed voice signal by using the deep learning model. Training a scene classifier by using the characteristic vector as input; the scene classifier comprises deep learning models such as Convolutional Neural Networks (CNNs), cyclic neural networks (RNNs) and the like so as to realize classification of different scenes.

And (3) automatic test scheduling, namely classifying the scenes by utilizing scene recognition, classifying the input voice signals, and triggering corresponding automatic test scripts according to the classification result. Specifically, according to classification results of different scenes, a corresponding automated test script may be triggered to perform testing of a specific scene, for example: take away scenes, dating scenes, etc. The module also has a feedback module for providing feedback based on the results of the automated test to help optimize training of the deep learning model and design of the test script.

The effect shows that the method can accurately identify the tone color and the scene of the called party and is applied to automatic test scheduling. This can improve the accuracy and the efficiency of automated testing, reduce human cost and test error rate. According to the voice automatic test method and the voice automatic test device, through voice automatic test based on tone recognition and scene recognition, reliability and stability of the voice automatic test can be enhanced, and quality and user experience of software products are improved.

According to yet another aspect of the present invention, there is provided an apparatus for voice automated testing, comprising: the method comprises the steps of a memory, a processor, and a voice automation test program stored on the memory and capable of running on the processor, wherein the voice automation test program is executed by the processor to realize the voice automation test method.

There is also provided a computer storage medium according to the present invention.

The computer storage medium stores a voice automatic test program, and the voice automatic test program realizes the steps of the voice automatic test method when being executed by the processor.

The method implemented when the voice automation test program running on the processor is executed may refer to various embodiments of the voice automation test method of the present invention, which are not described herein.

The invention also provides a computer program product.

The computer program product of the invention comprises a voice automated test program which, when executed by a processor, implements the steps of the voice automated test method as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing is merely exemplary embodiments of the present invention and is not intended to limit the scope of the invention, which is defined by the appended claims.

Claims

1. A method of voice automated testing comprising the steps of:

s1, voice signal acquisition and preprocessing are carried out, user call records during user call are obtained, noise reduction processing is carried out on the call records, and noise-reduced voice signals are obtained;

s2, voice data preprocessing and model training are carried out on the voice signals to realize voice recognition, and the voice data preprocessing and model training comprises the following steps: effective data acquisition, data labeling, data expansion, feature extraction and model training are carried out, and tone color recognition sequentially passes through voice data preprocessing, MFCC feature extraction and tone color model training to generate a tone color model, so that tone color recognition is realized;

s3, scene recognition, namely classifying the scenes, and generating a scene model through voice recognition, semantic recognition and scene model training in sequence, so as to realize scene recognition;

2. The method for voice automated testing according to claim 1, wherein the step S1 comprises the steps of:

s11, acquiring a user call record when the acquired user calls, and generating a wav format record file;

3. The method for voice automated testing according to claim 1, wherein step S2 comprises the steps of:

s21, after the noise reduction is finished, voice segmentation is carried out, firstly, a voice signal is divided into small time windows, the voice fragments are generated through processing, and the voice fragments are stored as an audio file in wav format; after generating the voice fragment, matching the voice fragment with the ID of the user to generate the annotation data of the training data set;

s22, expanding the generated data set, changing the speed, tone and volume of the existing voice fragment, marking the newly generated voice fragment by using marking data of the source audio, and adding the new data into the data set;

s23, extracting MFCC features, extracting voiceprint features of voice fragments by using an MFCC algorithm, dividing the voice fragments in the data set into a section of short-time stable frames by using a Hamming window, performing fast Fourier transform on signals of each frame, and squaring a frequency spectrum by taking a module to obtain a power spectrum; applying a group of Mel filters on the power spectrum, taking the logarithm of the output of each filter, performing discrete cosine transform on the output of the Mel filter group to obtain a cepstrum, and extracting the MFCC characteristics of the voice fragment;

s24, training a tone model, namely, transmitting the MFCC characteristics of the voice data and the labels into the model for iterative training until the model converges to generate the tone model;

4. The method for voice automated testing according to claim 3, wherein in step S23, the formula of the fast fourier transform FFT is:

5. The method of voice automated testing of claim 1, wherein step S3 comprises the steps of:

s31, generating a pre-training model by adopting a roberta model and an electric model based on a transducer structure so as to train the model for different voice automation test scenes, and selecting a corresponding voice automation test model according to the voice recognition and scene recognition results;

s32, acquiring a voice recognition result of the voice to be processed by adopting a CTC algorithm based on an improved RNN model, converting the voice recognition result into a text for subsequent processing, and simultaneously extracting useful features from audio data, wherein the features comprise Mel Frequency Cepstrum Coefficient (MFCC) voice features, and the features are used for assisting in analyzing a current test scene;

s33, carrying out semantic analysis on the text by using a natural language processing technology based on long and short time memory models according to text content generated by the voice of the client, identifying the intention of the client, matching corresponding service labels, and inducing the service labels into different voice automation test scenes to carry out scene recognition and scene classification, thereby selecting different voice automation test models according to the scene recognition.

6. An apparatus for voice automated testing, comprising:

the voice signal acquisition and preprocessing module is used for acquiring a user call record when a user calls, and carrying out noise reduction processing on the call record to obtain a noise-reduced voice signal; acquiring a user call record when the acquired user calls, and generating a wav format record file; performing wavelet transformation on the record file, decomposing the record data to different wavelet levels, and performing threshold processing on coefficients on each wavelet level; performing inverse wavelet transformation on the wavelet coefficient subjected to the threshold processing to obtain a noise-reduced voice signal;

the voice data preprocessing and model training are carried out on the voice signals to realize voice recognition, and the voice data preprocessing and model training comprises the following steps: effective data acquisition, data labeling, data expansion, feature extraction and model training are carried out; the tone color recognition module is used for generating a tone color model through voice data preprocessing, MFCC feature extraction and tone color model training in sequence, so as to realize tone color recognition;

the scene recognition module is used for classifying scenes, and generating a scene model through voice recognition, semantic recognition and scene model training in sequence, so that scene recognition is realized;

the voice automatic test scheduling module is used for triggering corresponding voice automatic test cases and voice automatic test scripts according to the classification results and generating test results after the voice automatic test is completed.

7. The apparatus for automated voice testing according to claim 6, wherein the tone color recognition module comprises:

the voice data processing unit is used for carrying out voice segmentation after the noise reduction is finished, firstly dividing a voice signal into small time windows, processing and generating voice fragments, and storing the voice fragments into an audio file in wav format; after generating the voice fragment, matching the voice fragment with the ID of the user to generate the annotation data of the training data set; expanding the generated data set, changing the speed, tone and volume of the existing voice fragments, marking the newly generated voice fragments by using marking data of source audio, and adding the new data into the data set;

the MFCC feature extraction unit is used for extracting voiceprint features of the voice fragments through an MFCC algorithm, dividing the voice fragments in the data set into a section of short-time stable frames by using a Hamming window, performing fast Fourier transform on signals of each frame, and squaring a frequency spectrum by taking a module to obtain a power spectrum; applying a group of Mel filters on the power spectrum, taking the logarithm of the output of each filter, performing discrete cosine transform on the output of the Mel filter group to obtain a cepstrum, and extracting the MFCC characteristics of the voice fragment;

wherein, the formula of the fast fourier transform is:

8. The apparatus for automated voice testing according to claim 6, wherein the scene recognition module generates the pre-training model using a Transformer structure-based roberta model plus an electra model to train the model for different automated voice testing scenes for subsequent selection of a corresponding automated voice testing model based on the results of the voice recognition and scene recognition;

the scene recognition module acquires a voice recognition result of the voice to be processed by adopting a CTC algorithm based on an improved RNN model, converts the voice recognition result into a text for subsequent processing, and simultaneously extracts useful features from audio data, wherein the features comprise mel frequency cepstrum coefficient voice features, and the features are used for assisting in analyzing a current test scene;

the scene recognition module performs semantic analysis on the text by using a natural language processing technology based on a long-short-term memory model according to text content generated by the voice of the client, recognizes the intention of the client, matches corresponding service labels, and generalizes the service labels into different voice automation test scenes to perform scene recognition and scene classification, so that different voice automation test models are selected according to the scene recognition.

9. An electronic device, comprising: memory, a processor and a voice automated test program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the voice automated test method of any one of claims 1 to 5.

10. A computer storage medium having stored thereon a voice automated test program which when executed by a processor performs the steps of the voice automated test method of any of claims 1 to 5.