CN117240958A

CN117240958A - Audio and video processing performance test method and device

Info

Publication number: CN117240958A
Application number: CN202210633014.1A
Authority: CN
Inventors: 蔡禄汀
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2022-06-06
Filing date: 2022-06-06
Publication date: 2023-12-15
Also published as: WO2023236730A1

Abstract

The embodiment of the application provides an audio and video processing performance test method and device, and the marked audio and video data set is input into CNN+LSTM for training; inputting audio and video data from the terminal to a trained CNN+LSTM neural network model to obtain audio and video quality information; judging whether the audio and video quality information and the preconfigured system resource use information meet or not are smaller than or equal to the corresponding threshold values, if not, increasing the number of terminals, repeating the step of inputting the audio and video data from the terminals into the trained CNN+LSTM neural network model to obtain the audio and video quality information until the audio and video quality information and the preconfigured system resource use information are smaller than or equal to the corresponding threshold values, outputting a processing performance test result, solving the problem that the audio and video processing quality and performance of a call center cannot be accurately evaluated and detected in the related technology, and achieving the effects of eliminating a large amount of manual participation and improving the test efficiency.

Description

Audio and video processing performance test method and device

Technical Field

The embodiment of the application relates to the field of communication, in particular to an audio and video processing performance test method and device.

Background

With the development of technologies such as 5G and audio and video, better experience and higher service quality are provided for clients, and a call center has evolved from a simple call answering process to a new generation system providing a strong audio and video interaction capability, so how to detect and evaluate the audio and video processing capability of the call center, so that the improvement of the service quality of the call center is a problem to be solved by the new generation call center.

The performance detection tool of the current call center mainly focuses on indexes such as the number of simultaneous agents on line, the number of real-time calls, the maximum number of calls and the like, the evaluation of audio and video quality is mostly limited to the capture of audio and video streams, and the performance detection evaluation mode is extremely low in efficiency by combining the manual analysis of professional personnel with the evaluation of agent users.

Disclosure of Invention

The embodiment of the application provides an audio and video processing performance testing method and device, which at least solve the problem that the audio and video processing quality and performance of a call center cannot be evaluated and detected efficiently and accurately in the related technology.

According to an embodiment of the present application, there is provided an audio/video processing performance test method, including: constructing an audio and video data set, and marking the audio and video data set; inputting the marked audio and video data set into a convolutional neural network cascade long-term memory neural network CNN+LSTM for training; configuring parameter information of a call model, and setting a quality threshold of audio and video data and a system resource use information threshold; inputting audio and video data from a terminal to the trained CNN+LSTM neural network model to obtain audio and video quality information; judging whether the audio and video quality information and the pre-configured system resource usage information meet or not are smaller than or equal to the corresponding threshold values, if not, increasing the number of the terminals, repeating the step of inputting the audio and video data from the terminals into the trained CNN+LSTM neural network model to obtain the audio and video quality information until the audio and video quality information and the pre-configured system resource usage information are smaller than or equal to the corresponding threshold values, and outputting a processing performance test result.

In an exemplary embodiment, the marking the audio-video data set includes at least one of: degrading the audio and video data set according to different degradation modes; dividing each degradation mode into a plurality of grades, and constructing a degradation grading vector; and marking the audio and video data set according to the degradation mode and the classification level.

In an exemplary embodiment, the different degradation patterns include at least one of: blurring the video; video frame extraction; the audio and video are not synchronous; and (5) adding noise to the audio.

In an exemplary embodiment, said classifying each of said degradation manners into a plurality of levels, constructing a degradation score vector comprising at least one of: the video blurring is performed by adopting no blurring and adopting convolution kernels with five sizes of 3 x 3,5 x 5,9 x 9 and 15 x 15 for Gaussian filtering; the video frame extraction is processed according to the intervals of 5 frames, 10 frames, 15 frames and 30 frames without frame extraction; the asynchronous processing of the audio and video is carried out according to-5 s, -2s,0s,2s and 5 s; the audio noise adding is carried out according to the mean value of zero, and the four incremental variances are added with Gaussian white noise and are not processed; and constructing a degradation score vector, wherein a1 represents the grade score of video blurring, a2 represents the grade score of video frame extraction, a3 represents the grade score of audio and video asynchronization, a4 represents the grade score of audio plus noise, and the highest score, namely the undegraded video quality score vector is [5, 5].

In an exemplary embodiment, the parameter information of the call model includes at least one of: number of call paths; the sending time of the audio and video data; video resolution; an encoding mode of audio and video data; the transmission code rate of the audio and video data.

In an exemplary embodiment, after the configuring the parameter information of the call model, the method further includes: configuring access information of a target call center, wherein the target call center supports a plurality of different call models; the terminal accesses the target call center through the SIP protocol.

According to another embodiment of the present application, there is also provided an audio/video processing performance test apparatus, including: the data processing module is used for constructing an audio and video data set and marking the audio and video data set; the neural network training module is used for inputting the marked audio and video data set into a convolutional neural network cascade long-term memory neural network CNN+LSTM for training; the first configuration module is used for configuring parameter information of the call model and setting a quality threshold value of audio and video data and a system resource use information threshold value; the test module is used for inputting the audio and video data from the terminal into the trained CNN+LSTM neural network model to obtain audio and video quality information; the judging and circulating module is used for judging whether the audio and video quality information and the pre-configured system resource use information meet the corresponding threshold value or not, if not, the number of the terminals is increased, and the test module is returned to until the audio and video quality information and the pre-configured system resource use information are smaller than or equal to the corresponding threshold value; and the reporting module is used for outputting a processing performance test result according to the audio and video quality information and the system resource use information when the audio and video quality information and the pre-configured system resource use information are smaller than or equal to the corresponding threshold values.

In an exemplary embodiment, the data processing module further comprises at least one of: a degradation unit for degrading the audio-video data set according to different degradation modes; a grade dividing unit for dividing each degradation mode into a plurality of grades and constructing a degradation grading vector; and the marking unit is used for marking the audio and video data set according to the degradation mode and the classification level.

In one exemplary embodiment, further comprising: a second configuration module, configured to configure access information of a target call center, where the target call center supports a plurality of different call models; and the terminal access module is used for accessing the terminal into the target call center through an SIP protocol.

According to a further embodiment of the application, there is also provided a computer readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

According to a further embodiment of the application, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Through the above embodiment of the present application, training is performed by inputting the marked audio-video data set to cnn+lstm; inputting audio and video data from the terminal to the trained CNN+LSTM neural network model to obtain audio and video quality information; judging whether the audio and video quality information and the preconfigured system resource use information meet or not are smaller than or equal to corresponding thresholds, if not, increasing the number of the terminals, repeatedly executing the steps of inputting the audio and video data from the terminals into the trained CNN+LSTM neural network model to obtain the audio and video quality information until the audio and video quality information and the preconfigured system resource use information are smaller than or equal to the corresponding thresholds, outputting a processing performance test result, solving the problem that the audio and video processing quality and performance of a call center cannot be accurately evaluated and detected in the related technology, and achieving the effects of eliminating a large amount of manual participation and improving the test efficiency.

Drawings

Fig. 1 is a block diagram of a hardware structure of a mobile terminal according to an audio and video processing performance test method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a network architecture of an audio/video processing performance test method according to an embodiment of the present application;

FIG. 3 is a flow chart of an audio and video processing performance test method according to an embodiment of the application;

FIG. 4 is a flow chart of an audio and video processing performance test method according to an embodiment of the application;

fig. 5 is a block diagram of an audio/video processing performance test apparatus according to an embodiment of the present application;

FIG. 6 is a block diagram showing the structure of a data processing module of an audio/video processing performance test apparatus according to an embodiment of the present application;

fig. 7 is a block diagram of an audio/video processing performance test apparatus according to an embodiment of the present application;

fig. 8 is a flowchart of an audio and video processing performance test method according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the embodiments of the present application may be performed in a mobile terminal, a computer terminal or similar computing device. Taking the mobile terminal as an example, fig. 1 is a block diagram of a hardware structure of the mobile terminal according to the audio/video processing performance testing method according to the embodiment of the present application. As shown in fig. 1, a mobile terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, wherein the mobile terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting of the structure of the mobile terminal described above. For example, the mobile terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to an audio/video processing performance test method in an embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the mobile terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.

The embodiment of the present application may be operated on the network architecture shown in fig. 2, and as shown in fig. 2, the audio/video processing performance test network architecture 20 includes: a call center access module 210, an audio video analysis module 220, and a configuration module 230. The call center access module is used for accessing different call centers needing to be tested through the SIP protocol; the audio and video analysis module is used for integrating the CNN+LSTM neural network, can train the neural network by utilizing audio and video data, and is used for analyzing the audio and video stream in the call flow; the configuration module is used for configuring the models and parameters related to the performance test.

In this embodiment, an audio/video processing performance testing method running on the mobile terminal or the network architecture is provided, and fig. 3 is a flowchart of the audio/video processing performance testing method according to an embodiment of the present application, as shown in fig. 3, where the flowchart includes the following steps:

step S302, an audio and video data set is constructed, and the audio and video data set is marked;

step S304, inputting the marked audio and video data set into a convolutional neural network cascade long-term memory neural network CNN+LSTM for training;

step S306, configuring parameter information of a call model, and setting a quality threshold of audio and video data and a system resource usage information threshold;

step S308, inputting audio and video data from a terminal to the trained CNN+LSTM neural network model to obtain audio and video quality information;

and step S310, judging whether the audio and video quality information and the pre-configured system resource use information meet the corresponding threshold value or not, if not, increasing the number of the terminals, and repeatedly executing step S308 until the audio and video quality information and the pre-configured system resource use information are smaller than or equal to the corresponding threshold value, and outputting a processing performance test result.

Through the steps, the marked audio and video data set is input to CNN+LSTM for training; inputting audio and video data from the terminal to the trained CNN+LSTM neural network model to obtain audio and video quality information; judging whether the audio and video quality information and the preconfigured system resource use information meet or not are smaller than or equal to corresponding thresholds, if not, increasing the number of the terminals, repeatedly executing the steps of inputting the audio and video data from the terminals into the trained CNN+LSTM neural network model to obtain the audio and video quality information until the audio and video quality information and the preconfigured system resource use information are smaller than or equal to the corresponding thresholds, outputting a processing performance test result, solving the problem that the audio and video processing quality and performance of a call center cannot be accurately evaluated and detected in the related technology, and achieving the effects of eliminating a large amount of manual participation and improving the test efficiency.

The main execution body of the above steps may be a base station, a terminal, or other processor platform on which the program required for executing the above method can be installed, but is not limited thereto.

In an exemplary embodiment, after the configuring the parameter information of the call model, the method further includes: configuring access information of a target call center, wherein the target call center supports a plurality of different call models; the terminal accesses the target call center through the SIP protocol. Fig. 4 is a flowchart of a method for testing audio and video processing performance according to an embodiment of the present application, as shown in fig. 4, the flowchart includes the following steps:

step S402, an audio and video data set is constructed, and the audio and video data set is marked;

step S404, inputting the marked audio and video data set into a convolutional neural network cascade long-term memory neural network CNN+LSTM for training;

step S406, configuring parameter information of a call model, and setting a quality threshold of audio and video data and a system resource usage information threshold;

step S408, configuring access information of a target call center, wherein the target call center supports a plurality of different call models;

step S410, the terminal accesses the target call center through SIP protocol;

step S412, inputting the audio and video data from the terminal to the trained CNN+LSTM neural network model to obtain audio and video quality information;

step S414, determining whether the audio and video quality information and the preconfigured system resource usage information meet or not, if not, increasing the number of the terminals, and repeating step S412 until the audio and video quality information and the preconfigured system resource usage information are smaller than or equal to the corresponding thresholds, and outputting a processing performance test result.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.

The embodiment also provides an audio/video processing performance testing device, which is used for implementing the above embodiment and the preferred implementation manner, and is not described in detail. As used below, the term "module, unit" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 5 is a block diagram of an audio/video processing performance test apparatus according to an embodiment of the present application, and as shown in fig. 5, the audio/video processing performance test apparatus 50 includes:

the data processing module 510 is configured to construct an audio/video data set, and mark the audio/video data set;

the neural network training module 520 is configured to input the marked audio and video data set to a convolutional neural network cascade long-term memory neural network cnn+lstm for training;

a first configuration module 530, configured to configure parameter information of the call model, and set a quality threshold of the audio/video data and a system resource usage information threshold;

the test module 540 is configured to input audio and video data from a terminal to the trained cnn+lstm neural network model, to obtain audio and video quality information;

a determining and circulating module 550, configured to determine whether the audio and video quality information and the preconfigured system resource usage information meet a corresponding threshold value or not, if not, increase the number of the terminals, and return to the testing module 540 until the audio and video quality information and the preconfigured system resource usage information are smaller than or equal to the corresponding threshold value;

and the reporting module 560 is configured to output a processing performance test result according to the audio and video quality information and the system resource usage information when the audio and video quality information and the preconfigured system resource usage information are less than or equal to the corresponding threshold values.

In an exemplary embodiment, fig. 6 is a block diagram of a data processing module of an audio/video processing performance testing apparatus according to an embodiment of the present application, and as shown in fig. 6, the data processing module 510 further includes at least one of the following: a degradation unit 610, configured to degrade the audio and video data set in different degradation manners; a ranking unit 620 configured to rank each of the degradation manners into a plurality of ranks, and construct a degradation score vector; and a marking unit 630, configured to mark the audio/video data set according to the degradation manner and the classification level.

In one exemplary embodiment, the different degradation manners in the degradation unit 610 include at least one of: blurring the video; video frame extraction; the audio and video are not synchronous; and (5) adding noise to the audio.

In one exemplary embodiment, the parameter information of the call model configured by the first configuration module 530 includes at least one of: number of call paths; the sending time of the audio and video data; video resolution; an encoding mode of audio and video data; the transmission code rate of the audio and video data.

In an exemplary embodiment, fig. 7 is a block diagram of an audio/video processing performance testing apparatus according to an embodiment of the present application, and as shown in fig. 7, the audio/video processing performance testing apparatus 70 includes, in addition to the respective modules mentioned in fig. 5: a second configuration module 710, configured to configure access information of a target call center, wherein the target call center supports a plurality of different call models; and a terminal access module 720, configured to access the terminal to the target call center through SIP protocol.

It should be noted that each of the above modules and units may be implemented by software or hardware, and the latter may be implemented by, but not limited to: the modules and the units are all positioned in the same processor; alternatively, the above modules and units may be located in different processors in any combination.

Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

In an exemplary embodiment, the electronic apparatus may further include a transmission device connected to the processor, and an input/output device connected to the processor.

Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.

In order to enable those skilled in the art to better understand the technical solution of the present application, the technical solution of the present application will be described below with reference to specific scenario embodiments.

Scene embodiment one

The current deep learning technology is developed at a high speed, and a neural network for realizing multiple functions can be trained by using the neural network technology in deep learning. The CNN+LSTM (convolutional neural network cascade connection long-short-term memory neural network) has wide application to processing of audio and video time sequences, the basic idea is to use the convolutional neural network to extract the characteristics of the audio and video, and LSTM is used to generate description, so that the CNN+LSTM is a combined application of two popular models in deep learning. The deep learning-based call center audio and video processing performance detection system provided by the application can exclude a large amount of manual participation by utilizing the capability of the CNN+LSTM neural network, and efficiently and accurately evaluate and detect the audio and video processing quality and performance of the call center.

Fig. 8 is a flowchart of a method for testing audio and video processing performance according to an embodiment of the present application, as shown in fig. 8, the flowchart includes the following steps:

step S802, training a CNN+LSTM neural network;

firstly, a large number of audio and video construction data sets are obtained by utilizing a network, a plurality of training sets and test sets with standard resolutions (720 p,1080p,2k and 4 k) are formed by classification, the original audio and video of the training sets are degraded in different modes, the degradation modes are divided into video blurring, video frame extraction, asynchronous processing of the audio and video, audio noise adding and 5 grades of each degradation according to intensity, wherein the video blurring adopts convolution kernels with five sizes of 3,5, 9 and 15 to carry out Gaussian filtering; the video frame extraction is carried out according to 5 frames, 10 frames, 15 frames and 30 frames at intervals, and no frame extraction is carried out; the asynchronous processing of the audio and video is carried out according to-5 s, -2s,0s,2s and 5 s; the audio noise is zero according to the mean value, and the four incremental variances add Gaussian white noise and do not add noise. Thus, the degradation score of the final training set may constitute a four-dimensional vector, where the highest score, i.e., the undegraded video quality score vector, is [5, 5].

4 modes (video blurring, video frame extraction, audio and video asynchronous processing and audio noise adding) are carried out on the training set by using the Score with the multiple degradation dimensions as targets, different Score marks are obtained through 5-level arrangement and combination processing, and the different Score marks are sent into a constructed CNN+LSTM neural network for training to obtain a pre-trained neural network;

step S804, configuring model parameters;

the call model information of the test system is configured, namely, the number of members needed by one-way call, the time for each member to send audio and video streams, the resolution of video, the coding mode of audio and video, the transmission code rate and other information. The initial call path number is configured. Configuring an audio and video quality threshold Score, and scoring from four dimensions according to the requirements of a specific contact center on video quality, for example [4,4,4,4], wherein the minimum video quality needs to reach 4 scores of four degradation intensities described in step S802; and configuring the system resource (CPU, memory) use information and the threshold value of the service end of the call center;

step S806, the analog terminal is accessed to a call center;

configuring access information of a target call center, accessing the target call center through an SIP protocol, simulating a terminal (providing audio and video streams) to call according to the model configured in the step S804 and the number of the calls, and sending a test set audio and video stream (also can be an actual audio and video stream) in the step S802;

step S808, the analog terminal receives a video stream;

sending the audio and video streams received by each analog terminal into the CNN+LSTM neural network trained in the step S802;

step S810, performing audio and video quality analysis by using the CNN+LSTM neural network;

performing audio and video quality analysis through the CNN+LSTM neural network to obtain audio and video quality information, and monitoring system resource (CPU, memory) use information of a service end where a call center is located by a test system;

step S812, judging whether a threshold value is reached;

judging whether the threshold value configured in step S804 is lowered or not according to the quality and performance information and the CPU usage information in step S810, if not, increasing the number of call paths (i.e., increasing the number of analog terminals), increasing the video processing pressure, and jumping to step S808, otherwise, proceeding to step S814;

step S814, outputting a performance test result report;

and finally outputting a performance test result report according to the parameter information obtained in the step S810, and ending the flow.

It should be understood by those skilled in the art that the number of levels, the division manner, the degradation manner, and the scoring manner in the embodiment of the present scene are all illustrated and not limited to only one manner.

The audio and video processing performance testing method and device provided by the application can be applied to audio and video performance quality detection of a call center, wherein the call center is required to have normal network communication conditions, has audio and video processing capability, and can carry out registration access and media negotiation of a terminal through an SIP standard protocol.

The application provides an audio and video processing performance testing method and device, and aims to provide a performance testing system aiming at audio and video processing capacity of a new generation call center, and detect defects and bottlenecks of audio and video processing of the call center by combining deep learning and other technologies, thereby providing assistance for improving service quality of the call center.

The application constructs a data set by using rich Internet audio and video resources, classifies the data set into a training set and a testing set with multiple resolutions, and trains the CNN+LSTM neural network of the testing system to identify the video quality capability by degrading and marking the original audio and video of the training set according to different grades. The test system provides the capability configuration parameters of the customized audio and video call model, and directly defines various audio and video call models for testing through configuration. The test system can access different call centers through a standard SIP protocol, send test set audio and video streams according to a model, send the received streams into a trained neural network to obtain an evaluation result output report of the obtained video quality of the current call center, and gradually increase the number of paths of the sent streams until the quality index is reduced to a set threshold value to stop.

The application provides a detection tool for processing capacity of audio and video for each call center system, and can provide a call center audio and video processing quality and performance detection system based on deep learning. The method provides reliable performance detection indexes for the audio and video processing capacity of the call center, helps the call center to find performance bottlenecks, improves the problems, and improves the audio and video processing capacity and user experience.

It should be understood by those skilled in the art that a system or scheme related to a call center for detecting its audio/video processing capability through deep learning and neural network falls within the scope of the present application.

It will be apparent to those skilled in the art that the modules, units, or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present application should be included in the protection scope of the present application.

Claims

1. The audio and video processing performance test method is characterized by comprising the following steps of:

constructing an audio and video data set, and marking the audio and video data set;

inputting the marked audio and video data set into a convolutional neural network cascade long-term memory neural network CNN+LSTM for training;

configuring parameter information of a call model, and setting a quality threshold of audio and video data and a system resource use information threshold;

inputting audio and video data from a terminal to the trained CNN+LSTM neural network model to obtain audio and video quality information;

judging whether the audio and video quality information and the pre-configured system resource usage information meet or not are smaller than or equal to the corresponding threshold values, if not, increasing the number of the terminals, repeating the step of inputting the audio and video data from the terminals into the trained CNN+LSTM neural network model to obtain the audio and video quality information until the audio and video quality information and the pre-configured system resource usage information are smaller than or equal to the corresponding threshold values, and outputting a processing performance test result.

2. The method of claim 1, wherein the tagging of the audio-visual data set comprises at least one of:

degrading the audio and video data set according to different degradation modes;

dividing each degradation mode into a plurality of grades, and constructing a degradation grading vector;

and marking the audio and video data set according to the degradation mode and the classification level.

3. The method of claim 2, wherein the different degradation patterns comprise at least one of:

blurring the video;

video frame extraction;

the audio and video are not synchronous;

and (5) adding noise to the audio.

4. A method according to claim 3, wherein said classifying each of said degradation patterns into a plurality of levels, constructing a degradation score vector, comprises at least one of:

the video blurring is performed by adopting no blurring and adopting convolution kernels with five sizes of 3 x 3,5 x 5,9 x 9 and 15 x 15 for Gaussian filtering;

the video frame extraction is processed according to the intervals of 5 frames, 10 frames, 15 frames and 30 frames without frame extraction;

the asynchronous processing of the audio and video is carried out according to-5 s, -2s,0s,2s and 5 s;

the audio noise adding is carried out according to the mean value of zero, and the four incremental variances are added with Gaussian white noise and are not processed;

construction of degradation Score vector Score [ a ] ₁ ,a ₂ ,a ₃ ,a ₄ ]Wherein a is ₁ A ranking score, a, representing video blur ₂ A represents a rank score, a, of a video snapshot ₃ A rank score representing audio-video dyssynchrony, a ₄ A is equal to or less than 1 and represents the grade fraction of the audio plus noise _i Less than or equal to 5, the highest score, i.e., the undegraded video quality score vector is [5,5]。

5. The method of claim 1, wherein the parameter information of the call model comprises at least one of:

number of call paths;

the sending time of the audio and video data;

video resolution;

an encoding mode of audio and video data;

the transmission code rate of the audio and video data.

6. The method of claim 5, further comprising, after said configuring the parameter information of the call model:

configuring access information of a target call center, wherein the target call center supports a plurality of different call models;

the terminal accesses the target call center through the SIP protocol.

7. An audio and video processing performance testing device, comprising:

the data processing module is used for constructing an audio and video data set and marking the audio and video data set;

the neural network training module is used for inputting the marked audio and video data set into a convolutional neural network cascade long-term memory neural network CNN+LSTM for training;

the first configuration module is used for configuring parameter information of the call model and setting a quality threshold value of audio and video data and a system resource use information threshold value;

the test module is used for inputting the audio and video data from the terminal into the trained CNN+LSTM neural network model to obtain audio and video quality information;

the judging and circulating module is used for judging whether the audio and video quality information and the pre-configured system resource use information meet the corresponding threshold value or not, if not, the number of the terminals is increased, and the test module is returned to until the audio and video quality information and the pre-configured system resource use information are smaller than or equal to the corresponding threshold value;

and the reporting module is used for outputting a processing performance test result according to the audio and video quality information and the system resource use information when the audio and video quality information and the pre-configured system resource use information are smaller than or equal to the corresponding threshold values.

8. The apparatus of claim 7, wherein the data processing module further comprises at least one of:

a degradation unit for degrading the audio-video data set according to different degradation modes;

a grade dividing unit for dividing each degradation mode into a plurality of grades and constructing a degradation grading vector;

and the marking unit is used for marking the audio and video data set according to the degradation mode and the classification level.

9. The apparatus of claim 8, wherein the different degradation patterns comprise at least one of:

blurring the video;

video frame extraction;

the audio and video are not synchronous;

and (5) adding noise to the audio.

10. The apparatus of claim 7, wherein the parameter information of the call model comprises at least one of:

number of call paths;

the sending time of the audio and video data;

video resolution;

an encoding mode of audio and video data;

the transmission code rate of the audio and video data.

11. The apparatus as recited in claim 7, further comprising:

a second configuration module, configured to configure access information of a target call center, where the target call center supports a plurality of different call models;

and the terminal access module is used for accessing the terminal into the target call center through an SIP protocol.

12. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, wherein the computer program, when executed by a processor, implements the method of any of claims 1 to 6.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 6 when executing the computer program.