CN116013367A

CN116013367A - Audio quality analysis method and device, electronic equipment and storage medium

Info

Publication number: CN116013367A
Application number: CN202211739631.6A
Authority: CN
Inventors: 方博伟; 朋尔
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-04-25

Abstract

The application provides an audio quality analysis method and device, electronic equipment and a storage medium, and according to the embodiment of the application, the quality of audio can be directly and accurately analyzed from multiple dimensions. Firstly, target audio data to be analyzed is acquired, then audio signal processing is carried out on the acquired target audio data to obtain target audio data with enlarged frequency bandwidth, thereby improving accuracy of quality analysis and realizing quality analysis of various frequency bandwidth audios. And finally, carrying out audio analysis on the target audio data by using the neural network model to obtain audio analysis information of the target audio data, wherein the audio analysis information comprises analysis results under a plurality of quality analysis dimensions. Because the quality analysis dimension has a corresponding relation with the factors influencing the audio quality, the problem influencing the audio quality can be examined according to the obtained analysis result, and the reason influencing the audio quality can be positioned.

Description

Audio quality analysis method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to an audio quality analysis method and apparatus, a neural network model processing method and apparatus for audio analysis, a voice quality analysis method and apparatus, an electronic device, and a storage medium.

Background

In recent years, with the wide application of Real-time communication technology (RTC) in education, office, entertainment, social, etc., applications such as online education, video conference, live broadcast, etc., have shown explosive growth. In the application scene of real-time communication, the quality of the audio directly affects the user experience. However, the universality and accuracy of the audio quality analysis of the related art are required to be improved.

Therefore, how to provide an analysis method of audio quality with universality, so as to objectively and accurately evaluate the audio quality in real-time communication has research significance for further promoting the development of real-time communication technology.

Disclosure of Invention

The embodiment of the application provides an audio quality analysis method and device, electronic equipment and a storage medium, so as to solve one or more of the technical problems.

In a first aspect, embodiments of the present application provide a method for analyzing audio quality, the method including:

acquiring target audio data to be analyzed;

performing audio signal processing on the target audio data to obtain target audio data with enlarged frequency bandwidth;

and performing audio analysis on the target audio data by using a neural network model to obtain audio analysis information of the target audio data, wherein the audio analysis information comprises analysis results under a plurality of quality analysis dimensions.

In a second aspect, embodiments of the present application provide a method for processing a neural network model for audio analysis, the method including:

acquiring an audio data sample marked with corresponding audio analysis information;

performing audio signal processing on the audio data samples to obtain audio data samples with enlarged frequency bandwidth;

training a neural network model based on the audio data samples, the neural network model for determining audio analysis information by audio analysis, the audio analysis information comprising analysis results in a plurality of quality analysis dimensions.

In a third aspect, an embodiment of the present application provides a method for analyzing speech quality, where the method includes:

acquiring voice data transmitted in real time in an audio-video session;

performing audio signal processing on the voice data to obtain voice data with enlarged frequency bandwidth;

performing audio analysis on the voice data by using a neural network model to obtain audio analysis information of the voice data, wherein the audio analysis information comprises analysis results under a plurality of quality analysis dimensions;

and providing the audio analysis information to at least one client participating in the audio-video session.

In a fourth aspect, embodiments of the present application provide a method for analyzing audio quality, the method including:

acquiring audio analysis information of target audio data, wherein the audio analysis information is obtained by performing audio analysis on the target audio data by using a neural network model, the frequency bandwidth of the target audio data is enlarged by audio signal processing, and the audio analysis information comprises analysis results under a plurality of quality analysis dimensions;

the audio analysis information is provided based on a client.

In a fifth aspect, embodiments of the present application provide an electronic device comprising a memory, a processor and a computer program stored on the memory, the processor implementing the method of any one of the preceding claims when the computer program is executed.

In a sixth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements a method as in any of the above.

Compared with the related art, the method has the following advantages:

according to the embodiment of the application, the target audio data to be analyzed is first obtained, where the target audio data may be a time domain audio signal in real-time communication or an audio file. Because the neural network model can directly analyze the target audio data without data except audio signals, such as data of packet loss rate, signal to noise ratio and the like monitored by an SDK (Software Development Kit ) embedded point in a real-time communication program, the audio quality analysis method provided by the embodiment of the application is higher in application flexibility, wider in application range and higher in universality.

And performing audio signal processing on the obtained target audio data to obtain target audio data with enlarged frequency bandwidth, so as to recover the lost information of the target audio data with low sampling rate in a higher frequency band, and realize quality analysis of various frequency bandwidth audios. If the target audio data is processed by an audio signal, and the up-sampling (Upsampling) is carried out on the audio data with the frequency of 48kHz, the full-band audio quality analysis of narrow band, wide band and ultra-wide band can be realized, the method is suitable for application scenes such as narrow-band conversation, high-quality live broadcasting, conferences, education and the like, and the universality and usability of the audio quality analysis method are improved.

The target audio data is subjected to audio analysis by using the neural network model, so that analysis results under a plurality of quality analysis dimensions are obtained, multi-angle analysis of the target audio data is realized, and dimensions causing problems can be examined according to the analysis results under the condition that the audio quality is in question. Since the quality analysis dimension has a correspondence with the factor affecting the audio quality, it is also possible to locate the root cause affecting the audio quality based on the obtained analysis result and provide advice for improving the audio quality based on the located cause.

Before the audio analysis is performed on the target audio data by using the neural network model, frequency division processing can be performed on the target audio data, and then audio features of a plurality of sub-band spectrums corresponding to different frequency band ranges are respectively extracted. When the information carried by the sub-band spectrums obtained after frequency division is different, more dimensional audio features can be extracted for the frequency band distribution with more information carrying capacity, so that the accuracy of audio quality analysis is improved. When the audio features are extracted, mel spectrum features which are more in line with the auditory characteristics of human ears can be selected so that the audio quality analysis result is more close to the auditory sense of human beings.

When constructing the neural network model for audio analysis, a fitting function between audio data parameters and audio quality analysis information can be constructed according to the initial audio data samples and the corresponding audio quality analysis information, and then the corresponding audio quality analysis information is marked for the initial audio data after parameter adjustment through the fitting function so as to obtain newly added audio data samples, thereby expanding the data volume of the audio data samples and simultaneously reducing the labor cost for marking the audio quality analysis information for the audio data samples. Training a neural network model for audio analysis using the amplified data amount of the audio data samples can improve accuracy of the model.

The foregoing description is merely an overview of the technical solutions of the present application, and in order to make the technical means of the present application more clearly understood, it is possible to implement the present application according to the content of the present specification, and in order to make the above and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.

Drawings

In the drawings, the same reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily drawn to scale. It is appreciated that these drawings depict only some embodiments according to the application and are not to be considered limiting of its scope.

FIG. 1A shows a schematic view of a scenario of an analysis scheme for audio quality provided in an embodiment of the present application;

FIG. 1B illustrates a schematic view of a scenario of another audio quality analysis scheme provided in an embodiment of the present application;

FIG. 2 shows a flow chart of a method of analyzing audio quality provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a neural network model structure according to an embodiment of the present disclosure;

FIG. 4A is a schematic diagram of a Mel-spectrum feature sequence provided in an embodiment of the present application;

FIG. 4B is a schematic view of another Mel-spectrum feature sequence provided in an embodiment of the present application;

FIG. 5A illustrates a process diagram for extracting audio features provided in an embodiment of the present application;

FIG. 5B illustrates another process diagram for extracting audio features provided in an embodiment of the present application;

FIG. 6A shows a schematic diagram of a client-based audio quality analysis scheme provided in an embodiment of the present application;

fig. 6B is a schematic diagram illustrating an analysis scheme of server-based audio quality provided in an embodiment of the present application;

FIG. 7 illustrates a flow chart of a method of processing a neural network model for audio analysis provided in an embodiment of the present application;

FIG. 8 is a flow chart illustrating a method of analyzing speech quality provided in an embodiment of the present application;

FIG. 9 shows a flow chart of another method of analysis of audio quality provided in an embodiment of the present application;

fig. 10 is a block diagram showing the structure of an audio quality analysis apparatus provided in an embodiment of the present application;

FIG. 11 shows a block diagram of a processing device for neural network models for audio analysis, provided in an embodiment of the present application;

FIG. 12 is a block diagram showing the construction of a voice quality analysis apparatus provided in an embodiment of the present application;

FIG. 13 shows a block diagram of another audio quality analysis apparatus provided in an embodiment of the present application; and

fig. 14 shows a block diagram of an electronic device used to implement an embodiment of the present application.

Detailed Description

Hereinafter, only certain exemplary embodiments are briefly described. As will be recognized by those of skill in the pertinent art, the described embodiments may be modified in various different ways without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

In order to facilitate understanding of the technical solutions of the embodiments of the present application, the following describes related technologies of the embodiments of the present application. The following related technologies may be optionally combined with the technical solutions of the embodiments of the present application, which all belong to the protection scope of the embodiments of the present application.

In a related art before this application, the audio quality analysis relies on the data provided by the SDK integrated in the audio/video communication program, and the audio quality analysis result is obtained through the algorithm analysis based on the information such as the packet loss rate, silence, signal to noise ratio and the like of the audio signal collected by the data collection module in the SDK. Since the audio quality analysis scheme depends on the SDK internal data, there is a limit in use, and the universality of the audio quality analysis in various scenes is low.

In other related art prior to this application, audio quality analysis was limited by one or more of the following factors: whether quality analysis based on reference content (e.g., a clean piece of audio signal is referenced as a sample), whether real-time quality analysis is supported, the bandwidth of the audio signal (e.g., narrowband, wideband, or ultra wideband), and whether multi-dimensional quality analysis is supported. Table 1 shows limit case statistics of several related technologies with respect to the above several factors.

Prior Art	PESQ	POLQA	P.563
				With/without reference	With references	With references	No reference is made to
Frequency band	Narrowband/wideband	Narrowband/wideband/ultra wideband	Narrow band
				Whether or not to support real-time	Whether or not	Whether or not	Is that
Whether or not to support multi-dimensions	Whether or not	Whether or not	Whether or not

TABLE 1 limiting factor for Audio quality analysis techniques

As shown in table 1, PESQ (Perceptual Evaluation of Speech Quality, objective speech quality assessment) requires reference content when quality analysis is performed on audio, supports only quality analysis on narrowband and wideband audio signals, and does not support real-time analysis; POLQA (Perceptual Objective Listening Quality Analysis, perceived objective auditory quality analysis) can perform quality analysis on full bandwidth (narrowband, wideband, and ultra wideband) audio signals, but requires reference content and does not support real-time analysis; p.563 (voice objective quality single-ended evaluation criteria applied to narrowband communications) does not require reference content, supports real-time analysis, but can only perform quality analysis on narrowband audio signals. It follows that PESQ, POLQA and p.563 are each limited to different degrees in the several factors involved above, and that all three audio quality analysis techniques can only provide one MOS (mean opinion score ) and cannot provide audio quality analysis results from multiple dimensions. In practical applications, the audio quality analysis results may also be less accurate due to limitations imposed by several factors.

In view of the foregoing, embodiments of the present application provide a new audio quality analysis scheme to solve the above-mentioned technical problems in whole or in part.

The embodiment of the application relates to an audio quality analysis scheme which is applied to a scene of real-time voice or video communication, wherein the related audio quality analysis module can be deployed at a client side of audio-video communication or at a server side. In the scene of audio-video communication, the audio quality analysis scheme provided by the embodiment of the application can be applied to carry out multi-dimensional quality analysis on the audio transmitted in real time. In an analysis scheme for audio quality, a neural network model for analyzing audio quality may analyze ultra wideband audio at 48 kHz. The method is suitable for quality analysis of full-bandwidth audio data of narrow-band (such as 8 kHz), wide-band (such as 16 kHz) and ultra-wide-band by upsampling the audio data with a low sampling rate to 48kHz before the audio data is sent to the neural network model for analysis. Because the neural network model is directly used for analyzing the real-time audio signal, other audio is not needed to be used as a sample reference, so that the method has higher usability compared with a scheme in the related art, wherein the audio quality is needed to be analyzed according to the reference content. The dimensions analyzed can comprise audio continuity, noise, coloring, loudness and comprehensive dimensions, and after the neural network model analyzes the audio data and the audio quality analysis results corresponding to the five dimensions can be respectively output, so that multidimensional analysis of the audio data is realized. It can be understood that the analysis scheme of audio quality provided in the embodiment of the present application may also be applied to a scene of offline analysis, where audio quality analysis is performed on an offline audio file.

Fig. 1A is a schematic view of an exemplary scenario for implementing an analysis scheme for audio quality according to an embodiment of the present application. As shown in fig. 1A, a user at a first client side communicates with a plurality of users at a second client side in real time through a network, and audio data generated by the user at the first client side is transmitted to the plurality of second clients through the network after being subjected to audio preprocessing at the first client side. In this scenario, the audio quality analysis module may be deployed at the first client, the second client, and the server, the audio quality analysis module may analyze the audio quality of the audio data obtained by the node A, B, C, D, and then return the audio quality analysis result to the user at the first client, so that the user at the first client monitors the audio quality in real-time communication in real-time. It can be appreciated that in practical applications, the audio quality analysis result may also be provided to the user at the second client side in combination with the application requirement. It should be noted that, the data (including, but not limited to, data for analysis, stored data, displayed data, etc.) referred to in the present application are all data authorized by the user or fully authorized by the parties, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and are provided with corresponding operation entries for the user to select authorization or rejection.

In the above scenario, a specific transmission procedure of audio data may be referred to as a scenario diagram of another audio quality analysis scheme provided in the embodiment of the present application shown in fig. 1B. Fig. 1B specifically illustrates a process of collecting audio data by a first client, preprocessing the audio, encoding the audio by an audio encoder of the first client, pushing the audio to a server, pulling the audio from the server, decoding the audio by an audio decoder, and playing the audio after processing such as mixing and a voltage limiter to a second client in the application scenario of fig. 1A. As shown in fig. 1B, the audio data acquired at the node a is the audio data acquired by the microphone device of the first client, that is, the audio data of the original audio stream in the real-time audio communication scene; the audio data acquired at the node B is the audio data processed by one or more audio preprocessing methods at the first client, that is, the audio data of the uplink audio stream in the real-time audio communication scene, where the related audio preprocessing methods may include AGC (Automatic Gain Control ), ANS (Adaptive Noise Suppression, adaptive noise suppression), AEC (Acoustic Echo Cancellation ) and the like; the audio data acquired at the node C are the audio data acquired at the server end and decoded by a server end decoder, and are also the audio data of the uplink audio stream in the real-time audio communication scene; the audio data acquired at the node D is the audio data obtained by decoding the downstream audio stream by the decoder of the second client, that is, the audio data acquired by the receiving end of the audio, that is, the audio heard by the user at the second client.

The audio quality analysis result fed back to the first client by the audio quality analysis module may include 5 dimensions of synthesis, continuity, noise (noise), coloring (color), loudness (Loudness). Wherein the integrated dimension is used to generally scale the overall quality of the audio; the continuity dimension is used for measuring the hearing continuity of the audio so as to evaluate the influence of the audio on the quality of the audio acquired frame loss at the first client and network packet loss in the transmission process; the noise dimension is used for measuring the noise condition of the audio so as to evaluate the environmental noisy degree of the audio collected by the first client and the noise suppression capability of the noise reduction algorithm under the condition that the signal to noise ratio of the audio is not needed; the coloring dimension is used for measuring the definition and the integrity of the audio frequency spectrum, and in practical application, factors such as code rate change, filtering, packet loss concealment and the like of the audio frequency all cause obvious change of the coloring dimension analysis result; the loudness dimension is used for measuring the volume of the audio, and when the loudness of the audio is reduced, the hearing experience is also reduced, so that the quality of the audio is affected.

It is noted that the above five dimensions are orthogonal to each other and have the characteristic of being independent of each other. The variations in audio quality in any dimension do not affect the quality analysis of the audio in other dimensions. It will be appreciated that, in addition to the above five dimensions, in practical applications, other audio quality analysis dimensions may be used as required, or existing analysis dimensions may be adjusted, and the specific analysis dimensions are not limited in this application.

In the quality analysis result provided by the embodiment of the present application, each dimension reflects the quality of audio with a MOS value, and the possible values of the MOS and the corresponding descriptions are shown in table 2.

MOS value	Description of the invention
		[4,5]	Excelent (Excellent)
[3,4)	Good (Good)
		[2,3)	Fair (general)
[1,2)	Poor (difference)
		[0,1)	Bad (very poor)

Table 2MOS values and corresponding descriptions

In a real-time communication scene, on one hand, the audio quality of the uplink audio data can be analyzed, and the audio quality analysis result is dynamically updated to the first client side user, so that the first client side user monitors the quality of the uplink audio data in real-time communication under a plurality of analysis dimensions, and the audio quality problem of a certain dimension is positioned and solved. For example, in a live teaching scene, when a teacher monitors that the MOS of the uplink audio data in the dimension of continuity is lower than 3, the situation of packet loss caused by poor network quality can be timely checked, the network is replaced to improve the class listening experience of students, the situation that students cannot understand teaching contents due to the fact that audio frequency in live teaching is discontinuous is avoided, and the teaching process is delayed.

On the other hand, the audio quality of the downstream audio data may also be analyzed. The audio quality analysis results of the downstream audio data may be used to evaluate the end-to-end audio quality. For example, in a scenario of a multi-person video conference, by analyzing the audio quality of each participant individually, the cause of the audio quality problem caused by the downstream network or devices or acoustic elements on the second client can be addressed by multi-dimensional audio analysis results when the participant reacts poorly to the hearing. Further, a solution for improving the audio quality can be provided for the user at the second client side by combining the quality analysis result of the uplink audio. For example, on the premise that it is determined that the audio quality of the uplink audio data is not problematic, when the MOS score of the continuity of the downlink audio data is low, the user of the second client may be prompted to replace the network, so as to obtain a smoother listening feel. Similarly, the audio quality analysis scheme provided by the embodiment of the application can also be applied to scenes such as a voice chat room, a live broadcast of a show field, an online singing practice room, video relatives, online desktops, network radio stations and the like, and the embodiment of the application is not limited to specific application scenes.

The execution body of the embodiment of the present application may be an application, a service, an instance, a functional module in a software form, a Virtual Machine (VM), a container, a cloud server, or the like, or a hardware device (such as a server or a terminal device) or a hardware chip (such as a CPU, GPU, FPGA, NPU, AI accelerator card or a DPU) with a data processing function, or the like. The apparatus for implementing audio quality analysis may be deployed on a computing device of an application side providing a corresponding service or a cloud computing platform providing computing power, storage and network resources, and a mode of externally providing services by the cloud computing platform may be IaaS (Infrastructure as a Service ), paaS (Platform as a Service as a service), saaS (Software-as-a-service) or DaaS (Data-as-a-service). Taking the example that the platform provides Software-as-a-Service (SaaS), the cloud computing platform can provide training of a neural network model or an audio quality analysis module by utilizing own computing resources, and a specific application architecture can be built according to Service requirements. For example, the platform may provide a build service based on the model to an application or individual using the platform resources, further invoking the model and implementing functions for analyzing audio quality online or offline based on audio quality analysis requests submitted by devices such as relevant clients or servers.

An embodiment of the present application provides a method for analyzing audio quality, as shown in fig. 2, which is a flowchart of a method 200 for analyzing audio quality according to an embodiment of the present application, where the method 200 may include:

in step S201, target audio data to be analyzed is acquired.

According to the audio quality analysis method provided by the embodiment of the application, the audio signal can be subjected to quality analysis. In order to distinguish the audio signals involved in the embodiments of the present application, audio data for analyzing the audio quality is noted as target audio data to be analyzed. In the on-line application scene, the target audio data to be analyzed can be audio data collected in real time by the client-side audio signal collection device, and the audio data collected by a certain node in the network transmission process, such as uplink audio data and downlink audio data; in the offline referencing scenario, the target audio data to be analyzed may be an audio file, such as a wav file, or may be an audio signal extracted from a video file.

In one possible implementation manner, acquiring the target audio data to be analyzed may include acquiring at least one of original audio data acquired by an audio providing end, acquiring uplink audio data submitted from the audio providing end to a server, and acquiring downlink audio data acquired from the server by an audio receiving end, where the uplink audio data and/or the downlink audio data are subjected to at least one of audio preprocessing and audio codec processing. The original audio data refers to audio data directly collected by a sound collection device (such as a microphone) at the client side. The audio data acquired by the node a in fig. 1B is the original audio data. The uplink audio data refers to audio data submitted to the server by the audio provider, that is, audio data transmitted to the server by the audio provider through the uplink network. As shown in fig. 1B, the upstream audio data may be acquired at the node B in the first client, that is, the audio data to be encoded after the audio preprocessing is performed on the original audio data; or the audio data can be obtained by the C node of the server side, namely, the audio data is pushed to the server side by the first client side and then decoded by a decoder of the server side. The downlink audio data may be obtained at the node D, that is, the audio data played to the user at the second client side after the second client side decodes, mixes the audio, and uses a voltage limiter.

By carrying out real-time quality analysis on the original audio data, the quality of the audio acquired by the sound acquisition equipment can be fed back to the audio providing end. For example, when the original audio data has obtained a MOS smaller than 3 minutes in the noise dimension, it is indicated that the signal-to-noise ratio of the original audio data is low and the influence of noise or noise on the audio is large. By carrying out real-time quality analysis on the uplink audio data and integrating the quality analysis result of the original audio data, the influence and improvement of the audio preprocessing on the audio quality can be dynamically analyzed. For example, in real-time communication, when the quality analysis result of the upstream audio data is significantly improved compared with the quality analysis result of the original audio data, it is explained that the audio preprocessing method produces a great improvement of the audio quality of the original audio data. If the quality analysis result of the uplink audio data is lower than that of the original audio data, it is indicated that the audio providing end of the original audio data needs to adjust the audio preprocessing method in time, for example, adjust the automatic gain control, the adaptive noise suppression and the acoustic echo cancellation of the original audio data, so as to eliminate the negative influence of the audio preprocessing method on the audio quality. By carrying out real-time quality analysis on the downlink audio data, the end-to-end audio quality in a communication scene can be monitored in real time, so that when a user on the client side receiving the audio data reflects poor hearing, the reason for the poor audio quality can be timely checked.

In one possible implementation, before performing the audio quality analysis, the initial audio data may be further processed in clauses, and the single-sentence audio data after the clauses is used as the target audio data. The audio data initially provided by the audio providing terminal without any processing is recorded as initial audio data. In a scenario where the audio quality is analyzed in real-time, the initial audio data may be audio data that is continuously collected by an audio collection device of the client providing the audio. In a scenario where the audio quality is analyzed offline, the initial audio data may be an audio file. The initial audio data can be segmented according to the preset audio duration in a mode of setting the audio duration of the single sentence audio data after the clause, so that the target audio data can be obtained. Under the condition that storage resources and computing resources are limited, audio quality analysis results can be obtained more quickly by limiting the audio duration of target audio data for audio quality analysis by sentence segmentation of initial audio data. It can be understood that, because the audio duration that is too short (e.g., 2-3 seconds) can result in insufficient data volume of the target audio data, thereby affecting the accuracy of audio quality analysis, a reasonable audio duration of the single-sentence audio data needs to be set, so as to ensure the data volume of the target audio data, and ensure the accuracy of audio quality.

For example, in one application example, the audio duration of the single sentence audio data may be set to 8-15 seconds. In the scene that the client analyzes the audio quality in real time, the real-time initial audio data can be updated to the data buffer area by setting a data buffer area of 8-15 seconds on the client, and then the audio data in the data buffer area is acquired as target audio data.

In addition, the interfering audio data may be segmented from the target audio data prior to audio quality analysis. Wherein the interfering audio data includes at least one of noise, blank audio, and music. The interference audio data refers to audio data which can cause interference to an audio quality result. For example, if the target audio data is mostly blank audio, that is, if the audio occupation ratio is low, the audio quality analysis result is inaccurate due to the insufficient audio data amount, and the audio quality analysis result in this case has no reference meaning for the audio quality analysis. Therefore, the accuracy of the audio quality analysis can be further improved by dividing the interference audio data from the target audio data in advance. In one application example, the interference audio data may be identified and segmented using algorithms such as voice detection, noise reduction, automatic gain control, etc. before the audio quality analysis, and then the segmented remaining audio data may be spliced, and the splice result may be updated to the target audio data.

The following basic information of the audio signal corresponding to the target audio data, such as the sampling rate, the total duration, the voice duration, the pop duration, the talk-back duty, the inter-frame silence duty or the special event mark, etc., of the target audio data to be analyzed can be obtained based on the target audio data while the target audio data to be analyzed is obtained. The total duration of the target audio data refers to the overall audio duration of the target audio data, the voice duration refers to the duration of the voice data in the target audio data (namely, the duration of the target audio data except for a mute frame), the pop duration duty ratio refers to the duration duty ratio of the acquired pop caused by the excessively high sensitivity of the audio acquisition device, the double talk duty ratio refers to the duration duty ratio of the double talk period in the target audio data, the inter-frame mute duty ratio refers to the duration duty ratio of the mute statistics of the target audio data except for the double talk period, and the special event mark refers to the mark of special conditions such as no speaking or complete mute of the target audio data.

In step S202, audio signal processing is performed on the target audio data to obtain target audio data with an enlarged frequency bandwidth.

Due to the audio acquisition device and the audio acquisition environment limitations, the sampling rate of the target audio data may be low, thereby losing the information carried by the target audio data in the higher frequency band. In order to recover the lost information of the target audio data in the higher frequency band, the audio signal processing can be performed on the target audio data, and the frequency bandwidth of the target audio data is expanded to obtain the target audio data with the expanded frequency bandwidth. The audio signal processing modes involved can comprise up-sampling processing, high-frequency reconstruction processing, spectrum signal extrapolation processing and the like. For example, the target audio signal may be up-sampled by using an interpolation filter to obtain target audio data with an enlarged frequency bandwidth; and the information of the low-frequency sub-band is copied to the high-frequency sub-band through frequency band copying, so that the high-frequency information is reconstructed for the target audio signal, and the target audio data with the expanded frequency bandwidth is obtained. Taking the up-sampling processing for audio signal processing of the target audio signal as an example, the target audio data can be up-sampled into 48kHz audio data, so that the full-band audio quality analysis of narrow-band, wide-band and ultra-wide-band is realized, and the universality of the audio quality analysis method is improved.

In step S203, audio analysis is performed on the target audio data using a neural network model, so as to obtain audio analysis information of the target audio data, where the audio analysis information includes analysis results in a plurality of quality analysis dimensions.

The related neural network model can perform audio analysis on input target audio data through pre-training, and feed back audio analysis information of the target audio data in a plurality of quality analysis dimensions. Fig. 3 shows a schematic diagram of a neural network model structure according to an embodiment of the present application. As shown in fig. 3, when audio analysis is performed on a target audio signal in the form of a time-domain audio signal, a frequency division process may be performed on the target new signal, and then fast-reading fourier transforms (FFTs, fast Fourier Transform) may be performed on low, medium, and high frequency subband spectrums obtained by the frequency division, respectively, so as to extract mel spectrum features, and the mel spectrum features may be input into a neural network model for audio analysis. The neural network model may be constructed from a convolutional neural network (CNN, convolutional Neural Networks) in combination with a Self-Attention mechanism (Self-Attention) and a plurality of Attention-Pooling layers (Attention-Pooling) corresponding to a plurality of quality analysis dimensions. The convolutional neural network can be used for further capturing the data characteristics of the target audio data, and the convolutional neural network combined with the self-attention mechanism can improve analysis efficiency while guaranteeing analysis accuracy.

The quality analysis dimension comprises audio continuity, noise, audio coloring or loudness, and the audio analysis information further comprises comprehensive analysis results of the target audio data under a plurality of quality analysis dimensions. The neural network model provided by the embodiment of the application is used for carrying out audio analysis on the target audio data, so that quality analysis results of the target audio data under the conditions of continuity, noise, audio coloring, loudness and comprehensive dimension can be obtained. In the neural network model, the attention pooling layer is constructed according to the quality analysis dimension, and the neural network model shown in fig. 3 comprises attention pooling layers corresponding to the five dimensions of synthesis, continuity, noise, coloring and loudness, and the quality analysis result of the neural network model on the target data is the MOS score in the dimensions of synthesis, continuity, noise, audio coloring and loudness. In practical application, in order to analyze audio quality from more dimensions, a pooling layer corresponding to other analysis dimensions can be added to the model according to requirements, which is not limited in the embodiment of the present application.

In order to further improve the accuracy of the audio analysis of the neural network model, other algorithms or processing methods may be adaptively added or adjusted to the model, for example, the convolutional neural network model combined with the self-attention mechanism may be replaced by a Bi-directional Long-Short Term Memory (Bi-directional Long-short term memory) RNN cyclic neural network model. The specific structure composition or algorithm selection of the neural network model is not limited.

In one possible implementation manner, before audio analysis is performed on the target audio data by using the neural network model to obtain audio analysis information of the target audio data, crossover processing may also be performed on the target audio data, where crossover processing refers to crossover filtering on the target audio data. By frequency-division filtering the target audio data, the target audio data of a plurality of sub-band spectrums corresponding to different frequency band ranges can be obtained. It will be appreciated that the filtering conditions may be determined in advance based on the characteristics of the amount of information carried by the audio signal. For example, for a voice audio signal, because voice is mainly composed of audio signals in low and medium frequency bands, and compared with the audio signals in the low frequency band, the target first-class data can be divided into low, medium and high frequency sub-band spectrums by filtering, and the information carried in the respective band spectrums is respectively analyzed, such as feature extraction, so as to obtain more accurate audio quality analysis results. In one application example, when the separation processing is performed for the target audio data with the sampling rate of 48kHz, the target audio data is filtered with the frequency band of 16kHz, and finally a low-frequency subband spectrum of 16kHz, an intermediate-frequency subband spectrum of 16kHz to 32kHz, and a high-frequency subband spectrum of 32kHz to 48kHz are obtained.

In one possible implementation, before audio analysis is performed on the target audio data using the neural network model to obtain audio analysis information of the target audio data, fourier transformation may be further performed on the target audio data to transform the target audio data from a time domain dimension to a frequency domain dimension. Because the audio signal carries a larger amount of information in the frequency domain dimension, the audio quality analysis of the target audio signal converted into the frequency domain dimension can obtain a more accurate result.

In one possible implementation, when audio analysis is performed on the target audio data using the neural network model, audio feature information of the target audio data is first extracted, where the audio feature information includes at least one of time domain information, frequency domain information, perceptually relevant amplitude information, and perceptually relevant frequency domain information. The time domain information and the frequency domain information refer to information carried by the audio signal in a time domain dimension and a frequency domain dimension respectively. The perceptually relevant amplitude information and perceptually relevant frequency domain information refer to amplitude information and frequency domain information related to human perception. The audio characteristic information is then input to the neural network model for audio analysis of the target audio data.

The extracted audio feature information may be a spectral feature obtained by performing fourier transform on the target audio data, or a Mel spectrum (Mel spectrum) feature obtained by further extracting based on the spectral feature. The Mel spectrum is an audio spectrum under Mel Scale (Mel Scale), and since the Mel Scale is matched with the auditory characteristics of human ears, a quality analysis result more closely similar to human auditory sensation can be obtained by using the Mel spectrum to characterize the target audio data.

When the audio feature extraction is performed on the target audio data, the audio feature of the target audio data can be directly extracted through fast Fourier transform, or the frequency division processing can be performed on the target audio data, and then the audio features of a plurality of sub-band spectrums corresponding to different frequency band ranges can be respectively extracted. When the information carried by the sub-band spectrums obtained after frequency division is different, more dimensional audio features can be extracted for the frequency band distribution with the largest carried information so as to improve the accuracy of audio quality analysis. Fig. 4A and 4B show schematic diagrams of mel-spectrum characteristic sequences obtained by applying a conventional scheme and a frequency division scheme, respectively. As can be seen by comparing fig. 4A and fig. 4B, the mel spectrum features extracted by the frequency division scheme can better represent the information amount carried by the audio signal in the low, middle and high frequency bands (corresponding to the left, middle and right parts in fig. 4B), and the information amount carried in the low frequency band is the largest. Thus, when extracting mel-spectrum features for target audio data, the most scaled feature dimensions can be assigned to the low frequency band. For example, in extracting 48-dimensional melpri features, the dimensions allocated for the three low, medium, and high frequency bands may be 24-dimensional, 16-dimensional, and 8-dimensional.

Fig. 5A and 5B are schematic diagrams showing the process of extracting 48-dimensional mel-spectrum features for target audio data up-sampled to 48kHz using the two schemes described above, respectively. As shown in fig. 5A, in the conventional scheme, 4096-point FFT may be directly performed on the target audio data, and 48-dimensional mel-spectrum characteristics may be obtained. As shown in fig. 5B, in the frequency division scheme, the target audio data may be first divided into three sub-band spectrums of a low frequency band, a middle frequency band and a high frequency band, and then 1024-point FFT is performed on the frequency-divided target audio data to obtain 24-dimensional, 16-dimensional and 8-dimensional mel spectrums corresponding to the sub-band spectrums of the low frequency band, the middle frequency band and the high frequency band, respectively, and the 24-dimensional, 16-dimensional and 8-dimensional mel spectrums are combined together to form a 48-dimensional mel spectrum as audio characteristics of the target audio data. In one application example, when extracting a 48-dimensional mel spectrum based on a frequency division scheme, the relevant parameters may be set by: the audio frame is 20ms, the frame stack is 10ms, the mel spectrum characteristic of each frame is extracted, 15 frames of data are characterized as a group (150 ms), and the frame stack is 4 frames (40 ms).

In one possible implementation, the audio analysis information obtained by the neural network model may include problem analysis information, which may indicate a quality analysis dimension where a problem occurs and information such as a cause of the problem when the audio quality of the target audio data is poor, that is, the audio quality is problematic. When audio analysis is performed on target audio data by using a neural network model to obtain audio analysis information of the target audio data, audio quality scoring output by the neural network model on the target audio data can be firstly obtained, and specifically scoring of the target audio data under five quality analysis dimensions of continuity, noise, audio coloring, loudness and comprehensive dimensions can be obtained. In one example application, one possible audio quality score output by the neural network model is shown in the following table:

Integrated MOS	Good,3.15
		Continuity MOS	Excellent,4.16
Coloring MOS	Good,3.46
		Noise MOS	Fair,2.9
Loudness MOS	Excellent,4.62

TABLE 3 Audio quality scoring

After the audio quality score is obtained, problem analysis information for the target audio data may be determined according to the audio quality score, and the problem analysis information may include at least one of an audio quality problem, a root cause of the problem, and a problem solution. Specifically, the root cause of the problem may be analyzed for dimensions having a score of 3 or less, and a problem solution may be provided at the same time. For example, problems with the continuity dimension, which result from higher network packet loss rates, may include looking at network connection conditions, adjusting the network to a smoother other connection; aiming at the problem of coloring dimension, the reason for the occurrence is that the audio frequency spectrum is lack, which leads to lower hearing definition and poorer integrity, when the problem occurs in the uplink audio data and the downlink audio data, the problem of whether the audio acquisition device of the client side for providing audio has poor state of acoustic elements, whether the encoding code rate is set improperly or whether the encoder has quality problem can be examined; for problems with noise dimensions, which result from low signal-to-noise ratio of the audio, a problem solution may include prompting a client side user providing the audio to avoid collecting the audio signal in a noisy environment; problems with the loudness dimension, which arise because of low audio volume, may include prompting a user on the client side providing audio to increase volume or adjust microphone position.

In combination with the above example, the score of the target audio data in the two dimensions of continuity and loudness is higher than 4, which indicates that the network for transmitting the target audio data is smooth, the sensitivity of the audio acquisition equipment at the audio providing end is proper, and the audio volume has no obvious problem. The coloring grade of the target audio data is Good, which indicates that the hearing of the target audio data is clear and complete. The score of the target audio data in the noise dimension is lower, which indicates that the signal-to-noise ratio of the target audio data is lower and the noise is more obvious. For this application example, the client providing the audio may be prompted that the environment in which the audio was collected is noisy, resulting in poor hearing of the target audio data.

In one possible implementation manner, the audio analysis information of the target audio data may also be prompted to the terminal device corresponding to the target audio data. By prompting the possible problems and the corresponding solutions to the terminal device corresponding to the target audio data, the user using the terminal device can adjust the related operations of receiving or sending the audio data in time according to the solutions, for example, in combination with fig. 1B, when prompting that the loudness of the uplink audio data is too low to the first client, and noticing to adjust the volume, the user at the first client side can adaptively increase the speaking volume or adjust the microphone position to increase the loudness of the audio collected by the microphone; when the network condition of the second client is poor and the MOS of the downlink audio data continuity is low, a prompt for switching the network is received, and a user at the second client side can adjust network connection according to the prompt so as to obtain smoother hearing.

Fig. 6A shows a schematic diagram of an analysis scheme for client-based audio quality provided in an embodiment of the present application. As shown in fig. 6A, a data buffer which can temporarily store 10s audio data is provided at the client for storing target audio data to be used for audio quality analysis in real-time communication. When the audio quality analysis is carried out on the target audio data in the data buffer area, the data is firstly subjected to basic information statistics, and the statistical contents comprise the sampling rate, the voice duration ratio, the pop duration ratio and whether the target audio data is completely muted. In practical applications, in order to save computational effort, the audio quality analysis may be performed on the target audio data in the data buffer only when the speech duration of the target audio data in the data buffer is greater than the threshold by setting the speech duration of the target audio data to a threshold (e.g., 60%).

After the basic information statistics is carried out on the target data, the target audio data which is up-sampled to 48kHz according to the sampling rate is sent to an audio quality analysis module for audio quality analysis. Dividing frequency of target audio data at an audio quality analysis module, extracting Mel spectrum characteristics from a plurality of sub-band spectrums after fast Fourier transformation, and finally transmitting the extracted Mel spectrum characteristics to a pre-trained neural network model to obtain an audio quality analysis result. The neural network model output by the neural network model includes 5 scores of integrated MOS, continuity MOS, noise MOS, coloring MOS, and loudness MOS. The 5 scores are instantaneous scores corresponding to the target audio data in the current data buffer. After analysis, the score can be reported by a buried point or saved to a log for providing to a client. And a buried point curve can be drawn according to the requirements of a user at the client side, so that the current user who carries out real-time communication can monitor the change curve of the audio quality conveniently.

Fig. 6B shows a schematic diagram of an analysis scheme of server-based audio quality provided in an embodiment of the present application. As shown in fig. 6B, according to the embodiment of the present application, the audio quality detection may also be performed on the offline audio file at the server. Wherein the offline audio file may be a 16-bit quantized wav file. And firstly, carrying out basic information statistics on the file, wherein the basic information can comprise the sampling rate of the audio file, statistics on the situation of complete silence or no speaking in the audio file, the sound burst duration ratio, the double-talk ratio, the inter-frame silence ratio, the voice duration and the like. The total duration of the audio file is not limited in the scene of offline audio quality analysis, but in order to accelerate the analysis efficiency when the actual analysis is carried out, the file can be segmented into small segments meeting the duration requirement for one-by-one analysis. Therefore, after acquiring a 16-bit quantized wav file to be subjected to audio quality analysis, it is necessary to segment the file in addition to the basic information statistics. When the audio file is segmented, blank audio and music audio in the file can be removed, and only voice audio is reserved. After slicing, the remaining audio is spliced into a plurality of target audio data (e.g., statement 1, statement 2, …, statement n in the figure) in time sequence, and saved as a wav file for later operation.

After the target audio data is acquired, the target audio data up-sampled to 48kHz according to the sampling rate is sent to an audio quality analysis module for audio quality analysis, so that comprehensive MOS, continuity MOS, noise MOS, coloring MOS and loudness MOS are obtained. The obtained result can be stored in the form of CSV (Comma-Separated Values) file, JSON (JavaScript Object Notation, JS object numbered musical notation) character string and the like, and can also be in other data forms, and the storage form of the audio quality analysis result is not limited in the embodiment of the application. In one example, the audio quality analysis results of the audio file may include basic information statistics, segmentation statement MOS results, and final results.

The specific content of the related basic information statistical result may be "detection effective, ret=0, file total duration= 182.60s, voice duration duty=0.44, duration= 80.21s, voice average volume= -4.39dB, noise average volume= -65.21dB, inter-frame mute frame number=4, mute frame duty=0.00, talk-back duty=0.07, pop duty=0.00". Where ret (result) is 0, indicating that the audio file is valid. When ret is not equal to 0, the audio file is an invalid file, for example, when the audio file is a mute for most of the audio data, the file is determined to be an invalid file.

The piecewise statement MOS result may be as shown in the following table:

file name	mos_pred	noi_pred	dis_pred	col_pred	loud_pred
						9.wav	0.978223	2.07175	2.83322	1.602849	1.273467
8.wav	2.65958	2.513888	3.073782	2.541225	3.116537
						6.wav	3.208017	2.809731	4.399879	3.314154	3.499624
7.wav	3.271022	2.800086	3.494078	2.82462	3.53301
						5.wav	3.124479	2.847238	4.129698	2.972225	3.676475
4.wav	2.630993	2.201362	3.989583	3.111258	3.229288
						0.wav	2.85819	2.586794	3.437651	2.803623	3.159785
1.wav	2.832904	2.896578	4.096958	3.23921	2.750009
						3.wav	2.758041	2.720349	3.973233	2.663443	3.02315
2.wav	3.289295	3.182507	3.875646	2.788188	3.575136

TABLE 4 piecewise statement MOS results

As shown in table 4, the segmental statement MOS result records the audio quality analysis result of each segmental statement stored as an audio file in wav format by the neural network model. In the table, 0.wav,1.wav, …,9.wav are file names of audio files, and the obtained audio quality analysis results are a comprehensive MOS (mos_pred), a noise MOS (noi _pred), a continuity MOS (dis_pred), a coloring MOS (col_pred), and a loudness MOS (loud_pred) of each file (i.e., each sentence).

The final result of the audio file refers to the final audio quality analysis result of the audio file after weighted summation according to the duration of each section of audio statement. For example, in combination with the basic information and the piecewise sentence MOS result, the final result of the audio file is "tag: the effect is achieved;

comprehensive MOS: good,3.15; continuity: excelent, 4.16; coloring: good,3.46; noise: fair,2.9, -65.21dB; loudness: excelent, 4.62, -4.39dB;

inter-frame silence duty cycle: 0.08; the double talk duty ratio: 6.84; pop sound duty ratio: 0.0."

The embodiment of the application further provides a method for processing a neural network model for audio analysis, as shown in fig. 7, which is a flowchart of a method 700 for processing a neural network model for audio analysis according to an embodiment of the application, where the method 700 may include:

In step S701, an audio data sample is acquired, the audio data sample being marked with corresponding audio analysis information.

In order to train the neural network model for audio analysis provided by embodiments of the present application, first an audio data sample for training the model needs to be obtained. The audio data samples may be pre-recorded audio files, audio questions or audio analog signal files of audio tracks separated from audio-video calls or live scenes, etc. The audio data samples are correspondingly marked with audio analysis information, so that the neural network model can learn based on the audio data samples and the audio analysis information corresponding to the neural network model. In particular, the audio analysis information may include scoring of the audio data samples in a plurality of quality analysis dimensions (e.g., continuity, noise, coloration, and loudness). Each audio data sample may be marked with a score corresponding to one or more quality analysis dimensions.

In one possible implementation manner, in order to expand the sample capacity and increase the data volume for training, the training result is more accurate, and when the audio data sample is acquired, the initial audio data sample may be acquired first, and then the value corresponding to the parameter of the initial audio data sample is changed, so as to obtain the newly added audio data sample. Wherein the initial audio data samples have been correspondingly marked with audio analysis information. The parameters of the initial audio data samples referred to are parameters that affect the audio quality in the quality analysis dimension. Such as packet loss rate (continuity), signal to noise ratio (noise), volume (loudness), and code rate (coloring). When the new data sample is acquired, a data sample aiming at a certain quality analysis dimension can be acquired, for example, for a certain initial audio data sample with the volume of-1 dB, the new audio data sample aiming at the quality analysis dimension of loudness can be acquired by adjusting the volume of the initial audio data sample to-10 dB and saving the audio data with the volume adjusted as the new audio data sample. Similarly, the values of packet loss rate (e.g., from 0% to 50%), signal-to-noise ratio (e.g., from-20 dB to 20 dB), code rate may also be adjusted to obtain more samples of the newly added audio data.

After the new audio data sample is obtained, audio analysis information corresponding to the new audio data sample is determined, and then the initial audio data sample and the new audio data sample marked with the audio analysis information are taken as audio data samples.

In one possible implementation manner, when determining the audio analysis information corresponding to the new audio data sample, a first fitting function corresponding to the quality analysis dimension may be first obtained, where the first fitting function characterizes numerical relationships between the audio analysis information corresponding to the initial audio data sample and the new audio data sample, and parameters of the initial audio data sample, and the audio analysis information refers to a score corresponding to the quality analysis dimension. For example, for a loudness-quality analysis dimension, when generating a new data sample for that dimension, a small portion of the reference audio data sample may be first generated to generate a first fit function for the loudness-quality analysis dimension. Since the reference audio data sample is audio data obtained by adjusting the volume of the initial audio data sample, and the quality analysis dimensions provided in the embodiments of the present application are in an orthogonal relationship with each other, that is, in the case of adjusting the volume only, the obtained audio data is only changed in the dimension of loudness compared with the initial audio data. Since the first generation of a small portion of the reference audio data samples is obtained by adjusting the parameters of the initial audio (e.g., the volume is from-1 dB to-10 dB), the first fitting function can be determined by scoring the reference audio data samples (loudness MOS) and then fitting the loudness MOS to the volume.

Similarly, a first fitting function corresponding to continuity, noise and audio coloring dimension can be generated, and then according to the first fitting function, audio analysis information corresponding to the initial audio data sample and values corresponding to parameters of the initial audio data sample, audio analysis information corresponding to the newly added audio data sample respectively is determined.

In one possible implementation, the audio analysis information further includes a comprehensive analysis result, where the comprehensive analysis result may be used to generally measure the overall quality of the audio, and in a practical application, the comprehensive analysis result may be used to express the overall auditory perception of the audio by the user at the audio receiving end. In order to obtain the comprehensive analysis result, when the audio data sample is obtained, audio analysis information corresponding to the audio data sample under a plurality of quality analysis dimensions may be first obtained. The method for obtaining the audio analysis information corresponding to the individual quality dimension may refer to the above embodiment, and will not be described herein. And then, determining a comprehensive analysis result corresponding to the audio data sample according to the second fitting function and the audio analysis information corresponding to the plurality of quality analysis dimensions, and marking the comprehensive analysis result on the number of the audio data samples, so that the audio data sample is marked with the audio analysis information of continuity, noise, coloring, loudness and the comprehensive analysis dimensions. The second fitting function is used for representing a numerical relation between an integrated analysis result (such as an integrated MOS) and audio analysis information corresponding to a plurality of other quality analysis dimensions respectively according to the initial audio data in advance. In one application example, the second fitting function may be expressed as: comprehensive mos=continuity mos×w_1+noise mos×w_2+coloring mos×w_3+loudness mos×w_4. Where w_1, w_2, w_3, w_4 represent coefficient values in the second fitting function corresponding to the continuity, noise coloration, and loudness dimensions, respectively.

In order to further improve the accuracy of the audio analysis information of the marks corresponding to the audio data samples, part of the audio data samples can be extracted through a manual subjective verification mode, and the audio analysis information, the first fitting function or the second fitting function can be calibrated.

In step S702, audio signal processing is performed on the audio data samples, and audio data samples with enlarged frequency bandwidth are obtained.

It can be understood that, since the analysis result (such as MOS value) of the audio quality is still lower than that of the audio data with high sampling rate after the audio signal processing is performed on the audio data with low sampling rate, in order to further improve the accuracy of the audio quality analysis, the MOS value of the mark corresponding to the audio data sample with low sampling rate can be mapped while the bandwidth of the audio data is enlarged through the audio signal processing, so that the audio data sample with high quality and low sampling rate can reach the audio quality analysis result conforming to the hearing sense.

In step S703, a neural network model is trained based on the audio data samples, the neural network model being used to determine audio analysis information by audio analysis, the audio analysis information comprising analysis results in a plurality of quality analysis dimensions.

For a specific implementation of the above steps and a specific description of the structure of the neural network model provided in the embodiments of the present application, reference may be made to the above embodiments, which are not repeated herein.

The embodiment of the present application further provides a method for analyzing voice quality, as shown in fig. 8, which is a flowchart of a method 800 for processing a neural network model for audio analysis according to an embodiment of the present application, where the method 800 may include:

in step S801, voice data transmitted in real time in an audio-video session is acquired. The voice data may be a voice signal directly obtained from an audio session or a voice signal extracted from video data of a video session.

In step S802, audio signal processing is performed on the voice data to obtain voice data with an enlarged frequency bandwidth.

In step S803, audio analysis is performed on the voice data using a neural network model, so as to obtain audio analysis information of the voice data, where the audio analysis information includes analysis results in a plurality of quality analysis dimensions.

In step S804, the audio analysis information is provided to at least one client participating in the audio-video session.

The voice quality analysis method provided by the embodiment of the application can be deployed at a client or a server, and can also be integrated in the audio/video session program in the form of an SDK. In the scene of an audio-video session such as live broadcast, video conference, etc., the voice quality of a plurality of clients participating in the audio-video session can be analyzed, and audio analysis information can be provided to the clients. On the one hand, the sender of the voice can make corresponding adjustment on the lower dimension of the MOS according to the audio analysis information in time, so that the audio quality of the uplink audio is improved. On the other hand, the receiver also makes corresponding adjustment according to the audio analysis information of the downlink audio, so that the user experience in the audio-video session is improved. Specific embodiments of voice quality analysis may be referred to the above examples, and will not be described herein.

The embodiment of the present application further provides another method for analyzing audio quality, as shown in fig. 9, which is a flowchart of a method 900 for analyzing audio quality according to an embodiment of the present application, where the method 900 may include:

in step S901, audio analysis information of target audio data, which is obtained by audio analysis of the target audio data using a neural network model, which has been expanded in frequency bandwidth by audio signal processing, is acquired, the audio analysis information including analysis results in a plurality of quality analysis dimensions.

In step S902, the audio analysis information is provided based on the client.

The method and the device can provide the audio analysis information for the target audio data to the client. In a real-time communication scene, audio analysis information is recorded in a client in a manner of reporting the embedded point or storing the audio analysis information in a log, and in practical application, an embedded point curve can be drawn according to the audio analysis information reported by the embedded point, so that a user on the client side can conveniently monitor the change curve of the audio quality in real time. In the process of real-time communication, the audio analysis information can be dynamically updated to an interactive interface of the real-time communication, and the audio analysis information displayed on the interactive interface can comprise MOS values of target audio data in multiple dimensions, and problems and corresponding solutions which may exist at present. For example, when the audio continuity is poor, a prompt for suggesting that the user change the network connection can be popped up on the interactive interface; when the loudness is poor, a prompt suggesting that the user adjust the microphone can be popped up at the interactive interface.

Specific embodiments of audio quality analysis may be referred to the above examples, and will not be described herein.

Corresponding to the application scene and the method of the method provided by the embodiment of the application, the embodiment of the application also provides an audio quality analysis device. As shown in fig. 10, which is a block diagram illustrating a structure of an audio quality analysis apparatus 1000 according to an embodiment of the present application, the apparatus 1000 may include:

a data acquisition module 1001, configured to acquire target audio data to be analyzed;

a signal processing module 1002, configured to perform audio signal processing on the target audio data to obtain target audio data with an enlarged frequency bandwidth;

the result analysis module 1003 is configured to perform audio analysis on the target audio data using a neural network model, so as to obtain audio analysis information of the target audio data, where the audio analysis information includes analysis results under a plurality of quality analysis dimensions.

In one possible implementation, the result analysis module 1003 may specifically include:

the information extraction sub-module is used for extracting audio characteristic information of the target audio data, wherein the audio characteristic information comprises at least one of time domain information, frequency domain information, perceptually relevant amplitude information and perceptually relevant frequency domain information;

And the information input module is used for inputting the audio characteristic information into the neural network model so as to perform audio analysis on the target audio data.

In a possible implementation manner, the result analysis module 1003 may further include a data transformation sub-module, configured to perform fourier transform on the target audio data to transform the target audio data from a time domain dimension to a frequency domain dimension before performing audio analysis on the target audio data using the neural network model to obtain audio analysis information of the target audio data.

In one possible implementation, the apparatus 1000 may further include:

the frequency division module is used for carrying out frequency division processing on the target audio data before the audio analysis information of the target audio data is obtained by carrying out audio analysis on the target audio data by using the neural network model, so as to obtain target audio data comprising a plurality of sub-band spectrums corresponding to different frequency band ranges.

In one possible implementation, the audio analysis information includes problem analysis information; the result analysis module 1003 may include:

the score acquisition sub-module is used for acquiring an audio quality score output by using a neural network model for the target audio data;

An information determination sub-module for determining problem analysis information for the target audio data according to the audio quality score, the problem analysis information comprising: at least one of audio quality problems, root cause of problems, and problem solution.

In a possible implementation manner, the apparatus 1000 may be further configured to prompt, to a terminal device corresponding to the target audio data, audio analysis information of the target audio data.

In one possible implementation manner, the acquiring the target audio data to be analyzed acquires audio data including at least one of the following:

the method comprises the steps of acquiring original audio data acquired by an audio providing end, acquiring uplink audio data submitted to a server from the audio providing end, and acquiring downlink audio data acquired from the server by an audio receiving end;

the uplink audio data and/or the downlink audio data are subjected to at least one of audio preprocessing and audio encoding and decoding.

In one possible implementation, the quality analysis dimension includes audio continuity, noise, audio coloration, or loudness, and the audio analysis information further includes a composite analysis result of the target audio data at a plurality of quality analysis dimensions.

In one possible implementation, the interfering audio data is segmented from the target audio data, and the interfering audio data includes at least one of noise, blank audio and music; the apparatus 1000 may be further configured to splice audio data remaining after the splitting, and update the splice result to target audio data.

In one possible implementation, the apparatus 1000 may be further configured to perform clause processing on the initial audio data, and use the single-sentence audio data after the clause as the target audio data.

Corresponding to the application scene and the method of the method provided by the embodiment of the application, the embodiment of the application also provides a processing device of the neural network model for audio analysis. As shown in fig. 11, which is a block diagram illustrating a processing apparatus 1100 of a neural network model for audio analysis according to an embodiment of the present application, the apparatus 1100 may include:

a sample acquisition module 1101, configured to acquire an audio data sample, where the audio data sample is marked with corresponding audio analysis information;

a signal processing module 1102, configured to perform audio signal processing on the audio data samples to obtain audio data samples with enlarged frequency bandwidth;

A result analysis module 1103 for training a neural network model based on the audio data samples, the neural network model for determining audio analysis information by audio analysis, the audio analysis information comprising analysis results in a plurality of quality analysis dimensions.

In one possible implementation, the sample acquisition module 1101 may be specifically configured to acquire an initial audio data sample; changing the value corresponding to the parameter of the initial audio data sample to obtain a new audio data sample; determining audio analysis information corresponding to the newly added audio data sample; the initial audio data sample marked with the audio analysis information and the newly added audio data sample are taken as audio data samples.

In one possible implementation manner, when determining the audio analysis information corresponding to the new audio data sample, the sample obtaining module 1101 may be specifically configured to obtain a first fitting function corresponding to a quality analysis dimension, where the first fitting function characterizes a numerical relationship between the audio analysis information corresponding to the initial audio data sample and the new audio data sample, and parameters of the initial audio data sample; and determining the audio analysis information corresponding to the newly added audio data samples respectively according to the first fitting function, the audio analysis information corresponding to the initial audio data samples and the values corresponding to the parameters of the initial audio data samples.

In a possible implementation manner, the audio analysis information further comprises comprehensive analysis results; the sample obtaining module 1101 may be specifically configured to obtain audio analysis information corresponding to the audio data sample in a plurality of quality analysis dimensions; and determining a comprehensive analysis result corresponding to the audio data sample according to the second fitting function and the audio analysis information corresponding to the plurality of quality analysis dimensions.

Corresponding to the application scene and the method of the method provided by the embodiment of the application, the embodiment of the application also provides a voice quality analysis device. As shown in fig. 12, which is a block diagram of a voice quality analysis apparatus 1200 according to an embodiment of the present application, the apparatus 1200 may include:

the data acquisition module 1201 is configured to acquire voice data transmitted in real time in an audio/video session;

a signal processing module 1202, configured to perform audio signal processing on the voice data to obtain voice data with an enlarged frequency bandwidth;

the result analysis module 1203 is configured to perform audio analysis on the voice data using a neural network model, to obtain audio analysis information of the voice data, where the audio analysis information includes analysis results under a plurality of quality analysis dimensions;

An information providing module 1204, configured to provide the audio analysis information to at least one client participating in the audio-video session.

Corresponding to the application scene and the method of the method provided by the embodiment of the application, the embodiment of the application also provides an audio quality analysis device. As shown in fig. 13, which is a block diagram of an audio quality analysis apparatus 1300 according to an embodiment of the present application, the apparatus 1300 may include:

an analysis information acquisition module 1301, configured to acquire audio analysis information of target audio data, where the audio analysis information is obtained by performing audio analysis on the target audio data using a neural network model, the target audio data has a bandwidth that is enlarged by audio signal processing, and the audio analysis information includes analysis results in a plurality of quality analysis dimensions;

the analysis information providing module 1302 is configured to provide the audio analysis information based on a client.

The functions of each module in each device of the embodiments of the present application may be referred to the corresponding descriptions in the above methods, and have corresponding beneficial effects, which are not described herein.

Fig. 14 is a block diagram of an electronic device used to implement an embodiment of the present application. As shown in fig. 14, the electronic device includes: a memory 1401 and a processor 1402, the memory 1401 storing a computer program executable on the processor 1402. The processor 1402, when executing the computer program, implements the methods of the embodiments described above. The number of memories 1401 and processors 1402 may be one or more.

The electronic device further includes:

the communication interface 1403 is used for communicating with external devices and performing data interaction transmission.

If the memory 1401, the processor 1402, and the communication interface 1403 are implemented independently, the memory 1401, the processor 1402, and the communication interface 1403 can be connected to each other through buses and perform communication with each other. The bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (Peripheral Component Interconnect, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 14, but not only one bus or one type of bus.

Alternatively, in a specific implementation, if the memory 1401, the processor 1402, and the communication interface 1403 are integrated on a chip, the memory 1401, the processor 1402, and the communication interface 1403 may perform communication with each other through internal interfaces.

The present embodiments provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the methods provided in the embodiments of the present application.

The embodiment of the application also provides a chip, which comprises a processor and is used for calling the instructions stored in the memory from the memory and running the instructions stored in the memory, so that the communication device provided with the chip executes the method provided by the embodiment of the application.

The embodiment of the application also provides a chip, which comprises: the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the application embodiment.

It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an advanced reduced instruction set machine (Advanced RISC Machines, ARM) architecture.

Further alternatively, the memory may include a read-only memory and a random access memory. The memory may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), programmable ROM (PROM), erasable Programmable ROM (EPROM), electrically Erasable EPROM (EEPROM), or flash Memory, among others. Volatile memory can include random access memory (Random Access Memory, RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, static RAM (SRAM), dynamic RAM (Dynamic Random Access Memory, DRAM), synchronous DRAM (SDRAM), double Data Rate Synchronous DRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct RAM (DR RAM).

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. Computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

Any process or method described in flow charts or otherwise herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes additional implementations in which functions may be performed in a substantially simultaneous manner or in an opposite order from that shown or discussed, including in accordance with the functions that are involved.

Logic and/or steps described in the flowcharts or otherwise described herein, e.g., may be considered a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. All or part of the steps of the methods of the embodiments described above may be performed by a program that, when executed, comprises one or a combination of the steps of the method embodiments, instructs the associated hardware to perform the method.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules described above, if implemented in the form of software functional modules and sold or used as a stand-alone product, may also be stored in a computer-readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The foregoing is merely exemplary embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of various changes or substitutions within the technical scope of the present application, which should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of analyzing audio quality, comprising:

acquiring target audio data to be analyzed;

2. The method of claim 1, wherein the audio analysis of the target audio data using a neural network model comprises:

extracting audio feature information of the target audio data, wherein the audio feature information comprises at least one of time domain information, frequency domain information, perceptually relevant amplitude information and perceptually relevant frequency domain information;

The audio feature information is input to the neural network model for audio analysis of the target audio data.

3. The method of claim 2, wherein prior to the audio analysis of the target audio data using the neural network model to obtain audio analysis information for the target audio data, the method further comprises:

and carrying out Fourier transform on the target audio data, and transforming the target audio data from a time domain dimension to a frequency domain dimension.

4. The method of claim 1, wherein prior to the audio analysis of the target audio data using the neural network model to obtain audio analysis information for the target audio data, the method further comprises:

and carrying out frequency division processing on the target audio data to obtain target audio data comprising a plurality of sub-band spectrums corresponding to different frequency band ranges.

5. The method of claim 1, wherein the audio analysis information comprises problem analysis information;

the audio analysis of the target audio data by using the neural network model comprises the following steps of:

Acquiring an audio quality score for the target audio data output using a neural network model;

determining problem analysis information for the target audio data according to the audio quality score, wherein the problem analysis information comprises: at least one of audio quality problems, root cause of problems, and problem solution.

6. The method of claim 1, wherein the method further comprises:

and prompting the audio analysis information of the target audio data to the terminal equipment corresponding to the target audio data.

7. The method of claim 1, wherein the acquiring target audio data to be analyzed comprises at least one of:

8. The method of claim 1, wherein the quality analysis dimensions comprise audio continuity, noise, audio coloration, or loudness, the audio analysis information further comprising a composite analysis of the target audio data at a plurality of quality analysis dimensions.

9. The method of claim 1, wherein interfering audio data is segmented from the target audio data, the interfering audio data comprising at least one of noise, blank audio, and music;

and splicing the audio data remained after the segmentation, and updating the splicing result into target audio data.

10. The method of claim 1, wherein the method further comprises:

and carrying out clause processing on the initial audio data, and taking the single-sentence audio data after the clause as target audio data.

11. A method of processing a neural network model for audio analysis, comprising:

12. The method of claim 11, wherein the acquiring audio data samples comprises:

Acquiring an initial audio data sample;

changing the value corresponding to the parameter of the initial audio data sample to obtain a new audio data sample;

determining audio analysis information corresponding to the newly added audio data sample;

the initial audio data sample marked with the audio analysis information and the newly added audio data sample are taken as audio data samples.

13. The method of claim 12, wherein the determining audio analysis information corresponding to the new audio data sample comprises:

acquiring a first fitting function corresponding to the quality analysis dimension, wherein the first fitting function characterizes the numerical relation between audio analysis information respectively corresponding to the initial audio data sample and the newly added audio data sample and parameters of the initial audio data sample;

and determining the audio analysis information corresponding to the newly added audio data samples respectively according to the first fitting function, the audio analysis information corresponding to the initial audio data samples and the values corresponding to the parameters of the initial audio data samples.

14. The method of claim 11, wherein the audio analysis information further comprises a composite analysis result;

the acquiring audio data samples includes:

Acquiring audio analysis information of the audio data sample corresponding to each of a plurality of quality analysis dimensions;

and determining a comprehensive analysis result corresponding to the audio data sample according to the second fitting function and the audio analysis information corresponding to the plurality of quality analysis dimensions.

15. A method of analyzing speech quality, comprising:

acquiring voice data transmitted in real time in an audio-video session;

16. A method of analyzing audio quality, comprising:

The audio analysis information is provided based on a client.

17. An electronic device comprising a memory, a processor and a computer program stored on the memory, the processor implementing the method of any one of claims 1-16 when the computer program is executed.

18. A computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-16.