CN114329040B

CN114329040B - Audio data processing method, device, storage medium, equipment and program product

Info

Publication number: CN114329040B
Application number: CN202111266011.0A
Authority: CN
Inventors: 黄江泉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2024-02-20
Anticipated expiration: 2041-10-28
Also published as: CN114329040A

Abstract

The application discloses an audio data processing method, an audio data processing device, a storage medium, equipment and a program product, which can be applied to various scenes such as audio processing, text processing, natural language processing, artificial intelligence and the like. The method comprises the following steps: obtaining scaling data corresponding to the target test questions, wherein the scaling data is audio data with the scaling scoring completed; clustering the calibration data to find out the audio data with divergence in the calibration data, wherein the audio data with divergence comprises audio data with similar or same answers to the same target test question but different calibration scores; and then sending the audio data with the divergence to a paper marking expert for re-evaluation adjustment, and re-determining the scaling score of the audio data with the divergence according to the re-evaluation result obtained by the paper marking expert through re-evaluation adjustment, so that the inconsistency of the scaling data corresponding to the audio data can be effectively reduced, the quality of the scaling data is improved, and the final scoring effect of the intelligent paper marking system for the spoken language examination is improved.

Description

Audio data processing method, device, storage medium, equipment and program product

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an audio data processing method, an apparatus, a storage medium, a device, and a program product.

Background

In the intelligent examination paper reading system for the spoken language examination, as the examination questions, examination points, scoring standards and the like vary greatly with regions, grades, examination grades and the like, the scoring work of all the examination cannot be completed by using one universal model. In practice, a small amount of manual calibration is carried out on each small question of each examination paper set of each examination, the scoring examination paper expert carries out self-adaptive learning on calibration data, and then machine intelligent scoring is carried out on the data to be scored.

Because of the limitations of manpower, time, cost and the like, the calibration data of each set of data is about 200 to 500 pieces, the data is relatively less, and the model self-adaptive learning effect can be influenced to a certain extent.

In general, the manual scaling is to score the same audio by two paper marking experts, and then judge whether the score of the audio needs arbitration according to a preset arbitration rule. If arbitration is needed, a third scoring expert/scoring group owner performs final scoring.

Even with arbitration schemes, the arbitration threshold is usually relatively loose for scaling time and cost considerations, or arbitration is performed according to topics or question sets, resulting in similar or even identical answers, different final scaling scores, or even large differences. Scaling quality is not high, and scoring accuracy of the intelligent scoring system is necessarily affected.

Disclosure of Invention

The embodiment of the application provides an audio data processing method, an audio data processing device, a storage medium, equipment and a program product, which can effectively reduce the inconsistency of calibration data corresponding to audio data and improve the quality of the calibration data, thereby improving the final scoring effect of a spoken language examination intelligent examination paper reading system.

In a first aspect, there is provided an audio data processing method, the method comprising: obtaining scaling data corresponding to a target test question, wherein the scaling data is audio data with scaling scoring completed; clustering the calibration data, and finding out the audio data with divergence in the calibration data, wherein the audio data with divergence comprises audio data with similar or same answers to the same target test question but different calibration scores; and sending the branched audio data to a scoring expert for re-evaluation adjustment, and re-determining the scaling score of the branched audio data according to the re-evaluation result obtained by the re-evaluation adjustment of the scoring expert.

In a second aspect, there is provided an audio data processing apparatus, the apparatus comprising: the acquisition unit is used for acquiring scaling data corresponding to the target test questions, wherein the scaling data are audio data with the scaling scoring completed; the clustering unit is used for clustering the scaled data, and finding out the audio data with divergence in the scaled data, wherein the audio data with divergence comprises audio data with similar or same answer but different scaled scores aiming at the same target test question; and the processing unit is used for sending the branched audio data to a paper marking expert for re-evaluation adjustment, and re-determining the scaling score of the branched audio data according to the obtained re-evaluation result obtained through the re-evaluation adjustment of the paper marking expert.

In a third aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program adapted to be loaded by a processor for performing the steps in the audio data processing method according to any of the embodiments above.

In a fourth aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory having stored therein a computer program for executing the steps of the audio data processing method according to any of the embodiments above, by invoking the computer program stored in the memory.

In a fifth aspect, a computer program product is provided comprising computer instructions which, when executed by a processor, implement the steps in the audio data processing method as described in any of the embodiments above.

According to the embodiment of the application, the scaling data corresponding to the target test questions are obtained, and the scaling data are audio data with the scaling scoring completed; clustering the calibration data to find out the audio data with divergence in the calibration data, wherein the audio data with divergence comprises audio data with similar or same answers to the same target test question but different calibration scores; and then sending the audio data with the divergence to a paper marking expert for re-evaluation adjustment, and re-determining the scaling score of the audio data with the divergence according to the re-evaluation result obtained by the re-evaluation adjustment of the paper marking expert. In the spoken language examination scaling process, the audio data with the scaling score is clustered, the audio data with the branching is automatically found out, namely the audio data with similar or same answers aiming at the same target test question but different scaling scores and the corresponding scaling scores are found out, and then the audio data with the branching is sent to the examination specialist for re-evaluation adjustment so as to determine the final scaling score of the audio data with the branching, so that the inconsistency of the scaling data corresponding to the audio data can be effectively reduced, the quality of the scaling data is improved, and the final scoring effect of the spoken language examination intelligent examination system is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of an intelligent examination paper marking system according to an embodiment of the present application.

Fig. 2 is a schematic flow chart of a method for processing audio data according to an embodiment of the present application.

Fig. 3 is an application scenario schematic diagram of an audio data processing method according to an embodiment of the present application.

Fig. 4 is a second flowchart of an audio data processing method according to an embodiment of the present application.

Fig. 5 is a third flow chart of an audio data processing method according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of an audio data processing device according to an embodiment of the present application.

Fig. 7 is another schematic structural diagram of an audio data processing device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The embodiment of the application provides an audio data processing method, an audio data processing device, computer equipment and a storage medium. Specifically, the audio data processing method of the embodiment of the application may be performed by a computer device, where the computer device may be a terminal or a server. The embodiment of the application can be applied to various scenes such as audio processing, text processing, natural language processing, artificial intelligence and the like.

First, partial terms or terminology appearing in the course of describing the embodiments of the present application are explained as follows:

artificial intelligence (Artificial Intelligence, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

Machine Learning (ML): is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Deep Learning (DL): is a branch of machine learning, an algorithm that attempts to abstract data at a high level using multiple processing layers, either comprising complex structures or consisting of multiple nonlinear transformations. Deep learning is the inherent law and expression hierarchy of learning training sample data, and the information obtained in the learning process is greatly helpful to the interpretation of data such as characters, images, sounds and the like. The final goal of deep learning is to enable a machine to analyze learning capabilities like a person, and to recognize text, images, and sound data. Deep learning is a complex machine learning algorithm that achieves far greater results in terms of speech and image recognition than prior art.

Neural Networks (NN): a deep learning model imitating the structure and function of a biological neural network is disclosed in the fields of machine learning and cognitive science.

Deep neural network (Deep Neural Network, DNN): is a technology in the field of machine learning.

Natural language processing (Natural language processing, NLP): is a field of computer science and linguistics in which computers interact with humans (natural language). The natural language generation (Natural language generation) system converts computer database information into human-readable language. A natural language understanding system (Natural language understanding) converts samples of human language into a more formalized representation that is easily handled by a computer, such as an analysis tree or first order logic. Many challenges in NLP scope are applicable to generation and understanding, for example, in order to understand sentences, a computer must be able to model morphology (word construction), and in order to produce grammatically correct english sentences, there must also be a morphological model. NLP has a significant intersection with the field of computational linguistics, which is generally considered a branch of artificial intelligence (artificial intelligence).

Automatic speech recognition (Automatic Speech Recognition, ASR): automatic speech recognition technology is a technology that converts human speech into text.

Mel-frequency cepstral coefficients (Mel Frequency Cepstral Coefficents, MFCC): is a coefficient composing a mel frequency cepstrum, a feature widely used in automatic speech and speaker recognition. The MFCC takes human auditory characteristics into account, maps a linear spectrum into Mel nonlinear spectrum based on auditory perception, and then converts to a cepstrum. The process of extracting MFCC features is briefly described as follows: pre-emphasis, framing and windowing are performed on the speech to enhance some pre-processing of speech signal performance (e.g., signal-to-noise ratio, processing accuracy, etc.); then, for each short-time analysis window, obtaining a corresponding frequency spectrum through FFT (fast Fourier transform) so as to obtain frequency spectrums distributed in different time windows on a time axis; then, the frequency spectrums distributed in different time windows on a time axis are processed through a Mel filter bank to obtain Mel frequency spectrums, namely linear natural frequency spectrums are converted into Mel frequency spectrums representing human auditory characteristics through the Mel frequency spectrums; then, a cepstrum analysis is performed on the Mel spectrum, for example, taking the logarithm and performing an inverse transformation, the actual inverse transformation is generally implemented by DCT discrete cosine transformation, and the 2 nd to 13 th coefficients after DCT are taken as MFCC coefficients to obtain Mel frequency cepstrum coefficients MFCC, which are features of corresponding frame voices, that is, the cepstrum analysis is performed to obtain MFCCs as voice features.

wav2vec: an Unsupervised Pre-training method for speech recognition, from the Facebook Speech Representation Unsupervised Pre-training model, entitled wav2vec: insupervised Pre-training for Speech Recognition, in which a noise contrast learning classification task (noise contrastive binary classification task) is presented, whereby the wav2vec model can be trained on large scale unlabeled data and the resulting representation is then used to improve the acoustic model training.

Pronunciation quality (Goodness of Pronunciation, GOP): a voice evaluation algorithm. The basic idea of the GOP algorithm is to use the character information known in advance to perform forced alignment (force alignment) on the voice and the corresponding characters, compare the likelihood score value obtained by forced alignment with the likelihood score value obtained without knowing the corresponding characters, and use the likelihood ratio (likelihood ratio) as the evaluation of sound. The GOP algorithm calculates the likelihood that the input speech corresponds to a known word, and if the likelihood is higher, the more standard the pronunciation is accounted for.

Hidden markov model (Hidden Markov Model, HMM): is a statistical model that is used to describe a markov process that contains implicit unknown parameters. The difficulty is to determine the implicit parameters of the process from the observable parameters. These parameters are then used for further analysis, such as pattern recognition.

BERT: bidirectional Encoder Representation from Transformers, is a pre-trained language characterization model.

XLNET: generalized Autoregressive Pretraining for Language Understanding, generalized autoregressive pre-training for language understanding. XLNET is a generalized autoregressive pretraining method that learns a bi-directional context by maximizing the expected likelihood of all permutations of a factorization sequence.

GPT: generating Pre-trained Transformer, generating a pretraining transducer (generating pretraining transducer). GPT uses the decoder portion of the Transformers on the model structure to process supervised tasks by learning a generic language model on unlabeled data, followed by fine tuning according to the specific task.

PCA: principal component analysis principal component analysis, also known as principal component analysis or principal component regression analysis, is an unsupervised data dimension reduction method. Firstly, transforming data into a new coordinate system by utilizing linear transformation; then, the dimension reduction concept is utilized, so that the first large variance of any data projection is on the first coordinate (called the first principal component) and the second large variance is on the second coordinate (called the second principal component). The dimension reduction idea firstly reduces the dimension of the data set, simultaneously maintains the characteristic of the data set with the greatest contribution to the difference, and finally enables the data to be visually presented in a two-dimensional coordinate system.

LDA: latent Dirichlet Allocation, implicit dirichlet distribution, is a topic model for processing documents. LDA is a dimension reduction technique for supervised learning, that is, each sample of its data set is output in class, the data is projected in a low dimension, it is desired that the projected points of each class of data are as close as possible after projection, and the distance between class centers of different classes of data is as large as possible.

k-means: the K-means clustering algorithm (K-means clustering algorithm) is an iterative solution clustering analysis algorithm, and comprises the steps of dividing data into K groups, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each seed clustering center, and distributing each object to the closest clustering center. The cluster centers and the objects assigned to them represent a cluster. For each sample assigned, the cluster center of the cluster is recalculated based on the existing objects in the cluster. This process will repeat until a certain termination condition is met. The termination condition may be that no (or a minimum number of) objects are reassigned to different clusters, no (or a minimum number of) cluster centers are changed again, and the sum of squares of errors is locally minimum.

DBSCAN: density-Based Spatial Clustering of Applications with Noise is a relatively representative Density-based clustering algorithm. Unlike the partitioning and hierarchical clustering method, which defines clusters as the largest set of densely connected points, it is possible to partition a region having a sufficiently high density into clusters and find clusters of arbitrary shape in a noisy spatial database.

AutoEncoder, an automatic encoder, is a neural network that uses a back-propagation algorithm to make the output value equal to the input value, which compresses the input into a potential spatial representation, and then reconstructs the output from this representation. The self-encoder consists of two parts, namely an encoder and a decoder: the encoder can compress the input into a potential spatial representation, which can be represented by the coding function h=f (x); the decoder can reconstruct the input from the potential spatial representation, which can be represented by the decoding function r=g (h).

Generating an antagonism network (Generative Adversarial Network, GAN): based on a game model, where the generater model (Generator) must compete with its opponent discriminant model (dispersor). The generation model directly generates a dummy sample, and the discrimination model tries to distinguish the sample generated by the generator (dummy sample) from the sample extracted from the training data (true sample). Generating a countermeasure network (GAN) is a generation model consisting of a Generator and a Discriminator. The generative model attempts to learn the feature distribution of the real data samples and generates new data samples. The discriminant model is a classifier that discriminates whether the input is real data or a generated sample. The perceptron or the deep learning model can be used for both the generation model and the discrimination model. The optimization process is a minimum maximum game (minimum game) problem, and the optimization target is to achieve Nash equilibrium until the discrimination model cannot recognize whether the false sample generated by the generation model is true or false.

Cloud technology (Cloud technology): the hosting technology is used for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business mode, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

Blockchain system: a distributed system formed by a client, a plurality of nodes (any form of computing device in an access network, such as a server, user terminal) connected by way of network communications. The nodes form a point-To-point (P2P, peer To Peer) network, the P2P protocol is an application layer protocol running on top of a transmission control protocol (TCP, transmission Control Protocol) protocol, in a distributed system, any machine such as a server and a terminal can be added To become a node, and the node comprises a hardware layer, a middle layer, an operating system layer and an application layer.

In the intelligent examination paper reading system for the spoken language examination, as the examination questions, examination points, scoring standards and the like vary greatly with regions, grades, examination grades and the like, the scoring work of all the examination cannot be completed by using one universal model. In practice, a small amount of manual calibration is carried out on each test paper set of each test by a scoring and scoring expert, and after the model carries out self-adaptive learning on calibration data, the model carries out intelligent scoring on the data to be scored. Due to limitations of manpower, time, cost, etc., the calibration data of each set of test questions is usually about 200 to 500. Wherein the scaling data comprises audio of the question answers, corresponding reference text/reference answers and a scaling score given by the scoring paper specialist.

The audio data is processed by traditional feature extraction (such as MFCC) or end-to-end feature extraction (such as wave2 vec), automatic speech recognition ASR and the like to obtain audio features such as phoneme sequences, pronunciation quality GOP and the like, or text features such as text similarity and the like are obtained by using a natural language processing NLP technology based on ASR recognition text.

In order to ensure the quality of the scaled data, the roles of the grader and the arbitrator are distinguished when scaling is performed manually. Firstly, two scoring experts (scoring staff) score the same audio, and then judge whether the score of the audio needs to be arbitrated according to a preset arbitration rule. If arbitration is required, a third paper specialist/paper group leader (arbitrator) performs final scoring.

Another solution to control the quality of the scaling is a review mechanism, i.e. the scoring operator will randomly extract the audio it has previously scored during the scoring process and distribute it to the scoring operator for re-scoring. If the scoring personnel is inconsistent with the scoring personnel before, the scoring personnel is not in place in understanding the standard, the scoring quality is not up to standard, and the scoring personnel needs to be trained or replaced.

Through the arbitration mechanism and the review mechanism, the quality of the calibration data can be guaranteed to a certain extent, the internal consistency of the calibration data is improved, and better input is provided for learning the scoring standard of the intelligent scoring algorithm. However, even with arbitration schemes, the arbitration threshold is generally relatively loose, or arbitration is performed according to the topic, the topic group total, and so on, due to the time cost and economic cost of scaling. While a looser arbitration threshold, or arbitrating according to the topic, topic group total score, may mask part of the audio that needs arbitration, resulting in similar or even identical answers, different final scaling scores, or even huge differences. The review mechanism only ensures the consistency of the same scoring staff in the scoring process, and cannot solve the problem of inconsistent standards among scoring staff. If the inconsistent proportion in the calibration data is higher, the quality of the calibration data is said to be low, and the effect is necessarily greatly influenced after the scoring model of the intelligent scoring system learns the calibration data with low quality. Therefore, the consistency of the calibration data is improved, the quality of the calibration data is improved, the final grading effect of the grading model is improved, and the data to be graded can be more accurately assessed.

The embodiment of the application starts from the source, solves the problem of inconsistent calibration data, and improves the quality of the calibration data. In the spoken language examination calibration process, audio data which are similar or the same in answer but different in calibration score and corresponding calibration score are automatically found out through clustering the audio data which are already calibrated and scored, and then the audio data are sent back to a paper marking expert for re-evaluation adjustment, so that the final calibration score of the answer class is determined, for example, the audio in question is sent back to other graders or secondary judges for re-evaluation adjustment, and finally, the final calibration score of the audio is confirmed through voting and other methods; or all audio and scaled data in the problematic cluster is sent to the arbitrator, who resolves the audio with the wrong confirmation score and modifies the score. According to the embodiment of the application, the inconsistency of the calibration data can be effectively reduced, the quality of the calibration data is improved, and therefore the final effect of the intelligent examination paper reading system for the spoken language examination is improved.

The embodiment of the application can be realized by combining cloud technology or blockchain network technology. The audio data processing method disclosed in the embodiments of the present application, wherein the data can be stored on a blockchain. For example, the scaling data, the audio data with divergence in the scaling data, and the re-determined scaling score of the audio data with divergence based on the re-evaluation result obtained by the review expert can be stored in the blockchain.

In order to facilitate the storage and query of the audio data for answering the examinee, the scaled data manually scaled by the paper marking expert, the audio data having a divergence in the scaled data, and the scaled score of the diverged audio data redetermined according to the re-evaluation result obtained by the paper marking expert, optionally, the audio data processing method further includes: the method comprises the steps of sending audio data responded by an examinee, calibration data manually calibrated by a paper specialist, audio data with divergence in the calibration data and calibration scores of the audio data with divergence redetermined according to a re-evaluation result obtained by re-adjustment of the paper specialist to a blockchain network, so that nodes of the blockchain network fill the audio data responded by the examinee, the calibration data manually calibrated by the paper specialist, the audio data with divergence in the calibration data and the calibration scores of the audio data with divergence redetermined according to the re-evaluation result obtained by re-adjustment of the paper specialist into a new block, and when the new block is consistent, adding the new block to the tail part of the blockchain. According to the embodiment of the invention, the audio data responded by the examinee, the scaling data manually scaled by the examination paper expert, the audio data with branches in the scaling data and the scaling score of the audio data with branches redetermined according to the re-evaluation result obtained by readjustment of the examination paper expert can be stored in a uplink manner, so that the backup of records is realized, when the audio data with branches in the scaling data is required to be manually scaled and re-evaluated, the corresponding audio data responded by the examinee, the scaling data manually scaled by the examination paper expert, the audio data with branches in the scaling data and the scaling score of the audio data with branches redetermined according to the re-evaluation result obtained by readjustment of the examination paper expert can be directly and quickly obtained from a blockchain, and the scaling score of the audio data with branches can be obtained only after a series of processing is not required by an intelligent examination paper system, so that the efficiency of data processing and data obtaining is improved.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an intelligent examination paper reading system according to an embodiment of the present application. The intelligent examination paper reading system comprises a terminal 10, a server 20 and the like; the terminal 10 and the server 20 are connected to each other through a network, for example, a wired or wireless network connection.

Wherein the terminal 10 may be used to display a graphical user interface. The terminal is used for interacting with a user through a graphical user interface, for example, corresponding application programs are downloaded and installed through the terminal and run. In the embodiment of the present application, the terminal 10 may be a terminal device used by a test taker who takes a spoken test, and the terminal 10 may also be a terminal device used by a paper specialist. In the spoken language test, the terminal device used by the examinee user receives the audio data of the current spoken language test, and sends the audio data of the answer to the server 20 to wait for scoring. The server 20 selects part of audio data from all audio data corresponding to the test questions of each examination paper as audio data to be scaled, sends the audio data to be scaled to terminal equipment used by an examination paper expert, manually scales the audio data to be scaled by the examination paper expert to obtain scaled data, sends the scaled data to the server 20, adaptively learns the scaled data through an examination paper model in the server 20, and intelligently scores the audio data to be scored through an examination paper model after adaptively learning.

Wherein, in the embodiment of the present application, during the scaling process, the server 20 may specifically be used to: obtaining scaling data corresponding to a target test question, wherein the scaling data is audio data with scaling scoring completed; clustering the calibration data, and finding out the audio data with divergence in the calibration data, wherein the audio data with divergence comprises audio data with similar or same answers to the same target test question but different calibration scores; and sending the branched audio data to a terminal 10 used by a paper marking expert for re-evaluation adjustment, and re-determining the scaling score of the branched audio data according to the obtained re-evaluation result obtained by the paper marking expert for re-evaluation adjustment.

The terminal 10 used by the paper specialist may display the title information, the scoring standard, the information of the branched audio data, etc. on the terminal screen after receiving the branched audio data, provide a scoring option button, input a re-scoring result of the branched audio data based on the touch operation of the paper specialist on the scoring option button, and transmit the re-scoring result back to the server 20, so that the server 20 re-determines the scaling score of the branched audio data according to the re-scoring result.

The following will describe in detail. It should be noted that the following description order of embodiments is not a limitation of the priority order of embodiments.

The embodiments of the present application provide an audio data processing method, where the method may be performed by a terminal or a server, or may be performed by the terminal and the server together; the embodiments of the present application will be described with an audio data processing method executed by a server as an example.

Referring to fig. 2 to 5, fig. 2 is a first flow chart of an audio data processing method according to an embodiment of the present application, fig. 3 is an application scenario of the audio data processing method according to an embodiment of the present application, fig. 4 is a second flow chart of the audio data processing method according to an embodiment of the present application, and fig. 5 is a third flow chart of the audio data processing method according to an embodiment of the present application. The method comprises the following steps:

step 210, obtaining scaling data corresponding to the target test questions, wherein the scaling data is audio data with the scaling scoring completed.

Optionally, the obtaining the calibration data corresponding to the target test question includes:

acquiring calibration data corresponding to the target test questions according to preset fixed time; or alternatively

And obtaining calibration data corresponding to the target test questions according to the calibration progress.

In the oral examination, the terminal equipment used by the examinee user receives audio data of the current oral examination, which is answered by the examinee aiming at the examination questions, and sends the audio data of the answer of the examinee to the server to wait for scoring. The server selects part of audio data from all audio data corresponding to the answers of the testers of each examination question in each set of examination papers as audio data to be scaled, the audio data to be scaled is sent to terminal equipment used by an examination paper expert, the examination paper expert carries out manual scaling on the audio data to be scaled to obtain scaling data, and the scaling data is sent to the server.

The main body framework in the scaling process can be finished by depending on a scaling platform. Firstly, after starting calibration, the calibration platform distributes the audio data to be calibrated to terminal equipment corresponding to a scoring expert (a scoring operator or a secondary operator), displays the audio data through a display interface of the calibration platform arranged on the terminal equipment corresponding to the scoring expert, and receives and stores the calibration score of the audio data to be calibrated, which is input by the scoring expert through the corresponding display interface. Then, in the scaling process, the scaling platform can acquire the audio data which is subjected to scaling scoring and corresponds to the target test question at regular time (such as at fixed time intervals) or according to the scaling progress (such as the scaling completion degree reaches a preset node), and call the audio clustering module to cluster the audio data which is subjected to scaling scoring.

Step 220, clustering the scaled data, and finding out the audio data with divergence in the scaled data, wherein the audio data with divergence comprises audio data with similar or same answer but different scaled scores for the same target test question.

For example, the embodiment of the application is mainly applied to the manual calibration process of the intelligent examination paper reading system for the spoken language examination. At present, more and more provinces and cities list oral English exams in middle and high examination, and at present, a plurality of provinces list oral English exams in high examination and are machine exams; some areas have put oral english practice exams into college entrance exams and college entrance exams into plans. Some provinces also apply for college entrance examination English professional examinees to carry out oral English examination, the examination adopts a machine examination form, and an intelligent examination paper reading system is used for automatic scoring. More and more oral exams are read at present, and the form of combining manual calibration with an intelligent examination reading system is adopted, and even a full intelligent examination reading system is adopted. In order to make the scoring result of the intelligent scoring system more in line with the local scoring standard, in practice, a small amount of manual scaling is carried out on each set of examination paper of each examination by a scoring expert, and after self-adaptive learning is carried out on the scaling data by the model, the machine intelligent scoring is carried out on the data to be scored.

Wherein, manual calibration can be completed on an intelligent scoring system calibration platform, and the system distributes a title to a scorer/a arbitrator each time, and information such as title information, scoring standard, student audio and the like can be displayed on a screen and scoring option buttons are provided. In the scaling process, the scaling platform can cluster the audio frequency which has completed scaling and scoring at regular time or according to the scaling progress, and automatically find out the audio frequency groups which are similar or the same in answer but different in scaling score and the corresponding scaling score, namely find out the audio frequency data which has diverged by the grader.

Taking the title in fig. 3 as an example, the stem is "How is the boy. The scoring criteria were: left, arm, hurt can be correctly stated, and 1 score is obtained; three words of left, arm, hurt can be basically stated, or two words are inaccurate in pronunciation but word senses can be still distinguished, and 0.5 score is obtained; left, arm, hurt vowels or consonants are inaccurate in pronunciation and are not readable or are severely wrong when being readable, and 0 score is obtained.

For example, if there are 9 pieces of audio answer content "My left arm stems," and 7 pieces of audio answer content have a score of 1 and 2 pieces of audio answer content have a score of 0.5, the 9 pieces of audio data will be identified as audio data having a divergence by the clustering algorithm and extracted for further processing.

The scaling data are clustered, and the audio data of the same or similar answers are aggregated into the same class, so that an audio list with branches of the grader is found out, and then re-evaluation and adjustment can be performed.

For example, audio feature extraction, ASR recognition, and text feature extraction may be performed on audio data for which scaling scoring has been completed, resulting in audio features and text features, respectively. The audio feature extraction may be based on conventional MFCC or the like, or may use the latest wave2vec or the like deep learning method. Also, text feature extraction may use rule/statistics-based methods, or deep learning methods such as BERT, XLNET, GPT.

After the audio feature extraction and the text feature extraction are completed, as the number of features may be relatively large, some invalid features exist in the text feature extraction, and feature merging and feature screening operations are needed. The feature screening can use traditional methods such as PCA, LDA and the like, and can also train a simple model to perform feature transformation. Experimental results show that after proper feature screening is performed, the speed of a clustering algorithm can be effectively improved, and the clustering effect can be possibly improved by a small amount.

The filtered features may then be clustered using a clustering algorithm such as k-means, DBSCAN, hierarchical clustering, etc., to aggregate the same or similar audio data into a class. The clustering algorithm requires the use of a distance function, for example, euclidean distance may be used, cosine distance may be used, or other suitable distance function.

Then, according to the clustering result, the existing scaling scores are combined, and whether the scaling scores in the same cluster have differences or not can be analyzed. If the difference exists, the situation that the scoring staff is disagreed is indicated, and re-evaluation and adjustment are needed. The scaling system then automatically distributes the audio with the divergence to the grader and/or the arbitrator according to a preset scheme.

Then, after the scorers and/or arbitrators complete the re-evaluation and adjustment, the scaling system updates the scaled scores of the corresponding audio for subsequent cluster analysis and derivation as a final result.

For example, the scaled data may be clustered using an audio clustering module by which the same or similar answer audio may be automatically clustered into the same audio group, by which the initial audio group G may be obtained ₁ ，G ₂ ，…，G _N . By determining only one initial audio group G _i If the middle scaling scores are consistent, the initial audio group G can be determined _i Whether there is a divergence in the scaled score of (a).

In order to reduce the audio group data judged to have the divergence as much as possible, in combination with the processing flow in the scaling process, further optimization can be performed on whether the scaling scores in the initial audio group are consistent or not.

Typically, in the scaling process, the scaling score for an audio uses the arithmetic average of the scores of two graders without exceeding the arbitration threshold to trigger the arbitration mechanism. When the difference value of the scores of the two scoring staff is an odd number of gears, the arithmetic average value can be in a half gear condition. For example, take the title in fig. 3 as an example, its scored gear/granularity is 0.5. If a piece of audio data is scored by two graders as 0.5 and 1, respectively, and the audio data is not arbitrated, then the final score of the audio data is 0.75.

Wherein, in judging an initial audio group G _i If the scale scores are consistent, the scale scores with half gears are ignored, and the initial audio group G is considered only when the difference value of the scale scores of two audios reaches a gear or more _i There is a divergence in the scaling scores of (a).

For example, if an initial audio group G is determined _i When there is no divergence, but there is a scaling score for a half gear in the initial audio set, then the scaling score for the half gear is adjusted to the full gear scaling score within the initial audio set.

For example, if all of the initial audio groups are half-range scores, then the initial audio groups are considered to diverge and require re-evaluation by a scorer and/or a arbitrator.

Optionally, the clustering the scaled data to find out the audio data with divergence in the scaled data includes: clustering the calibration data, and aggregating the audio data with similar or same answers to the same target test question in the calibration data to the same initial audio group; judging whether the scaling scores of all audio data in the initial audio group are consistent; and if the scaling scores of all the audio data in the initial audio group are inconsistent, determining all the audio data contained in the initial audio group as branched audio data.

Optionally, after said determining whether the scaling scores of all audio data in the initial audio group are consistent, further comprising: if the scaling scores of all the audio data in the initial audio group are consistent, judging whether all the scaling scores of all the audio data in the initial audio group are half-gear scaling scores or not; and if all the scale scores of all the audio data in the initial audio group are half-gear scale scores, determining all the audio data contained in the initial audio group as branched audio data.

Optionally, after said determining whether the scaling scores of all audio data in the initial audio group are consistent, further comprising: if the scaling scores of all the audio data in the initial audio group are inconsistent, eliminating the audio data with half-gear scaling scores in the initial audio group to obtain an updated audio group, wherein the scaling scores of all the audio data in the updated audio group are whole-gear scaling scores; judging whether the scaling scores of all audio data in the updated audio group are consistent; if the scaling scores of all the audio data in the updated audio group are consistent, determining that the scaling scores of the audio data contained in the updated audio group are not branched, and adjusting the scaling score of the audio data with the half-gear scaling score in the initial audio group to be the same as the whole-gear scaling score of the scaling scores of all the audio data in the updated audio group.

Optionally, after said determining whether the scaling scores of all audio data in said updated audio group are consistent, further comprising: and if the scaling scores of all the audio data in the updated audio group are inconsistent, determining that the scaling scores of the audio data contained in the updated audio group are divergent, and determining all the audio data contained in the initial audio group as divergent audio data.

Optionally, the clustering the scaled data, and aggregating the audio data having similar or identical answers to the same target test question in the scaled data to the same initial audio group includes: extracting features of the scaling data to obtain audio features and text features of the scaling data; and clustering the audio features and the text features of the scaling data based on a preset algorithm, and aggregating the audio data with similar or same answers aiming at the same target test question in the scaling data into the same initial audio group.

Optionally, before the clustering of the audio feature and the text feature of the scaled data based on the preset algorithm, the method further includes: and carrying out feature merging and feature screening operation on the audio features and the text features of the scaling data to remove invalid features in the audio features and the text features of the scaling data.

Alternatively, as shown in fig. 4, step 220 may be implemented by steps 401 to 407, specifically:

and step 401, extracting features of the scaled data to obtain audio features and text features of the scaled data.

For example, the scaling data is audio data for which scaling scoring has been completed. And carrying out audio feature extraction, ASR recognition and text feature extraction on the audio data with the scale scoring completed, and obtaining audio features and text features respectively. The audio feature extraction may be based on conventional MFCC or the like, or may use the latest wave2vec or the like deep learning method. Also, text feature extraction may use rule/statistics-based methods, or deep learning methods such as BERT, XLNET, GPT.

The method can also be based on a deep learning clustering method to extract, combine and screen the features, so that the obtained features are more beneficial to clustering by using a traditional machine learning clustering method.

And step 402, performing feature combination and feature screening operation on the audio features and the text features of the scaling data to eliminate invalid features in the audio features and the text features of the scaling data.

For example, after the audio feature extraction and the text feature extraction are completed, since the number of features may be relatively large, there are some invalid features inside, and feature merging and feature screening operations are required. The feature screening can use traditional methods such as PCA, LDA and the like, and can also train a simple model to perform feature transformation. Experimental results show that after proper feature screening is performed, the speed of a clustering algorithm can be effectively improved, and the clustering effect can be possibly improved by a small amount.

Step 403, clustering the audio features and the text features of the scaled data based on a preset algorithm, and aggregating the audio data with similar or same answer to the same target test question in the scaled data to the same initial audio group.

For example, the filtered features may be clustered using a preset clustering algorithm such as k-means, DBSCAN, hierarchical clustering, etc., to aggregate the same or similar audio data into one class. The clustering algorithm requires the use of a distance function, for example, euclidean distance may be used, cosine distance may be used, or other suitable distance function.

For example, the preset clustering algorithm may use a deep learning clustering algorithm based on AutoEncoder, GAN or the like, in addition to the aforementioned conventional machine learning methods such as k-means and DBSCAN.

Step 404, judging whether the scaling scores of all audio data in the initial audio group are consistent; if the scaling scores of all the audio data in the initial audio group are consistent, executing step 405; if the scaling scores of all audio data in the initial audio set are not consistent, step 406 is performed.

For example, the scaled data may be clustered using an audio clustering module by which the same or similar answer audio may be automatically clustered into the same initial audio group, by which the initial audio group G may be obtained ₁ ，G ₂ ，…，G _N . By determining only one initial audio group G _i If the middle scaling scores are consistent, the initial audio group G can be determined _i Whether there is a divergence in the scaled score of (a).

Step 405, determining whether all scaling scores of all audio data in the initial audio group are half-gear scaling scores; if yes, go to step 406; if not, step 407 is performed.

For example, in order to minimize the audio group data for which a divergence is determined, in conjunction with the processing flow in the scaling process, further optimization may be performed to determine whether the scaling scores in the initial audio group are consistent.

For example, if the initial audio group G _i When the scaling scores of all the audio data in the audio data set are consistent, an initial audio group G cannot be completely determined _i If no divergence exists, further judging whether all the scaling scores of all the audio data in the initial audio group are half-gear scaling scores. If the scaling scores are consistent, the initial audio group G _i The initial audio group G is also considered if the inner is all half-range scores _i If the disagreement exists, a paper reading expert is required to carry out re-evaluation adjustment.

For example, if the scaled scores are consistent for the initial audio group G _i Not in half gear point, i.e. if the scaled points are identical, the initial audio group G _i The same whole gear fraction is considered to be the initial audio group G _i If there is no divergence, no re-evaluation adjustment is required.

Step 406, determining all audio data contained in the initial audio group as branched audio data.

For example, if the initial audio group G _i When the scaling scores of all the audio data in the audio data set are inconsistent, the initial audio set G is considered _i If the disagreement exists, a paper reading expert is required to carry out re-evaluation adjustment.

For example, if the scaled scores are consistent for the initial audio group G _i The initial audio group G is also considered if the inner is all half-range scores _i If the disagreement exists, a paper reading expert is required to carry out re-evaluation adjustment.

Step 407, determining that there is no divergence in the initial audio set.

Alternatively, as shown in fig. 5, step 220 may also be implemented by steps 501 to 511, specifically:

and step 501, extracting features of the scaled data to obtain audio features and text features of the scaled data.

And 502, performing feature combination and feature screening operation on the audio features and the text features of the scaling data to remove invalid features in the audio features and the text features of the scaling data.

Step 503, clustering the audio features and the text features of the scaled data based on a preset algorithm, and aggregating the audio data with similar or same answer to the same target test question in the scaled data into the same initial audio group.

Step 504, judging whether the scale scores of all audio data in the initial audio group are consistent; if the scaling scores of all the audio data in the initial audio group are inconsistent, executing step 505; if the scaling scores of all audio data in the initial audio group are consistent, step 509 is performed.

Typically, in the scaling process, the scaling score for an audio uses the arithmetic average of the scores of two graders without exceeding the arbitration threshold to trigger the arbitration mechanism. When the difference value of the scores of the two scoring staff is an odd number of gears, the arithmetic average value can be in a half gear condition. Also taking the title in fig. 3 as an example, the gear/granularity of the score is 0.5. If a piece of audio data is scored by two graders as 0.5 and 1, respectively, and the audio data is not arbitrated, then the final score of the audio data is 0.75.

Wherein, in judging an initial audio groupG _i If the scale scores are consistent, the scale scores with half gears are ignored, and the initial audio group G is considered only when the difference value of the scale scores of two audios reaches a gear or more _i There is a divergence in the scaling scores of (a).

Therefore, if the scaling scores of all the audio data in the initial audio set are inconsistent, step 505 is executed to exclude the audio data having the half-gear scaling score in the initial audio set, so as to obtain an updated audio set, where the scaling score of all the audio data in the updated audio set is the full-gear scaling score. If the scaling scores of all the audio data in the initial audio set are consistent, step 509 is executed to further determine whether all the scaling scores of all the audio data in the initial audio set are half-gear scaling scores.

Step 505, excluding the audio data with the half-gear scaling score in the initial audio group to obtain an updated audio group, where the scaling score of all the audio data in the updated audio group is the whole-gear scaling score.

For example, the scale scores of half-shift positions in the initial audio group are ignored by the acquired updated audio group, and then whether the scale scores of the updated audio group have differences is further analyzed.

Step 506, judging whether the scaling scores of all audio data in the updated audio group are consistent; if the scaling scores of all the audio data in the updated audio group are consistent, executing step 507; if the scaling scores of all audio data in the updated audio set are not consistent, step 508 is performed.

Step 507, determining that there is no divergence in the scaling score of the audio data included in the updated audio group, and adjusting the scaling score of the audio data having the half-gear scaling score in the initial audio group to be the same overall-gear scaling score as the scaling score of all the audio data in the updated audio group.

For example, if the scale scores of all audio data in the updated audio set are consistent, determining the audio data contained in the updated audio setNo divergence exists in the scaling score of the audio group G, and then the initial audio group G is judged _i There is no divergence but the initial audio group G _i When the scaling fraction of the half gear exists, the scaling fraction of the half gear needs to be adjusted to be the initial audio group G _i An overall gear scaling score within.

Step 508, determining that there is a divergence in the scaling score of the audio data contained in the updated audio set, and further performing step 510.

For example, after ignoring the scaling score of the half shift in the initial audio group, if the scaling scores of all the audio data in the updated audio group are inconsistent, determining that the scaling score of the audio data included in the updated audio group is divergent, further determining that the initial audio group is divergent, and further performing step 510 to determine that all the audio data included in the initial audio group is divergent.

Step 509, determining whether all scaling scores of all audio data in the initial audio group are half-gear scaling scores; if yes, go to step 510; if not, step 511 is performed.

Step 510, determining all audio data contained in the initial audio group as branched audio data.

For example, ifIf the scaling scores of all the audio data in the updated audio group are inconsistent, determining that the scaling scores of the audio data contained in the updated audio group are divergent, and further determining the initial audio group G _i If the disagreement exists, a paper reading expert is required to carry out re-evaluation adjustment.

Step 511 determines that there is no divergence in the initial audio set.

And 230, sending the branched audio data to a paper marking expert for re-evaluation adjustment, and re-determining the scaling score of the branched audio data according to the obtained re-evaluation result obtained by the paper marking expert through re-evaluation adjustment.

For example, the scaling platform records audio group G with divergence ₁ ’，G ₂ ’，…,G _M ' Audio group G is assembled according to the scheme described above _j ' Each piece of Audio is distributed to the grader, or the entire Audio group G _j And the' and corresponding scaling scores are sent to the arbitrator, and the scaling scores of the audio data with the divergence are redetermined according to the re-evaluation results obtained through re-evaluation adjustment of the scoring personnel and/or the arbitrator.

After the scorer and/or the arbitrator finishes the re-evaluation adjustment of the audio group with the divergence, if the calibration process is not finished, the scorer and/or the arbitrator is continuously distributed with calibration tasks, the audio clustering module is continuously and automatically invoked at regular time or according to the calibration progress, and the operation flow of clustering, analyzing and re-evaluation adjustment is repeated.

Optionally, the paper marking expert includes a scoring operator and a secondary arbitrator, the sending the branched audio data to the paper marking expert for re-marking adjustment, and re-determining the scaling score of the branched audio data according to the obtained re-marking result obtained by the re-marking adjustment of the paper marking expert, including:

transmitting the branched audio data to the grader which does not participate in grading to carry out re-grading so as to obtain a new scaling score of the branched audio data obtained by re-grading of the grader;

Judging whether new scaling scores of all audio data in the audio data with the branches are consistent;

if the new scaling scores of all the audio data in the branched audio data are inconsistent, sending the branched audio data to the secondary arbitrator for re-evaluation adjustment, and re-determining the scaling score of each piece of audio data in the branched audio data according to the obtained final scaling score of each piece of audio data in the branched audio data obtained through the secondary arbitrator re-evaluation adjustment; or alternatively

And if the new scaling scores of all the audio data in the branched audio data are consistent, re-determining the scaling score of the branched audio data according to the new scaling score of each piece of audio data in the branched audio data.

For example, if there are 9 pieces of audio answer content "My left arm stems," and 7 pieces of audio answer content have a score of 1 and 2 pieces of audio answer content have a score of 0.5, the 9 pieces of audio data will be identified as audio data having a divergence by the clustering algorithm and extracted. For example, the 9 pieces of audio may be respectively distributed to the graders who are not involved in the grading, and the graders may perform the grading again. If the re-scoring result of a certain audio in the re-scoring cluster (the initial audio group) is inconsistent with the scaling result, the audio is sent to a secondary judge for scoring.

Optionally, the sending the branched audio data to the grader not participating in grading the branched audio data to perform re-grading so as to obtain new scaling scores of all audio data in the branched audio data obtained by re-grading by the grader, including: dividing the branched audio data into a first audio subgroup and a second audio subgroup according to different scaling scores, wherein the number of the first audio subgroup is smaller than that of the second audio subgroup; respectively sending each piece of audio data in the first audio subgroup to the grader which does not participate in grading to carry out re-grading so as to obtain a new scaling score of each piece of audio data in the first audio subgroup obtained through re-grading by the grader; judging whether the new scaling scores of all the audio data in the first audio subgroup are consistent with the initial scaling scores of all the audio data in the second audio subgroup; if the new scaling score of all the audio data in the first audio subgroup is inconsistent with the initial scaling score of all the audio data in the second audio subgroup, respectively sending each piece of audio data in the second audio subgroup to the scoring staff which does not participate in scoring to re-score so as to obtain the new scaling score of each piece of audio data in the second audio subgroup obtained by re-scoring by the scoring staff; and obtaining the new scaling score of the branched audio data according to the new scaling score of each piece of audio data in the first audio subgroup and the new scaling score of each piece of audio data in the second audio subgroup.

Optionally, after said determining whether the new scaling scores of all audio data in the first audio subgroup are consistent with the initial scaling scores of all audio data in the second audio subgroup, further comprising; and if the new scaling score of all the audio data in the first audio subgroup is consistent with the initial scaling score of all the audio data in the second audio subgroup, re-determining the scaling score of the branched audio data according to the new scaling score of each piece of audio data in the first audio subgroup.

For example, there may be an optimization term, that is, a few of the inconsistent audios (the first audio subgroup) in the scaling result are re-rated first, and if the scaling results in the cluster (the initial audio subgroup) after re-evaluation are all consistent, the majority of audios (the second audio subgroup) are not re-rated, so that the scaling cost may be effectively saved. If the scaling results in the clusters after the reevaluation of the few audios are inconsistent, reevaluating the majority of audios, namely, after reevaluation of all audio data in the clusters, if the scaling results in the clusters after the reevaluation of the few audios and the majority of audios are inconsistent, the audios in the clusters need to be sent to a secondary judge for scoring.

Optionally, the paper marking expert includes a secondary arbitrator, the sending the branched audio data to the paper marking expert for re-evaluation adjustment, and re-determining the scaling score of the branched audio data according to the obtained re-evaluation result obtained by the re-evaluation adjustment of the paper marking expert, including:

and sending all the audio data in the branched audio data and the corresponding scaling scores to the arbitrator for readjustment, and redetermining the scaling scores of the branched audio data according to the obtained final scaling score of each piece of audio data in the branched audio data obtained through the arbitrator readjustment.

For example, if there are 9 pieces of audio answer content "My left arm stems," and 7 pieces of audio answer content have a score of 1 and 2 pieces of audio answer content have a score of 0.5, the 9 pieces of audio data will be identified as audio data having a divergence by the clustering algorithm and extracted. For example, the 9 pieces of audio may be sent to the arbitrator together with the corresponding scaling score, with the comparison decision and score adjustment being made uniformly by the arbitrator. The scheme has the advantages that all the data with the divergence are directly compared, so that the problems can be more intuitively found and the modification and adjustment can be carried out; but the job task of the arbitrator is therefore burdensome.

In practical projects, factors such as the requirement for calibration quality, time, cost and the like can be comprehensively considered, and a proper scheme can be selected.

After the presence of a divergent audio group is identified, a combination of both schemes may be employed in addition to the foregoing distribution of a single audio to the scorers and the overall distribution of the audio group to the arbitrators.

By re-evaluating and adjusting the branched calibration data, the amount of the branched calibration data is greatly reduced, and the consistency of the calibration data is obviously improved, which means that the quality of the calibration data is also greatly improved.

For example, the effect of the present application will be described by taking a post-hearing answer type of English listening and speaking test in a certain area as an example, wherein the number of questions is 4, and each question has 237 pieces of scale data.

After clustering is performed on one of the topics by using the audio clustering module, 7 high-quality classification results can be obtained, namely 7 audio groups are generated, and 67 pieces of data are covered. Of the 7 audio groups, one diverges, involving 10 pieces of data altogether; there is also no divergence in an audio set of 16 pieces of data, but 4 of them are scaled to half-shift scores. According to the scheme, after the audio groups with the branches are reevaluated and adjusted, the scaling scores of 3 audios are changed; for the audio which is judged to have no divergence but has half gear fraction, the scaling fraction is automatically adjusted according to rules.

Then, the same audio clustering, analysis and re-evaluation adjustment operation are carried out on other 3 small questions, the original calibration data and the calibration data optimized by the application are used for training a scoring model of the intelligent examination paper marking system, then, the examination is carried out on a manually marked test set, and the result is as follows:

from the test results, the accuracy of the scoring model on the test set is improved by 1.5 to 2.2 percent after the quality of the calibration data is improved through the method, and the average accuracy is improved by 1.88 percent after the scoring model is trained. And whether the training effect of the original calibration data is good or bad, the application brings stable and reliable improvement.

All the above technical solutions may be combined to form an optional embodiment of the present application, which is not described here in detail.

In order to facilitate better implementation of the audio data processing method of the embodiment of the application, the embodiment of the application also provides an audio data processing device. Referring to fig. 6, fig. 6 is a schematic structural diagram of an audio data processing device according to an embodiment of the present application. Wherein the audio data processing device 600 may include:

an obtaining unit 601, configured to obtain scaling data corresponding to a target test question, where the scaling data is audio data with a scaled score;

a clustering unit 602, configured to cluster the scaled data, and find out audio data with a divergence in the scaled data, where the audio data with a divergence includes audio data with similar or identical answers to the same target test question but different scaling scores;

and the processing unit 603 is configured to send the audio data with the divergence to a scoring expert for re-evaluation adjustment, and re-determine the scaling score of the audio data with the divergence according to the obtained re-evaluation result obtained by the scoring expert for re-evaluation adjustment.

Optionally, the clustering unit 602 is specifically configured to: clustering the calibration data, and aggregating the audio data with similar or same answers to the same target test question in the calibration data to the same initial audio group; judging whether the scaling scores of all audio data in the initial audio group are consistent; and if the scaling scores of all the audio data in the initial audio group are inconsistent, determining all the audio data contained in the initial audio group as branched audio data.

Optionally, after said determining whether the scaling scores of all audio data in the initial audio group are consistent, the clustering unit 602 may be further configured to: if the scaling scores of all the audio data in the initial audio group are consistent, judging whether all the scaling scores of all the audio data in the initial audio group are half-gear scaling scores or not; and if all the scale scores of all the audio data in the initial audio group are half-gear scale scores, determining all the audio data contained in the initial audio group as branched audio data.

Optionally, after said determining whether the scaling scores of all audio data in the initial audio group are consistent, the clustering unit 602 may be further configured to: if the scaling scores of all the audio data in the initial audio group are inconsistent, eliminating the audio data with half-gear scaling scores in the initial audio group to obtain an updated audio group, wherein the scaling scores of all the audio data in the updated audio group are whole-gear scaling scores; judging whether the scaling scores of all audio data in the updated audio group are consistent; if the scaling scores of all the audio data in the updated audio group are consistent, determining that the scaling scores of the audio data contained in the updated audio group are not branched, and adjusting the scaling score of the audio data with the half-gear scaling score in the initial audio group to be the same as the whole-gear scaling score of the scaling scores of all the audio data in the updated audio group.

Optionally, the clustering unit 602 may be further configured to, after the determining whether the scaling scores of all audio data in the updated audio group are consistent: and if the scaling scores of all the audio data in the updated audio group are inconsistent, determining that the scaling scores of the audio data contained in the updated audio group are divergent, and determining all the audio data contained in the initial audio group as divergent audio data.

Optionally, when the clustering unit 602 clusters the scaled data, and aggregates the audio data having similar or identical answers to the same target test question in the scaled data into the same initial audio group, the clustering unit is specifically configured to: extracting features of the scaling data to obtain audio features and text features of the scaling data; and clustering the audio features and the text features of the scaling data based on a preset algorithm, and aggregating the audio data with similar or same answers aiming at the same target test question in the scaling data into the same initial audio group.

Optionally, before the clustering unit 602 clusters the audio feature and the text feature of the scaled data based on the preset algorithm, the clustering unit may be further configured to: and carrying out feature merging and feature screening operation on the audio features and the text features of the scaling data to remove invalid features in the audio features and the text features of the scaling data.

Optionally, the paper marking expert includes a scoring member and a secondary arbitrator, and when the processing unit 603 sends the audio data with the divergence to the paper marking expert to perform re-evaluation adjustment, the processing unit may be configured to: transmitting the branched audio data to the grader which does not participate in grading to carry out re-grading so as to obtain a new scaling score of the branched audio data obtained by re-grading of the grader; judging whether new scaling scores of all audio data in the audio data with the branches are consistent; if the new scaling scores of all the audio data in the branched audio data are inconsistent, sending the branched audio data to the secondary arbitrator for re-evaluation adjustment, and re-determining the scaling score of each piece of audio data in the branched audio data according to the obtained final scaling score of each piece of audio data in the branched audio data obtained through the secondary arbitrator re-evaluation adjustment; or if the new scaling scores of all the audio data in the branched audio data are consistent, re-determining the scaling score of the branched audio data according to the new scaling score of each piece of audio data in the branched audio data.

Optionally, when the processing unit 603 sends the branched audio data to the grader not participating in grading the branched audio data to perform re-grading, so as to obtain new scaling scores of all audio data in the branched audio data obtained by re-grading by the grader, the processing unit may be configured to: dividing the branched audio data into a first audio subgroup and a second audio subgroup according to different scaling scores, wherein the number of the first audio subgroup is smaller than that of the second audio subgroup; respectively sending each piece of audio data in the first audio subgroup to the grader which does not participate in grading to carry out re-grading so as to obtain a new scaling score of each piece of audio data in the first audio subgroup obtained through re-grading by the grader; judging whether the new scaling scores of all the audio data in the first audio subgroup are consistent with the initial scaling scores of all the audio data in the second audio subgroup; if the new scaling score of all the audio data in the first audio subgroup is inconsistent with the initial scaling score of all the audio data in the second audio subgroup, respectively sending each piece of audio data in the second audio subgroup to the scoring staff which does not participate in scoring to re-score so as to obtain the new scaling score of each piece of audio data in the second audio subgroup obtained by re-scoring by the scoring staff; and obtaining the new scaling score of the branched audio data according to the new scaling score of each piece of audio data in the first audio subgroup and the new scaling score of each piece of audio data in the second audio subgroup.

Optionally, the processing unit 603 may be further configured to, after the determining whether the new scaling score of all audio data in the first audio subgroup is consistent with the initial scaling score of all audio data in the second audio subgroup; and if the new scaling score of all the audio data in the first audio subgroup is consistent with the initial scaling score of all the audio data in the second audio subgroup, re-determining the scaling score of the branched audio data according to the new scaling score of each piece of audio data in the first audio subgroup.

Optionally, the paper marking expert includes a secondary arbitrator, and when the processing unit 603 sends the audio data with divergence to the paper marking expert to perform re-evaluation adjustment, the processing unit may be configured to: and sending all the audio data in the branched audio data and the corresponding scaling scores to the arbitrator for readjustment, and redetermining the scaling scores of the branched audio data according to the obtained final scaling score of each piece of audio data in the branched audio data obtained through the arbitrator readjustment.

Optionally, the acquiring unit 601 may be configured to: acquiring calibration data corresponding to the target test questions according to preset fixed time; or obtaining the calibration data corresponding to the target test question according to the calibration progress.

It should be noted that, the functions of each module in the audio data processing apparatus 600 in the embodiments of the present application may be correspondingly referred to the specific implementation manner of any embodiment in the above method embodiments, which is not described herein again.

The respective units in the above-described audio data processing device may be realized in whole or in part by software, hardware, and a combination thereof. The above units may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor invokes and executes operations corresponding to the above units.

The audio data processing device 600 may be integrated in a terminal or a server having a memory and a processor mounted therein and having an arithmetic capability, for example, or the audio data processing device 600 may be the terminal or the server. The terminal can be a smart phone, a tablet personal computer, a notebook computer, a smart television, a smart sound box, wearable smart equipment, a personal computer (Personal Computer, PC) and other equipment, and the terminal can also comprise a client, wherein the client can be a video client, a browser client or an instant messaging client and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligent platforms, and the like.

Fig. 7 is another schematic structural diagram of an audio data processing device according to an embodiment of the present application, and as shown in fig. 7, an audio data processing device 700 may include: a communication interface 701, a memory 702, a processor 703 and a communication bus 704. A communication interface 701, a memory 702, and a processor 703 communicate with each other via a communication bus 704. The communication interface 701 is used for data communication of the apparatus 700 with external devices. The memory 702 may be used to store software programs and modules, and the processor 703 may be configured to execute the software programs and modules stored in the memory 702, such as the software programs for corresponding operations in the foregoing method embodiments.

Alternatively, the processor 703 may invoke a software program and module stored in the memory 702 to perform the following operations: obtaining scaling data corresponding to a target test question, wherein the scaling data is audio data with scaling scoring completed; clustering the calibration data, and finding out the audio data with divergence in the calibration data, wherein the audio data with divergence comprises audio data with similar or same answers to the same target test question but different calibration scores; and sending the branched audio data to a scoring expert for re-evaluation adjustment, and re-determining the scaling score of the branched audio data according to the re-evaluation result obtained by the re-evaluation adjustment of the scoring expert.

Alternatively, the audio data processing device 700 may be integrated in a terminal or a server having a memory and a processor mounted with an arithmetic capability, or the audio data processing device 700 may be the terminal or the server, for example. The terminal can be a smart phone, a tablet personal computer, a notebook computer, a smart television, a smart sound box, wearable smart equipment, a personal computer and other equipment. The server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like.

Optionally, the application further provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.

The present application also provides a computer-readable storage medium for storing a computer program. The computer readable storage medium may be applied to a computer device, and the computer program causes the computer device to execute a corresponding flow in the audio data processing method in the embodiment of the present application, which is not described herein for brevity.

The present application also provides a computer program product comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes a corresponding flow in the audio data processing method in the embodiment of the present application, which is not described herein for brevity.

The present application also provides a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes a corresponding flow in the audio data processing method in the embodiment of the present application, which is not described herein for brevity.

It should be appreciated that the processor of an embodiment of the present application may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The processor may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

It will be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

It should be understood that the above memory is exemplary but not limiting, and for example, the memory in the embodiments of the present application may be Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), direct RAM (DR RAM), and the like. That is, the memory in embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of audio data processing, the method comprising:

obtaining scaling data corresponding to a target test question, wherein the scaling data is audio data with scaling scoring completed;

clustering the calibration data to find out the audio data with divergence in the calibration data, wherein the audio data with divergence comprises audio data with similar or same answer but different calibration scores aiming at the same target test question, and the method comprises the following steps: extracting features of the scaling data to obtain audio features and text features of the scaling data, clustering the audio features and the text features of the scaling data based on a preset algorithm, and aggregating the audio data with similar or same answers aiming at the same target test question in the scaling data into the same initial audio group; judging whether the scaling scores of all audio data in the initial audio group are consistent; if the scaling scores of all the audio data in the initial audio group are inconsistent, determining all the audio data contained in the initial audio group as branched audio data;

sending the branched audio data to a scoring expert for re-evaluation adjustment, and re-determining the scaling score of the branched audio data according to the obtained re-evaluation result obtained by the re-evaluation adjustment of the scoring expert, wherein the re-determination comprises the following steps:

The paper marking expert comprises a scoring operator and a secondary cutter, the branched audio data is sent to the scoring operator which does not participate in the scoring of the paper marking expert to be re-scored, so that a new scaling score of the branched audio data obtained through the re-scoring of the scoring operator which does not participate in the scoring of the paper marking expert is obtained, and the method specifically comprises the following steps: dividing the branched audio data into a first audio subgroup and a second audio subgroup according to different scaling scores, wherein the number of the first audio subgroup is smaller than that of the second audio subgroup; respectively sending each piece of audio data in the first audio subgroup to the scoring staff which does not participate in scoring to re-score so as to obtain a new scaling score of each piece of audio data in the first audio subgroup obtained by re-scoring by the scoring staff which does not participate in scoring; judging whether the new scaling scores of all the audio data in the first audio subgroup are consistent with the initial scaling scores of all the audio data in the second audio subgroup; if the new scaling score of all the audio data in the first audio subgroup is inconsistent with the initial scaling score of all the audio data in the second audio subgroup, respectively sending each piece of audio data in the second audio subgroup to the scoring staff which does not participate in scoring to re-score so as to obtain the new scaling score of each piece of audio data in the second audio subgroup obtained by re-scoring the scoring staff which does not participate in scoring; obtaining new scaling scores of the branched audio data according to the new scaling scores of each piece of audio data in the first audio subgroup and the new scaling scores of each piece of audio data in the second audio subgroup;

if the new scaling scores of all the audio data in the branched audio data are inconsistent, sending the branched audio data to the secondary arbitrator for re-evaluation adjustment, and re-determining the scaling score of each piece of audio data in the branched audio data according to the obtained final scaling score of each piece of audio data in the branched audio data obtained through the secondary arbitrator re-evaluation adjustment; or if the new scaling scores of all the audio data in the branched audio data are consistent, re-determining the scaling score of the branched audio data according to the new scaling score of each piece of audio data in the branched audio data.

2. The audio data processing method of claim 1, further comprising, after said determining whether scaling scores of all audio data in said initial audio group are identical:

if the scaling scores of all the audio data in the initial audio group are consistent, judging whether all the scaling scores of all the audio data in the initial audio group are half-gear scaling scores or not;

And if all the scale scores of all the audio data in the initial audio group are half-gear scale scores, determining all the audio data contained in the initial audio group as branched audio data.

3. The audio data processing method of claim 1, further comprising, after said determining whether scaling scores of all audio data in said initial audio group are identical:

if the scaling scores of all the audio data in the initial audio group are inconsistent, eliminating the audio data with half-gear scaling scores in the initial audio group to obtain an updated audio group, wherein the scaling scores of all the audio data in the updated audio group are whole-gear scaling scores;

judging whether the scaling scores of all audio data in the updated audio group are consistent;

if the scaling scores of all the audio data in the updated audio group are consistent, determining that the scaling scores of the audio data contained in the updated audio group are not branched, and adjusting the scaling score of the audio data with the half-gear scaling score in the initial audio group to be the same as the whole-gear scaling score of the scaling scores of all the audio data in the updated audio group.

4. The audio data processing method of claim 3, further comprising, after said determining whether scaling scores of all audio data in said updated audio group are identical:

and if the scaling scores of all the audio data in the updated audio group are inconsistent, determining that the scaling scores of the audio data contained in the updated audio group are divergent, and determining all the audio data contained in the initial audio group as divergent audio data.

5. The audio data processing method of claim 1, further comprising, before the clustering of the audio features and the text features of the scaled data based on a preset algorithm:

and carrying out feature merging and feature screening operation on the audio features and the text features of the scaling data to remove invalid features in the audio features and the text features of the scaling data.

6. The audio data processing method of claim 1, further comprising, after said determining whether new scaling scores for all audio data in said first audio subgroup agree with initial scaling scores for all audio data in said second audio subgroup:

And if the new scaling score of all the audio data in the first audio subgroup is consistent with the initial scaling score of all the audio data in the second audio subgroup, re-determining the scaling score of the branched audio data according to the new scaling score of each piece of audio data in the first audio subgroup.

7. The audio data processing method of claim 1, wherein the acquiring the scaling data corresponding to the target test question comprises:

8. An audio data processing device, the device comprising:

the acquisition unit is used for acquiring scaling data corresponding to the target test questions, wherein the scaling data are audio data with the scaling scoring completed;

the clustering unit is used for clustering the scaled data, finding out the audio data with divergence in the scaled data, wherein the audio data with divergence comprises audio data with similar or same answer but different scaled scores aiming at the same target test question, and comprises the following steps: extracting features of the scaling data to obtain audio features and text features of the scaling data, clustering the audio features and the text features of the scaling data based on a preset algorithm, and aggregating the audio data with similar or same answers aiming at the same target test question in the scaling data into the same initial audio group; judging whether the scaling scores of all audio data in the initial audio group are consistent; if the scaling scores of all the audio data in the initial audio group are inconsistent, determining all the audio data contained in the initial audio group as branched audio data;

The processing unit is used for sending the branched audio data to a paper marking expert for re-evaluation adjustment, and re-determining the scaling score of the branched audio data according to the obtained re-evaluation result obtained by the paper marking expert for re-evaluation adjustment, and comprises the following steps:

9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which is adapted to be loaded by a processor for performing the steps in the audio data processing method according to any of claims 1-7.

10. A computer device, characterized in that it comprises a processor and a memory, in which a computer program is stored, the processor being arranged to perform the steps of the audio data processing method according to any of claims 1-7 by calling the computer program stored in the memory.