CN117746901A

CN117746901A - Deep learning-based primary and secondary school performance scoring method and system

Info

Publication number: CN117746901A
Application number: CN202311758565.1A
Authority: CN
Inventors: 刘云光; 尚兴年
Original assignee: Nanjing Guanghui Interactive Network Technology Co ltd
Current assignee: Nanjing Guanghui Interactive Network Technology Co ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-03-22

Abstract

The invention discloses a method and a system for scoring performance of middle and primary schools based on deep learning, wherein the method comprises the following steps of S1, collecting student records for noise extraction and training; s2, collecting playing sounds of corresponding instruments as training samples, randomly screening audio played by student examination instruments as test samples, and training an output model structure of the neural convolution network; s3, processing the recorded sound of the students, xml files of playing questions and standard overtone samples; and S4, scoring by adopting a multi-dimensional evaluation module, and summarizing the scores of the multi-dimensional evaluation module according to the weighted values through an output module to obtain the final score of the student playing musical instrument. The music performance scoring system based on deep learning provided by the invention solves a plurality of defects in the prior art. The efficiency and accuracy of scoring are improved by an automated and intelligent method.

Description

Deep learning-based primary and secondary school performance scoring method and system

Technical Field

The invention relates to the technical field of deep learning, in particular to a performance scoring method and system for middle and primary schools based on deep learning.

Background

In conventional medium and small academic quality assessment systems, particularly in musical performance scoring, generally rely on subjective judgment by human commentators, which are time-consuming and susceptible to personal preferences and emotions, resulting in inconsistent and inaccurate scoring. In addition, the conventional scoring method has limitations in capturing and evaluating various fine aspects of performance, such as intonation, rhythm, timbre, and expressivity. Dynamic and emotional expressions of the performance are often ignored, which is important for a comprehensive understanding and evaluation of a musical performance. Meanwhile, the conventional method has a limitation in providing comprehensive, objective and consistent evaluation, particularly when dealing with performance of a large number of students. Therefore, how to provide a method and a system for scoring performance of middle and primary schools based on deep learning is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a deep learning-based music performance scoring method and a deep learning-based music performance scoring system for middle and primary schools, which solve a plurality of defects in the prior art. The efficiency and accuracy of scoring are improved by an automated and intelligent method.

According to an embodiment of the invention, a performance scoring system for middle and primary schools based on deep learning comprises

The training module is used for performing deep learning training on the sound and the environmental noise of various musical instruments to obtain a musical instrument classification model and a noise reduction model, and transmitting the musical instrument classification model and the noise reduction model to the preprocessing module;

the preprocessing module is used for processing student performance audio and track information in a specific environment, and the processing content comprises AI noise reduction, signal enhancement, signal conversion, normalization processing and xml analysis of the performance audio, midi and overtones by combining a musical instrument classification model and a noise reduction model;

the multidimensional evaluation module is used for carrying out multidimensional evaluation on the preprocessed student playing audio according to the track information, wherein the multidimensional evaluation comprises the steps of grading the types of musical instruments, the pitch, the rhythm, the integrity and the emotion;

the identification module is used for intelligently identifying and classifying musical instruments used for playing the audio;

and the output module is used for carrying out weighted operation on each dimension score of the multi-dimension evaluation and the musical instrument identification result, and finally outputting student performance scores.

Optionally, the method comprises the following steps:

s1, collecting student records for noise extraction and training;

S2, collecting playing sounds of corresponding instruments as training samples, randomly screening audio played by student examination instruments as test samples, and training an output model structure of the neural convolution network;

s3, processing the recorded sound of the students, xml files of playing questions and standard overtone samples;

and S4, scoring by adopting a multi-dimensional evaluation module, and summarizing the scores of the multi-dimensional evaluation module according to the weighted values through an output module to obtain the final score of the student playing musical instrument.

Optionally, the S1 specifically includes:

s11, converting the student record into a mono 16-bit wav audio format;

s12, carrying out noise extraction and classification labeling on the student records: noise 1, noise 2 and … noise n, and obtaining noise samples;

s13, classifying and labeling the pure student records: pure tone 1, pure tone 2, … pure tone n, obtain a pure tone sample;

s14, generating a noise-carrying frequency segment by combining the noise sample and the pure sound sample randomly and setting the signal to noise ratio, and using the noise-carrying frequency segment to input a model to quickly fit the characteristics under various noise environments.

Optionally, the step S14 specifically includes cutting the clean sound sample to obtain frame segments of the clean sound, randomly selecting a noise from each frame segment to perform noise training, and setting a random signal-to-noise ratio to obtain frame signals including different noise types and noise intensities:

purgain＝rand(-24,24)dB；

noisegain＝rand(-12,12)dB；

noiseFrame _n ＝f _n *purgain+noise[rand(0,noise_len)[0:frame_len]*noisegain]；

Wherein, noiseFrame _n Representing the nth noisy frame signal, f _n Representing the nth clean audio frame, purgain, noisegain representing random gain values for clean sound and noise, respectively, noise_len representing the length of the noise class, frame_len representing the frame length;

and (3) circularly executing the frame fragments of the clean tone to finish processing, and obtaining all randomly generated frame signals with different noise types and noise intensities.

Optionally, the noise training specifically includes adopting a cyclic neural network technical scheme, adopting keras, tensorflow by a deep learning framework, adopting librosa by an audio processing algorithm, respectively performing DCT (discrete cosine transform) on noise frame data and pure tone data to obtain BFCC barker frequency cepstrum coefficients, calculating MFCC characteristics, spectrum centroid, spectrum attenuation and fundamental frequency period characteristic coefficients, inputting the MFCC characteristics, spectrum centroid, spectrum attenuation and fundamental frequency period characteristic coefficients into a training network composed of a dense layer and a GRU (generalized GRU) layer for multiple training, obtaining a network model for filtering noise, and storing the network model into a training module for subsequent testing and operation.

Optionally, the S2 specifically includes:

s21, processing the training sample and the test sample into a wav file with 16-bit, mono and 16khz sampling rate, and cutting to obtain a data structure of a sample unit;

S22, performing MFCC (frequency division multiplexing) mel cepstrum coefficient conversion on the data structure to obtain a sample unit data structure of mel cepstrum frequency band data quantity, and performing fitting to obtain a single training sample data structure;

s23, performing oneHot conversion on the category of each instrument sample unit, and expanding the sample unit to a form consistent with the sample unit data structure;

s24, processing training data by adopting a neural convolutional network architecture, performing dimension reduction processing on the data characteristic diagram by using a neural convolutional network layer, inputting the data characteristic diagram into a dense layer for approximate prediction, and obtaining final prediction result data to form an output model structure of the neural convolutional network.

Optionally, the xml file of the performance question is a spectrum information file derived from the midi which is produced by professional spectral software, and the overtone sample is sample audio recorded by a proposition staff of the performance question on site and is used as a standard for judging performance emotion.

Optionally, the step S3 specifically includes:

s31, noise reduction is carried out on the student record through the training module in the S1, and a band-pass filter is adopted for signal enhancement on the noise-reduced student audio to obtain final student record preprocessing audio;

s32, analyzing an xml data file in a performance title to obtain track basic information, wherein the track basic information comprises a number of small beats, a number of beats, a time value of beats, a tone, a time value and a frequency sequence of notes, filling or compressing the xml data, and converting the xml information into a sequence of each note frequency in a time domain: xmlSeq;

S33, extracting a fundamental frequency sequence from the noise-reduced student wav audio through a pyin technology to obtain a frequency sequence with the same time axis as xml: stuSeq;

s34, carrying out loudness extraction on the standard overtone sample to obtain amplitude data of each sampling point:

sample_loundness1,sample_loundness2,...sample_loundnessn；

obtaining a sound amplitude sequence in an overtone time domain range;

s35, the same processing method is used for loudness extraction of the student audio to obtain an amplitude sequence in the student time domain range.

Optionally, the integrity evaluation refers to an XmlSeq standard music spectrum sequence, compares student performance sequences, takes a mask of the student performance sequences StuSeq, extracts a mute mask, and calculates the distribution of the mute mask:

filtering the part which is silent in the XmlSeq standard music spectrum sequence;

carrying out ventilation identification of small gaps on the filtered mute sequence, if the distribution accords with periodicity and the interval is within a preset range, incorporating the mute sequence into a normal performance ventilation sequence, and not counting the deduction part;

filing the section and the beat point of the finally filtered mute sequence, taking the total frame number of the beat point of the continuous frame loss number as a deduction interval, namely, all frames of the beat point are lost as full deduction, and finally multiplying the full deduction coefficient to obtain the full deduction;

The pitch evaluation refers to XmlSeq standard curvelet sequences, student playing sequences are compared, two sequences of XmlSeq and StuSeq are subjected to normalization processing, two sequence value ranges are fixed in the same frequency dimension, the correction is carried out by adopting section-by-section normalization processing, if the correction is abnormal, the sequences are processed in a section-by-section normalization processing mode, the two sequences are segmented according to sections and beat points, then each section of sequence is aligned through a DTW algorithm, and the section-by-section aligned sequences are subjected to frame pitch comparison:

the frame sequence of the standard beat is xf1, xf2, xf3,..xfn;

student beat point frame sequences sf1, sf2, sf3,..sfn;

standard mean comparison of two sequences:

wherein,sequence standard mean square deviations of a beat point respectively representing a standard music score and a student performance, wherein the closer the two mean square deviations are, the more stable the performance is illustrated, and the average value of the sequence is taken> The closer the two averages are, the more accurate the pitch is, mean ^xf -mean ^sf The result is a negative number, which is considered to be a pitch decrease, and a positive number, which is considered to be a pitch increase;

setting the pitch stability and the accuracy, performing stability withholding when the absolute value difference of the standard mean square error is in the range, performing accuracy withholding when the absolute value range of the mean difference is in the standard mean range of the preset times, and multiplying the two indexes by the withholding coefficient and the index weight ratio to obtain the pitch withholding of each beat point;

The rhythm evaluation is carried out on the sequences in the same pitch in the processing mode, the aligned XmlSeq and StuSeq sequences are directly subjected to DTW route comparison, the position distribution of each frame offset of each beat point is obtained, and a frame offset sequence is obtained;

shf1,sf2,sf3,...sfn；

archiving the frame offset according to the section unit, and calculating the standard mean square error of the frame offset of each section:

obtained byThe smaller the value is, the more stable the rhythm deviation of the current bar is, and meanwhile, the average value of the bar rhythm deviation is calculatedmean ^shf The value of (2) represents the average number of frames of the bar rhythm point offset, when mean ^shf <The range of 10% of the number of frames of the small beat points identifies that the current small beat playing point is accurate, and out of the range, the rhythm evaluation also comprises two indexes of rhythm stability and rhythm accuracy, and the deduction of each small beat rhythm is obtained according to the allocation of the indexes and the weight allocation of the deduction coefficient.

Optionally, the overtone loudness reference sequence and the student performance loudness sequence are calibrated according to the DTW of the integrity, the pitch and the rhythm, the overtone loudness reference sequence and the student performance loudness sequence are recalibrated in the time domain with overtones after eliminating mute masks to obtain the sound intensity comparison sequence of each bar and each beat, the two calibrated sound intensity sequences are normalized to obtain the sequence with the decibel scale of [0,1], and the sampling points corresponding to each beat are archived to obtain the subsequence of the sound intensity of each beat:

batch1：[ln1,ln2,ln3,...lnn]；

The sequence of overtones and student audio intensity for each beat is obtained:

sample_ln_seq＝[[ln1,ln2,ln3...lnn],...[ln1,ln2,..lnk]]；

stu_ln_seq＝[[ln1,ln2,ln3...lnn],...[ln1,ln2,..lnk]]；

the two sequences are subjected to simultaneous sound intensity trend and difference quantification, the sound intensity change trend and the beat elevation change difference value of the beat can be directly compared with each other, the sound intensity trend is represented by 1, the rising is represented by-1, the flat is represented by 0, and each beat of the two sequences is subjected to operation and multiplication operation:

result_up _i ＝sample_up _i AND stu_up _i *(sample_up _i *stu_up _i )；

wherein result_up _i The result represents the ith beat, the two audio sound intensity change trends are consistent if the non-negative number represents the trend, and the negative number represents the trend opposite;

for the operation of the sound intensity difference value, the difference value between the sound intensity beating mean values is directly calculated:

result_sample_shift _i ＝sample_shift _i -sample_shift _i-1 ；

the same serialization operation is carried out on the student audio:

result_stu_shift _i ＝stu_shift _i -stu_shift _i-1 ；

obtaining a sequence of difference values between two sound intensity beating points, carrying out difference value operation on each point of the two sequences, solving standard mean square error of the result of the difference value operation, and obtaining the trend of sound intensity amplitude change of the whole beating point:

shift_diff _i ＝result_sample_shift _i -result_stu_shift _i ；

wherein,the standard mean square error of the sound intensity amplitude variation of all the beating points of the whole starter is the same as the overall sound intensity variation trend of the two starter represented by the smaller result.

The beneficial effects of the invention are as follows:

the music performance scoring system based on deep learning provided by the invention solves a plurality of defects in the prior art. The efficiency and accuracy of scoring are improved by an automated and intelligent method. The system can comprehensively analyze the audio, objectively evaluate the plurality of dimensions such as the pitch, the rhythm, the tone and the like, thereby providing more comprehensive and consistent evaluation. This reduces the subjectivity and time costs of manual scoring, making the assessment process more fair and efficient. For large-scale student performance evaluation, the system is particularly beneficial, and can bring higher quality and standard for middle and primary academic education.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a system flow chart of a performance scoring method and system for middle and primary schools based on deep learning;

fig. 2 is a flowchart of noise training in embodiment 1 of a performance scoring method and system for middle and primary schools based on deep learning.

Fig. 3 is a flowchart of a method and a system for scoring performance of middle and primary schools based on deep learning according to the output model structure of the network in embodiment 1 of the invention;

FIG. 4 is a flowchart of a method and a system for scoring performance of middle and primary schools based on deep learning according to the present invention in example 1 for identifying musical instruments;

FIG. 5 is a flow chart of clipping compensation in embodiment 1 of a method and system for scoring performance of middle and primary schools based on deep learning;

fig. 6 is a graph of implementation results of integrity evaluation in embodiment 1 of a performance scoring method and system for middle and primary schools based on deep learning;

fig. 7 is a graph of implementation results of pitch evaluation in embodiment 1 of a performance scoring method and system for middle and primary schools based on deep learning;

Fig. 8 is a graph of the implementation result of rhythm evaluation in embodiment 1 of the performance scoring method and system for middle and primary schools based on deep learning.

Detailed Description

The invention will now be described in further detail with reference to the accompanying drawings. The drawings are simplified schematic representations which merely illustrate the basic structure of the invention and therefore show only the structures which are relevant to the invention.

Referring to FIG. 1, a performance scoring system for middle and primary schools based on deep learning comprises

the preprocessing module is used for processing student playing audio and track information in a specific environment, and the processing content comprises AI noise reduction, signal enhancement, signal conversion, normalization processing and xml analysis of playing audio, midi sound and overtone by combining a musical instrument classification model and a noise reduction model;

the multidimensional evaluation module is used for carrying out multidimensional evaluation on the preprocessed student playing audio according to the track information, wherein the multidimensional evaluation comprises musical instrument types, pitch scores, rhythm scores, integrity scores and emotion scores;

In this embodiment, the method comprises the following steps:

s1, collecting student records for noise extraction and training;

In this embodiment, S1 specifically includes:

s11, converting the student record into a mono 16-bit wav audio format;

In this embodiment, S14 specifically includes cutting a clean tone sample to obtain frame segments of clean tone, randomly selecting a noise for noise training by each frame segment, and setting a random signal-to-noise ratio to obtain frame signals including different noise types and noise intensities:

purgain＝rand(-24,24)dB；

noisegain＝rand(-12,12)dB；

wherein, noiseFrame _n Representing the nth noisy frame signal, f _n Represents the nth clean audio frame, purgain, noisegain represents the random gain values of clean sound and noise, respectively, noise_len represents the length of the noise class, and frame_len representsA frame length;

Referring to fig. 2, in this embodiment, the noise training specifically includes adopting a cyclic neural network technical scheme, adopting keras, tensorflow for a deep learning framework, adopting librosa for an audio processing algorithm, performing DCT transformation on noise frame data and pure tone data to obtain BFCC barker frequency cepstrum coefficients, calculating MFCC characteristics and spectrum centroid, spectrum attenuation and fundamental frequency period characteristic coefficients, inputting the MFCC characteristics and spectrum centroid, spectrum attenuation and fundamental frequency period characteristic coefficients into a training network composed of dense layers and GRU layers for multiple training, obtaining a network model for filtering noise, and storing the network model into a training module for subsequent testing and operation.

In this embodiment, S2 specifically includes:

In this embodiment, the xml file of the performance question is a spectrum information file derived from the midi produced by the professional music playing software, and the overtone sample is a sample audio recorded by a proposer of the performance question on site, which is used as a criterion for judging the emotion of the performance.

In this embodiment, S3 specifically includes:

S32, analyzing an xml data file in the performance title to obtain track basic information, wherein the track basic information comprises a number of small segments, a number of beats per hour, a time value of beats per time, a tone, a time value and a frequency sequence of notes, filling or compressing the xml data, and converting the xml information into a sequence of each note frequency in a time domain: xmlSeq;

sample_loundness1,sample_loundness2,...sample_loundnessn；

obtaining a sound amplitude sequence in an overtone time domain range;

In this embodiment, the integrity evaluation refers to the XmlSeq standard score sequence, compares the student performance sequences, takes the student performance sequence StuSeq as a mask, extracts the silence mask, and calculates the distribution of the silence mask:

pitch evaluation refers to an XmlSeq standard curvelet sequence, a student playing sequence is compared, two sequences of the XmlSeq and StuSeq are subjected to normalization processing, two sequence value ranges are fixed in the same frequency dimension, correction is carried out by adopting section-by-section normalization processing, if correction is abnormal, the sequences are processed in a section-by-section normalization processing mode, the two sequences are segmented according to sections and beat points, then each section of sequence is aligned through a DTW algorithm, and the section-by-section aligned sequences are subjected to frame pitch comparison:

the frame sequence of the standard beat is xf1, xf2, xf3,..xfn;

student beat point frame sequences sf1, sf2, sf3,..sfn;

standard mean comparison of two sequences:

the rhythm evaluation is carried out on the sequences in the same pitch, and aligned XmlSeq and StuSeq sequences are directly compared with each other in a DTW route to obtain the position distribution of each frame offset of each beat point and obtain a frame offset sequence;

shf1,sf2,sf3,...sfn；

In this embodiment, according to the DTW calibration of the integrity, pitch and rhythm, the overtone loudness reference sequence and the student performance loudness sequence are recalibrated in the time domain with overtones after eliminating the mute mask, so as to obtain the sound intensity comparison sequence of each bar and each beat, and the two calibrated sound intensity sequences are normalized to obtain the sequence with the decibel scale of [0,1], and the sampling points corresponding to each beat are archived, so as to obtain the subsequence of the sound intensity of each beat:

batch1：[ln1,ln2,ln3,...lnn]；

sample_ln_seq＝[[ln1,ln2,ln3...lnn],...[ln1,ln2,..lnk]]；

stu_ln_seq＝[[ln1,ln2,ln3...lnn],...[ln1,ln2,..lnk]]；

result_up _i ＝sample_up _i AND stu_up _i *(sample_up _i *stu_up _i )；

result_sample_shift _i ＝sample_shift _i -sample_shift _i-1 ；

the same serialization operation is carried out on the student audio:

result_stu_shift _i ＝stu_shift _i -stu_shift _i-1 ；

shift_diff _i ＝result_sample_shift _i -result_stu_shift _i ；

wherein,is the whole head of a songThe standard mean square deviation of the sound intensity variation of all the beat points of the two pieces of music is smaller, and the overall sound intensity variation trend of the two pieces of music is consistent.

Example 1:

referring to fig. 6, the integrity evaluation is a method of referring to XmlSeq standard music score sequence, comparing student performance sequences, masking student performance sequences StuSeq (Mask), extracting silence masks, and then calculating distribution of silence masks, and calculating silence Mask distribution: 1. the part of the XmlSeq standard music spectrum sequence that is itself silent is filtered because there may be a space beat and placeholder in the music spectrum (part of the music that would not otherwise be needed to perform). The filtered mute sequence is a potential integrity deduction point; 2. the filtered mute sequences are subjected to 'ventilation' recognition of small gaps, and because the wind instrument is a type which requires students to ventilate at the end of a part of the small knots (the played instrument is not required), the distribution of the mute sequences at the end of the small knots is extracted when the wind instrument is played in combination with the type of the wind instrument recognition, if the distribution accords with periodicity and the interval is between 250ms and 1000ms, the mute sequences are included in the normal playing 'ventilation' sequences, and the deduction part is not counted; 3. filing the section and the beat point of the finally filtered mute sequence, taking the total frame number of the beat point of the continuous frame loss number [0.5,1] as a deduction interval (frame loss), namely, all frames of the beat point are lost as a full deduction, and finally, multiplying the full deduction coefficient to obtain the full deduction: integrity_dp.

Referring to fig. 7, pitch evaluation refers to an XmlSeq standard spectrum sequence, and compares student playing sequences, and performs normalization processing on the XmlSeq and StuSeq sequences, so that two sequence value ranges are fixed in the same frequency dimension, namely, the vertical axis values are all in the range of 0-1, pitch difference is evaluated more reasonably and intuitively, and meanwhile, the problem that octave change of musical instrument pitch caused by different playing strengths or blowing strengths affects pitch comparison when students play musical instruments is avoided, and the normalization makes an algorithm more robust. Note that, in the actual student playing process, there may be different octaves used before and after the playing (the student blows hard suddenly or modifies the playing tone), which may cause asymmetry of the sequence in the normalization process, and affect objective comparison of pitches. For this case, the algorithm adopts the section-by-section normalization processing to perform the correction, and if the correction is abnormal, the sequence is processed by the section-by-section normalization processing.

The pitch stability and the accuracy are subjected to deduction setting:when the absolute value difference of the standard mean square error and the standard mean square error is in the range, performing stationarity deduction according to the (0, 1) beat point score; abs (mean) ^xf ,mean ^sf )∈[0.2,1]*mean ^xf And when the absolute value range of the mean value difference is within the standard mean value range of 0.2-1 times, performing accuracy deduction according to the (0, 1) beat point score. The two indexes are multiplied by a deduction coefficient and an index weight ratio to finally obtain the pitch deduction of each beat point: pitch_dp.

Referring to fig. 8, the rhythm evaluation is performed on the aligned XmlSeq and StuSeq sequences in the same pitch, and DTW route comparison is directly performed on the aligned XmlSeq and StuSeq sequences, so that the position distribution of each frame offset of each beat point can be obtained.

With the frame offset distribution, a frame offset sequence is obtained: shf1, sf2, sf 3..sfn, all theoretically 0, represents the best cadence point (as with standard curvement), the frame offsets are archived in measures, and the standard mean square error is calculated per measure of frame offset: obtained->A smaller value indicates that the rhythm of the current bar is more stable, when +.>The tempo is considered stable in the current section, which is withheld outside this range. Simultaneously calculating the measures of the mean value of the measures of the shift of the measures of the shift of the measures>mean ^shf The value of (2) represents the average number of frames of the bar rhythm point offset, when mean ^shf <The range of 10% of the number of small beat points identifies that the current small beat playing point is accurate, and the small beat points are withheld outside the range. From this, the rhythm evaluation also includes two indexes of rhythm stability and rhythm accuracy, and according to the configuration of the indexes and the weight allocation of the deduction coefficient, the deduction of each bar rhythm is obtained: rhythm_dp.

Finally, emotion evaluation, namely, the evaluation of emotion mainly comprises the steps of playing the sound plumpness and the sound intensity change of music by students, highlighting the intensity difference and expressing the emotion such as cheering, rising, enthusiasm and the like according to the characteristics of the music. This expression can be manifested by a change in sound intensity: if there is a track rhythm that is cheerful, enthusiasm, lyrics, etc., the rising and falling (beginning and ending) of notes corresponding to each emotion are similar, such as cheerful, the sound is usually crisp, the note duration is short, and the ending is crisp; the enthusiasm emotion sounding is rapid, the sound intensity is steep, the relationship between intensity and weakness is periodically sensitive, etc. Each track can define the emotion requirement, and overtones are used for directly defining a template for sound intensity change in emotion, and the template is used as emotion scoring standard.

The invention is mainly applied to the examination scene of the student on-site performance in the machine room environment, is limited by objective factors such as machine room hardware equipment, student examination order and the like, has large difference of the quality of the student performance record, and adopts an AI noise reduction method to extract and train a large number of student records for the specific scene, thereby reducing the influence of the difference of the environment and the equipment on the student performance score.

The music piece played by the student is converted into a mono 16-bit wav audio format. The student records are manually subjected to noise extraction and classification labeling: noise 1, noise 2, … noise n, and noise samples are obtained.

The pure student records are manually classified and marked: pure tone 1, pure tone 2, … pure tone n, a pure tone sample is obtained.

In order to enable the deep learning model to better extract noise characteristics, the invention adopts a mode of randomly combining noise and pure sound and randomly setting signal to noise ratio to generate a large number of noise frequency fragments which are used for inputting the model and rapidly fitting the characteristics under various noise environments, and the specific processing design is as follows:

and cutting the clean sound every 20ms, performing Fourier transform on the clean sound to obtain frame fragments of the clean sound, randomly selecting one noise from each fragment to mix, setting a random signal-to-noise ratio, and finally obtaining frame signals containing different noise types and noise intensities.

The noise training adopts a cyclic neural network technical scheme (RNNoise), the deep learning framework adopts keras, tensorflow, the audio processing algorithm adopts librosa, the noise frame data and the pure tone data are respectively subjected to DCT (discrete cosine transform) to obtain BFCC bark frequency cepstrum coefficients (smaller frequency band range and higher operation efficiency), 22 characteristic data are obtained in total, 36 characteristic coefficients of MFCC characteristics, spectrum centroid, spectrum attenuation, fundamental frequency period and the like are calculated, and the characteristic coefficients are input into a training network consisting of 3 dense layers and 3 GRU layers for multiple training.

Through the steps, a network model for filtering noise can be obtained, and the model is stored in a training module for subsequent testing and operation.

Professional musicians record playing sounds of different musical instruments, including clarinet, ceramic flute, accordion, harmonica and cucurbit flute (expanded according to the assessment requirement of the musical instrument), and each musical instrument records long audio frequency (sample basis, and later dynamic supplement according to the classification accuracy of each musical instrument) at 2 hours as a training sample. Audio of the performance of 200 student test instruments was randomly screened as test samples.

Uniformly processing sample audio into a wav file with 16-bit, mono and 16khz sampling rate, cutting the audio into about 4 seconds of audio, and obtaining a data structure (channel number and sampling point number) of a sample unit: (1, 16000*4).

The clipped audio is MFCC mel-frequency spectral converted to 64 band features, sample unit data structure (1, 64, 250), where 250 is the amount of data converted to each mel-frequency band at a 16k sampling rate (16000/64/1).

Fitting was performed for one batch per 32 sample units: a single training sample data structure (32, 1, 64, 250) is obtained in which the number of batches is dynamically adjustable according to training performance.

The oneHot transform is performed on the class of each instrument sample cell and extends to a morphology consistent with the sample cell data structure (5)), where 5 is the currently trained instrument class.

Referring to fig. 3, the training data is processed by adopting a CNN neural convolutional network architecture, the data feature map is reduced to (32, 64,3, 31) by using three CNN layers, and is input into 1 dense layer to perform approximate prediction, so as to obtain (32,5) final prediction result data, wherein 5 represents the type of instrument currently trained, and the output result is the estimated similarity (number from 0 to 1) of 32 sample units for 5 instruments. To this end, the output model structure of the network has been determined to be complete.

The identification of the musical instrument is to input the audio file recorded by the student into a model, output a prediction result by the model, and directly take the musical instrument subscript corresponding to the maximum value as the finally identified musical instrument according to the result in the last step. The identified instrument type can be compared with the test sample, 200 audios can be manually and randomly screened in the first step, and the instrument can be played by labeling, so that the identification accuracy can be obtained.

Referring to fig. 4, in order for a professional musician to generate a model of identifying a musical instrument in one-stop, the training module asks the musician to create a folder according to the name of the musical instrument, files and uploads each musical instrument sound to the training module, which automatically creates a classification tag according to the name of the folder and performs: audio cutting, mute removing, feature extracting, network training, training result evaluating, model generating and test model whole automatic process, outputting each process running log for a developer to track problems, outputting test reports to a musician, and the musician can adjust or supplement training data for retraining according to the audio with the identification errors displayed in the report content. And outputting a final model after the test result is 100% correct.

The training module separates the developer from the business personnel in a cooperative way, so that the two parties pay more attention to the work in the respective professional fields, and the efficiency, expandability and professionality of the musical instrument identification are greatly improved.

In preprocessing, the record of the student, the xml file of the performance problem and the standard overtone sample are required to be processed, so that the performance level of the student can be better scored in a multi-dimensional manner. Wherein the xml file of the performance questions is a spectrum information file derived from the midi produced by professional music playing software (Seebeck et al). The overtone sample is sample audio recorded by a proposition personnel of a performance problem on site and is used as a standard for judging performance emotion.

Firstly, the recording of the student is noise reduced through the training module in the step 1, and the noise-reduced student audio is subjected to signal enhancement through a band-pass filter, so that the final recording preprocessing audio of the student is obtained.

Analyzing the xml data file in the performance title to obtain track basic information, such as: the number of bars, the number of beats per hour, the duration of beats, the tonality, the duration and frequency sequence of notes, etc. Because the xml information has no speed attribute, the xml data needs to be filled or compressed (the time length is unified to be the same as the real playing time length of the title, the subsequent operation can be performed after the data alignment on the time axis is ensured), and finally the xml information is converted into a sequence of each note frequency in the time domain: xmlSeq (334 hz,334hz … |112hz,231hz,231hz … | …).

Similarly, extracting a fundamental frequency sequence from the noise-reduced student wav audio through a pyin technology to obtain a frequency sequence with the same time axis as xml: stuSeq.

Referring to fig. 5, in the process of playing and recording, the student is limited by the playing professional level of the student, hardware and some uncontrollable factors in the system operation process, and the situation of delay, tailing and the like may exist in the recording of the student, so that the blank exists at the head and the tail of the recording, the rhythm deviation and the like, and the subsequent rhythm scoring is seriously influenced. For this purpose, the preprocessing module adopts a head-tail compensation and clipping scheme, and fills or clips the beginning and the end of the audio of the student with 1 to n beats (set according to the beat point and the time length of the question) of a note (corresponding to a placeholder) of 0 Hz. Thus, the recording duration of students exceeds the length of xml data, an xml audio sequence XmlSeq is aligned with the head of the student audio sequence StuSeq, then the XmlSeq is moved beat by beat until the tail of the student audio sequence StuSeq is aligned, the minimum distance operation is carried out on the two sequences by using a DTW technology in each movement, and the optimal result (the aligned positions of the XmlSeq and the StuSeq with the shortest distance and one-to-one correspondence) is found, so that the final real playing intention of the students is determined, and the problems of slow entering or tailing, recording delay and the like in the recording process of the students are solved.

Next, loudness extraction (loundness) is performed on the standard harmonic overtone samples, so as to obtain amplitude data of each sampling point: sample_loundness1, sample_loundness 2. The amplitude data for each sample point is kept in order to make the sequence continuously predictable so that we can more easily derive the direction of the loudness changes, losing much accuracy if converted into the time domain range. Finally we get the sound amplitude sequence in the overtone time domain: sampleLoundness.

The same processing method carries out loudness extraction on student audios, and finally obtains an amplitude sequence in the student time domain range: stuLoundness.

After the signal preprocessing operation, four sequences on the same time axis are obtained, wherein XmlSeq is a music spectrum standard sequence, stuSeq is a student playing sequence, sampleLoundness is an overtone loudness reference sequence, and StuLoundness is a student playing loudness sequence.

The preprocessing module is used for obtaining sampleLoundness and sampleLoundness loudness sequences, and meanwhile, according to the DTW calibration of the previous integrity, pitch and rhythm, the two intensity sequences can be recalibrated in the time domain with overtones after eliminating mute masks, so that the sound intensity comparison sequences of each bar and each beat are obtained.

Normalizing the two calibrated sound intensity sequences to obtain a sequence with decibel scales of [0,1], wherein the two sequences are sound intensities corresponding to sampling points at present, and archiving the sampling points corresponding to each beat, so that subsequences of the sound intensities of each beat can be obtained: batch1: [ ln1, ln2, ln3,.. lnn ], the sub-sequences are split to locally determine the direction of the intensity change and calculate the difference in change. Finally we get the sequence of overtones and student audio intensity per beat:

sample_ln_seq＝[[ln1,ln2,ln3...lnn],...[ln1,ln2,..lnk]]；

stu_ln_seq＝[[ln1,ln2,ln3...lnn],...[ln1,ln2,..lnk]]。

the two sequences were subjected to simultaneous intensity trend and difference quantization, i.e. two sounds were obtained, the intensity trend (up, down, level) of the beat was found, the beat elevation was found to be different (+0.2, -0.1,0.0, +3.0.), we defined the two sets of change sequences as:

sample_ln_dir_seq＝[[up1,up2,up3,...upn],...[up1,up2,up3,...upk]；

stu_ln_dir_seq＝[[up1,up2,up3,...upn],...[up1,up2,up3,...upk]；

sample_ln_shift_seq＝[[shift1,...shiftn],..[shift1,shift2,...shiftk]；

stu_ln_shift_seq＝[[shift1,...shiftn],...[shift1,shift2,...shiftk]；

for the intensity trend, beat point comparison can be directly performed, the rising represents 1, the representing-1, the flat represents 0, and the two sequences are performed for each beat and the operation and multiplication operation:

result_up _i ＝sample_up _i AND stu_up _i *(sample_up _i *stu_up _i )；

the two audio sound intensity variation trends are identical if the non-negative numbers represent the trends and opposite if the negative numbers represent the trends. Of particular note is: because 0 is very difficult to occur (overtones are also very difficult to ensure that the intensity mean is the same between beats), but it is possible that the comparison between relatively smooth beats produces quite opposite results (e.g., successive beat decibels differ by +0.05dB and-0.05 dB, respectively, which are theoretically considered similar intervals, but their multiplication results are negative), we reset all the data in the interval of +0.05, 0.1 to 0 here to reduce the trend assessment error caused by this sensitivity.

For the operation of the sound intensity difference value (the amplitude of the sound intensity change of each beat), the difference value between the sound intensity mean values of the beats is directly calculated:

result_sample_shift _i ＝sample_shift _i -sample_shift _i-1 ；

likewise, student audio performs the same serialization operation

result_stu_shift _i ＝stu_shift _i -stu_shift _i-1 ；

Thus we obtain the sequence of the difference between the two sound intensity beating points, each point of the two sequences carries out the difference operation, and the result of the difference operation is calculated as the standard Mean Square Error (MSE), thus obtaining the trend of the sound intensity amplitude variation of the whole beating point.

shift_diff _i ＝result_sample_shift _i -result_stu_shift _i ；

/>

The smaller the result, the same trend of the overall sound intensity of the two pieces of music is represented, and the following is defined The interval corresponds to emotion score, for example, 0.01 or less represents emotion complete matching, emotion score weight is configured, and emotion score is given to: the element_bp.

The instrument identification part classifies student audios by using an instrument tone model obtained by a training module, matches the student audios with the instruments required in the tracks, and if the matching is successful, the correct use of the instrument is represented, otherwise, if the question is not configured, the instrument is regarded as wrong instrument playing or free playing, and the corresponding deduction configuration and weight are flexibly configured according to the service requirement

Defining the instrument decision score as: celesta_dp.

The output module is used for summarizing the scores of the multi-dimensional evaluation module in the previous step according to the weighted values, and finally obtaining the final results of the student playing musical instrument:

score＝100-(integrity_dp*w _i +pitch_dp*w _p +rhythm_dp*w _r +

celesta_dp*w _c )+emotion_bp。

The primary beneficial effects of the deep learning-based primary and secondary school performance scoring method and system include improving scoring efficiency and accuracy. The system can comprehensively analyze the audio, objectively evaluate the multiple dimensions such as the intonation, the rhythm, the tone and the like, and provide more comprehensive and consistent evaluation. This reduces the subjectivity and time costs of manual scoring, making the assessment process more fair and efficient. For large-scale student performance evaluation, the system is particularly beneficial, and can bring higher quality and standard for middle and primary academic education.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A medium and primary school performance scoring system based on deep learning is characterized by comprising

2. A scoring method of a deep learning-based performance scoring system for middle and primary schools as claimed in claim 1, comprising the steps of:

S1, collecting student records for noise extraction and training;

3. The method for scoring a performance of middle and primary schools based on deep learning according to claim 2, wherein the step S1 specifically includes:

s11, converting the student record into a mono 16-bit wav audio format;

4. The method for scoring performance of middle and primary schools based on deep learning as claimed in claim 3, wherein S14 specifically includes cutting clean voice samples to obtain frame segments of clean voice, randomly selecting a noise from each frame segment to perform noise training, and setting a random signal-to-noise ratio to obtain frame signals containing different noise types and noise intensities:

purgain＝rand(-24,24)dB；

noisegain＝rand(-12,12)dB；

5. The method for scoring performance of middle and primary schools based on deep learning according to claim 4, wherein the noise training specifically comprises adopting a cyclic neural network technical scheme, adopting keras, tensorflow for a deep learning framework, adopting librosa for an audio processing algorithm, respectively performing DCT on noise frame data and pure tone data to obtain BFCC barker frequency cepstrum coefficients, calculating MFCC characteristics and spectrum centroid, spectrum attenuation and fundamental frequency periodic characteristic coefficients, inputting the MFCC characteristics and spectrum centroid, spectrum attenuation and fundamental frequency periodic characteristic coefficients into a training network consisting of dense layers and GRU layers for training for multiple times, obtaining a network model for filtering noise, and storing the network model into a training module for subsequent testing and operation.

6. The method for scoring a performance of middle and primary schools based on deep learning according to claim 2, wherein S2 specifically comprises:

7. The deep learning-based performance scoring method for middle and primary schools according to claim 2, wherein the xml file of the performance questions is a spectrum information file derived from midi which is produced by professional music playing software, and the overtone sample is sample audio recorded by a proposer of the performance questions on site and is used as a standard for judging emotion of performance.

8. The deep learning-based performance scoring method for middle and primary schools according to claim 7, wherein S3 specifically comprises:

sample_loundness1,sample_loundness2,...sample_loundnessn；

obtaining a sound amplitude sequence in an overtone time domain range;

9. The deep learning-based middle and primary school performance scoring method of claim 8, wherein the integrity evaluation refers to an XmlSeq standard score sequence, compares student performance sequences, masks student performance sequences StuSeq, extracts silence masks, and calculates a distribution of silence masks:

the frame sequence of the standard beat is xf1, xf2, xf3,..xfn;

Student beat point frame sequences sf1, sf2, sf3,..sfn;

standard mean comparison of two sequences:

shf1,sf2,sf3,...sfn；

obtained byThe smaller the value is, the more stable the rhythm deviation of the current bar is, and meanwhile, the average value of the bar rhythm deviation is calculated mean ^shf The value of (2) represents the average number of frames of the bar rhythm point offset, when mean ^shf <The range of 10% of the number of frames of the small beat points identifies that the current small beat playing point is accurate, and out of the range, the rhythm evaluation also comprises two indexes of rhythm stability and rhythm accuracy, and the deduction of each small beat rhythm is obtained according to the allocation of the indexes and the weight allocation of the deduction coefficient.

10. The middle and primary school performance scoring method based on deep learning according to claim 9, wherein the overtone loudness reference sequence and the student performance loudness sequence are calibrated according to the DTW of the integrity, pitch and rhythm, the overtone loudness reference sequence and the student performance loudness sequence are recalibrated in the time domain with overtones after mute masks are removed to obtain sound intensity comparison sequences of each section and each beat, the calibrated two sound intensity sequences are normalized to obtain a sequence with decibel scales of [0,1], and sampling points corresponding to each beat are archived to obtain a subsequence of sound intensity of each beat:

batch1：[ln1,ln2,ln3,...lnn]；

sample_ln_seq＝[[ln1,ln2,ln3...lnn],...[ln1,ln2,..lnk]]；

stu_ln_seq＝[[ln1,ln2,ln3...lnn],...[ln1,ln2,..lnk]]；

result_up _i ＝sample_up _i AND stu_up _i *(sample_up _i *stu_up _i )；

result_sample_shift _i ＝sample_shift _i -sample_shift _i-1 ；

the same serialization operation is carried out on the student audio:

result_stu_shift _i ＝stu_shift _i -stu_shift _i-1 ；

shift_diff _i ＝result_sample_shift _i -result_stu_shift _i ；