CN114822455A

CN114822455A - Audio processing method and device, electronic equipment and computer readable medium

Info

Publication number: CN114822455A
Application number: CN202210332724.0A
Authority: CN
Inventors: 郑正; 徐豪骏; 李山亭; 王敬群
Original assignee: Shanghai Miaoke Information Technology Co ltd
Current assignee: Shanghai Miaoke Information Technology Co ltd
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-07-29

Abstract

The embodiment of the disclosure discloses an audio processing method, an audio processing device, an electronic device and a computer readable medium. One embodiment of the method comprises: responding to the received audio sent by the user terminal, and performing format conversion processing on the audio to generate converted audio; inputting the converted audio into a pre-trained audio information extraction model to obtain an audio information sequence; according to the number of each audio frame included in the audio information sequence, grouping each audio information in the audio information sequence to obtain a grouped audio information sequence set; matching the packet audio information sequence set with the standard audio information sequence set to obtain a standard packet audio information sequence set; and generating an audio detection text according to the grouped audio information sequence set and the standard grouped audio information sequence set. This embodiment reduces the waste of teaching time.

Description

Audio processing method and device, electronic equipment and computer readable medium

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a computer-readable medium.

Background

With the development of modern musical instruments, how to determine the accuracy of the content played by the user becomes an important research topic. At present, when determining the accuracy of content played by a user, the method generally adopted is as follows: and analyzing the content played by the user through the teacher to determine the accuracy of the played content.

However, when the accuracy of the content played by the user is determined in the above manner, there are often technical problems as follows:

firstly, when the teacher determines whether the content played by the user is correct, the teacher easily omits the content played wrongly, so that the content played by the user needs to be played repeatedly to find out the content played wrongly, which causes the waste of teaching time;

secondly, when the teacher determines whether the content played by the user is correct, the teacher cannot determine the rhythm of each tone played by the user, so that the user needs to repeatedly practice the content of the full song, and the practice time is wasted.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose audio processing methods, apparatuses, electronic devices and computer readable media to solve one or more of the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide an audio processing method, including: responding to the received audio sent by the user terminal, and performing format conversion processing on the audio to generate converted audio; inputting the converted audio into a pre-trained audio information extraction model to obtain an audio information sequence, wherein the audio information in the audio information sequence comprises audio frame numbers; according to the number of each audio frame included in the audio information sequence, performing grouping processing on each audio information in the audio information sequence to obtain a grouped audio information sequence set, wherein each grouped audio information sequence in the grouped audio information sequence set is arranged according to the number of each included audio frame; matching the packet audio information sequence set with a standard audio information sequence set to obtain a standard packet audio information sequence set, wherein the number of the standard packet audio information sequences included in the standard packet audio information sequence set is equal to the number of the packet audio information sequences included in the packet audio information sequence set; and generating an audio detection text according to the grouped audio information sequence set and the standard grouped audio information sequence set.

In a second aspect, some embodiments of the present disclosure provide an audio processing apparatus, the apparatus comprising: the conversion processing unit is configured to respond to the received audio sent by the user terminal and carry out format conversion processing on the audio so as to generate converted audio; the input unit is configured to input the converted audio into a pre-trained audio information extraction model to obtain an audio information sequence, wherein the audio information in the audio information sequence comprises audio frame numbers; the grouping processing unit is configured to perform grouping processing on each audio information in the audio information sequence according to each audio frame number included in the audio information sequence to obtain a grouping audio information sequence set, wherein each grouping audio information sequence in the grouping audio information sequence set is arranged according to each included audio frame number; a matching processing unit configured to match the set of packet audio information sequences with a set of standard audio information sequences to obtain a set of standard packet audio information sequences, wherein the number of standard packet audio information sequences included in the set of standard packet audio information sequences is equal to the number of packet audio information sequences included in the set of packet audio information sequences; and the generating unit is configured to generate the audio detection text according to the grouped audio information sequence set and the standard grouped audio information sequence set.

In a third aspect, some embodiments of the present disclosure provide an electronic device, comprising: one or more processors; a storage device, on which one or more programs are stored, which when executed by one or more processors cause the one or more processors to implement the method described in any implementation of the first aspect.

In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium on which a computer program is stored, wherein the program, when executed by a processor, implements the method described in any of the implementations of the first aspect.

The above embodiments of the present disclosure have the following beneficial effects: by the audio processing method of some embodiments of the present disclosure, waste of teaching time is reduced. Specifically, the reason why the teaching time is wasted is that: when the teacher determines whether the content played by the user is correct, the teacher easily omits the content played wrongly, so that the content played by the user needs to be played repeatedly to find out the content played wrongly, and the teaching time is wasted. Based on this, the audio processing method according to some embodiments of the present disclosure first performs format conversion processing on the audio in response to receiving the audio sent by the user end, so as to generate a converted audio. Thus, the audio format can be converted into an audio format to which the audio information extraction model can be applied. And secondly, inputting the converted audio into a pre-trained audio information extraction model to obtain an audio information sequence. Therefore, the audio information can be grouped, and the matching with the standard audio information sequence set is convenient. Then, according to the number of each audio frame included in the audio information sequence, grouping each audio information in the audio information sequence to obtain a grouped audio information sequence set; and matching the packet audio information sequence set with a standard audio information sequence set to obtain the standard packet audio information sequence set. Therefore, each grouping audio information sequence played by the user is matched with the standard audio information sequence set, so that whether each grouping audio information sequence is matched with the corresponding standard audio information sequence or not is convenient to determine, and the waste of teaching time is reduced. And finally, generating an audio detection text according to the grouped audio information sequence set and the standard grouped audio information sequence set. Therefore, the playing condition of the user can be displayed in the audio detection text, and the correctness of the content played by the user is determined. Thus, the waste of teaching time is reduced.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.

Fig. 1 is a schematic diagram of one application scenario of an audio processing method of some embodiments of the present disclosure;

fig. 2 is a flow diagram of some embodiments of an audio processing method according to the present disclosure;

fig. 3 is a schematic structural diagram of some embodiments of an audio processing apparatus according to the present disclosure;

FIG. 4 is a schematic block diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is a schematic diagram of one application scenario of an audio processing method of some embodiments of the present disclosure.

In the application scenario of fig. 1, first, the computing device 101 may perform format conversion processing on the audio 102 in response to receiving the audio 102 sent by the user terminal to generate converted audio 103. Second, the computing device 101 may input the converted audio 103 into a pre-trained audio information extraction model, resulting in an audio information sequence 104. The audio information in the audio information sequence 104 includes audio frame numbers. Then, the computing device 101 may perform grouping processing on each audio information in the audio information sequence 104 according to each audio frame number included in the audio information sequence 104, and obtain a grouped audio information sequence set 105. Wherein the respective packet audio information sequences in the above-mentioned packet audio information sequence set 105 are arranged in accordance with the respective audio frame numbers included. The computing device 101 may then match the set of packet audio information sequences 105 described above with the set of standard audio information sequences 106, resulting in a set of standard packet audio information sequences 107. Wherein the standard packet audio information sequence set 107 includes a number of standard packet audio information sequences equal to the number of packet audio information sequences included in the packet audio information sequence set 105. Finally, the computing device 101 can generate audio detection text 108 based on the set of grouped audio information sequences 105 and the set of standard grouped audio information sequences 107.

The computing device 101 may be hardware or software. When the computing device is hardware, it may be implemented as a distributed cluster composed of multiple servers or terminal devices, or may be implemented as a single server or a single terminal device. When the computing device is embodied as software, it may be installed in the hardware devices enumerated above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

It should be understood that the number of computing devices in FIG. 1 is merely illustrative. There may be any number of computing devices, as implementation needs dictate.

With continued reference to fig. 2, a flow 200 of some embodiments of an audio processing method according to the present disclosure is shown. The audio processing method comprises the following steps:

step 201, in response to receiving the audio sent by the user terminal, performing format conversion processing on the audio to generate a converted audio.

In some embodiments, an executing entity of the audio processing method (e.g., the computing device 101 shown in fig. 1) may perform format conversion processing on the audio in response to receiving the audio sent by the user terminal to generate converted audio. The user terminal may be a terminal device wirelessly connected to the execution main body. The audio may be the audio recorded by the user terminal and played by the user. The audio may also be audio uploaded by the user to the user terminal. The conversion process may convert the audio into a predetermined audio format. For example, the preset audio format may be a WAV format.

Step 202, inputting the converted audio into a pre-trained audio information extraction model to obtain an audio information sequence.

In some embodiments, the execution subject may input the converted audio into a pre-trained audio information extraction model to obtain an audio information sequence. The audio information in the audio information sequence may include, but is not limited to: audio frame number, pitch value, frame number time. The above audio frame number may characterize the minimum duration unit (time granularity) of the audio duration in the audio. The pitch values described above may be used to characterize the pitch. The frame time may be a certain time point in the audio playing time. The audio playing time length may be a time length required for playing the audio. For example, the audio playing time period may be 3 minutes, and the frame number time may be 5s, which represents 5 seconds of the audio playing time period "3 minutes". The audio information extraction model may be a convolutional neural network model trained in advance for extracting audio information. For example, the audio information extraction model may be a ResNet network model.

As an example, the audio information sequence may be:

{ "audio frame number": "1", "pitch value": "25", "frame number time": '1 s' };

{ "audio frame number": "2", "pitch value": "40", "frame number time": '1.01 s' };

{ "audio frame number": '3', 'Pitch value': "34", "frame number time": '1.52 s';

{ "audio frame number": '4', 'Pitch value': "89", "frame number time": '2.05 s'.

And step 203, grouping each audio information in the audio information sequence according to each audio frame number included in the audio information sequence to obtain a grouped audio information sequence set.

In some embodiments, the execution main body may perform grouping processing on each audio information in the audio information sequence according to the number of each audio frame included in the audio information sequence, so as to obtain a grouped audio information sequence set. In practice, first, the executing entity may perform a note name extraction process on the audio to obtain a note name sequence. The note name extraction process may be extracting each note name in the audio. Second, the executing body may extract a note duration corresponding to the note name. Third, for each note name in the note name sequence, the execution main body may merge the respective audio information corresponding to the note duration corresponding to the note name in the audio information sequence into a grouped audio information sequence, so as to obtain a grouped audio information sequence set. Wherein, each group audio information sequence in the group audio information sequence set is arranged according to the number of the included audio frames from small to large.

As an example, the execution main body may merge audio information corresponding to every two audio frame numbers in the audio information sequence exemplified in step 202 into a packet audio information sequence, resulting in a set of packet audio information sequences:

{ [ "audio frame number': "1", "pitch value": "25", "frame number time": '1 s' ];

"audio frame number": "2", "pitch value": "40", "frame number time": '1.01 s' ] },

{ [ "audio frame number': '3', 'Pitch value': "34", "frame number time": '1.52 s' ];

"audio frame number": '4', 'Pitch value': "89", "frame number time": '2.05 s' ]).

And 204, matching the grouped audio information sequence set with the standard audio information sequence set to obtain a standard grouped audio information sequence set.

In some embodiments, the execution body may match the set of packet audio information sequences with a set of standard audio information sequences to obtain a set of standard packet audio information sequences. The matching process may be: and for each packet audio information sequence in the packet audio information sequence set, selecting the standard audio information sequence with the minimum distance from the packet audio information sequence from the standard audio information sequence set as a standard packet audio information sequence to obtain a standard packet audio information sequence set. The distance may be a characteristic distance of the sequence of packet audio information from the sequence of standard audio information. For example, the distance may be an euclidean distance. The distance may also be a manhattan distance.

In practice, the execution body may generate the standard audio information sequence set by:

in the first step, the execution subject may input the standard music score to a pre-trained audio information extraction model to obtain a standard audio information sequence. The standard music score may be an audio frequency corresponding to a pre-stored music score.

And secondly, the execution main body carries out grouping processing on the standard audio information sequence to obtain a standard audio information sequence set. The standard audio information in the standard audio information sequence comprises a standard audio frame number, a standard frame number time and a standard pitch value. The standard pitch value can represent the tone height corresponding to a standard audio frame number in the standard music score. The standard audio frame number described above may characterize the minimum duration unit (time granularity) of the score duration in the standard score. The standard frame number time may be a certain time point in the music score playing time length. The music score playing time length can be the time length required for playing the standard music score. For example, the music score playing time period may be 5 minutes, and the standard frame number time may be 3s, which represents the third second of the music score playing time period "5 minutes".

In some optional implementations of some embodiments, the executing body may match the set of packet audio information sequences with a set of standard audio information sequences to obtain a set of standard packet audio information sequences by:

and step one, matching the grouped audio information sequence set with the standard audio information sequence set to generate a matched audio information sequence set. In practice, the executing body may match the set of packet audio information sequences with the set of standard audio information sequences by:

the first substep, for each block audio information sequence in the block audio information sequence set, determining a distance between the block audio information sequence and each standard audio information sequence in the standard audio information sequence set, and obtaining a distance sequence.

And a second substep of composing the obtained distance sequences into a two-dimensional distance matrix. Wherein, the horizontal distances from left to right of the two-dimensional distance matrix correspond to the grouped audio information sequences in the grouped audio information sequence set in sequence. And sequencing each group audio information sequence in the group audio information sequence set according to the number of audio frames from small to large. And the distances in the longitudinal direction of the two-dimensional distance matrix from bottom to top sequentially correspond to the standard audio information sequences in the standard audio information sequence set. And sequencing each standard audio information sequence in the standard audio information sequence set according to the standard audio frame number from small to large. For example, the last distance of the first column in the two-dimensional distance matrix is a distance between the first packet audio information sequence in the set of packet audio information sequences and the first standard audio information sequence in the set of standard audio information sequences. The execution subject may establish a rectangular coordinate system with a lower left corner of a last distance in a first column of the two-dimensional distance matrix as an origin. As an example, the coordinate of the distance of the first packet audio information sequence in the set of packet audio information sequences from the first standard audio information sequence in the set of standard audio information sequences is (1, 1).

And a third sub-step of creating an empty set of target matching audio information sequences.

And a fourth substep of selecting a minimum distance from the distance sequences corresponding to the first grouped audio information sequence in the grouped audio information sequence set, and determining the minimum distance as the initial distance.

A fifth substep of executing the following processing steps based on the initial distance and the two-dimensional distance matrix:

a first processing step of determining coordinates of the initial distance in the two-dimensional distance matrix as initial coordinates;

a second processing step of determining a minimum distance between the initial coordinates and the target coordinates according to the following formula:

λ＝d(A _i ，B _j )+min[λ(A _i+1 ，B _j )，λ(A _i ，B _j+1 )，λ(A _i+1 ，B _j+1 )]。

where λ represents the minimum distance. A. the _i Representing a set of packet audio information sequencesThe ith packet audio information sequence in (1). B is _j Representing the jth standard audio information sequence in the set of standard audio information sequences. (A) _i ，B _j ) Representing the initial coordinates. d (A) _i ，B _j ) Indicating the initial distance. min represents taking the minimum function. Lambda (A) _i+1 ，B _j ) And the distance between the distance of the (i + 1) th packet audio information sequence in the packet audio information sequence set and the jth standard audio information sequence in the standard audio information sequence set and the initial distance. Lambda (A) _i ，B _j+1 ) Indicating the distance between the distance of the ith packet audio information sequence in the packet audio information sequence set and the j +1 th standard audio information sequence in the standard audio information sequence set and the initial distance. Lambda (A) _i+1 ，B _j+1 ) And the distance between the distance of the (i + 1) th packet audio information sequence in the packet audio information sequence set and the (j + 1) th standard audio information sequence in the standard audio information sequence set and the initial distance is represented. The target coordinates are coordinates of a two-dimensional distance matrix corresponding to the result of the minimum function.

And a third processing step of adding a standard audio information sequence corresponding to the initial distance to the target matching audio information sequence set in response to the abscissa of the target coordinate being smaller than the number of the grouped audio information sequences in the grouped audio information sequence set, and executing the processing steps again by using the added target matching audio information sequence set as the target matching audio information sequence set, using the target coordinate as the initial coordinate, and using the minimum distance as the initial distance.

And a fourth processing step of adding a standard audio information sequence corresponding to the initial distance to the target matching audio information sequence set in response to the abscissa of the target coordinate being equal to the number of the grouped audio information sequences in the grouped audio information sequence set, and taking the added target matching audio information sequence set as a matching audio information sequence set.

And a second step of determining the set of matching audio information sequences as a set of standard packet audio information sequences in response to a last matching audio information sequence in the set of matching audio information sequences corresponding to a last standard audio information sequence in the set of standard audio information sequences. In practice, the execution body may determine the set of matching audio information sequences as a set of standard packet audio information sequences in response to a last matching audio information sequence in the set of matching audio information sequences being identical to a last standard audio information sequence in the set of standard audio information sequences.

And thirdly, in response to the last matching audio information sequence in the matching audio information sequence set corresponding to any standard audio information sequence except the last standard audio information sequence in the standard audio information sequence set, matching the grouped audio information sequence set with the matching audio information sequence set to generate a re-matching audio information sequence set serving as a standard grouped audio information sequence set. Wherein the execution body may perform matching processing on the set of packet audio information sequences and the set of matching audio information sequences to generate a set of re-matching audio information sequences as a set of standard packet audio information sequences in response to a last matching audio information sequence in the set of matching audio information sequences being identical to any standard audio information sequence except for the last standard audio information sequence in the set of standard audio information sequences.

Step 205, generating an audio detection text according to the packet audio information sequence set and the standard packet audio information sequence set.

In some embodiments, first, the execution body may determine, for each of the grouped audio information sequences in the grouped audio information sequence set, whether the grouped audio information sequence is identical to a standard grouped audio information sequence in the standard grouped audio information sequence set corresponding to the grouped audio information sequence set. Then, the execution main body described above may determine, as the total score of fingering, a ratio of the number of individual grouped audio information sequences identical to the corresponding standard grouped audio information sequence to the number of grouped audio information sequences included in the grouped audio information sequence set. Finally, the executing body can add the total score of playing to the audio detection text template to generate the audio detection text. The audio detection text template may be a preset text template for filling in a total score of the performance. For example, the above text template may be "your total score for playing is _ score".

In some optional implementations of some embodiments, the executing subject may generate the audio detection text by:

the first step is to obtain the standard audio selection information sent by the user terminal. The standard audio selection information may be information sent by the user through the user terminal to characterize the length of the playing music score selected by the user. For example, the standard audio selection information may be "play" a whole song ". The standard audio selection information may also be "play" from the second sentence to the 15 th sentence.

And secondly, in response to the standard audio selection information being the first selection information, determining a ratio of the number of the grouped audio information sequences included in the grouped audio information sequence set to the number of the standard audio information sequences included in the standard audio information sequence set as a playing integrity value. The first selection information can represent that the user carries out full-tune playing on the standard music score corresponding to the grouped audio information sequence set through user side selection.

And thirdly, in response to the standard audio selection information being second selection information, intercepting the standard audio information sequence set according to the standard audio selection information to generate an intercepted standard audio information sequence set. In practice, the execution main body may select each standard audio information sequence corresponding to the playing part selected by the user through the user side from the standard audio information sequence set as the intercepted standard audio information sequence set. The second selection information may represent a part of the music score content in the standard music score corresponding to the group audio information sequence set selected by the user through the user terminal.

And fourthly, determining the ratio of the number of the standard grouped audio information sequences in the standard grouped audio information sequence set corresponding to the grouped audio information sequence set to the number of the intercepted standard audio information sequences included in the intercepted standard audio information sequence set as a playing integrity value.

And fifthly, for each pitch value included in the grouped audio information sequence set, in response to determining that the pitch value is the same as the standard pitch value corresponding to the pitch value, determining the pitch value as a pitch accurate value. Here, for each pitch value included in the set of grouped audio information sequences, the execution body may determine the pitch value as a pitch-accurate value in response to determining that the pitch value is the same as a standard pitch value included in standard audio information corresponding to the pitch value.

And sixthly, determining the ratio of the number of the pitch accurate values contained in each determined pitch accurate value to the number of the audio information contained in the audio information sequence set as the audio accurate amount.

And seventhly, generating an audio evaluation score according to the playing integrity value and the audio accuracy. In practice, the executive may generate the audio rating score by the following equation:

Num＝α×Pluck+β×Pitch。

where Num represents the audio rating score. α represents a first weight, β represents a second weight, and here, the setting of the first weight and the second weight is not limited, and may be a weight obtained according to experimental data. Pluck represents the playing integrity value. Pitch represents the exact amount of audio.

And eighthly, adding the audio evaluation score, the playing integrity value and the audio accuracy to an audio detection text template to generate an audio detection text.

Optionally, after the eighth step, for each of the grouped audio information sequences in the grouped audio information sequence set except for the last grouped audio information sequence, a difference between a frame number time corresponding to a first audio frame number included in the target grouped audio information sequence and a frame number time corresponding to a last audio frame number included in the above grouped audio information sequence is determined as the interval duration.

In some embodiments, the execution body may determine, as the interval duration, a difference between a frame number time included in a first target packet audio information of the target packet audio information sequence and a frame number time included in a last packet audio information included in the packet audio information sequence, for each of the packet audio information sequence sets except for the last packet audio information sequence. Wherein the target packet audio information sequence is a first packet audio information sequence following the packet audio information sequence.

Optionally, for each standard packet audio information sequence in the standard packet audio information sequence set except for the last standard packet audio information sequence, a difference between a standard frame number time corresponding to the first standard audio frame number included in the target standard packet audio information sequence and a standard frame number time corresponding to the last standard audio frame number included in the standard packet audio information sequence is determined as the standard interval duration.

In some embodiments, the execution subject may determine, as the standard interval duration, a difference between a standard frame number time included in a first target standard packet audio information of the target standard packet audio information sequence and a standard frame number time included in a last standard packet audio information of the standard packet audio information sequence, for each standard packet audio information sequence of the standard packet audio information sequence set other than the last standard packet audio information sequence. Wherein the target standard packet audio information sequence is a first standard packet audio information sequence after the standard packet audio information sequence.

Optionally, for each determined interval duration in each interval duration, determining a ratio of the interval duration to a standard interval duration corresponding to the interval duration as a rhythm accuracy weight.

In some embodiments, the execution subject may determine, for each of the determined interval durations, a ratio of the interval duration to a standard interval duration corresponding to the interval duration as a tempo accuracy weight.

Optionally, a mode of the determined respective tempo accuracy weights is determined as a standard tempo value.

In some embodiments, the execution subject may determine a mode of the determined respective tempo accuracy weights as a standard tempo value.

Optionally, a standard tempo value range is generated according to the standard tempo value.

In some embodiments, first, the execution subject may determine a difference between the standard tempo value and a preset tempo value as a minimum value of the standard tempo value range. Secondly, the executing body may determine a sum of the standard tempo value and the preset tempo value as a maximum value of the standard tempo value range, to obtain the standard tempo value range. Wherein, the preset rhythm value is a positive number smaller than the standard rhythm value.

Optionally, for each of the determined tempo accuracy weights, performing the following processing steps:

the first processing step, determine whether the accurate weight of the rhythm is in the range of the standard rhythm value.

In some embodiments, the executing entity may determine whether the tempo accuracy weight is within the standard tempo value range.

And a second processing step, wherein an accurate rhythm mark corresponding to the accurate rhythm weight value is generated in response to the fact that the accurate rhythm weight value is determined to be within the range of the standard rhythm value.

In some embodiments, the executing entity may generate an accurate tempo flag corresponding to the tempo accuracy weight in response to determining that the tempo accuracy weight is within the standard tempo value range. Wherein, the accurate rhythm mark can represent that the accurate rhythm weight value is in the standard rhythm value range.

And a third processing step, wherein a fast rhythm mark corresponding to the accurate rhythm weight value is generated in response to the fact that the accurate rhythm weight value is smaller than the minimum value of all standard rhythm values included in the standard rhythm value range.

In some embodiments, the executing entity may generate a fast tempo flag corresponding to the tempo accuracy weight in response to determining that the tempo accuracy weight is smaller than a minimum value of standard tempo values included in the standard tempo value range. The fast rhythm mark can represent that the accurate rhythm weight value is smaller than the minimum value of all standard rhythm values included in the standard rhythm value range.

A fourth processing step, in response to determining that the tempo accuracy weight is greater than the maximum value of the standard tempo values included in the standard tempo value range, generating a slow tempo flag corresponding to the tempo accuracy weight.

In some embodiments, the executing entity may generate a slow tempo flag corresponding to the tempo accuracy weight in response to determining that the tempo accuracy weight is greater than a maximum value of standard tempo values included in the standard tempo value range. The slow tempo mark can represent that the tempo accurate weight is larger than the maximum value of each standard tempo value included in the standard tempo value range.

Optionally, a ratio of the number of accurate tempo markers included in each generated accurate tempo marker to the number of audio information included in the above-mentioned audio information sequence set is determined as a tempo accuracy amount.

In some embodiments, the execution subject may determine, as the tempo accuracy amount, a ratio of a number of accurate tempo markers included in each of the generated accurate tempo markers to a number of audio information included in the audio information sequence set.

Optionally, a tempo detection text is generated according to the tempo accuracy, the generated fast tempo markers, the generated accurate tempo markers, and the generated slow tempo markers.

In some embodiments, first, the executing entity may create an empty rhythm marker set as the target rhythm marker set. Second, the execution body may add each of the generated fast tempo markers, each of the generated accurate tempo markers, and each of the generated slow tempo markers to the tempo target marker set. Thirdly, the executing body may arrange each target rhythm marker in the rhythm target marker set from small to large according to the number of corresponding audio frames, so as to obtain a rhythm marker sequence. Fourth, the execution subject may add the tempo accuracy amount and the tempo flag sequence to a tempo detection text template to generate a tempo detection text. The rhythm detection text template can be a text template used for filling out rhythm accuracy and rhythm marking sequences. For example, the text template may be: "your tempo accuracy is cents and your tempo flag sequence is _".

Optionally, the rhythm detection text and the audio detection text are sent to the user side for display.

In some embodiments, the execution subject may send the rhythm detection text and the audio detection text to the user side for display.

Optionally, the related content is taken as an invention point of the present disclosure, and a second technical problem mentioned in the background art is solved, namely, when determining whether the content played by the user is correct through the teacher, the teacher cannot determine the rhythm speed of each tone played by the user, so that the user needs to repeatedly practice the full-song content, and practice time is wasted. The factors that cause the waste of teaching time are often as follows: when the teacher determines whether the content played by the user is correct, the teacher cannot determine the rhythm speed of each tone played by the user, so that the user needs to repeatedly practice the full-song content, and the practice time is wasted. If the above-mentioned factors are solved, the effect of reducing the waste of the exercise time can be achieved. To achieve this, first, for each of the grouped audio information sequences in the grouped audio information sequence set except for the last one, a difference between the first audio frame number included in the target grouped audio information sequence and the last audio frame number included in the above grouped audio information sequence is determined as an interval duration; for each standard packet audio information sequence in the standard packet audio information sequence set except for the last standard packet audio information sequence, determining the difference between the first standard audio frame number included in the target standard packet audio information sequence and the last standard audio frame number included in the standard packet audio information sequence as the standard interval duration. Therefore, data support is provided for determining accurate weight of the rhythm. Secondly, for each determined interval duration in each interval duration, determining the ratio of the interval duration to the standard interval duration corresponding to the interval duration as a rhythm accurate weight. Therefore, a mode can be selected from the determined accurate rhythm weights, and a standard rhythm value range is generated according to the mode. Then, determining the mode in each determined accurate rhythm weight as a standard rhythm value; and generating a standard rhythm value range according to the standard rhythm value. Thereby, the degree of the tempo of the user's playing can be determined. Then, for each rhythm accurate weight value in the determined rhythm accurate weight values, executing the following processing steps: firstly, whether the accurate rhythm weight value is in the standard rhythm value range is determined. Thereby, a corresponding mark can be generated according to the result of the determination. Secondly, generating an accurate rhythm mark corresponding to the accurate rhythm weight value in response to the fact that the accurate rhythm weight value is determined to be within the standard rhythm value range; in response to determining that the accurate tempo weight is smaller than the minimum value of the standard tempo values included in the standard tempo value range, generating a fast tempo mark corresponding to the accurate tempo weight; and generating a slow rhythm mark corresponding to the rhythm accurate weight value in response to the fact that the rhythm accurate weight value is larger than the maximum value of all standard rhythm values included in the standard rhythm value range. Therefore, the speed of the rhythm played by the user can be determined according to the generated marks, and data support is provided for generating the accurate rhythm. And then, determining the ratio of the number of the accurate rhythm marks included in each generated accurate rhythm mark to the number of the audio information included in the audio information sequence set as the rhythm accurate amount. Thereby, generation of tempo detection text is facilitated. Finally, generating a rhythm detection text according to the rhythm accuracy quantity, the generated fast rhythm marks, the generated accurate rhythm marks and the generated slow rhythm marks; and sending the rhythm detection text and the audio detection text to the user side for display. Thereby, the degree of how fast the rhythm content is played by the user is determined. And the speed degree of the rhythm content played by the user is determined, so that the user can practice aiming at the content with inaccurate playing rhythm, and the waste of the practice time is reduced.

The above embodiments of the present disclosure have the following beneficial effects: by the audio processing method of some embodiments of the present disclosure, waste of teaching time is reduced. Specifically, the reason why the teaching time is wasted is that: when the teacher determines whether the content played by the user is correct, the teacher easily omits the content played wrongly, so that the content played by the user needs to be played repeatedly to find out the content played wrongly, and teaching time is wasted. Based on this, the audio processing method according to some embodiments of the present disclosure first performs format conversion processing on the audio in response to receiving the audio sent by the user end, so as to generate a converted audio. Thereby, the audio format can be converted into an audio format to which the audio information extraction model can be applied. And secondly, inputting the converted audio into a pre-trained audio information extraction model to obtain an audio information sequence. Therefore, the audio information can be grouped, and the matching with the standard audio information sequence set is convenient. Then, according to the number of each audio frame included in the audio information sequence, grouping each audio information in the audio information sequence to obtain a grouped audio information sequence set; and matching the packet audio information sequence set with a standard audio information sequence set to obtain the standard packet audio information sequence set. Therefore, each grouped audio information sequence played by the user is matched with the standard audio information sequence set, whether each grouped audio information sequence is matched with the corresponding standard audio information sequence or not is convenient to determine, and therefore waste of teaching time is reduced. And finally, generating an audio detection text according to the grouped audio information sequence set and the standard grouped audio information sequence set. Therefore, the playing condition of the user can be displayed in the audio detection text, and the correctness of the content played by the user is determined. Thus, the waste of teaching time is reduced.

With further reference to fig. 3, as an implementation of the methods shown in the above figures, the present disclosure provides some embodiments of an audio processing apparatus, which correspond to those shown in fig. 2, and which may be applied in particular in various electronic devices.

As shown in fig. 3, the audio processing apparatus 300 of some embodiments includes: a conversion processing unit 301, an input unit 302, a packet processing unit 303, a matching processing unit 304, and a generation unit 305. The conversion processing unit 301 is configured to, in response to receiving audio sent by a user, perform format conversion processing on the audio to generate converted audio; the input unit 302 is configured to input the converted audio into a pre-trained audio information extraction model, resulting in an audio information sequence, wherein the audio information in the audio information sequence includes audio frame numbers; the grouping processing unit 303 is configured to perform grouping processing on each audio information in the audio information sequence according to each audio frame number included in the audio information sequence to obtain a grouped audio information sequence set, where each grouped audio information sequence in the grouped audio information sequence set is arranged according to each included audio frame number; the matching processing unit 304 is configured to match the set of packet audio information sequences with a set of standard audio information sequences, resulting in a set of standard packet audio information sequences, wherein the set of standard packet audio information sequences comprises a number of standard packet audio information sequences equal to a number of packet audio information sequences comprised by the set of packet audio information sequences; the generating unit 305 is configured to generate the audio detection text based on the set of grouped audio information sequences and the set of standard grouped audio information sequences.

It will be understood that the units described in the apparatus 300 correspond to the various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 300 and the units included therein, and are not described herein again.

Referring now to FIG. 4, a block diagram of an electronic device (such as computing device 101 shown in FIG. 1)400 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 4, electronic device 400 may include a processing device (e.g., central processing unit, graphics processor, etc.) 401 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage device 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic apparatus 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

Generally, the following devices may be connected to the I/O interface 405: input devices 404 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the electronic device 400 to communicate wirelessly or by wire with other devices to exchange data. While fig. 4 illustrates an electronic device 400 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 4 may represent one device or may represent multiple devices as desired.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network through the communication device 409, or from the storage device 408, or from the ROM 402. The computer program, when executed by the processing apparatus 401, performs the above-described functions defined in the methods of some embodiments of the present disclosure.

It should be noted that the computer readable medium described in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: and responding to the received audio sent by the user terminal, and performing format conversion processing on the audio to generate converted audio. And inputting the converted audio into a pre-trained audio information extraction model to obtain an audio information sequence, wherein the audio information in the audio information sequence comprises audio frame numbers. And according to the number of each audio frame included in the audio information sequence, grouping each audio information in the audio information sequence to obtain a grouped audio information sequence set, wherein each grouped audio information sequence in the grouped audio information sequence set is arranged according to the number of each included audio frame. And matching the packet audio information sequence set with a standard audio information sequence set to obtain a standard packet audio information sequence set, wherein the number of the standard packet audio information sequences included in the standard packet audio information sequence set is equal to the number of the packet audio information sequences included in the packet audio information sequence set. And generating an audio detection text according to the grouped audio information sequence set and the standard grouped audio information sequence set.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. The described units may also be provided in a processor, and may be described as: a processor includes a conversion processing unit, an input unit, a packet processing unit, a matching processing unit, and a generation unit. The names of these units do not limit the unit itself in some cases, and for example, the conversion processing unit may also be described as "a unit that performs format conversion processing on audio sent from a user end to generate converted audio" in response to receiving the audio.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. An audio processing method, comprising:

responding to the received audio sent by the user terminal, and performing format conversion processing on the audio to generate converted audio;

inputting the converted audio into a pre-trained audio information extraction model to obtain an audio information sequence, wherein the audio information in the audio information sequence comprises audio frame numbers;

according to the number of each audio frame included in the audio information sequence, performing grouping processing on each audio information in the audio information sequence to obtain a grouped audio information sequence set, wherein each grouped audio information sequence in the grouped audio information sequence set is arranged according to the number of each included audio frame;

matching the grouped audio information sequence set with a standard audio information sequence set to obtain a standard grouped audio information sequence set, wherein the number of the standard grouped audio information sequences included in the standard grouped audio information sequence set is equal to the number of the grouped audio information sequences included in the grouped audio information sequence set;

and generating an audio detection text according to the grouped audio information sequence set and the standard grouped audio information sequence set.

2. The method of claim 1 wherein said matching said set of packet audio information sequences to a set of standard audio information sequences resulting in a set of standard packet audio information sequences comprises:

matching the grouped audio information sequence set with the standard audio information sequence set to generate a matched audio information sequence set;

determining the set of matching audio information sequences as a set of standard packet audio information sequences in response to a last matching audio information sequence in the set of matching audio information sequences corresponding to a last standard audio information sequence in the set of standard audio information sequences;

in response to a last matching audio information sequence in the set of matching audio information sequences corresponding to any standard audio information sequence in the set of standard audio information sequences except the last standard audio information sequence, matching the set of packet audio information sequences with the set of matching audio information sequences to generate a set of re-matching audio information sequences as a set of standard packet audio information sequences.

3. The method of claim 1 wherein said generating audio detection text from said set of grouped audio information sequences and said set of standard grouped audio information sequences comprises:

acquiring standard audio selection information sent by the user side;

in response to the standard audio selection information being first selection information, determining a ratio of a number of the grouped audio information sequences comprised by the grouped audio information sequence set to a number of the standard audio information sequences comprised by the standard audio information sequence set as a playing integrity value;

in response to the standard audio selection information being second selection information, intercepting the standard audio information sequence set according to the standard audio selection information to generate an intercepted standard audio information sequence set;

and determining the ratio of the number of the standard grouped audio information sequences in the standard grouped audio information sequence set corresponding to the grouped audio information sequence set to the number of the intercepted standard audio information sequences included in the intercepted standard audio information sequence set as a playing integrity value.

4. The method of claim 3, wherein the audio information in the sequence of audio information further includes a pitch value, and the standard audio information in the standard sequence of audio information includes a standard pitch value; and

generating an audio detection text according to the packet audio information sequence set and the standard packet audio information sequence set, further comprising:

for each pitch value included in the set of grouped audio information sequences, determining the pitch value as a pitch accurate value in response to determining that the pitch value is the same as a standard pitch value to which the pitch value corresponds;

determining a ratio of a number of pitch accuracy values comprised by each determined pitch accuracy value to a number of audio information comprised by the set of audio information sequences as an audio accuracy measure.

5. The method of claim 4 wherein said generating audio detection text from said set of grouped audio information sequences and said set of standard grouped audio information sequences further comprises:

generating an audio evaluation score according to the playing integrity value and the audio accuracy;

and adding the audio evaluation score, the playing integrity value and the audio accuracy to an audio detection text template to generate an audio detection text.

6. An audio processing apparatus comprising:

the conversion processing unit is configured to respond to the received audio sent by the user terminal and carry out format conversion processing on the audio so as to generate converted audio;

an input unit configured to input the converted audio into a pre-trained audio information extraction model to obtain an audio information sequence, wherein audio information in the audio information sequence comprises audio frame numbers;

the grouping processing unit is configured to perform grouping processing on each audio information in the audio information sequence according to each audio frame number included in the audio information sequence to obtain a grouping audio information sequence set, wherein each grouping audio information sequence in the grouping audio information sequence set is arranged according to each included audio frame number;

a matching processing unit configured to match the set of packet audio information sequences with a set of standard audio information sequences, resulting in a set of standard packet audio information sequences, wherein the set of standard packet audio information sequences comprises a number of standard packet audio information sequences equal to a number of packet audio information sequences comprised by the set of packet audio information sequences;

a generating unit configured to generate an audio detection text based on the set of grouped audio information sequences and the set of standard grouped audio information sequences.

7. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

8. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1 to 5.