CN114822457A

CN114822457A - Music score determination method and device, electronic equipment and computer readable medium

Info

Publication number: CN114822457A
Application number: CN202210372740.2A
Authority: CN
Inventors: 张航; 徐豪骏; 李山亭
Original assignee: Shanghai Miaoke Information Technology Co ltd
Current assignee: Shanghai Miaoke Information Technology Co ltd
Priority date: 2022-04-11
Filing date: 2022-04-11
Publication date: 2022-07-29

Abstract

The embodiment of the disclosure discloses a music score determination method, a music score determination device, electronic equipment and a computer readable medium. One embodiment of the method comprises: acquiring target audio information; inputting target audio information into a feature recognition model included in a pre-trained audio recognition model to generate audio feature information to obtain an audio feature information set; generating target music score fragment information based on a music score matching model included by a music score fragment information set, an audio characteristic information set and a pre-trained audio recognition model in a pre-constructed music score fragment library to obtain a target music score fragment information sequence; and sending the target music score fragment information sequence to a display terminal corresponding to a target user for display. The embodiment realizes the accurate determination of the specific segments corresponding to the audios in the music score through the audio information, and improves the accuracy of the determination of the music score segments.

Description

Music score determination method and device, electronic equipment and computer readable medium

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a music score determination method, an apparatus, an electronic device, and a computer-readable medium.

Background

Score determination is a technique for determining a score desired by a user. At present, when determining a music score required by a user, the following methods are generally adopted: first, note information in audio information (e.g., audio information corresponding to an audio piece) is identified. The note information is then converted to a MIDI (Musical Instrument Digital Interface) file. Finally, the score desired by the user is determined from the sequence of notes in the MIDI file.

However, when the score is determined in the above manner, there are often technical problems as follows:

firstly, the specific position of the music score segment corresponding to the audio in the music score cannot be located, so that the music score segment corresponding to the audio cannot be accurately determined;

secondly, because the MIDI file does not support the audio sung by the user, when the audio information is the audio sung by the user, the score segment corresponding to the audio sung by the user cannot be determined;

thirdly, when the audio information is the audio sung by the user, the standard pitch and the music score fragment corresponding to the audio cannot be determined according to the characteristic information corresponding to the audio.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose a score determination method, apparatus, electronic device and computer readable medium to solve one or more of the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a score determination method, including: acquiring target audio information, wherein the target audio information is audio information to be subjected to audio identification, and the target audio information represents audio played by a target user; inputting the target audio information into a feature recognition model included in a pre-trained audio recognition model to generate audio feature information to obtain an audio feature information set; generating target music score fragment information based on a music score matching model included by a music score fragment information set, the audio characteristic information set and the pre-trained audio recognition model in a pre-constructed music score fragment library to obtain a target music score fragment information sequence; and sending the target music score fragment information sequence to a display terminal corresponding to the target user for display.

In a second aspect, some embodiments of the present disclosure provide a score determining apparatus, comprising: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire target audio information, the target audio information is audio information to be subjected to audio identification, and the target audio information represents audio played by a target user; the information input unit is configured to input the target audio information into a feature recognition model included in a pre-trained audio recognition model to generate audio feature information, so as to obtain an audio feature information set; an information generating unit configured to generate target score fragment information based on a score matching model included in a score fragment information set, the audio feature information set and the pre-trained audio recognition model in a pre-constructed score fragment library, resulting in a target score fragment information sequence; and the information sending unit is configured to send the target music score fragment information sequence to a display terminal corresponding to the target user for display.

In a third aspect, some embodiments of the present disclosure provide an electronic device, comprising: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement the method described in any of the implementations of the first aspect.

In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium on which a computer program is stored, wherein the program, when executed by a processor, implements the method described in any of the implementations of the first aspect.

The above embodiments of the present disclosure have the following beneficial effects: by the music score determination method of some embodiments of the present disclosure, a music score segment corresponding to an audio can be accurately determined. Specifically, the reason why the score fragment corresponding to the audio cannot be accurately determined is that: first, the specific position of the audio segment corresponding to the audio in the music score cannot be located, so that the audio segment corresponding to the audio cannot be determined accurately. Second, because the MIDI file does not support the audio sung by the user, when the audio information is the audio sung by the user, the score fragment corresponding to the audio sung by the user cannot be determined. Based on this, the score determination method of some embodiments of the present disclosure includes: firstly, target audio information is obtained, wherein the target audio information is audio information to be subjected to audio identification, and the target audio information represents audio played by a target user. Then, the target audio information is input into a feature recognition model included in a pre-trained audio recognition model to generate audio feature information, and an audio feature information set is obtained. Because the target audio information does not need to be converted into the MIDI file, the problem that the MIDI file does not support the audio sung by the user is avoided. Meanwhile, the audio features contained in the target audio information can be well extracted and obtained. And secondly, generating target music score fragment information based on a music score matching model included by a music score fragment information set, the audio characteristic information set and the pre-trained audio recognition model in a pre-constructed music score fragment library to obtain a target music score fragment information sequence. Therefore, the music score fragment corresponding to the audio information can be accurately matched. And finally, transmitting the target music score fragment information sequence to a display terminal corresponding to the target user for display. Therefore, the specific segments in the music score can be accurately determined through the audio information, and the accuracy of determining the music score segments is improved.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.

Fig. 1 is a schematic diagram of an application scenario of a score determination method of some embodiments of the present disclosure;

fig. 2 is a flow chart of some embodiments of a score determination method according to the present disclosure;

fig. 3 is a schematic structural diagram of some embodiments of a score determination apparatus according to the present disclosure;

FIG. 4 is a schematic block diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is a schematic diagram of an application scenario of a score determination method of some embodiments of the present disclosure.

In the application scenario of fig. 1, first, in step 102, an information processing terminal in the computing device 101 may acquire target audio information. The target audio information represents the audio played and/or played by the target user. Then, in step 103, the information processing terminal may input the target audio information into a feature recognition model included in a pre-trained audio recognition model to generate audio feature information, so as to obtain an audio feature information set. Next, in step 104, the information processing terminal may generate target score fragment information based on a score matching model included in a score fragment information set, an audio feature information set and a pre-trained audio recognition model in a pre-constructed score fragment library, so as to obtain a target score fragment information sequence. Finally, in step 105, the information processing terminal may send the target score fragment information sequence 106 to a display terminal corresponding to the target user for displaying.

The computing device 101 may be hardware or software. When the computing device is hardware, it may be implemented as a distributed cluster composed of multiple servers or terminal devices, or may be implemented as a single server or a single terminal device. When the computing device is embodied as software, it may be installed in the hardware devices enumerated above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

It should be understood that the number of computing devices in FIG. 1 is merely illustrative. There may be any number of computing devices, as implementation needs dictate.

With continuing reference to fig. 2, a flow 200 of some embodiments of a score determination method according to the present disclosure is shown. The music score determining method comprises the following steps:

step 201, obtaining target audio information.

In some embodiments, the subject of execution of the score determination method (e.g., computing device 101 shown in fig. 1) may obtain the target audio information by way of a wired connection or a wireless connection. The target audio information is audio information to be subjected to audio identification. The target audio information represents the audio played and/or played by the target user.

As an example, the above-mentioned target audio information may be information for characterizing audio time-domain features. The audio time domain feature refers to a time domain feature of an audio corresponding to the target audio information in the time domain.

Step 202, inputting the target audio information into a feature recognition model included in a pre-trained audio recognition model to generate audio feature information, so as to obtain an audio feature information set.

In some embodiments, the executing entity may input the target audio information into a feature recognition model included in the pre-trained audio recognition model to generate audio feature information, so as to obtain the audio feature information set. The audio feature information may be used to characterize time-frequency features of the target audio information. The audio recognition model may be a model for recognizing audio information to generate a corresponding score fragment. The audio recognition model comprises a feature recognition model and a music score matching model. For example, the audio feature information may be a mel-frequency spectrogram. The feature recognition model may be a convolutional neural network model.

In some optional implementation manners of some embodiments, the executing entity may input the target audio information into a feature recognition model included in a pre-trained audio recognition model to generate audio feature information, and obtain the audio feature information set, and may include the following steps:

first, a first feature information set is generated according to the target audio information.

The first feature information in the first feature information set may be log mel-frequency filter bank feature information.

Optionally, the generating, by the execution main body, the first feature information set according to the target audio information may include the following sub-steps:

the first substep, with preset step length, adopt the fixed time window to carry on the segmentation processing to the above-mentioned goal audio information, in order to produce the segmental audio information, get the segmental audio information set.

The preset step may be a moving step of the fixed time window. The fixed time window may be a window for segmenting the target audio information. Wherein, the window length of the fixed time window is the target time length. For example, the target time length may be 15 ms. The segmented audio information includes a plurality of sampling points and sound intensity values corresponding to the sampling points.

As an example, the time length of the above-described target audio information may be 60 ms. The preset step size may be 15 ms. And carrying out segmentation processing on the target audio information to generate 4 segments of segment characteristic information with the time length of 15 ms. Each piece of segment feature information includes 240 sampling points and a sound intensity value of each sampling point.

And a second substep of performing weighting processing on each piece of segmented audio information in the set of segmented audio information to generate weighted audio information, thereby obtaining a set of weighted audio information.

The weighted audio information may be obtained by weighting the sound intensity value corresponding to each sampling point in the segmented audio information. The executing body performs weighting processing on the sound intensity value corresponding to each sampling point in each piece of segmented audio information in the segmented audio information set to generate weighted audio information, so as to obtain a weighted audio information set, which can be implemented by the following formula:

S′(n)＝S(M)×W(n)。

where w (n) represents the weight of the nth sample point. n denotes the number of sampling points. And N represents the number of sampling points included in the segmented characteristic information. S (n) represents the sound intensity value of the nth sample point. S' (n) represents the sound intensity value of the nth sample point after the weighting processing.

And a third substep of performing feature transformation processing on each weighted audio information in the weighted audio information set to generate feature transformation information, so as to obtain a feature transformation information set.

The feature transformation information may be information obtained by performing feature transformation processing on a sound intensity value corresponding to each sampling point in the weighted audio information. The feature transformation information is used for characterizing the features of the target audio information in the frequency domain. The above-mentioned characteristic transform may be a fast fourier transform.

A fourth substep of squaring each feature transformation information in the feature transformation information set to generate squared feature information, to obtain a squared feature information set.

The executing body squares the sound intensity value corresponding to each sampling point in each feature transformation information in the feature transformation information set to generate squared feature information, so as to obtain a squared feature information set, which can be implemented by the following formula:

X′(k)＝|X(k)| ² 。

where k denotes the number of sample points after the fast fourier transform. X (k) represents the sound intensity value of the kth sample point. X' (k) represents the value of the k-th sample point after the squaring process.

A fifth substep, performing filtering processing on each squared feature information in the squared feature information set to generate first feature information, so as to obtain the first feature information set.

The execution body may perform filtering processing on each piece of squared feature information in the set of squared feature information through a filter in a filter bank to generate mel-frequency information, so as to obtain a mel-frequency information set. For example, the filter bank may be a mel filter bank.

And secondly, sequentially inputting the first feature information in the first feature information set into the feature recognition model to generate audio feature information, so as to obtain the audio feature information set.

Optionally, the feature recognition model may include: a convolutional subnetwork. The step of the executing main body sequentially inputting the first feature information in the first feature information set into the feature recognition model to generate audio feature information, and obtaining the audio feature information set may include the following substeps:

a first sub-step of performing convolution processing on each first feature information in the first feature information set through the convolution sub-network to generate second feature information, thereby obtaining a second feature information set.

Wherein the convolution sub-network may be a convolution neural network.

And a second substep of generating a third feature information set based on the second feature information set.

The execution body may screen, for each second feature information in the second feature information set, a maximum value from the second feature information as third feature information.

As an example, the second characteristic information may be

The maximum value 28 is selected from the second characteristic information as third characteristic information.

And a third substep of normalizing each piece of third feature information in the third feature information set to generate audio feature information, so as to obtain the audio feature information set.

The execution body may perform normalization processing on each third feature information in the third feature information set through the following formula to generate audio feature information, so as to obtain the audio feature information set.

Where i represents the number of the third feature information. Z _i And indicating the frequency value corresponding to the ith third characteristic information. Z' _i And representing the frequency value corresponding to the ith audio characteristic information. C denotes the total amount of the third characteristic information.

The step 202 is an invention point of an embodiment of the present disclosure, and solves the problem that "since the MIDI file does not support the audio sung by the user, when the audio information is the audio sung by the user, the score segment corresponding to the audio sung by the user cannot be determined" in the technical problem mentioned in the background section. Since the MIDI file does not support the audio sung by the user, when the audio information is the audio sung by the user, the score segment corresponding to the audio sung by the user cannot be determined. Therefore, in order to accurately determine the score segment corresponding to the audio, first, a first feature information set is generated according to the target audio information. Thereby, first feature information (e.g., log mel-filter bank features) more conforming to the target audio information is obtained. And then, sequentially inputting the first feature information in the first feature information set into the feature recognition model to generate audio feature information, thereby obtaining the audio feature information set. Because the target audio information does not need to be converted into the MIDI file, the problem that the MIDI file does not support the audio sung by the user is avoided. Meanwhile, the first characteristic information is subjected to characteristic re-extraction through the characteristic recognition model, and characteristics capable of accurately expressing the target audio information are obtained.

And 203, generating target music score fragment information based on a music score matching model included by a music score fragment information set, an audio characteristic information set and a pre-trained audio recognition model in a pre-constructed music score fragment library to obtain a target music score fragment information sequence.

In some embodiments, the executing body may generate target score fragment information based on a score matching model included in a score fragment information set, an audio feature information set and a pre-trained audio recognition model in a pre-constructed score fragment library, so as to obtain a target score fragment information sequence. The music score fragment library may be a pre-constructed database containing information of a plurality of music score fragments. The score piece information may be information for characterizing a piece in a corresponding score. The target segment information may be used to characterize segments in the score that match the target audio information.

As an example, the score matching model may include: a first sub-network, a second sub-network, a third sub-network, and a fourth sub-network. The first sub-network may be configured to perform feature re-extraction on the audio feature information in the audio feature information set. The first sub-network may include: a first filter layer and a second filter layer. Wherein, the first filter layer can be a 7 × 7 convolution layer with step size of 2. The second filter layer may be a 3 x 3 pooling layer with a step size of 2. The second sub-network includes: 16 first feature sub-networks and 8 second feature sub-networks. The first feature subnetwork is used for local feature extraction. The second feature subnetwork is used for global feature extraction. Wherein the first feature subnetwork may comprise: a lower projection convolution layer, a spatial convolution layer, an upper projection convolution layer and a residual error network. Wherein, the lower projection convolution layer can be a 1 × 1 convolution layer. The spatial convolution layer may be a 3 x 3 convolution layer. The up-projected convolution layer may be a 1 x 1 convolution layer. The residual network may be a network (e.g., a ResNet network) that performs residual connections between the inputs and outputs of the first feature sub-network. The second feature subnetwork may comprise: multi-headed self-attentive networks and multi-layered perceptrons. The third sub-network may be used to interact the output of the first feature sub-network with the output of the second feature sub-network. The third sub-network comprises: a third convolution layer, a first normalization layer and a second normalization layer. The third convolutional layer may be a 1 x 1 convolutional layer. The first normalization layer may be a BatchNorm layer. The first normalization layer may be used to normalize the output of the second feature sub-network. The second normalization layer may be a LayerNorm layer. The second normalization layer may be used to normalize inputs and outputs of the multi-headed self-attention network. The fourth sub-network may include: a first fully-connected layer and a second fully-connected layer. The first fully connected layer and the second fully connected layer have the same weight.

The above-mentioned music score matching model is an inventive point of an embodiment of the present disclosure, and solves the technical problems mentioned in the background section, "when the audio information is the audio sung by the user, it is impossible to determine the standard pitch and the music score fragment corresponding to the audio according to the feature information corresponding to the audio". When the audio information is the audio singed by the user, the pitch and the music score fragment corresponding to the audio cannot be determined according to the characteristic information corresponding to the audio. Thus, the execution main body can firstly re-extract the characteristics of the audio feature information in the audio feature information set through the first sub-network. And then, obtaining the probability value of the standard pitch corresponding to the target audio through the second sub-network, the third sub-network and the fourth sub-network. Then, the executing body may obtain a probability value of each score segment through the score matching model, so that the score segment corresponding to the target audio can be accurately determined.

Optionally, the score fragment information in the score fragment information set may include: a set of pitch information. The pitch information in the pitch information set may be used to represent the pitch contained in the score segment corresponding to the score segment information. For example, the pitch information may be A ₁ 。

In some optional implementations of some embodiments, the executing main body generates the target score fragment information based on a score matching model included in a score fragment information set, an audio feature information set and a pre-trained audio recognition model in a pre-constructed score fragment library, and obtains the target score fragment information sequence, and may include the following steps:

firstly, sequentially inputting the audio characteristic information in the audio characteristic information set into the music score matching model to generate standard pitch information, and obtaining a standard pitch information sequence.

Wherein the standard pitch information can be used to characterize the standard pitch in the audio feature information. The standard pitch information includes a standard pitch of the target number and a probability value corresponding to the standard pitch. For example, the target number may be 88.

And secondly, determining standard pitches and probability values corresponding to the standard pitches, which meet the first target condition, in the standard pitch information sequence as candidate standard pitch information to obtain a candidate standard pitch information sequence.

Wherein, the first target condition is that the probability value corresponding to the standard pitch included in the standard pitch information is greater than the first target probability value. The candidate standard pitch information may be used to characterize a standard pitch in which the corresponding probability value is greater than the first target probability value in the standard pitch information. For example, the first target probability value may be 50%. The candidate standard pitch information may be [ standard pitch A ] ₁ ：60％]。

And thirdly, carrying out de-emphasis processing on the standard pitches included in the candidate standard pitch information sequence to generate de-emphasis pitch information and obtain a de-emphasis pitch information set.

The de-emphasis pitch information may represent a standard pitch included in the audio corresponding to the target audio information. The de-emphasis pitch information may include: standard pitch after de-weighting.

For example, the candidate standard pitch information sequence may be { [ standard pitch A ] ₁ ：60％][ Standard Pitch A ] ₁ : 70% standard pitch C ₁ ：60％]Includes a standard pitch of [ standard pitch A ] ₁ Standard pitch A ₁ Standard pitch C ₁ ]. The de-emphasis pitch information set obtained after de-emphasis is [ standard pitch A ] ₁ Standard pitch C ₁ ]。

And fourthly, screening score segment information meeting a second target condition from the score segment information set to serve as candidate score segment information to obtain a candidate score segment information set.

Wherein the second target condition is that the score fragment information includes target pitch information. The target pitch information is pitch information corresponding to the same pitch as in the above-described set of deduplicated pitch information present in the set of pitch information. The candidate score fragment information may be score fragment information having pitch information identical to that of the de-emphasized high information set in the score fragment information set.

As an example, the above-mentioned set of de-emphasized pitch information may be [ standard pitch A ] ₁ Standard pitch C ₁ ]. The score fragment information may include a set of pitch information that may be [ standard pitch a ] ₁ Standard pitch F ₁ ]. The same pitch in the de-emphasized pitch information set and the pitch information set is A ₁ If the pitch information of the target is A ₁ . Set pitch information [ standard pitch A ] ₁ Standard pitch F ₁ ]And determining the corresponding score fragment information as candidate score fragment information.

And fifthly, for each piece of candidate music score information in the candidate music score piece information set, performing similarity matching on pitch information included in the candidate music score piece information and standard pitch included in the de-accent pitch information set to generate a matching value corresponding to the candidate music score piece information.

Wherein the matching value may characterize a similarity between pitch information included in the candidate score segment information and the de-accent height information included in the set of de-accent height information including standard pitch.

As an example, the matching value can be obtained by the following formula:

where P represents a match value. S represents the number of standard pitches in the pitch information set. r the number of standard pitches in the pitch information.

And sixthly, screening candidate music score fragment information meeting a third target condition from the candidate music score fragment information set to serve as first target music score fragment information, and obtaining a first target music score fragment information set.

The first target score segment information may be candidate score segment information having a corresponding matching value greater than or equal to a target matching value. The third target condition is that the matching value corresponding to the candidate score fragment information is greater than or equal to the target matching value. The target matching value may be a matching value at the first target position after sorting the obtained multiple matching values according to size. For example, the first target location may be the 100 th bit.

And seventhly, inputting the audio characteristic information set and the first target music score fragment information set into the music score matching model to generate second target music score fragment information to obtain a second target music score fragment information set.

And the second target score fragment information in the second target score fragment information set comprises a probability value corresponding to the second target score fragment and the second target score fragment. The number of the second target score fragment information in the second target score fragment information set is consistent with the number of the first target score fragment information in the first target score fragment information set.

As an example, the above-mentioned second target score piece information set may be { a: 0.8, B: 0.5, C: 0.6}.

And eighthly, screening second target music score fragment information meeting a fourth target condition from the second target music score fragment information sequence, and taking the second target music score fragment information as target music score fragment information to obtain the target music score fragment information sequence.

The fourth target condition may be that the probability value corresponding to the second target score segment information is greater than or equal to the second target probability value. The second target probability value may be a probability value of a second target position after the probability values corresponding to the obtained multiple pieces of music score information are sorted according to size. For example, the second target location may be bit 3.

And 204, sending the target music score fragment information sequence to a display terminal corresponding to the target user for display.

In some embodiments, the executing main body may send the target musical score fragment information sequence to a display terminal corresponding to the target user for display.

As an example, the display terminal described above may be a terminal having a display function.

The above embodiments of the present disclosure have the following beneficial effects: by the music score determination method of some embodiments of the present disclosure, a music score segment corresponding to an audio can be accurately determined. Specifically, the reason why the music score segment corresponding to the audio cannot be accurately determined is that: first, the specific position of the audio segment corresponding to the audio in the music score cannot be located, so that the audio segment corresponding to the audio cannot be determined accurately. Second, because the MIDI file does not support the audio sung by the user, when the audio information is the audio sung by the user, the score fragment corresponding to the audio sung by the user cannot be determined. Based on this, the score determination method of some embodiments of the present disclosure includes: firstly, target audio information is obtained, wherein the target audio information is audio information to be subjected to audio identification, and the target audio information represents audio played by a target user. Then, the target audio information is input into a feature recognition model included in a pre-trained audio recognition model to generate audio feature information, and an audio feature information set is obtained. Because the target audio information does not need to be converted into the MIDI file, the problem that the MIDI file does not support the audio sung by the user is avoided. Meanwhile, the audio features contained in the target audio information can be well extracted and obtained. And secondly, generating target music score fragment information based on a music score matching model included by a music score fragment information set, the audio characteristic information set and the pre-trained audio recognition model in a pre-constructed music score fragment library to obtain a target music score fragment information sequence. Therefore, the music score fragment corresponding to the audio information can be accurately matched. And finally, transmitting the target music score fragment information sequence to a display terminal corresponding to the target user for display. Therefore, the specific segments in the music score can be accurately determined through the audio information, and the accuracy of determining the music score segments is improved.

With further reference to fig. 3, as an implementation of the methods shown in the above figures, the present disclosure provides some embodiments of a score determination apparatus, which correspond to those method embodiments shown in fig. 2, and which are particularly applicable in various electronic devices.

As shown in fig. 3, the score determining apparatus 300 of some embodiments includes: an acquisition unit 301, an information input unit 302, an information generation unit 303, and an information transmission unit 304. Wherein the obtaining unit 301 is configured to obtain the target audio information. The target audio information represents audio played and/or played by a target user; the information input unit 302 is configured to input the target audio information into a feature recognition model included in a pre-trained audio recognition model to generate audio feature information, resulting in an audio feature information set; the information generating unit 303 is configured to generate target score fragment information based on a score matching model comprised by a score fragment information set, the audio feature information set and the pre-trained audio recognition model in a pre-constructed score fragment library, resulting in a target score fragment information sequence; the information sending unit 304 is configured to send the target score fragment information sequence to a display terminal corresponding to the target user for display.

It will be understood that the units described in the apparatus 300 correspond to the various steps in the method described with reference to fig. 2. Thus, the operations, features and advantages described above with respect to the method are also applicable to the apparatus 300 and the units included therein, and are not described herein again.

Referring now to FIG. 4, a block diagram of an electronic device (such as computing device 101 shown in FIG. 1)400 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 4, electronic device 400 may include a processing device (e.g., central processing unit, graphics processor, etc.) 401 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage device 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic apparatus 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

Generally, the following devices may be connected to the I/O interface 405: input devices 404 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the electronic device 400 to communicate wirelessly or by wire with other devices to exchange data. While fig. 4 illustrates an electronic device 400 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 4 may represent one device or may represent multiple devices as desired.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network through the communication device 409, or from the storage device 408, or from the ROM 402. The computer program, when executed by the processing device 401, performs the above-described functions defined in the methods of some embodiments of the present disclosure.

It should be noted that the computer readable medium described in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring target audio information, wherein the target audio information is audio information to be subjected to audio identification, and the target audio information represents audio played by a target user; inputting the target audio information into a feature recognition model included in a pre-trained audio recognition model to generate audio feature information to obtain an audio feature information set; generating target music score fragment information based on a music score matching model included by a music score fragment information set, the audio characteristic information set and the pre-trained audio recognition model in a pre-constructed music score fragment library to obtain a target music score fragment information sequence; and sending the target music score fragment information sequence to a display terminal corresponding to the target user for display.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, an information input unit, an information generation unit, and an information transmission unit. The names of these units do not form a limitation on the unit itself in some cases, for example, the information input unit may also be described as a unit that inputs the above target audio information into a feature recognition model included in a pre-trained audio recognition model to generate audio feature information, resulting in an audio feature information set.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combinations of the above-mentioned features, and other embodiments in which the above-mentioned features or their equivalents are combined arbitrarily without departing from the spirit of the invention are also encompassed. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A score determination method, comprising:

acquiring target audio information, wherein the target audio information is audio information to be subjected to audio identification, and the target audio information represents audio played by a target user;

inputting the target audio information into a feature recognition model included in a pre-trained audio recognition model to generate audio feature information to obtain an audio feature information set;

generating target music score fragment information based on a music score matching model included by a music score fragment information set, the audio characteristic information set and the pre-trained audio recognition model in a pre-constructed music score fragment library to obtain a target music score fragment information sequence;

and sending the target music score fragment information sequence to a display terminal corresponding to the target user for display.

2. The method of claim 1, wherein generating target score segment information based on score segment information sets in a pre-built library of score segments, the set of audio feature information, and a score matching model comprised by the pre-trained audio recognition model, resulting in a target score segment information sequence, comprises:

sequentially inputting the audio characteristic information in the audio characteristic information set into the music score matching model to generate standard pitch information to obtain a standard pitch information sequence, wherein the standard pitch information in the standard pitch information sequence comprises standard pitches with target quantity and probability values corresponding to the standard pitches;

and determining a standard pitch which meets a first target condition in the standard pitch information sequence and a probability value corresponding to the standard pitch as candidate standard pitch information to obtain a candidate standard pitch information sequence, wherein the first target condition is that the probability value corresponding to the standard pitch which meets the first target condition in the standard pitch information is greater than a first target probability value.

3. The method of claim 2, wherein the score fragment information in the set of score fragment information comprises: a set of pitch information; and

generating target score fragment information based on a score matching model included by a score fragment information set, the audio characteristic information set and the pre-trained audio recognition model in a pre-constructed score fragment library to obtain a target score fragment information sequence, and further comprising:

carrying out de-emphasis processing on standard pitches included in the candidate standard pitch information sequence to generate de-emphasis pitch information to obtain a de-emphasis pitch information set;

and screening score segment information meeting a second target condition from the score segment information set to obtain a candidate score segment information set as candidate score segment information, wherein the second target condition is that the score segment information comprises target pitch information, and the target pitch information is pitch information which exists in the pitch information set and corresponds to the same pitch in the de-emphasis pitch information set.

4. The method of claim 3, wherein generating target score segment information based on score segment information sets in a pre-built library of score segments, the set of audio feature information, and a score matching model comprised by the pre-trained audio recognition model, resulting in a target score segment information sequence, further comprises:

for each candidate score fragment information in a candidate score fragment information set, performing similarity matching on pitch information included in the candidate score fragment information and standard pitch included in de-accent height information in the de-accent height information set so as to generate a matching value corresponding to the candidate score fragment information;

and screening candidate score fragment information meeting a third target condition from the candidate score fragment information set, and obtaining a first target score fragment information set as first target score fragment information, wherein the third target condition is that a matching value corresponding to the candidate score fragment information is greater than or equal to a target matching value.

5. The method of claim 4, wherein generating target score segment information based on score segment information sets in a pre-built library of score segments, the set of audio feature information, and a score matching model comprised by the pre-trained audio recognition model, resulting in a target score segment information sequence, further comprises:

inputting the audio characteristic information set and the first target score fragment information set into the score matching model to generate second target score fragment information to obtain a second target score fragment information set, wherein the second target score fragment information in the second target score fragment information set comprises probability values corresponding to the second target score fragment and the second target score fragment, and the number of the second target score fragment information in the second target score fragment information set is consistent with the number of the first target score fragment information in the first target score fragment information set;

and screening second target music score fragment information meeting a fourth target condition from the second target music score fragment information sequence, and obtaining the target music score fragment information sequence as target music score fragment information, wherein the fourth target condition is that the probability value corresponding to the second target music score fragment information is more than or equal to a second target probability value.

6. The method of claim 1, wherein the inputting the target audio information into a feature recognition model included in a pre-trained audio recognition model to generate audio feature information, resulting in a set of audio feature information, comprises:

generating a first characteristic information set according to the target audio information;

and sequentially inputting the first characteristic information in the first characteristic information set into the characteristic identification model to generate audio characteristic information, so as to obtain the audio characteristic information set.

7. A score determination apparatus, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire target audio information, the target audio information is audio information to be subjected to audio identification, and the target audio information represents audio played and/or played by a target user;

an information input unit configured to input the target audio information into a feature recognition model included in a pre-trained audio recognition model to generate audio feature information, resulting in an audio feature information set;

an information generating unit configured to generate target score fragment information based on a score fragment information set in a pre-constructed score fragment library, the audio feature information set and a score matching model included in the pre-trained audio recognition model, resulting in a target score fragment information sequence;

and the information sending unit is configured to send the target music score fragment information sequence to a display terminal corresponding to the target user for display.

8. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

9. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1 to 6.