WO2024004564A1

WO2024004564A1 - Acoustic analysis system, acoustic analysis method, and program

Info

Publication number: WO2024004564A1
Application number: PCT/JP2023/021287
Authority: WO
Inventors: 和彦山本
Original assignee: ヤマハ株式会社
Priority date: 2022-07-01
Filing date: 2023-06-08
Publication date: 2024-01-04
Also published as: JP2024006175A

Abstract

This information analysis system 100 comprises: a beat point estimation unit 21 for estimating a plurality of beat points B by an estimation process performed on an acoustic signal A; a beat point editing unit 24 for moving, on a time axis in accordance with an instruction from a user, a target beat point selected by the user from among the plurality of beat points B and one or more adjacent beat points located around the target beat point from among the plurality of beat points B; and an update processing unit 25 for updating the estimation process in accordance with the movement of the target beat point and the one or more adjacent beat points. The beat point estimation unit 21 re-estimates the plurality of beat points B by executing the updated estimation process on the acoustic signal A.

Description

Acoustic analysis system, acoustic analysis method and program

The present disclosure relates to techniques for analyzing acoustic signals.

Analysis techniques have been proposed in the past that estimate the beats of a song by analyzing acoustic signals representing the performance sounds of the song. For example, Patent Document 1 discloses a technique for estimating beat points of a song using a probability model such as a hidden Markov model.

Japanese Patent Application Publication No. 2015-114361

In conventional techniques for estimating beat points of a song, for example, there is a possibility that a backbeat of a song may be incorrectly estimated as a beat point, or a beat point that corresponds to a tempo twice the original tempo of the song may be incorrectly estimated. There is a possibility that Furthermore, there is a possibility that the beat point estimation result does not match the user's intention, as in the case where the back beat of a song is estimated in a situation where the user expects the front beat to be estimated. Considering the above circumstances, it is important to have a configuration that allows the user to change the positions on the time axis of a plurality of beat points estimated from the acoustic signal. In consideration of the above circumstances, one aspect of the present disclosure aims to estimate a beat that appropriately matches the user's intention.

In order to solve the above problems, an acoustic analysis system according to one aspect of the present disclosure includes a beat point estimation unit that estimates a plurality of first beat points by estimation processing on an acoustic signal, and a beat point estimation unit that estimates a plurality of first beat points by estimation processing for an acoustic signal; The target beat point selected by the user among them and one or more adjacent beat points located around the target beat point among the plurality of first beat points are set on a time axis according to instructions from the user. a beat point editing unit that moves above the target beat point; and an update processing unit that updates the estimation process according to movement of the target beat point and the one or more adjacent beat points; A plurality of second beat points are estimated by performing subsequent estimation processing on the acoustic signal.

An acoustic analysis method according to one aspect of the present disclosure estimates a plurality of first beat points by estimation processing on an acoustic signal, and includes a target beat point selected by a user among the plurality of first beat points, and a target beat point selected by a user from among the plurality of first beat points. Among the first beat points, one or more adjacent beat points located around the target beat point are moved on the time axis according to instructions from the user, and the target beat point and the one or more adjacent beat points are The estimation process is updated according to the movement of adjacent beat points, and the updated estimation process is executed on the acoustic signal, thereby estimating a plurality of second beat points.

A program according to one aspect of the present disclosure includes: a beat point estimation unit that estimates a plurality of first beat points by estimation processing on an acoustic signal; a target beat point selected by a user from among the plurality of first beat points; a beat point editing unit that moves one or more adjacent beat points located around the target beat point among the plurality of first beat points on a time axis according to an instruction from the user; A program that causes a computer system to function as an update processing unit that updates the estimation process according to movement of the target beat point and the one or more adjacent beat points, the beat point estimation unit updating the estimation process after the update. A plurality of second beat points are estimated by performing this on the acoustic signal.

FIG. 1 is a block diagram illustrating the configuration of an acoustic analysis system in a first embodiment. FIG. 1 is a block diagram illustrating a functional configuration of an acoustic analysis system. It is a flowchart of estimation processing. FIG. 2 is an explanatory diagram of machine learning for establishing an estimation model. It is a schematic diagram of a confirmation image. FIG. 3 is an explanatory diagram of beat point movement and update processing. It is a flowchart of update processing. It is a flowchart of acoustic analysis processing. FIG. 2 is a block diagram illustrating a functional configuration of an acoustic analysis system in a second embodiment.

A: First Embodiment FIG. 1 is a block diagram illustrating the configuration of an acoustic analysis system 100 according to a first embodiment. The acoustic analysis system 100 is a computer system that estimates a plurality of beat points B of a song by analyzing an acoustic signal A representing the performance sound of the song.

The acoustic analysis system 100 includes a control device 11, a storage device 12, a display device 13, an operating device 14, and a sound emitting device 15. The acoustic analysis system 100 is realized by, for example, an information device such as a smartphone, a tablet terminal, or a personal computer. Note that the acoustic analysis system 100 is realized not only as a single device but also as a plurality of devices configured separately from each other.

The control device 11 is one or more processors that control each element of the acoustic analysis system 100. Specifically, for example, CPU (Central Processing Unit), GPU (Graphics Processing Unit), SPU (Sound Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application Specific Integrated Circuit). The control device 11 is composed of one or more types of processors such as the following.

The storage device 12 is one or more memories that store programs executed by the control device 11 and various data used by the control device 11. For example, a known recording medium such as a semiconductor recording medium and a magnetic recording medium, or a combination of multiple types of recording media is used as the storage device 12. Note that, for example, a portable recording medium that can be attached to and detached from the acoustic analysis system 100 or a recording medium that can be accessed by the control device 11 via a communication network (for example, cloud storage) is used as the storage device 12. Good too.

The storage device 12 stores the acoustic signal A. The acoustic signal A is a sample series representing the waveform of the performance sound of the music piece. Specifically, the acoustic signal A represents at least one of an instrumental sound and a singing sound of a song. The data format of the acoustic signal A is arbitrary. Note that the acoustic signal A may be supplied to the acoustic analysis system 100 from a signal supply device separate from the acoustic analysis system 100. The signal supply device is, for example, a playback device that supplies the acoustic signal A recorded on a recording medium to the acoustic analysis system 100, or an acoustic signal A received from a distribution device (not shown) via a communication network to the acoustic analysis system. 100.

The display device 13 displays images under the control of the control device 11. For example, various display panels such as a liquid crystal display panel or an organic EL (Electroluminescence) panel are used as the display device 13. Note that a display device 13 that is separate from the acoustic analysis system 100 may be connected to the acoustic analysis system 100 by wire or wirelessly. The operating device 14 is an input device that accepts instructions from a user. The operating device 14 is, for example, an operator operated by a user or a touch panel that detects a touch by the user.

The sound emitting device 15 reproduces sound under the control of the control device 11. For example, a speaker or headphones are used as the sound emitting device 15. Note that a sound emitting device 15 that is separate from the acoustic analysis system 100 may be connected to the acoustic analysis system 100 by wire or wirelessly.

FIG. 2 is a block diagram illustrating the functional configuration of the acoustic analysis system 100. The control device 11 has a plurality of functions (beat point estimation section 21, display control section 22, playback control section 23, beat point editing) for processing the acoustic signal A by executing a program stored in the storage device 12. 24 and update processing section 25).

The beat point estimation unit 21 estimates a plurality of beat points B in the song by analyzing the acoustic signal A. Specifically, the beat point estimating unit 21 generates time series data that specifies the time of each of the plurality of beat points B in the song. The beat point estimation unit 21 of the first embodiment includes a feature extraction unit 30, a first processing unit 31, and a second processing unit 32.

The feature extraction unit 30 calculates the feature amount F(t) of the acoustic signal A for each of a plurality of time points (hereinafter referred to as "analysis time points") t on the time axis. Each analysis time point t is a time point set on the time axis at predetermined intervals. The interval between the analysis time points t is sufficiently smaller than the interval between the beat points B assumed in the song.

The feature quantity F(t) is information representing the acoustic characteristics of the acoustic signal A at the analysis time t. For example, the feature amount F(t) at each analysis time point t is a time series of acoustic information within a predetermined period of time including the analysis time point t. The acoustic information is, for example, information regarding the intensity of the acoustic signal A, such as volume and amplitude. Furthermore, information regarding the frequency characteristics (timbre) of the acoustic signal A is also used as acoustic information. Examples of information regarding frequency characteristics include MFCC (Mel-Frequency Cepstrum Coefficients), MSLS (Mel-Scale Log Spectrum), and Constant-Q Transform (CQT). Note that a plurality of pieces of acoustic information corresponding to one analysis time point t may be used as the feature amount F(t). Furthermore, the types of acoustic information are not limited to the above examples. The acoustic information may be a combination of multiple types of acoustic information regarding the acoustic signal A.

The first processing unit 31 and the second processing unit 32 estimate a plurality of beat points B from each feature amount F(t) of the acoustic signal A. FIG. 3 is a flowchart of the process S2 of estimating a plurality of beat points B (hereinafter referred to as "estimation process"). The estimation process S2 includes a first process S21 and a second process S22. The first processing section 31 executes the first processing S21, and the second processing section 32 executes the second processing S22.

The first process S21 is a process that generates a probability P(t) that each analysis time point t corresponds to the beat point B of the song. The greater the probability P(t) of each analysis time point t, the higher the probability that the analysis time point t corresponds to the beat point B. The first processing unit 31 generates a time series with probability P(t) by repeating the first process S21 at every analysis time point t. Estimated model M is used in the first process S21.

There is a correlation between the feature amount F(t) of each analysis time point t of the acoustic signal A and the probability P(t) that the analysis time point t corresponds to the beat point B. The estimated model M is a statistical model that has learned the above correlation. That is, the estimated model M is a learned model that has learned the relationship between the feature amount F(t) and the probability P(t) by machine learning. The estimated model M is also expressed as a trained model that has acquired the relationship between the feature quantity F(t) and the probability P(t) through training (machine learning). The first processing unit 31 generates a probability P(t) by processing the feature amount F(t) of the acoustic signal A at each analysis time point t using the estimation model M. Specifically, the first processing unit 31 generates the probability P(t) by inputting input data including the feature amount F(t) to the estimation model M.

The estimation model M is composed of, for example, a deep neural network (DNN). For example, any type of deep neural network such as a recurrent neural network (RNN) or a convolutional neural network (CNN) is used as the estimation model M. The estimation model M may be configured by a combination of multiple types of deep neural networks. Further, additional elements such as long short-term memory (LSTM) or attention may be included in the estimation model M.

The estimation model M includes a program that causes the control device 11 to execute a calculation to generate a probability P(t) from a feature amount F(t), and a plurality of variables (specifically, weight values and biases) applied to the calculation. This is realized in combination with A program and a plurality of variables that realize the estimation model M are stored in the storage device 12. The numerical values of each of the plurality of variables that define the estimation model M are set in advance by machine learning.

The second process S22 in FIG. 3 is a process for estimating a plurality of beat points B in the song from the time series of probabilities P(t) generated in the first process S21. Various state transition models are used in the second process S22. The state transition model is composed of, for example, a Hidden Semi-Markov Model (HSMM), and a plurality of beat points B are estimated by the Viterbi algorithm, which is an example of dynamic programming. For example, the time point when probability P(t) becomes maximum is estimated as beat point B.

FIG. 4 is an explanatory diagram of machine learning to establish the estimation model M. For example, the estimated model M is established by machine learning using a machine learning system 200 that is separate from the acoustic analysis system 100. An estimated model M is provided from the machine learning system 200 to the acoustic analysis system 100. Note that the functions of the machine learning system 200 may be installed in the acoustic analysis system 100.

A plurality of training data Z are used for machine learning of the estimated model M. Each of the plurality of teacher data Z is composed of a combination of a feature amount Fm for machine learning and a probability Pm for machine learning. The feature amount Fm is a feature amount F(t) at a specific time point of the acoustic signal Am prepared for machine learning. The acoustic signal Am is a signal recording the sound radiated into the acoustic space, or a signal synthesized by known sound synthesis processing. The probability Pm for machine learning corresponding to a specific point in time is the probability (that is, the correct value) that the point in time corresponds to beat point B of the song. A plurality of pieces of teacher data Z are prepared for a large number of songs whose beat points B are known. Note that the acoustic signal Am is an example of a "learning acoustic signal."

The machine learning system 200 calculates the probability P(t) that an initial or provisional model (hereinafter referred to as "provisional model") M0 outputs when the feature quantity Fm of each teaching data Z is input, and An error function representing the error with probability Pm is calculated. The machine learning system 200 then updates the multiple variables of the interim model M0 so that the error function is reduced. The provisional model M0 at the time when the above process is repeated for each of the plurality of teacher data Z is determined as the estimated model M.

Therefore, the estimation model M calculates a statistically valid probability P( t) is output. That is, the estimated model M is a learned model that has learned the relationship between the feature amount Fm of the acoustic signal Am for machine learning and the probability Pm that the time point at which the feature amount is observed corresponds to the beat point B. The first processing unit 31 processes the feature amount F(t) of each analysis time point t using the estimation model M established in the above procedure, and calculates the probability that the analysis time point t corresponds to the beat point B of the song. Generate (t).

As explained above, in the first embodiment, the relationship between the feature amount Fm of the acoustic signal Am for machine learning and the probability Pm that the analysis time t at which the feature amount Fm is observed corresponds to the beat point B is calculated. A plurality of beat points B are estimated from the acoustic signal A using the learned estimation model M. Therefore, a plurality of beat points B can be estimated with high accuracy for an unknown acoustic signal A in which the feature amount F(t) varies in various ways.

The display control unit 22 in FIG. 2 displays an image on the display device 13. Specifically, the display control unit 22 displays the confirmation image G in FIG. 5 on the display device 13. The confirmation image G includes a waveform area Ga and a beat area Gb. A common time axis is set for the waveform area Ga and the beat area Gb.

In the waveform area Ga, a waveform within a specific range (hereinafter referred to as "display range") of the acoustic signal A is displayed. The display control unit 22 changes the display range of the acoustic signal A according to a user's instruction to the operating device 14. A plurality of beat points B estimated from the acoustic signal A by the beat point estimation unit 21 are displayed in the beat point area Gb. Specifically, a plurality of beat points B within the display range of the acoustic signal A are displayed in the beat point area Gb. The beat area Gb is an example of a "beat point image."

The user can instruct the reproduction of the audio signal A by operating the operating device 14. The reproduction control unit 23 in FIG. 2 reproduces the sound represented by the acoustic signal A by supplying the acoustic signal A to the sound emitting device 15. The display control unit 22 displays the reproduction position Gc on the confirmation image G in parallel with the reproduction of the acoustic signal A, as illustrated in FIG. The reproduction position Gc is the point in time when the acoustic signal A is being reproduced by the sound emitting device 15. Therefore, the reproduction position Gc advances in parallel with the reproduction of the acoustic signal A in the direction of the time axis. The user can confirm the position of the beat point B estimated by the immediately preceding estimation process S2 by visually recognizing the beat point area Gb while listening to the sound reproduced by the sound emitting device 15. If the current position of beat point B does not match the user's intention, the user can instruct correction of the estimated beat point B position by operating the operating device 14.

The beat point editing unit 24 in FIG. 2 moves each beat point B on the time axis according to instructions from the user. Moving the beat point B is a process of changing the position of the beat point B on the time axis. FIG. 6 is an explanatory diagram regarding the movement of the beat point B. State 1 in FIG. 6 is a state in which a plurality of beat points B have been estimated by the estimation process S2 described above. FIG. 6 also shows a time series of the probability P(t) calculated in the first process S21.

The user can select any one of the plurality of beat points B (hereinafter referred to as "target beat point Bn") displayed in the beat point area Gb by operating the operating device 14 while checking the beat point area Gb. . Further, the user can instruct movement of the target beat point Bn on the time axis by operating the operating device 14 while checking the beat point area Gb. Specifically, the user can instruct the movement direction (forward/backward) and movement amount δ of the target beat point Bn. For example, the user can instruct the target beat point Bn to be moved to a point that he or she deems appropriate. As illustrated as state 2 in FIG. 6, the beat point editing unit 24 moves the target beat point Bn in the direction (forward/backward) specified by the user by the amount of movement δ specified by the user on the time axis. move on. Although FIG. 6 illustrates a case in which the target beat point Bn moves forward, the target beat point Bn may also move backward.

As illustrated as state 3 in FIG. 6, the beat point editing unit 24 selects a beat point B located immediately before the target beat point Bn (hereinafter referred to as "adjacent beat point Bn-1") among the plurality of beat points B. , a beat point B located immediately after the target beat point Bn (hereinafter referred to as "adjacent beat point Bn+1") is moved on the time axis in conjunction with the target beat point Bn. Specifically, the beat point editing unit 24 edits the adjacent beat points Bn by the amount of movement δ specified by the user for the target beat point Bn in the movement direction (forward/backward) specified by the user for the target beat point Bn. -1 and the adjacent beat point Bn+1 are moved on the time axis. That is, the three beat points B, the target beat point Bn and the adjacent beat points Bn±1 before and after, move in the same way on the time axis according to instructions from the user. Therefore, the temporal relationship between the target beat point Bn and each adjacent beat point Bn±1 is maintained before and after the movement.

The beat point area Gb displayed on the display device 13 includes the target beat point Bn and the adjacent beat points Bn±1. The display control section 22 causes the movement of each beat point B by the beat point editing section 24 to be reflected in the beat point area Gb displayed on the display device 13. Specifically, the display control unit 22 moves the target beat point Bn and each adjacent beat point Bn±1 in the beat point area Gb on the time axis according to an instruction from the user.

The update processing unit 25 in FIG. 2 updates the estimated model M according to the movement of the target beat point Bn and each adjacent beat point Bn±1. Specifically, the update processing unit 25 updates the estimated model M by machine learning according to the movement of the target beat point Bn and each adjacent beat point Bn±1.

FIG. 7 is a flowchart of the process (hereinafter referred to as "update process") S8 in which the control device 11 (update processing unit 25) updates the estimated model M. The updating process S8 is started with the movement of the target beat point Bn and each adjacent beat point Bn±1.

When the update process S8 is started, the update processing unit 25 sets a numerical value string C corresponding to the moved target beat point Bn and each adjacent beat point Bn±1 on the time axis (S81). As illustrated as state 4 in FIG. 6, the numerical value string C is a time series of numerical values Q(t) set at each analysis time point t on the time axis.

The numerical value sequence C includes a numerical value distribution D corresponding to the target beat point Bn after movement and each adjacent beat point Bn±1. Numerical distribution D is a distribution of numerical values Q(t) in a specific range on the time axis. The numerical distribution D is expressed by a probability distribution function defined with time t as a variable on the time axis. The numerical distribution D in the first embodiment is a line-symmetric triangular distribution over a predetermined distribution width. Numerical distribution D is individually set for each beat point B. The position on the time axis of the numerical distribution D corresponding to each beat point B is determined so that the value is the maximum value at the beat point B. For example, the numerical distribution D corresponding to the target beat point Bn takes the maximum value at the target beat point Bn, and the numerical distribution D corresponding to each adjacent beat point Bn±1 takes the maximum value at the relevant adjacent beat point Bn±1. . Numerical values Q(t) at each analysis time point t other than each numerical distribution D in the numerical value string C are set to zero.

As illustrated as state 5 in FIG. 6, the update processing unit 25 calculates the error e(t) for each analysis time point t within the applicable interval T on the time axis (S82). The applicable section T is a series of sections including the adjacent beat point Bn-1 and the adjacent beat point Bn+1. Specifically, a period on the time axis whose end points are the adjacent beat point Bn-1 and the adjacent beat point Bn+1 is set as the applicable section T. The error e(t) at each analysis time t is a numerical value corresponding to the difference between the probability P(t) at the analysis time t and the numerical value Q(t) of the numerical sequence C at the analysis time t. For example, the square of the difference between the probability P(t) and the numerical value Q(t) (={P(t)−Q(t)} ² ) is calculated as the error e(t).

The update processing unit 25 calculates an error function E from a plurality of errors e(t) calculated for different analysis time points t within the applicable interval T (S83). The error function E is an objective function representing the difference between the probability P(t) and the numerical value Q(t) within the application interval T. For example, the sum of multiple errors e(t) within the applicable interval T is calculated as the error function E.

The update processing unit 25 updates the estimated model M so that the error function E is minimized (S84). For updating the estimation model M, a known technique is arbitrarily adopted. For example, adaptive processing using Self-Attention is employed to update the estimated model M. The adaptive processing for the estimated model M is described in, for example, Kazuhiko Yamamoto, “HUMAN-IN-THE-LOOP ADAPTATION FOR INTERACTIVE MUSICAL BEAT TRACKING,” Proceedings of the 22nd ISMIR Conference, Online, November 7-12, 2021. .

As understood from the above explanation, the update processing unit 25 calculates the numerical distribution D (numerical value Q(t)) corresponding to the target beat point Bn after movement and each adjacent beat point Bn±1, and the pre-estimation process S2 The estimated model M is updated so that the error e(t) between the probability P(t) estimated in (first process S21) and the time series is reduced. Therefore, the movement of the target beat point Bn and each adjacent beat point Bn±1 can be appropriately reflected in the estimation model M.

FIG. 8 is a flowchart of the process (hereinafter referred to as "acoustic analysis process") executed by the control device 11. For example, the acoustic analysis process is started in response to a user's instruction to the operating device 14.

When the acoustic analysis process is started, the control device 11 (feature extraction unit 30) calculates the feature amount F(t) of the acoustic signal A for each analysis time point t on the time axis (S1). The control device 11 (beat point estimating unit 21) estimates a plurality of beat points B from each feature amount F(t) of the acoustic signal A by estimation processing S2 illustrated in FIG. In the first process S21 of the estimation process S2, an estimation model M that has learned the relationship between the feature amount F(t) and the probability P(t) by machine learning is used. The control device 11 (display control section 22) displays the confirmation image G on the display device 13 (S3). In the beat point area Gb of the confirmation image G, a plurality of beat points B estimated by the estimation process S2 are displayed.

The control device 11 determines whether the termination condition is satisfied (S4). The termination condition is, for example, that the user instructs to terminate the acoustic analysis process by operating the operating device 14. If the termination condition is satisfied (S4: YES), the control device 11 terminates the acoustic analysis process. If the end condition is not satisfied (S4: NO), the control device 11 (beat point editing unit 24) determines whether an instruction to move the target beat point Bn has been received from the user (S5). If movement of the target beat point Bn is not instructed (S5: NO), the control device 11 moves the process to step S4. That is, the control device 11 waits for an instruction to end the acoustic analysis process or an instruction to move the target beat point Bn.

When receiving an instruction to move the target beat point Bn (S5: YES), the control device 11 (beat point editing unit 24) moves the target beat point Bn and adjacent beat points Bn±1 before and after it from the user. It moves on the time axis according to the instruction (S6). Further, the control device 11 (display control unit 22) moves the target beat point Bn and each adjacent beat point Bn±1 in the beat point area Gb according to an instruction from the user (S7).

The control device 11 (update processing unit 25) updates the estimated model M by the update process S8 illustrated in FIG. When the estimated model M is updated, the control device 11 shifts the process to estimation processing S2. That is, the control device 11 (beat point estimating unit 21) estimates a plurality of beat points B by performing estimation processing S2 on the acoustic signal A using the updated estimation model M. The feature amount F(t) calculated immediately after the start of the acoustic analysis process is applied to the second and subsequent estimation processes S2.

As understood from the above explanation, the updating process S8 of the estimated model M and the estimation process S2 using the updated estimated model M are repeated every time the target beat point Bn moves. Therefore, the position of each beat point B estimated by the estimation process S2 approaches the position where the user's instruction is reflected each time the estimation process S2 is repeated. The beat point B estimated by any one estimation process S2 is an example of the "first beat point", and the beat point B estimated by the next estimation process S2 after updating the estimation model M is the "second beat point". This is an example of "beat point".

As explained above, in the first embodiment, among the plurality of beat points B estimated by the estimation process S2, the target beat point Bn selected by the user and the adjacent beat points Bn around the target beat point Bn The estimation model M is updated according to the movement from ±1, and the plurality of beat points B are re-estimated by estimation processing S2 applying the updated estimation model M. That is, in updating the estimated model M, not only the movement of the target beat point Bn but also the temporal relationship between the target beat point Bn and each adjacent beat point Bn±1 is reflected in the estimated model M. Therefore, compared to a configuration in which only the movement of the target beat point Bn is reflected in the estimation model M (hereinafter referred to as a "comparative example"), a beat point B that appropriately matches the user's intention can be estimated.

Specifically, in the comparison example, the interval between the target beat point Bn and the immediately preceding adjacent beat point Bn-1 is reduced, and the interval between the target beat point Bn and the immediately following adjacent beat point Bn+1 is increased. , reflected in the estimated model M. Therefore, the estimation model M is given a tendency (ritardando) that the performance speed decreases after the target beat point Bn has elapsed. However, when the target beat point Bn is moved, it is more likely that the user intends to modify the beat point B throughout the song than when the user intends to change the performance speed. In the first embodiment, since the temporal relationship between the target beat point Bn and each adjacent beat point Bn±1 is reflected in the estimation model M, the performance speed decreases after the target beat point Bn has passed. The problem will be resolved. That is, as described above, by comparing with the comparative example, it is possible to estimate the beat point B that appropriately matches the user's intention. Furthermore, it is possible to provide the user with the customer experience of being able to estimate the beat point B that appropriately reflects the user's intention.

In addition, in the first embodiment, since the beat point area Gb is displayed on the display device 13, the movement of the target beat point Bn and each adjacent beat point Bn±1 in accordance with instructions from the user can be viewed. Users can check visually. Therefore, while predicting the beat point B estimated by the updated estimation model M, the user can instruct movement between the target beat point Bn and each adjacent beat point Bn±1.

B: Second Embodiment The second embodiment will be described. In addition, in each of the embodiments exemplified below, for elements whose functions are similar to those in the first embodiment, the same reference numerals as used in the explanation of the first embodiment are used, and detailed explanations of each are omitted as appropriate. do.

FIG. 9 is a block diagram illustrating the functional configuration of the acoustic analysis system 100 in the second embodiment. By executing the program stored in the storage device 12, the control device 11 of the second embodiment has the same elements as the first embodiment (beat point estimating section 21, display control section 22, playback control section 23, beat point estimation section 21, display control section 22, playback control section 23, It functions as a section setting section 26 in addition to the point editing section 24 and update processing section 25).

The section setting unit 26 sets a partial section (hereinafter referred to as a "specific section") of the acoustic signal A on the time axis. Specifically, the section setting unit 26 sets a specific section according to an instruction from the user. For example, by operating the operating device 14, the user can specify a specific section of the acoustic signal A displayed in the waveform area Ga. The section setting unit 26 sets the section specified by the user as a specific section.

The control device 11 of the second embodiment executes the acoustic analysis process of FIG. 8 for a specific section of the acoustic signal A. For example, the estimation process S2 by the beat point estimating unit 21 is executed limitedly for a specific section. That is, a plurality of beat points B are estimated within a specific section within the song.

The specific steps of the acoustic analysis process are the same as in the first embodiment. Therefore, the second embodiment also achieves the same effects as the first embodiment. Further, in the second embodiment, the beat point B can be estimated in a limited manner for a part of the section (specific section) of the acoustic signal A.

Note that in the above description, a mode in which a specific section is set according to an instruction from a user has been exemplified, but the method for setting a specific section is arbitrary and is not limited to the above example. For example, the section setting unit 26 may set the specific section according to predetermined rules without requiring instructions from the user. For example, the section setting unit 26 may set any one of a plurality of structural sections of the song represented by the acoustic signal A as the specific section. A structural section is a section in which a piece of music is divided on the time axis according to musical meaning. The structural sections are, for example, sections such as an intro, a verse, a bridge, a chorus, and an outro. The section setting unit 26 divides the acoustic signal A into a plurality of structural sections by analyzing the acoustic signal A, and sets a specific structural section among the plurality of structural sections as a specific section. According to the above configuration, beat points B can be estimated in a limited manner for a specific structural section.

C: Modifications Specific modifications added to each of the above-mentioned embodiments will be exemplified below. Two or more aspects arbitrarily selected from the examples below may be combined as appropriate to the extent that they do not contradict each other.

(1) In each of the above embodiments, the adjacent beat point Bn-1 immediately before the target beat point Bn and the adjacent beat point Bn+1 immediately after the target beat point Bn are moved together with the target beat point Bn. A mode in which only one of Bn-1 and the adjacent beat point Bn+1 is moved together with the target beat point Bn is also assumed. For example, the beat point editing unit 24 moves only the target beat point Bn and the immediately preceding adjacent beat point Bn-1 on the time axis according to an instruction from the user, and the update processing unit 25 moves only the target beat point Bn and the immediately preceding adjacent beat point Bn-1. The error e(t) may be calculated within the applicable interval T between -1 and the target beat point Bn. Similarly, the beat point editing unit 24 moves only the target beat point Bn and the immediately following adjacent beat point Bn+1 on the time axis according to instructions from the user, and the update processing unit 25 moves only the target beat point Bn+1 The error e(t) may be calculated within the applicable interval T between Bn and the adjacent beat point Bn+1. As understood from the above explanation, the beat point editing unit 24 uses one or more adjacent beat points Bn±1 located around the target beat point Bn among the plurality of beat points B as an element to move on the time axis. expressed.

In addition, in each of the above-mentioned embodiments, not only the movement of the target beat point Bn, but also the temporal relationship between the target beat point Bn and the immediately preceding adjacent beat point Bn-1, and the relationship between the target beat point Bn and the immediately following adjacent beat point The temporal relationship with Bn+1 is also reflected in the estimation model M. Therefore, compared to a configuration in which only the target beat point Bn and one surrounding adjacent beat point B are reflected in the estimation model M, the estimation model M is estimated so that a beat point B that appropriately matches the user's intention can be estimated. Model M can be updated.

(2) In each of the above embodiments, a triangular distribution was exemplified as the numerical distribution D corresponding to the target beat point Bn and each adjacent beat point Bn±1 after movement, but the type or shape of the numerical distribution D is as described above. Not limited to examples. For example, a probability distribution such as a normal distribution or a pulse-like distribution may also be employed as the numerical distribution D.

(3) The type of feature amount F(t) that the feature extraction unit 30 calculates from the acoustic signal A is not limited to the examples in each of the above embodiments. For example, a time series of a predetermined number of samples constituting the acoustic signal A may be applied to the estimation process S2 as the feature amount F(t). In the above embodiment, it can be interpreted that the feature extraction unit 30 extracts a time series of samples from the acoustic signal A, but from the viewpoint that the acoustic signal A itself is partially applied to the estimation process S2, the feature extraction It can also be interpreted as a form in which the section 30 is omitted.

(4) In each of the above embodiments, the target beat point Bn and each adjacent beat point Bn±1 were moved according to the movement direction (forward/backward) and movement amount δ specified by the user, but the target beat point The method by which the user instructs the movement of Bn and each adjacent beat point Bn±1 is not limited to the above example.

For example, the beat point editing unit 24 may move the target beat point Bn and each adjacent beat point Bn±1 according to the sign (±) and numerical value input by the user. When the user inputs a negative number, the beat point editing unit 24 moves the target beat point Bn and each adjacent beat point Bn±1 forward on the time axis by a movement amount δ corresponding to the absolute value of the negative number. . On the other hand, if the user inputs a positive number, the beat point editing unit 24 moves the target beat point Bn and each adjacent beat point Bn±1 backward on the time axis by the amount of movement δ corresponding to the positive number. do.

Furthermore, the beat point editing unit 24 may move the target beat point Bn and each adjacent beat point Bn±1 by a predetermined unit amount on the time axis over the number of times instructed by the user. For example, every time a movement instruction is received from the user, the beat point editing unit 24 moves the target beat point Bn and each adjacent beat point Bn±1 by a unit amount in the direction (forward/backward) specified by the user. do. Therefore, the target beat point Bn and each adjacent beat point Bn±1 move on the time axis by a movement amount δ corresponding to the multiplication value of a predetermined unit amount and the number of movement instructions.

As can be understood from the above explanation, in the present disclosure, "moving the beat point in response to an instruction from the user" means that the conditions for moving the beat point (for example, the direction and amount of movement) are based on the instruction from the user. The method of instruction by the user and the matters instructed by the user are arbitrary in this disclosure. Furthermore, "moving the beat point" means changing the position of the beat point on the time axis.

(5) In each of the above embodiments, a deep neural network is illustrated as the estimation model M, but the configuration of the estimation model M is not limited to the above examples. For example, a statistical model such as a Hidden Markov Model (HMM) or a Support Vector Machine (SVM) may also be used as the estimation model M. Note that in each of the above embodiments, the estimated model M was updated by the update process S8. Since the estimation model M is applied to the estimation process S2, the update process S8 can also be expressed as a process for updating the estimation process S2.

(6) For example, the acoustic analysis system 100 may be realized by a server device that communicates with an information device such as a smartphone or a tablet terminal. For example, the acoustic analysis system 100 estimates a plurality of beat points B by analyzing the acoustic signal A received from the information device, and transmits data representing the plurality of beat points B to the information device.

(7) The functions of the acoustic analysis system 100 exemplified above are realized by cooperation between one or more processors that constitute the control device 11 and the program stored in the storage device 12, as described above. A program according to the present disclosure may be provided in a form stored in a computer-readable recording medium and installed on a computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but any known recording medium such as a semiconductor recording medium or a magnetic recording medium is used. Also included are recording media in the form of. Note that the non-transitory recording medium includes any recording medium excluding transitory, propagating signals, and does not exclude volatile recording media. Furthermore, in a configuration in which a distribution device distributes a program via a communication network, a storage device that stores the program in the distribution device corresponds to the above-mentioned non-transitory recording medium.

D: Supplementary Note From the forms exemplified above, for example, the following configurations can be understood.

An acoustic analysis system according to one aspect (aspect 1) of the present disclosure includes a beat point estimation unit that estimates a plurality of first beat points by estimation processing on an acoustic signal, and a beat point estimation unit that estimates a plurality of first beat points by an estimation process for an acoustic signal; A beat that moves a selected target beat point and one or more adjacent beat points located around the target beat point among the plurality of first beat points on a time axis according to an instruction from the user. a point editing unit; and an update processing unit that updates the estimation process according to movement of the target beat point and the one or more adjacent beat points, and the beat point estimation unit updates the estimation process after the update. A plurality of second beat points are estimated by performing the estimation on the acoustic signal.

According to the above aspect, the estimation process is updated according to the movement on the time axis regarding the target beat point selected by the user and one or more adjacent beat points located around the target beat point, and the updated A plurality of second beat points are estimated by the estimation process. In updating the estimation process, not only the movement of the target beat point but also the temporal relationship between the target beat point and one or more adjacent beat points is reflected in the estimation process. Therefore, compared to a configuration in which only the movement of the target beat point is reflected in the estimation process, it is possible to estimate the second beat point that appropriately matches the user's intention. Note that the acoustic analysis system may also be expressed as an acoustic analysis device. It does not matter whether the "acoustic analysis system" or the "acoustic analysis device" is composed of a single device or a plurality of mutually separate devices.

The "estimation process" is a process for estimating multiple beat points (first beat point/second beat point) from the acoustic signal. For example, an example of "estimation processing" is a process that uses an estimation model that has learned the relationship between the feature quantity of the acoustic signal for learning and the probability that the time point at which the feature quantity is observed corresponds to a beat point. Specifically, by processing the feature amount at a specific point in time of the acoustic signal to be processed using the estimation model, the probability that the point in time corresponds to a beat point is output.

"Updating estimation processing" means processing for updating elements applied to estimation processing. For example, assuming an estimation process that uses an estimation model, machine learning that updates variables that define the estimation model corresponds to "updating the estimation process."

In a specific example of aspect 1 (aspect 2), the one or more adjacent beat points include a first beat point located immediately before the target beat point among the plurality of first beat points, and a first beat point located immediately before the target beat point among the plurality of first beat points. A first beat point located immediately after the target beat point among the points. In the above aspect, not only the movement of the target beat point, but also the temporal relationship between the target beat point and the immediately preceding first beat point, and the temporal relationship between the target beat point and the immediately following first beat point are determined. are also reflected in the estimation process. Therefore, compared to a configuration in which only the target beat point and one surrounding adjacent beat point are reflected in the estimation process, the estimation process is designed to be able to estimate the second beat point that appropriately matches the user's intention. Can be updated.

The acoustic analysis system according to a specific example of aspect 1 or aspect 2 (aspect 3) displays a beat point image representing the target beat point and the one or more adjacent beat points on the display device 13, and displays the beat point image in the beat point image. The display control unit 22 further includes a display control unit 22 that moves the included target beat point and the one or more adjacent beat points in accordance with an instruction from the user. In the above aspect, the user can visually confirm how the target beat point and one or more adjacent beat points move according to instructions from the user. Therefore, the user can instruct the movement of the target beat point and one or more adjacent beat points while predicting the second beat point estimated by the updated estimation process.

In a specific example of any one of aspects 1 to 3 (aspect 4), the estimation process calculates the relationship between the feature amount of the learning acoustic signal and the probability that the time point at which the feature amount is observed corresponds to a beat point. a first process of generating a probability that the point corresponds to a beat point by processing feature amounts at each point in time of the acoustic signal using a learned estimation model; and a time series of the probabilities generated by the first process. and a second process of identifying the plurality of first beat points from. In the above aspect, multiple beat points are estimated from the acoustic signal using an estimation model that has learned the relationship between the feature amount of the learning acoustic signal and the probability that the time point of the feature value corresponds to a beat point. . Therefore, a plurality of beat points (first beat point/second beat point) can be estimated with high accuracy for an unknown acoustic signal in which the feature value changes in various ways.

In a specific example of Aspect 4 (Aspect 5), the update processing unit includes a numerical distribution set on a time axis corresponding to the target beat point after the movement and the one or more adjacent beat points; The estimation model is updated so that the error between the time series of probabilities estimated by one process and the time series is reduced. In the above aspect, the estimation model is updated so that the error between the numerical distribution corresponding to the target beat point and adjacent beat points and the time series of probabilities estimated by the first process is reduced. It is possible to appropriately reflect the movement of beat points and adjacent beat points in the estimation model.

"Numerical distribution" is the distribution of numerical values on the time axis. The type and shape of the numerical distribution are arbitrary. For example, a triangular distribution, a normal distribution, or a pulsed distribution is exemplified as a "numeric distribution." Regarding the numerical distribution, "set corresponding to (target/adjacent) beat points" means that the position of the beat point on the time axis and the position of the numerical distribution on the time axis correspond to each other. That is, as the position of the beat point on the time axis changes, the position of the numerical distribution on the time axis also changes. For example, a relationship in which the maximum point of the numerical distribution coincides with a beat point is a typical example of a relationship "set corresponding to a beat point."

The acoustic analysis system according to a specific example of any one of aspects 1 to 5 (aspect 6) further includes an interval setting unit 26 that sets a specific interval that is a part of the acoustic signal on the time axis, The estimation process by the beat point estimation unit is performed for the specific section. According to the above aspect, the second beat point can be estimated in a limited manner for a part of the acoustic signal.

The "specific section" is an arbitrary part of the section of the acoustic signal on the time axis. For example, a section specified by the user is an example of a "specific section." Alternatively, the beat point may be estimated by using any one of a plurality of structural sections of the music represented by the acoustic signal as a "specific section." A structural section is a section in which a piece of music is divided on the time axis according to musical meaning. The structural sections are, for example, sections such as an intro, a verse, a bridge, a chorus, and an outro.

An acoustic analysis method according to one aspect of the present disclosure estimates a plurality of first beat points by estimation processing on an acoustic signal, and includes a target beat point selected by a user among the plurality of first beat points, and a target beat point selected by a user from among the plurality of first beat points. Among the first beat points, one or more adjacent beat points located around the target beat point are moved on the time axis according to instructions from the user, and the target beat point and the one or more adjacent beat points are The estimation process is updated according to the movement of adjacent beat points, and the updated estimation process is executed on the acoustic signal, thereby estimating a plurality of second beat points. Note that each aspect illustrated for the acoustic analysis system is similarly applied to the acoustic analysis method according to the present disclosure.

A program according to one aspect of the present disclosure includes: a beat point estimation unit that estimates a plurality of first beat points by estimation processing on an acoustic signal; a target beat point selected by a user from among the plurality of first beat points; a beat point editing unit that moves one or more adjacent beat points located around the target beat point among the plurality of first beat points on a time axis according to an instruction from the user; A program that causes a computer system to function as an update processing unit that updates the estimation process according to movement of the target beat point and the one or more adjacent beat points, the beat point estimation unit updating the estimation process after the update. A plurality of second beat points are estimated by performing this on the acoustic signal. Note that each aspect illustrated for the acoustic analysis system is similarly applied to the program according to the present disclosure.

DESCRIPTION OF SYMBOLS 100... Acoustic analysis system, 11... Control device, 12... Storage device, 13... Display device, 14... Operating device, 15... Sound emitting device, 21... Beat point estimation part, 22... Display control part, 23... Playback control part , 24... Beat point editing section, 25... Update processing section, 26... Section setting section, 30... Feature extraction section, 31... First processing section, 32... Second processing section.

Claims

a beat point estimation unit that estimates a plurality of first beat points by estimation processing on the acoustic signal;
A target beat point selected by the user among the plurality of first beat points and one or more adjacent beat points located around the target beat point among the plurality of first beat points are received from the user. A beat point editing section that moves on the time axis according to the instructions of
an update processing unit that updates the estimation process according to movement of the target beat point and the one or more adjacent beat points,
The beat point estimating unit estimates a plurality of second beat points by performing the updated estimation process on the acoustic signal.
The one or more adjacent beat points are
a first beat point located immediately before the target beat point among the plurality of first beat points;
The acoustic analysis system according to claim 1, further comprising: a first beat point located immediately after the target beat point among the plurality of first beat points.
A beat point image representing the target beat point and the one or more adjacent beat points is displayed on a display device, and the target beat point and the one or more adjacent beat points included in the beat point image are received from the user. The acoustic analysis system according to claim 1 or 2, further comprising a display control section that moves in response to an instruction.
The estimation process is
By processing the feature values at each time point of the acoustic signal using an estimation model that has learned the relationship between the feature value of the learning acoustic signal and the probability that the time point at which the feature value is observed corresponds to a beat point, a first process of generating a probability that a time point corresponds to a beat point;
3. The acoustic analysis system according to claim 1, further comprising a second process of identifying the plurality of first beat points from a time series of probabilities generated by the first process.
The update processing unit includes:
a numerical distribution set on a time axis corresponding to the target beat point after the movement and the one or more adjacent beat points;
a time series of probabilities estimated by the first process;
The acoustic analysis system according to claim 4, wherein the estimation model is updated so that the error in the estimation model is reduced.
further comprising: a section setting unit that sets a specific section that is a part of the acoustic signal on the time axis;
The acoustic analysis system according to claim 1, wherein the estimation process by the beat point estimation unit is performed for the specific section.
Estimating a plurality of first beat points by estimation processing on the acoustic signal,
A target beat point selected by the user among the plurality of first beat points and one or more adjacent beat points located around the target beat point among the plurality of first beat points are received from the user. Move on the time axis according to the instructions of
updating the estimation process according to movement of the target beat point and the one or more adjacent beat points;
An acoustic analysis method realized by a computer system, wherein a plurality of second beat points are estimated by executing the updated estimation process on the acoustic signal.
The one or more adjacent beat points are
a first beat point located immediately before the target beat point among the plurality of first beat points;
The acoustic analysis method according to claim 7, further comprising: a first beat point located immediately after the target beat point among the plurality of first beat points.
moreover,
Displaying a beat point image representing the target beat point and the one or more adjacent beat points on a display device,
The acoustic analysis method according to claim 7 or 8, wherein the target beat point and the one or more adjacent beat points included in the beat point image are moved in accordance with an instruction from the user.
The estimation process is
By processing the feature values at each time point of the acoustic signal using an estimation model that has learned the relationship between the feature value of the learning acoustic signal and the probability that the time point at which the feature value is observed corresponds to a beat point, a first process of generating a probability that a time point corresponds to a beat point;
9. The acoustic analysis method according to claim 7, further comprising a second process of identifying the plurality of first beat points from a time series of probabilities generated by the first process.
In updating the estimation process,
a numerical distribution set on a time axis corresponding to the target beat point after the movement and the one or more adjacent beat points;
a time series of probabilities estimated by the first process;
The acoustic analysis method according to claim 10, wherein the estimation model is updated so that an error in the estimation model is reduced.
moreover,
setting a specific section that is a part of the acoustic signal on the time axis,
The acoustic analysis method according to claim 7, wherein the estimation process is performed for the specific section.
a beat point estimation unit that estimates a plurality of first beat points by estimation processing on the acoustic signal;
A target beat point selected by the user among the plurality of first beat points and one or more adjacent beat points located around the target beat point among the plurality of first beat points are received from the user. A beat point editing section that moves on the time axis according to instructions, and
an update processing unit that updates the estimation process according to movement of the target beat point and the one or more adjacent beat points;
A program that causes a computer system to function as
The beat point estimating unit estimates a plurality of second beat points by performing the updated estimation process on the acoustic signal.