WO2021060273A1 - Procédé de commande de sortie son et dispositif de commande de sortie de son - Google Patents

Procédé de commande de sortie son et dispositif de commande de sortie de son Download PDF

Info

Publication number
WO2021060273A1
WO2021060273A1 PCT/JP2020/035785 JP2020035785W WO2021060273A1 WO 2021060273 A1 WO2021060273 A1 WO 2021060273A1 JP 2020035785 W JP2020035785 W JP 2020035785W WO 2021060273 A1 WO2021060273 A1 WO 2021060273A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
hand
phoneme
specific
pronunciation
Prior art date
Application number
PCT/JP2020/035785
Other languages
English (en)
Japanese (ja)
Inventor
入山 達也
慶二郎 才野
Original Assignee
ヤマハ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ヤマハ株式会社 filed Critical ヤマハ株式会社
Publication of WO2021060273A1 publication Critical patent/WO2021060273A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/18Selecting circuits
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Definitions

  • This disclosure relates to a technique for controlling pronunciation.
  • pronunciation is started when the user contacts.
  • the purpose is to start pronunciation before an object such as a user's finger comes into contact with a surface such as a key.
  • the sound control method detects that the object is in a specific state while the object is moving toward the surface, and the specific state is detected.
  • the state is detected, the first sound is sounded, the striking event in which the object hits the surface due to the movement of the object is detected, and the second sound is sounded when the striking event is detected.
  • the object is in a specific state while the object is moving toward the surface, and the object hits the surface due to the movement of the object. It includes a detection unit that detects a striking event, and a sound control unit that sounds a first sound when the specific state is detected and a second sound when the striking event is detected.
  • FIG. 1 is a block diagram illustrating the configuration of the pronunciation control system 100 according to the embodiment of the present disclosure.
  • the pronunciation control system 100 synthesizes a virtual voice in which a specific singer sings a musical piece. Each phoneme that composes the synthesized voice is pronounced at the time instructed by the user.
  • the pronunciation control system 100 includes an operation unit 10 and a pronunciation control device 20.
  • the user instructs the pronunciation control device 20 at a time when the operation unit 10 is hit with his / her own hand H to start the pronunciation of each phoneme (hereinafter referred to as “pronunciation start point”).
  • the pronunciation control device 20 synthesizes voice by pronouncing each phoneme according to an instruction from the user.
  • the operation unit 10 includes an operation reception unit 11, a first sensor 13, and a second sensor 15.
  • the operation reception unit 11 includes a surface (hereinafter referred to as “striking surface”) F that is hit by the user's hand H.
  • the hand H is an example of an "object” that hits the striking surface F.
  • the operation receiving unit 11 includes a housing 112 and a light transmitting unit 114.
  • the housing 112 is, for example, a hollow structure having an opening at the top.
  • the light transmitting portion 114 is a flat plate-shaped member formed of a member that transmits light in a wavelength range that can be detected by the first sensor 13.
  • the light transmitting portion 114 is installed so as to close the opening of the housing 112.
  • the surface of the light transmitting portion 114 on the side opposite to the internal space of the housing 112 corresponds to the striking surface F.
  • the user hits the striking surface F with the hand H in order to indicate the pronunciation start point of each phoneme. Specifically, the user hits the hitting surface F by moving the hand H from above the hitting surface F toward the hitting surface F. A phoneme is pronounced according to the time when the hand H hits the striking surface F.
  • the first sensor 13 and the second sensor 15 are housed inside the housing 112.
  • the first sensor 13 is a sensor for detecting the state of the user's hand H.
  • a distance image sensor that measures the distance between the subject and the imaging surface for each pixel is used as the first sensor 13.
  • the hand H moving toward the striking surface F is imaged by the first sensor 13.
  • the first sensor 13 is installed, for example, in the central portion of the bottom surface of the housing 112, and images the hand H moving toward the striking surface F from the palm side (inside of the housing 112).
  • the first sensor 13 can detect light in a specific wavelength range, and receives light coming from the hand H located above the striking surface F via the light transmitting portion 114 to receive the light.
  • Image data D1 Data representing the image of H (hereinafter referred to as "image data") D1 is generated.
  • the light transmitting portion 114 is formed of a member that transmits light that can be detected by the first sensor 13.
  • the image data D1 is transmitted to the sound control device 20.
  • the first sensor 13 and the sound control device 20 can communicate with each other wirelessly or by wire.
  • the image data D1 is repeatedly generated at predetermined intervals.
  • the second sensor 15 is a sensor for detecting the impact of the hand H on the impact surface F.
  • a sound collecting device that collects ambient sounds and generates a sound signal D2 representing the collected sounds is used as the second sensor 15.
  • the second sensor 15 collects the hitting sound generated when the user's hand H hits the hitting surface F.
  • the sound signal D2 is transmitted to the sound control device 20.
  • the second sensor 15 and the sound control device 20 can communicate with each other wirelessly or by wire.
  • FIG. 2 is a block diagram illustrating the configuration of the sound control device 20.
  • the sound generation control device 20 synthesizes voice according to the action of hitting the hitting surface F by the user.
  • the sound control device 20 includes a control device 21, a storage device 23, and a sound emitting device 25.
  • the control device 21 is, for example, a single or a plurality of processors that control each element of the sound control device 20.
  • the control device 21 is a CPU (Central Processing Unit), SPU (Sound Processing Unit), GPU (Graphics Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application Specific Integrated Circuit). ) Etc., it is composed of one or more types of processors.
  • the control device 21 executes a program stored in the storage device 23 to generate a plurality of signals (hereinafter referred to as “synthetic signals”) V representing the voices of the singer singing the music.
  • synthetic signals hereinafter referred to as “synthetic signals” V representing the voices of the singer singing the music.
  • the storage device 23 is a single or a plurality of memories composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium.
  • the storage device 23 stores a program executed by the control device 21 and various data used by the control device 21.
  • the storage device 23 may be configured by combining a plurality of types of recording media.
  • the storage device 23 may be a portable recording medium that can be attached to and detached from the sound control device 20, or an external recording medium (for example, online storage) that the sound control device 20 can communicate with via a communication network.
  • the storage device 23 stores data (hereinafter referred to as “synthetic data”) S representing sounds to be synthesized by the sound control device 20.
  • Synthetic data S is data that specifies the content of the music.
  • the synthetic data S is data for designating the pitch Sx and the phoneme Sy for each of the plurality of notes constituting the music.
  • the pitch Sx is any one of a plurality of pitches (for example, a note number).
  • Phonological S is a pronunciation content that should be uttered together with the pronunciation of notes.
  • the phoneme Sy corresponds to one syllable (pronunciation unit) constituting the lyrics of the music.
  • a typical phoneme Sy in Japanese is a combination of a consonant and a vowel immediately after it, or a single vowel.
  • the synthetic signal V is generated by voice synthesis using the synthetic data S.
  • the sounding start point of each note is controlled according to the action of striking the striking surface F by the user.
  • the order of the plurality of notes constituting the music is specified in the composite data S, but the pronunciation start point of each note is not specified in the composite data S.
  • the phoneme specifying unit 212 determines whether or not the phoneme Sy specified by the synthetic data S for each note is a phoneme composed of consonants and vowels (hereinafter referred to as "specific phoneme"). Specifically, the phoneme specifying unit 212 determines that the phoneme Sy composed of a consonant and the vowel following the consonant is a specific phoneme, and the phoneme Sy composed of a single vowel is other than the specific phoneme. Judged as a phoneme.
  • the user takes the rhythm of the music by hitting the hitting surface F in sequence. Specifically, the user hits the hitting surface F at each time when the pronunciation of each note in the music should be started.
  • the pronunciation start point of the vowel following the consonant is audibly recognized as the pronunciation start point of the specific phoneme as a whole. Therefore, in a configuration in which the consonant of a specific phoneme is started to be pronounced when the user hits the striking surface F (hereinafter referred to as “hit time”), and the vowel is pronounced after the consonant, the user recognizes it. It is perceived that the pronunciation of a specific phoneme of the note is started when it is delayed from the start point of the note. Therefore, in the present embodiment, the pronunciation of the specific phoneme is started before the time of hitting. Therefore, it is possible to reduce the delay in hearing a specific phoneme.
  • FIG. 3 is a graph showing the relationship between the distance P between the hand H and the striking surface F and the time.
  • the distance P is the height of the hand H from the striking surface F.
  • the distance P becomes 0.
  • the specific state means that the distance P becomes a specific distance (hereinafter, “specific distance”) Pz in the process of decreasing. That is, the specific state is the state of the hand H before it comes into contact with the striking surface F.
  • the distance P may be, for example, the distance between the reference point (for example, the center point) on the striking surface F and the hand H.
  • FIG. 3 shows t1 at the time when the hand H is in a specific state (hereinafter referred to as “reaching time”) and t2 at the time of hitting.
  • a consonant of a specific phoneme is sounded at the arrival point t1 (that is, a time when the distance P becomes a specific distance Pz), and a vowel of a specific phoneme is sounded at a hit time t2 (that is, a time when the distance P becomes 0).
  • the detection unit 213 of FIG. 2 includes a first detection unit 31 and a second detection unit 32.
  • the first detection unit 31 detects that the hand H is in a specific state.
  • the first detection unit 31 specifies the distance P by using the image data D1.
  • the first detection unit 31 estimates the region of the hand H from the image data D1 by image recognition such as contour extraction, and specifies the distance P of the hand H from the distance measured by the first sensor 13 for the pixels in the region. To do. Any known technique is used to specify the distance P.
  • the first detection unit 31 determines whether or not the distance P has reached the specific distance Pz by comparing the distance P with the first threshold value.
  • the first threshold value is set according to, for example, a specific distance Pz.
  • the second detection unit 32 detects that the hand H has hit the striking surface F due to the movement of the hand H. Specifically, the second detection unit 32 detects that the hand H has hit the striking surface F by analyzing the sound signal D2. First, the second detection unit 32 identifies the volume of the sound represented by the sound signal D2 (hereinafter referred to as “sound collection level”) by analyzing the sound signal D2. Any known sound analysis technique is used for the analysis of the sound signal D2. Next, the second detection unit 32 determines whether or not the hand H has hit the striking surface F by comparing the sound collection level with the second threshold value. For example, when the hand H hits the hitting surface F, a hitting sound is generated. The second threshold value is set assuming, for example, the hitting sound when the hand H hits the hitting surface F.
  • the sound collection level is lower than the second threshold value, it is determined that the sound signal D2 does not include the striking sound. That is, it is determined that the striking surface F is not striking.
  • the sound collection level exceeds the second threshold value, it is determined that the sound signal D2 contains a striking sound. That is, it is determined that the hand H has hit the striking surface F.
  • a slight time difference actually occurs inevitably between the hit time t2 at which the hand H hits the hit surface F and the time when the hit (hit event) is detected. In, the hit time t2 and the time when the hit is detected are equated with each other as substantially the same time point.
  • the pronunciation control unit 214 generates a composite signal V representing the sound specified by the composite data S.
  • the composite signal V is a signal representing a voice in which the composite data S pronounces the phoneme Sy specified for the note at the pitch Sx specified for each note.
  • a known technique is arbitrarily adopted for speech synthesis.
  • a statistical model type that generates a synthetic signal V by using a statistical model such as HMM (Hidden Markov Model) or a neural network, which is a piece-connected voice synthesis that generates a synthetic signal V by connecting a plurality of voice elements. Speech synthesis is used to generate the synthetic signal V.
  • the pronunciation start point of each phoneme S designated by the synthetic data S is controlled according to the result of detection by the first detection unit 31 and the second detection unit 32.
  • the sound control unit 214 causes the phoneme to be pronounced when the hit surface F is hit. .. Specifically, the sound control unit 214 causes the second detection unit 32 to pronounce the phoneme when the hit is detected. That is, a synthetic signal V in which the sounding start point of the entire phoneme is set at the hit time t2 is generated.
  • the phoneme Sy specified by the synthetic data S is specified to be a specific phoneme by the phoneme specifying unit 212
  • the sound control unit 214 causes the specific phoneme to be pronounced before hitting the striking surface F. ..
  • the sound control unit 214 produces a consonant of a specific phonology when the first detection unit 31 detects a specific state, and a vowel of the specific phonology when the second detection unit 32 detects a blow. Make it pronounce. That is, a synthetic signal V is generated in which the pronunciation start point of the consonant of the specific phoneme is set at the arrival time t1 and the pronunciation start point of the vowel following the consonant is set at the striking time t2. The combined signal V is supplied to the sound emitting device 25.
  • the sound emitting device 25 (for example, a speaker) is a reproduction device that emits the sound represented by the synthetic signal V. Therefore, the sound in which the pronunciation start point of the phoneme Sy is controlled is emitted for the music. That is, it is possible to reduce the delay in hearing the entire specific phoneme of the music.
  • FIG. 4 is a flowchart of processing of the control device 21.
  • the user hits the striking surface F when he / she wants to start the pronunciation of each note in the musical piece. That is, the striking surface F is striked by the hand H for each note.
  • the process of FIG. 4 is executed for each note of the composite data S.
  • the musical note to be processed in FIG. 4 is referred to as a “target musical note”.
  • a process of specifying the distance P by the first detection unit 31 and a process of specifying the sound collection level by the second detection unit 32 are executed.
  • the process of specifying the distance P and the process of specifying the sound collection level are repeatedly executed in a cycle shorter than the cycle in which the process of FIG. 4 is executed.
  • the phoneme specifying unit 212 determines whether or not the phoneme Sy of the target note in the composite data S is a specific phoneme (Sa1).
  • the first detection unit 31 determines whether or not the hand H is in a specific state while moving toward the striking surface F. (Sa2). That is, it is determined whether or not the distance P is at the specific distance Pz in the process of decreasing the distance P. Specifically, the first detection unit 31 determines whether or not the distance P is decreasing.
  • the first detection unit 31 determines whether or not the hand H is in a specific state by comparing the distance P with the first threshold value. When the distance P is increasing, it is not determined whether or not the hand H is in a specific state.
  • the sound control unit 214 makes a consonant of a specific phoneme sound (Sa3). Specifically, the sound control unit 214 generates a synthetic signal V in which the sounding start point of the consonant of the specific phoneme is set at the time when the specific state is detected, and supplies the synthetic signal V to the sound emitting device 25. That is, the consonant of the specific phonology is pronounced at the time when the specific state is detected (that is, the arrival time t1).
  • the process of step Sa2 is repeatedly executed until the hand H is in the specific state.
  • the second detection unit 32 determines whether or not the hand H has hit the striking surface F (Sa4). Specifically, by comparing the sound collection level with the second threshold value, it is determined whether or not the hand H has hit the striking surface F.
  • the sound control unit 214 causes the consonant of the specific phoneme to sound the vowel following the consonant (Sa5). Specifically, the sound control unit 214 generates a synthetic signal V in which the sounding start point of the consonant of the specific phoneme is set at the time when the hit on the hitting surface F is detected, and the synthetic signal V is sent to the sound emitting device 25.
  • step Sa4 the process of step Sa4 is repeatedly executed until the hand H moves to the striking surface F and hits the striking surface F.
  • the pronunciation of the specific phoneme is started before the hand H hits the striking surface F.
  • step Sa2 and step Sa3 are omitted, and step Sa4 Processing is executed. That is, for the phonemes other than the specific phonemes, the pronunciation of the phonemes is started at the time of hitting t2.
  • the continuation length of the note may be a fixed time length, or may be a time length specified for each note by the composite data S.
  • a consonant of a specific phonology is pronounced when the hand H is detected to be in a specific state, and the specific phonology is detected when a blow to the striking surface F is detected. Vowel is pronounced. Therefore, the consonant of the specific phoneme can be pronounced before the hand H hits the striking surface F. That is, it is possible to reduce the perception that a specific phoneme is delayed. Further, since the vowel of the specific phonology is pronounced by detecting the impact of the hand H on the striking surface F, the consonant can be pronounced before the vowel while maintaining the operation feeling for pronouncing the specific phonology.
  • the distance P between the hand H and the striking surface F is at the specific distance Pz. That is, the state on the way from the hand H to the striking surface F is detected as a specific state. Therefore, the consonant can be pronounced without the user being aware of the operation for pronouncing the consonant of the specific phoneme. Further, since the impact of the hand H on the striking surface F is detected by analyzing the sound signal D2, it is possible to pronounce a vowel of a specific phoneme when a striking sound is generated by the impact on the striking surface F.
  • the sound produced by detecting the specific state corresponds to the "first sound”
  • the sound produced by detecting the impact on the striking surface F corresponds to the "second sound”.
  • the consonant of the specific phoneme is an example of the "first sound”
  • the vowel of the specific phoneme is an example of the "second sound”. That is, the sound control unit 214 is comprehensively expressed as an element that sounds the first sound when a specific state is detected and sounds the second sound when a hit on the hitting surface F is detected.
  • the first sound is not limited to the consonants of the specific phoneme
  • the second sound is not limited to the vowels of the specific phoneme.
  • the sound related to the preparatory movement for pronunciation hereinafter referred to as “preparatory sound”
  • the sound following the preparatory movement hereinafter referred to as “target sound”
  • a target sound is a sound that is defined by a musical note and is the object of singing or playing.
  • the preparatory sound is a sound produced due to the preparatory operation for pronouncing the target sound.
  • a breath sound is exemplified as a preparatory sound
  • a voice sung after the breath sound is exemplified as a target sound.
  • the performance sound of an instrument for example, the breathing sound generated when playing a wind instrument, the fret sound of a string instrument, or the wind noise of a stick when playing a percussion instrument are exemplified as the preparation sound, and the preparation is made.
  • the performance sound of the instrument following the sound is exemplified as the target sound. That is, the voice synthesized by the pronunciation control device 20 is not limited to the voice singing the music.
  • the target sound is sounded before the original target sound. It is possible to pronounce a preparatory sound for pronunciation.
  • the entire phoneme may be the first sound, and the entire other phoneme following the phoneme may be the second sound.
  • first and second sounds are phonemes (for example, vowels or consonants).
  • the configuration in which the first phoneme, which is an example of the first sound, is a consonant and the second phoneme, which is an example of the second sound, is a vowel is illustrated, but each of the first phoneme and the second phoneme is illustrated. It doesn't matter if it is a vowel or a consonant.
  • a phoneme composed of a consonant and a consonant following the consonant, or a phoneme composed of a vowel and a vowel following the vowel is assumed.
  • the first phoneme in the phoneme is exemplified as the first phoneme
  • the phoneme following the first phoneme is exemplified as the second phoneme.
  • the distance image sensor capable of measuring the distance is illustrated as the first sensor 13, but the function of measuring the distance is not essential in the first sensor 13.
  • an image sensor may be used as the first sensor 13.
  • the first detection unit 31 may calculate the movement amount of the hand H by analyzing the image captured by the image sensor, and may estimate the distance P from the movement amount.
  • the function of capturing the image of the hand H is not essential in the first sensor 13.
  • an infrared sensor that emits infrared light may be used as the first sensor 13. In the configuration in which the infrared sensor is used as the first sensor 13, the first sensor 13 specifies the distance between the hand H and the first sensor 13 from the light receiving intensity received by the infrared light reflected by the hand H.
  • the first detection unit 31 determines that the hand H is in a specific state when the distance between the hand H and the first sensor 13 is less than a predetermined threshold value, and when the distance exceeds the threshold value, the hand H is moved. Judge that it is not in a specific state. That is, it is not essential to calculate the distance P in the process of determining whether or not the hand H is in a specific state.
  • the distance between the hand H and the first sensor 13 corresponds to the sum of the distance P between the hand H and the striking surface F and the distance between the striking surface F and the first sensor 13.
  • the distance between the hand H and the striking surface F When the distance P between the hand H and the striking surface F is at the specific distance Pz, the distance between the hand H and the first sensor 13 becomes a specific distance, so that the distance P is also specified in the above configuration. It can be said that being at a distance Pz is a specific state.
  • the function of the first detection unit 31 may be mounted on the first sensor 13. When the first sensor 13 detects a specific state, the first sensor 13 instructs the sound control unit 214 to pronounce a consonant having a specific phoneme.
  • the impact on the striking surface F is detected by analyzing the sound signal D2, but the method for detecting the impact is not limited to the above examples.
  • the second detection unit 32 determines that the hand H has hit the striking surface F.
  • a vibration sensor that detects vibration when the hand H hits the striking surface F may be used as the second sensor 15.
  • the second sensor 15 generates a signal according to, for example, the magnitude of vibration.
  • the second detection unit 32 detects the impact in response to the signal.
  • a pressure sensor that detects the pressure applied to the striking surface F when the hand H comes into contact with the striking surface F may be used as the second sensor 15.
  • the second sensor 15 generates a signal according to, for example, the magnitude of the pressure applied to the striking surface F.
  • the second detection unit 32 detects the impact in response to the signal.
  • the second sensor 15 may be equipped with the function of the second detection unit 32. When the second sensor 15 detects a hit on the hitting surface F, the second sensor 15 instructs the sound control unit 214 to pronounce a vowel of a specific phoneme.
  • the first sensor 13 and the second sensor 15 are housed in the internal space of the housing 112, but the positions where the first sensor 13 and the second sensor 15 are installed are arbitrary.
  • the first sensor 13 and the second sensor 15 may be installed outside the housing 112. In the configuration in which the first sensor 13 is installed outside the housing 112, it is not essential that the upper surface of the housing 112 is formed of a light-transmitting member in the operation receiving unit 11.
  • the striking surface F is hit by the hand H, but the object that hits the striking surface F is not limited to the hand H.
  • the type of the object is arbitrary as long as it is possible to hit the striking surface F.
  • a striking member such as a stick may be an object. The user moves the stick toward the striking surface F to strike the striking surface F.
  • the object includes both a part of the user's body (typically the hand H) and a striking member operated by the user.
  • the first sensor 13 or the second sensor 15 may be mounted on the member.
  • the specific state is not limited to the above examples.
  • the specific state is arbitrary as long as the hand H is in the middle of moving toward the striking surface F.
  • the change in the moving direction of the hand H may be set as a specific state.
  • the moving direction of the hand H changes from the direction away from the striking surface F to the approaching direction, or the moving direction of the hand H changes from the direction horizontal to the direction perpendicular to the striking surface F.
  • the change in the shape of the hand H (for example, change from goo to par) may be set as a specific state.
  • the continuation length of a consonant of a specific phonology differs depending on the type of the consonant.
  • the time length required to pronounce the consonant "s" in the specific phoneme “sa” is about 250 ms
  • the time length required to pronounce the consonant "k” in the specific phoneme “ka” is about 30 ms. is there. That is, the appropriate specific distance Pz differs depending on the type of consonant of the specific phonology. Therefore, a configuration in which the first threshold value is variably set according to the type of consonant of a specific phoneme is also adopted.
  • the first detection unit 31 sets the first threshold value according to the type of consonant of the phoneme specifying unit 212. Then, the first detection unit 31 determines whether or not the hand H is in a specific state by comparing the set first threshold value with the distance P.
  • the operation reception unit 11 is composed of the housing 112 and the light transmission unit 114, but the operation reception unit 11 is not limited to the above examples.
  • a flat plate-shaped member may be used as the operation reception unit 11.
  • the keyboard-type operator may be used as the operation reception unit 11.
  • the pitch Sx for each note of the composite data S. The user instructs the sounding start point of each note and the pitch of the note by operating the operation reception unit 11. That is, the pitch of each note may be set according to an instruction from the user.
  • the surface of the operation reception unit 11 that the user comes into contact with when hitting corresponds to the hitting surface F.
  • the state of the user's hand H may be detected and the pronunciation may be controlled according to the detection result.
  • note conditions for example, pitch, phoneme or continuation length
  • the state of the user's hand H is, for example, the moving speed of the hand H, the moving direction of the hand H, the shape of the hand H, or the like.
  • the combination of the detected hand H state and the note condition is arbitrary.
  • the user can instruct the condition of the note by changing the state of the hand H.
  • a specific configuration for controlling pronunciation according to the state of the user's hand H will be illustrated.
  • the type of phoneme (that is, pronunciation content) may be set according to the movement speed of hand H.
  • the first detection unit 31 detects the moving speed of the hand H from the image data D1.
  • the moving speed is detected from the time change of the distance P specified from the image data D1.
  • the first detection unit 31 may detect the moving speed of the hand H by using, for example, the output from the speed sensor that detects the speed.
  • the phoneme specifying unit 212 sets the type of the specific phoneme according to the moving speed.
  • the phoneme specifying unit 212 sets the type of the specific phoneme before the hand H is in the specific state.
  • FIG. 5 is a schematic diagram showing the relationship between the moving speed of the hand H and the type of the specific phoneme.
  • FIG. 5 illustrates a specific phoneme set when the moving speed of the hand H1 is fast and a specific phoneme set when the moving speed of the hand H2 is slow.
  • a specific phoneme for example, "ta (ta)" including a consonant (for example, [t]) having a short duration
  • a specific phoneme eg, "sa”
  • a specific phoneme eg, "sa”
  • the consonant is started to be pronounced at the arrival time t1 when the distance P becomes the specific distance Pz, and the vowel is started to be pronounced at the hit time t2.
  • the time length from the arrival point t1 to the striking time t2 is shorter than when the moving speed of the hand H2 is slow.
  • the continuous length or pitch of the note may be set according to the moving speed of the hand H. In the above examples, the case of setting the specific phoneme type is illustrated, but the phoneme type other than the specific phoneme may be controlled according to the moving speed.
  • the type of phoneme may be set according to the moving direction of hand H.
  • the user hits the hitting surface F by moving the hand H from different directions according to the desired phoneme.
  • the user can hit the hitting surface F by moving the hand H from various directions with respect to the hitting surface F.
  • the first detection unit 31 detects the moving direction of the hand H from the image data D1
  • the phoneme specifying unit 212 sets the type of phoneme according to the moving direction.
  • the phoneme specifying unit 212 sets the phoneme type before the hand H is in a specific state.
  • the continuous length or pitch of the note may be set according to the moving direction of the hand H.
  • Shape of hand H For example, the type of phoneme may be set according to the shape of hand H.
  • the user hits the striking surface F with the hand H having an arbitrary shape by, for example, moving a finger. For example, move the hand H so that it has a goo, choki, or par shape.
  • FIG. 6 is a table showing the relationship between the shape of the hand H and the phonology.
  • the type of phoneme may be set in consideration of whether the hand H is the right hand or the left hand.
  • the state of the hand H also includes whether the user's hand H is the right hand or the left hand.
  • the first detection unit 31 detects whether the hand H is the right hand or the left hand and the shape of the hand H from the image data D1.
  • a known image analysis technique is arbitrarily adopted for detecting whether the hand H is the right hand or the left hand and the shape of the hand H.
  • the phoneme specifying unit 212 sets the phoneme type before the hand H is in a specific state.
  • the phoneme specifying unit 212 specifies the phoneme according to the shapes of the right / left hand and the hand H. As illustrated in FIG. 6, for example, when the striking surface F is struck with the left hand in the shape of a goo, the phoneme "ta" is pronounced.
  • the continuous length or pitch of the note may be set according to the shape of the hand H.
  • the moving speed of the hand H is detected, and the pronunciation of the phoneme is controlled according to the content of the detection.
  • the pronunciation of the phoneme is controlled according to the content of the detection.
  • the user can control the pronunciation of the first sound and the second sound by changing the moving speed, moving direction, and shape of the object.
  • the state of the hand H is not limited to the moving speed of the hand H, the moving direction of the hand H, and the shape of the hand H.
  • the moving angle of the hand H (the angle at which the hand H moves with respect to the striking surface F) may be set as the state of the hand H.
  • the first detection unit 31 detects the moving speed of the hand H from, for example, the image data D1. The moving speed is detected before the hand H is in a specific state. Next, the first detection unit 31 sets the first threshold value according to the moving speed of the hand H. Specifically, the first detection unit 31 sets the first threshold value relatively large when the moving speed of the hand H is fast, and sets the first threshold value relatively small when the moving speed of the hand H is slow. To do. Then, the first detection unit 31 compares the set first threshold value with the distance P, and determines whether or not the distance P has reached the specific distance Pz. According to the above configuration, it is possible to reduce the change in the continuous length of the consonant according to the moving speed of the hand H.
  • the first threshold value may be changed according to the moving direction of the hand H.
  • the first detection unit 31 detects the moving direction of the hand H from, for example, the image data D1. The moving direction is detected before the hand H is in a specific state.
  • the first detection unit 31 sets the first threshold value according to the moving direction of the hand H. For example, when the moving direction of the hand H is the first direction, the first detection unit 31 sets the first threshold value to the first value, and the moving direction of the hand H is different from the first direction in the second direction. If, the first threshold value is set to a second value larger than the first value. Then, the first detection unit 31 compares the set first threshold value with the distance P, and determines whether or not the distance P has reached the specific distance Pz.
  • the continuation length of the consonant of the specific phoneme changes according to the first threshold value. Specifically, the continuation length of a consonant of a specific phonology becomes longer as the first threshold value becomes larger, and becomes shorter as the first threshold value becomes smaller.
  • the first threshold value may be set variably.
  • the time point at which the pronunciation of the phoneme is finished may be controlled according to the movement of the hand H by the user.
  • the pronunciation of the phoneme may be terminated when the hand H separates from the striking surface F after striking the striking surface F.
  • FIG. 7 is a block diagram illustrating the configuration of the detection unit 213 according to the modified example.
  • the detection unit 213 includes a third detection unit 33 in addition to the first detection unit 31 and the second detection unit 32.
  • the third detection unit 33 detects that the hand H is separated from the striking surface F.
  • the analysis of the image data D1 detects that the hand H is separated from the striking surface F.
  • the third detection unit 33 may detect that the hand H is separated from the striking surface F by using the output from the pressure sensor that detects the pressure applied to the striking surface F.
  • the pronunciation control unit 214 ends the pronunciation of the phoneme.
  • the striking surface F is hit by the user's hand H, but for example, the user hits the virtual striking surface F by using the antennae technology (haptics) using tactile feedback. Is also adopted.
  • the user strikes the striking surface F prepared in the virtual space by operating an operator capable of operating a pseudo hand in the virtual space displayed on the display device. By mounting a vibration motor that vibrates when hitting the hitting surface F in the virtual space on the operator, the user perceives that the hitting surface F is actually hit.
  • the hand in the virtual space is in a specific state, a consonant of a specific phoneme is pronounced, and when the striking surface F is hit in the virtual space, a vowel of the specific phoneme is sounded.
  • the striking surface F may be a surface in the virtual space.
  • the hand H may be a hand in the virtual space.
  • the function of the sound control device 20 illustrated above is realized by the cooperation of one or more processors constituting the control device 21 and the program stored in the storage device 23.
  • the program according to the present disclosure may be provided and installed on a computer in a form stored in a computer-readable recording medium.
  • the recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disc) such as a CD-ROM is a good example, but a known arbitrary such as a semiconductor recording medium or a magnetic recording medium. Recording media in the format of are also included.
  • the non-transient recording medium includes any recording medium other than the transient propagation signal (transitory, propagating signal), and the volatile recording medium is not excluded. Further, in the configuration in which the distribution device distributes the program via the communication network, the storage device 23 that stores the program in the distribution device corresponds to the above-mentioned non-transient recording medium.
  • the sound control method detects that an object is in a specific state while the object is moving toward a surface, and when the specific state is detected. Is made to sound the first sound, the hitting event in which the object hits the surface due to the movement of the object is detected, and the second sound is sounded when the hitting event is detected.
  • the first sound is produced when the object reaches a specific state while moving toward the surface, and the second sound is produced when the object hits the surface. Therefore, the first sound can be pronounced before the object hits the surface. Further, since the second sound is pronounced by detecting the impact of the object on the surface, the first sound can be sounded before the second sound while maintaining the operation feeling for sounding the second sound. ..
  • the first sound is a first phoneme
  • the second sound is a second phoneme different from the first phoneme.
  • the first phoneme is a consonant and the second phoneme is a vowel.
  • a consonant is pronounced when the object is in a specific state, and a vowel is pronounced following the consonant when the object hits the surface. Therefore, it is possible to reduce the perception that the pronunciation of a phoneme composed of consonants and vowels is delayed.
  • the first sound is a sound related to the preparatory movement for pronunciation
  • the second sound is a sound following the preparatory movement.
  • the specific state is that the distance between the object and the surface is at a specific distance.
  • the first sound is produced when the distance between the object and the surface becomes a specific distance. That is, the first sound is pronounced in the middle of the movement of the object toward the surface. Therefore, the first sound can be pronounced without the user being aware of the operation for pronouncing the first sound.
  • the impact in the detection of the impact of the object, is detected by analyzing the sound signal generated by the sound collecting device.
  • the impact of the object on the surface is detected by analyzing the sound signal generated by the sound collecting device. Therefore, the striking sound generated by striking the surface can be used for the pronunciation of the second sound.
  • At least one of the moving speed of the object, the moving direction of the object, and the shape of the object is detected, and depending on the content of the detection, Controls the pronunciation of at least one of the first sound and the second sound.
  • the pronunciation in at least one of the first sound and the second sound is controlled according to the speed at which the object moves, the direction in which the object moves, and at least one of the shapes of the objects. Therefore, the pronunciation of the first sound and the second sound can be controlled by the user changing the moving speed, moving direction, and shape of the object.
  • the object is in a specific state while the object is moving toward the surface, and the movement of the object causes the object to move to the surface. It is provided with a detection unit that detects a striking event that hits, and a sound control unit that sounds the first sound when the specific state is detected and sounds the second sound when the hit event is detected. To do.
  • the pronunciation control method and the pronunciation control device of the present disclosure can start pronunciation before an object such as a user's finger comes into contact with a surface such as a key.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • User Interface Of Digital Computer (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

L'invention concerne un dispositif de commande de sortie de son (20) comprenant : une unité de détection (213) pour détecter qu'un objet est dans un état spécifique tandis que l'objet se déplace vers une surface et détecter un événement d'impact dans lequel l'objet frappe la surface en résultat du déplacement ; et une unité de commande de sortie de son (214) pour amener un premier son à être fourni en sortie lorsque l'état spécifique est détecté et amener un second son à être fourni en sortie lorsque l'événement d'impact est détecté.
PCT/JP2020/035785 2019-09-26 2020-09-23 Procédé de commande de sortie son et dispositif de commande de sortie de son WO2021060273A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2019175253A JP7380008B2 (ja) 2019-09-26 2019-09-26 発音制御方法および発音制御装置
JP2019-175253 2019-09-26

Publications (1)

Publication Number Publication Date
WO2021060273A1 true WO2021060273A1 (fr) 2021-04-01

Family

ID=75157779

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/035785 WO2021060273A1 (fr) 2019-09-26 2020-09-23 Procédé de commande de sortie son et dispositif de commande de sortie de son

Country Status (2)

Country Link
JP (1) JP7380008B2 (fr)
WO (1) WO2021060273A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022255052A1 (fr) * 2021-06-03 2022-12-08 ヤマハ株式会社 Système à percussion

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004061753A (ja) * 2002-07-26 2004-02-26 Yamaha Corp 歌唱音声を合成する方法および装置
JP2014098801A (ja) * 2012-11-14 2014-05-29 Yamaha Corp 音声合成装置
JP2014186307A (ja) * 2013-02-22 2014-10-02 Yamaha Corp 音声合成装置
JP2017146555A (ja) * 2016-02-19 2017-08-24 ヤマハ株式会社 演奏支援のための装置および方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004061753A (ja) * 2002-07-26 2004-02-26 Yamaha Corp 歌唱音声を合成する方法および装置
JP2014098801A (ja) * 2012-11-14 2014-05-29 Yamaha Corp 音声合成装置
JP2014186307A (ja) * 2013-02-22 2014-10-02 Yamaha Corp 音声合成装置
JP2017146555A (ja) * 2016-02-19 2017-08-24 ヤマハ株式会社 演奏支援のための装置および方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022255052A1 (fr) * 2021-06-03 2022-12-08 ヤマハ株式会社 Système à percussion

Also Published As

Publication number Publication date
JP7380008B2 (ja) 2023-11-15
JP2021051249A (ja) 2021-04-01

Similar Documents

Publication Publication Date Title
US10490181B2 (en) Technology for responding to remarks using speech synthesis
JP6140579B2 (ja) 音響処理装置、音響処理方法、及び音響処理プログラム
JP5821824B2 (ja) 音声合成装置
US6392132B2 (en) Musical score display for musical performance apparatus
JP5162938B2 (ja) 楽音発生装置及び鍵盤楽器
JP2002268699A (ja) 音声合成装置及び音声合成方法、並びにプログラムおよび記録媒体
US8785761B2 (en) Sound-generation controlling apparatus, a method of controlling the sound-generation controlling apparatus, and a program recording medium
JP7383943B2 (ja) 制御システム、制御方法、及びプログラム
JP5040778B2 (ja) 音声合成装置、方法及びプログラム
WO2021060273A1 (fr) Procédé de commande de sortie son et dispositif de commande de sortie de son
JP4720563B2 (ja) 楽音制御装置
JP4654513B2 (ja) 楽器
JP5151401B2 (ja) 音声処理装置
JP2017146555A (ja) 演奏支援のための装置および方法
JP4644893B2 (ja) 演奏装置
JP4244338B2 (ja) 音出力制御装置、楽曲再生装置、音出力制御方法、そのプログラム、および、そのプログラムを記録した記録媒体
CN112466266B (zh) 控制系统以及控制方法
JP6090043B2 (ja) 情報処理装置、及びプログラム
JP3584585B2 (ja) 電子楽器
JP4544258B2 (ja) 音響変換装置およびプログラム
JP2002304187A (ja) 音声合成装置および音声合成方法、並びにプログラムおよび記録媒体
JP6260499B2 (ja) 音声合成システム、及び音声合成装置
JPH11338492A (ja) 話者認識装置
JP2004294659A (ja) 音声認識装置
JP2016177275A (ja) 演奏支援のための装置および方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20867292

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20867292

Country of ref document: EP

Kind code of ref document: A1