WO2021060273A1

WO2021060273A1 - Sound output control method and sound output control device

Info

Publication number: WO2021060273A1
Application number: PCT/JP2020/035785
Authority: WO
Inventors: 入山　達也; 慶二郎才野
Original assignee: ヤマハ株式会社
Priority date: 2019-09-26
Filing date: 2020-09-23
Publication date: 2021-04-01
Also published as: JP2021051249A; JP7380008B2

Abstract

A sound output control device (20) comprises: a detection unit (213) for detecting that an object is in a specific state while the object is moving toward a surface and detecting a striking event in which the object strikes the surface as a result of the movement; and a sound output control unit (214) for causing a first sound to be output when the specific state is detected and causing a second sound to be output when the striking event is detected.

Description

Pronunciation control method and pronunciation control device

This disclosure relates to a technique for controlling pronunciation.

For example, a technique for synthesizing a singing voice by operating an operator such as a keyboard has been proposed. For example, in Patent Document 1, when a user presses a key corresponding to a desired pitch, the lyrics set for the pitch are pronounced. Specifically, when it is detected that the user's finger touches the key, a consonant is sounded, and when it is detected that the key is pressed off, a vowel following the consonant is sounded.

Japanese Patent Application Laid-Open No. 2014-98801

In the technique of Patent Document 1, pronunciation is started when the user contacts. However, for example, depending on the content of the pronunciation, it may be desired to start the pronunciation before the finger touches the key. In consideration of the above circumstances, the purpose is to start pronunciation before an object such as a user's finger comes into contact with a surface such as a key.

In order to solve the above problems, the sound control method according to one aspect of the present disclosure detects that the object is in a specific state while the object is moving toward the surface, and the specific state is detected. When the state is detected, the first sound is sounded, the striking event in which the object hits the surface due to the movement of the object is detected, and the second sound is sounded when the striking event is detected.

In the sound control device according to one aspect of the present disclosure, the object is in a specific state while the object is moving toward the surface, and the object hits the surface due to the movement of the object. It includes a detection unit that detects a striking event, and a sound control unit that sounds a first sound when the specific state is detected and a second sound when the striking event is detected.

It is a block diagram which illustrates the structure of the pronunciation control system which concerns on 1st Embodiment of this disclosure. It is a block diagram which illustrates the functional structure of a sounding control device. It is a graph which shows the relationship between the distance between a hand and a striking surface, and time. It is a flowchart of the process executed by a control device. It is a schematic diagram which shows the relationship between the movement speed of a hand and the type of a specific phoneme. It is a table showing the relationship between the shape of the hand and the phonology. It is a block diagram which illustrates the structure of the detection part which concerns on the modification.

<Embodiment>
FIG. 1 is a block diagram illustrating the configuration of the pronunciation control system 100 according to the embodiment of the present disclosure. The pronunciation control system 100 synthesizes a virtual voice in which a specific singer sings a musical piece. Each phoneme that composes the synthesized voice is pronounced at the time instructed by the user.

The pronunciation control system 100 includes an operation unit 10 and a pronunciation control device 20. The user instructs the pronunciation control device 20 at a time when the operation unit 10 is hit with his / her own hand H to start the pronunciation of each phoneme (hereinafter referred to as “pronunciation start point”). The pronunciation control device 20 synthesizes voice by pronouncing each phoneme according to an instruction from the user.

The operation unit 10 includes an operation reception unit 11, a first sensor 13, and a second sensor 15. The operation reception unit 11 includes a surface (hereinafter referred to as “striking surface”) F that is hit by the user's hand H. The hand H is an example of an "object" that hits the striking surface F. Specifically, the operation receiving unit 11 includes a housing 112 and a light transmitting unit 114. The housing 112 is, for example, a hollow structure having an opening at the top. The light transmitting portion 114 is a flat plate-shaped member formed of a member that transmits light in a wavelength range that can be detected by the first sensor 13. The light transmitting portion 114 is installed so as to close the opening of the housing 112. The surface of the light transmitting portion 114 on the side opposite to the internal space of the housing 112 corresponds to the striking surface F. The user hits the striking surface F with the hand H in order to indicate the pronunciation start point of each phoneme. Specifically, the user hits the hitting surface F by moving the hand H from above the hitting surface F toward the hitting surface F. A phoneme is pronounced according to the time when the hand H hits the striking surface F.

The first sensor 13 and the second sensor 15 are housed inside the housing 112. The first sensor 13 is a sensor for detecting the state of the user's hand H. For example, a distance image sensor that measures the distance between the subject and the imaging surface for each pixel is used as the first sensor 13. For example, the hand H moving toward the striking surface F is imaged by the first sensor 13. The first sensor 13 is installed, for example, in the central portion of the bottom surface of the housing 112, and images the hand H moving toward the striking surface F from the palm side (inside of the housing 112). Specifically, the first sensor 13 can detect light in a specific wavelength range, and receives light coming from the hand H located above the striking surface F via the light transmitting portion 114 to receive the light. Data representing the image of H (hereinafter referred to as "image data") D1 is generated. The light transmitting portion 114 is formed of a member that transmits light that can be detected by the first sensor 13. The image data D1 is transmitted to the sound control device 20. The first sensor 13 and the sound control device 20 can communicate with each other wirelessly or by wire. The image data D1 is repeatedly generated at predetermined intervals.

The second sensor 15 is a sensor for detecting the impact of the hand H on the impact surface F. For example, a sound collecting device that collects ambient sounds and generates a sound signal D2 representing the collected sounds is used as the second sensor 15. Specifically, the second sensor 15 collects the hitting sound generated when the user's hand H hits the hitting surface F. The sound signal D2 is transmitted to the sound control device 20. The second sensor 15 and the sound control device 20 can communicate with each other wirelessly or by wire.

FIG. 2 is a block diagram illustrating the configuration of the sound control device 20. The sound generation control device 20 synthesizes voice according to the action of hitting the hitting surface F by the user. Specifically, the sound control device 20 includes a control device 21, a storage device 23, and a sound emitting device 25.

The control device 21 is, for example, a single or a plurality of processors that control each element of the sound control device 20. For example, the control device 21 is a CPU (Central Processing Unit), SPU (Sound Processing Unit), GPU (Graphics Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application Specific Integrated Circuit). ) Etc., it is composed of one or more types of processors. Specifically, the control device 21 executes a program stored in the storage device 23 to generate a plurality of signals (hereinafter referred to as “synthetic signals”) V representing the voices of the singer singing the music. (Phonology identification unit 212, detection unit 213, and sound control unit 214) are realized.

The storage device 23 is a single or a plurality of memories composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium. The storage device 23 stores a program executed by the control device 21 and various data used by the control device 21. The storage device 23 may be configured by combining a plurality of types of recording media. Further, the storage device 23 may be a portable recording medium that can be attached to and detached from the sound control device 20, or an external recording medium (for example, online storage) that the sound control device 20 can communicate with via a communication network. Specifically, the storage device 23 stores data (hereinafter referred to as “synthetic data”) S representing sounds to be synthesized by the sound control device 20.

Synthetic data S is data that specifies the content of the music. Specifically, the synthetic data S is data for designating the pitch Sx and the phoneme Sy for each of the plurality of notes constituting the music. The pitch Sx is any one of a plurality of pitches (for example, a note number). Phonological S is a pronunciation content that should be uttered together with the pronunciation of notes. Specifically, the phoneme Sy corresponds to one syllable (pronunciation unit) constituting the lyrics of the music. For example, a typical phoneme Sy in Japanese is a combination of a consonant and a vowel immediately after it, or a single vowel. The synthetic signal V is generated by voice synthesis using the synthetic data S. The sounding start point of each note is controlled according to the action of striking the striking surface F by the user. The order of the plurality of notes constituting the music is specified in the composite data S, but the pronunciation start point of each note is not specified in the composite data S.

The phoneme specifying unit 212 determines whether or not the phoneme Sy specified by the synthetic data S for each note is a phoneme composed of consonants and vowels (hereinafter referred to as "specific phoneme"). Specifically, the phoneme specifying unit 212 determines that the phoneme Sy composed of a consonant and the vowel following the consonant is a specific phoneme, and the phoneme Sy composed of a single vowel is other than the specific phoneme. Judged as a phoneme.

The user takes the rhythm of the music by hitting the hitting surface F in sequence. Specifically, the user hits the hitting surface F at each time when the pronunciation of each note in the music should be started. On the other hand, the pronunciation start point of the vowel following the consonant is audibly recognized as the pronunciation start point of the specific phoneme as a whole. Therefore, in a configuration in which the consonant of a specific phoneme is started to be pronounced when the user hits the striking surface F (hereinafter referred to as “hit time”), and the vowel is pronounced after the consonant, the user recognizes it. It is perceived that the pronunciation of a specific phoneme of the note is started when it is delayed from the start point of the note. Therefore, in the present embodiment, the pronunciation of the specific phoneme is started before the time of hitting. Therefore, it is possible to reduce the delay in hearing a specific phoneme.

FIG. 3 is a graph showing the relationship between the distance P between the hand H and the striking surface F and the time. As illustrated in FIG. 3, when the hand H is moved toward the striking surface F, the distance P between the hand H and the striking surface F becomes smaller with time. In other words, the distance P is the height of the hand H from the striking surface F. Then, when the hand H hits the striking surface F, the distance P becomes 0. It is assumed that the user's hand H will be in a specific state (hereinafter referred to as "specific state") while moving toward the striking surface F. In the present embodiment, the specific state means that the distance P becomes a specific distance (hereinafter, “specific distance”) Pz in the process of decreasing. That is, the specific state is the state of the hand H before it comes into contact with the striking surface F. The distance P may be, for example, the distance between the reference point (for example, the center point) on the striking surface F and the hand H.

FIG. 3 shows t1 at the time when the hand H is in a specific state (hereinafter referred to as “reaching time”) and t2 at the time of hitting. A consonant of a specific phoneme is sounded at the arrival point t1 (that is, a time when the distance P becomes a specific distance Pz), and a vowel of a specific phoneme is sounded at a hit time t2 (that is, a time when the distance P becomes 0). That is, when the hand H moves to the position where it reaches the specific distance Pz, the sound of the consonant is started, and when the hand H further moves from the position at the specific distance Pz and hits the striking surface F, the sound of the vowel following the consonant is sounded. Is started.

The detection unit 213 of FIG. 2 includes a first detection unit 31 and a second detection unit 32. The first detection unit 31 detects that the hand H is in a specific state. First, the first detection unit 31 specifies the distance P by using the image data D1. For example, the first detection unit 31 estimates the region of the hand H from the image data D1 by image recognition such as contour extraction, and specifies the distance P of the hand H from the distance measured by the first sensor 13 for the pixels in the region. To do. Any known technique is used to specify the distance P. Next, the first detection unit 31 determines whether or not the distance P has reached the specific distance Pz by comparing the distance P with the first threshold value. The first threshold value is set according to, for example, a specific distance Pz. When the distance P exceeds the first threshold value, it is determined that the distance P has not reached the specific distance Pz. On the other hand, when the distance P is lower than the first threshold value, it is determined that the distance P has reached the specific distance Pz. It should be noted that, in reality, a slight time difference is inevitably generated between the arrival time t1 at which the hand H reaches the specific state and the time when the specific state is detected. The time point at which t1 and the specific state are detected are equated with each other as substantially the same time point.

The second detection unit 32 detects that the hand H has hit the striking surface F due to the movement of the hand H. Specifically, the second detection unit 32 detects that the hand H has hit the striking surface F by analyzing the sound signal D2. First, the second detection unit 32 identifies the volume of the sound represented by the sound signal D2 (hereinafter referred to as “sound collection level”) by analyzing the sound signal D2. Any known sound analysis technique is used for the analysis of the sound signal D2. Next, the second detection unit 32 determines whether or not the hand H has hit the striking surface F by comparing the sound collection level with the second threshold value. For example, when the hand H hits the hitting surface F, a hitting sound is generated. The second threshold value is set assuming, for example, the hitting sound when the hand H hits the hitting surface F. When the sound collection level is lower than the second threshold value, it is determined that the sound signal D2 does not include the striking sound. That is, it is determined that the striking surface F is not striking. On the other hand, when the sound collection level exceeds the second threshold value, it is determined that the sound signal D2 contains a striking sound. That is, it is determined that the hand H has hit the striking surface F. It should be noted that a slight time difference actually occurs inevitably between the hit time t2 at which the hand H hits the hit surface F and the time when the hit (hit event) is detected. In, the hit time t2 and the time when the hit is detected are equated with each other as substantially the same time point.

The pronunciation control unit 214 generates a composite signal V representing the sound specified by the composite data S. The composite signal V is a signal representing a voice in which the composite data S pronounces the phoneme Sy specified for the note at the pitch Sx specified for each note. A known technique is arbitrarily adopted for speech synthesis. For example, a statistical model type that generates a synthetic signal V by using a statistical model such as HMM (Hidden Markov Model) or a neural network, which is a piece-connected voice synthesis that generates a synthetic signal V by connecting a plurality of voice elements. Speech synthesis is used to generate the synthetic signal V. The pronunciation start point of each phoneme S designated by the synthetic data S is controlled according to the result of detection by the first detection unit 31 and the second detection unit 32.

When the phoneme Sy specified by the synthetic data S is specified by the phoneme specifying unit 212 to be a phoneme other than the specified phoneme, the sound control unit 214 causes the phoneme to be pronounced when the hit surface F is hit. .. Specifically, the sound control unit 214 causes the second detection unit 32 to pronounce the phoneme when the hit is detected. That is, a synthetic signal V in which the sounding start point of the entire phoneme is set at the hit time t2 is generated. On the other hand, when the phoneme Sy specified by the synthetic data S is specified to be a specific phoneme by the phoneme specifying unit 212, the sound control unit 214 causes the specific phoneme to be pronounced before hitting the striking surface F. .. Specifically, the sound control unit 214 produces a consonant of a specific phonology when the first detection unit 31 detects a specific state, and a vowel of the specific phonology when the second detection unit 32 detects a blow. Make it pronounce. That is, a synthetic signal V is generated in which the pronunciation start point of the consonant of the specific phoneme is set at the arrival time t1 and the pronunciation start point of the vowel following the consonant is set at the striking time t2. The combined signal V is supplied to the sound emitting device 25.

The sound emitting device 25 (for example, a speaker) is a reproduction device that emits the sound represented by the synthetic signal V. Therefore, the sound in which the pronunciation start point of the phoneme Sy is controlled is emitted for the music. That is, it is possible to reduce the delay in hearing the entire specific phoneme of the music.

FIG. 4 is a flowchart of processing of the control device 21. The user hits the striking surface F when he / she wants to start the pronunciation of each note in the musical piece. That is, the striking surface F is striked by the hand H for each note. The process of FIG. 4 is executed for each note of the composite data S. In the following description, among a plurality of musical notes, the musical note to be processed in FIG. 4 is referred to as a “target musical note”. In parallel with the process of FIG. 4, a process of specifying the distance P by the first detection unit 31 and a process of specifying the sound collection level by the second detection unit 32 are executed. The process of specifying the distance P and the process of specifying the sound collection level are repeatedly executed in a cycle shorter than the cycle in which the process of FIG. 4 is executed.

When the process of FIG. 4 starts, the phoneme specifying unit 212 determines whether or not the phoneme Sy of the target note in the composite data S is a specific phoneme (Sa1). When it is determined that the phoneme Sy of the target note is a specific phoneme (Sa1: YES), the first detection unit 31 determines whether or not the hand H is in a specific state while moving toward the striking surface F. (Sa2). That is, it is determined whether or not the distance P is at the specific distance Pz in the process of decreasing the distance P. Specifically, the first detection unit 31 determines whether or not the distance P is decreasing. When the distance P is decreasing, the first detection unit 31 determines whether or not the hand H is in a specific state by comparing the distance P with the first threshold value. When the distance P is increasing, it is not determined whether or not the hand H is in a specific state.

When it is determined that the hand H is in a specific state (Sa2: YES), the sound control unit 214 makes a consonant of a specific phoneme sound (Sa3). Specifically, the sound control unit 214 generates a synthetic signal V in which the sounding start point of the consonant of the specific phoneme is set at the time when the specific state is detected, and supplies the synthetic signal V to the sound emitting device 25. That is, the consonant of the specific phonology is pronounced at the time when the specific state is detected (that is, the arrival time t1). On the other hand, when it is determined that the hand H is not in the specific state (Sa2: NO), the process of step Sa2 is repeatedly executed until the hand H is in the specific state.

Hand H moves further toward the striking surface F from the position in the specific state. The second detection unit 32 determines whether or not the hand H has hit the striking surface F (Sa4). Specifically, by comparing the sound collection level with the second threshold value, it is determined whether or not the hand H has hit the striking surface F. When it is determined that the hand H has hit the striking surface F (Sa4: YES), the sound control unit 214 causes the consonant of the specific phoneme to sound the vowel following the consonant (Sa5). Specifically, the sound control unit 214 generates a synthetic signal V in which the sounding start point of the consonant of the specific phoneme is set at the time when the hit on the hitting surface F is detected, and the synthetic signal V is sent to the sound emitting device 25. Supply. That is, the vowel of the specific phonology is pronounced at the time when the impact on the impact surface F is detected (that is, the impact time t2). On the other hand, when it is determined that the hand H has not reached the striking surface F (Sa4: NO), the process of step Sa4 is repeatedly executed until the hand H moves to the striking surface F and hits the striking surface F. To. With respect to the specific phoneme, by the above processing, the pronunciation of the specific phoneme is started before the hand H hits the striking surface F.

On the other hand, when it is determined that the phoneme Sy of the target note is a phoneme other than the specific phoneme (typically, the phoneme of a single vowel) (Sa1: NO), the processing of step Sa2 and step Sa3 is omitted, and step Sa4 Processing is executed. That is, for the phonemes other than the specific phonemes, the pronunciation of the phonemes is started at the time of hitting t2. The continuation length of the note may be a fixed time length, or may be a time length specified for each note by the composite data S.

As can be understood from the above description, in the present embodiment, a consonant of a specific phonology is pronounced when the hand H is detected to be in a specific state, and the specific phonology is detected when a blow to the striking surface F is detected. Vowel is pronounced. Therefore, the consonant of the specific phoneme can be pronounced before the hand H hits the striking surface F. That is, it is possible to reduce the perception that a specific phoneme is delayed. Further, since the vowel of the specific phonology is pronounced by detecting the impact of the hand H on the striking surface F, the consonant can be pronounced before the vowel while maintaining the operation feeling for pronouncing the specific phonology.

It is detected as a specific state that the distance P between the hand H and the striking surface F is at the specific distance Pz. That is, the state on the way from the hand H to the striking surface F is detected as a specific state. Therefore, the consonant can be pronounced without the user being aware of the operation for pronouncing the consonant of the specific phoneme. Further, since the impact of the hand H on the striking surface F is detected by analyzing the sound signal D2, it is possible to pronounce a vowel of a specific phoneme when a striking sound is generated by the impact on the striking surface F.

<Modification example>
Specific modifications added to the above-exemplified embodiments will be illustrated below. Two or more embodiments arbitrarily selected from the following examples may be appropriately merged to the extent that they do not contradict each other.

(1) The sound produced by detecting the specific state corresponds to the "first sound", and the sound produced by detecting the impact on the striking surface F corresponds to the "second sound". In the above-described form, the consonant of the specific phoneme is an example of the "first sound", and the vowel of the specific phoneme is an example of the "second sound". That is, the sound control unit 214 is comprehensively expressed as an element that sounds the first sound when a specific state is detected and sounds the second sound when a hit on the hitting surface F is detected.

The first sound is not limited to the consonants of the specific phoneme, and the second sound is not limited to the vowels of the specific phoneme. For example, the sound related to the preparatory movement for pronunciation (hereinafter referred to as “preparatory sound”) may be the first sound, and the sound following the preparatory movement (hereinafter referred to as “target sound”) may be the second sound. A target sound is a sound that is defined by a musical note and is the object of singing or playing. On the other hand, the preparatory sound is a sound produced due to the preparatory operation for pronouncing the target sound. When synthesizing a singing sound, for example, a breath sound is exemplified as a preparatory sound, and a voice sung after the breath sound is exemplified as a target sound. In addition, when synthesizing the performance sound of an instrument, for example, the breathing sound generated when playing a wind instrument, the fret sound of a string instrument, or the wind noise of a stick when playing a percussion instrument are exemplified as the preparation sound, and the preparation is made. The performance sound of the instrument following the sound is exemplified as the target sound. That is, the voice synthesized by the pronunciation control device 20 is not limited to the voice singing the music. According to the configuration in which the preparation sound is sounded when a specific state is detected and the target sound is sounded when a hit on the striking surface F is detected, the target sound is sounded before the original target sound. It is possible to pronounce a preparatory sound for pronunciation. The entire phoneme may be the first sound, and the entire other phoneme following the phoneme may be the second sound.

Focusing on the pronunciation of speech sounds, typical examples of the first and second sounds are phonemes (for example, vowels or consonants). In the embodiment, the configuration in which the first phoneme, which is an example of the first sound, is a consonant and the second phoneme, which is an example of the second sound, is a vowel is illustrated, but each of the first phoneme and the second phoneme is illustrated. It doesn't matter if it is a vowel or a consonant. For example, depending on the language of the lyrics of the music in speech synthesis, a phoneme composed of a consonant and a consonant following the consonant, or a phoneme composed of a vowel and a vowel following the vowel is assumed. The first phoneme in the phoneme is exemplified as the first phoneme, and the phoneme following the first phoneme is exemplified as the second phoneme.

(2) In the above-described embodiment, the distance image sensor capable of measuring the distance is illustrated as the first sensor 13, but the function of measuring the distance is not essential in the first sensor 13. For example, an image sensor may be used as the first sensor 13. The first detection unit 31 may calculate the movement amount of the hand H by analyzing the image captured by the image sensor, and may estimate the distance P from the movement amount. Further, the function of capturing the image of the hand H is not essential in the first sensor 13. For example, an infrared sensor that emits infrared light may be used as the first sensor 13. In the configuration in which the infrared sensor is used as the first sensor 13, the first sensor 13 specifies the distance between the hand H and the first sensor 13 from the light receiving intensity received by the infrared light reflected by the hand H. Then, the first detection unit 31 determines that the hand H is in a specific state when the distance between the hand H and the first sensor 13 is less than a predetermined threshold value, and when the distance exceeds the threshold value, the hand H is moved. Judge that it is not in a specific state. That is, it is not essential to calculate the distance P in the process of determining whether or not the hand H is in a specific state. The distance between the hand H and the first sensor 13 corresponds to the sum of the distance P between the hand H and the striking surface F and the distance between the striking surface F and the first sensor 13. When the distance P between the hand H and the striking surface F is at the specific distance Pz, the distance between the hand H and the first sensor 13 becomes a specific distance, so that the distance P is also specified in the above configuration. It can be said that being at a distance Pz is a specific state. The function of the first detection unit 31 may be mounted on the first sensor 13. When the first sensor 13 detects a specific state, the first sensor 13 instructs the sound control unit 214 to pronounce a consonant having a specific phoneme.

(3) In the above-described embodiment, the impact on the striking surface F is detected by analyzing the sound signal D2, but the method for detecting the impact is not limited to the above examples. For example, it may be detected that the hand H hits the hitting surface F by analyzing the image data D1 generated by the first sensor 13. For example, when it is estimated from the image data D1 that the hand H has come into contact with the striking surface F, the second detection unit 32 determines that the hand H has hit the striking surface F.

A vibration sensor that detects vibration when the hand H hits the striking surface F may be used as the second sensor 15. The second sensor 15 generates a signal according to, for example, the magnitude of vibration. The second detection unit 32 detects the impact in response to the signal. Further, a pressure sensor that detects the pressure applied to the striking surface F when the hand H comes into contact with the striking surface F may be used as the second sensor 15. The second sensor 15 generates a signal according to, for example, the magnitude of the pressure applied to the striking surface F. The second detection unit 32 detects the impact in response to the signal. The second sensor 15 may be equipped with the function of the second detection unit 32. When the second sensor 15 detects a hit on the hitting surface F, the second sensor 15 instructs the sound control unit 214 to pronounce a vowel of a specific phoneme.

(4) In the above-described embodiment, the first sensor 13 and the second sensor 15 are housed in the internal space of the housing 112, but the positions where the first sensor 13 and the second sensor 15 are installed are arbitrary. For example, the first sensor 13 and the second sensor 15 may be installed outside the housing 112. In the configuration in which the first sensor 13 is installed outside the housing 112, it is not essential that the upper surface of the housing 112 is formed of a light-transmitting member in the operation receiving unit 11.

(5) In the above-described embodiment, the striking surface F is hit by the hand H, but the object that hits the striking surface F is not limited to the hand H. The type of the object is arbitrary as long as it is possible to hit the striking surface F. For example, a striking member such as a stick may be an object. The user moves the stick toward the striking surface F to strike the striking surface F. As understood from the above description, the object includes both a part of the user's body (typically the hand H) and a striking member operated by the user. In the configuration in which a striking member such as a stick is used as an object, the first sensor 13 or the second sensor 15 may be mounted on the member.

(6) In the above-described embodiment, the fact that the distance P between the hand H and the striking surface F is at the specific distance Pz is illustrated as a specific state, but the specific state is not limited to the above examples. The specific state is arbitrary as long as the hand H is in the middle of moving toward the striking surface F. For example, the change in the moving direction of the hand H may be set as a specific state. Specifically, for example, the moving direction of the hand H changes from the direction away from the striking surface F to the approaching direction, or the moving direction of the hand H changes from the direction horizontal to the direction perpendicular to the striking surface F. Is exemplified as a specific state. Further, the change in the shape of the hand H (for example, change from goo to par) may be set as a specific state.

(7) The continuation length of a consonant of a specific phonology differs depending on the type of the consonant. For example, the time length required to pronounce the consonant "s" in the specific phoneme "sa" is about 250 ms, and the time length required to pronounce the consonant "k" in the specific phoneme "ka" is about 30 ms. is there. That is, the appropriate specific distance Pz differs depending on the type of consonant of the specific phonology. Therefore, a configuration in which the first threshold value is variably set according to the type of consonant of a specific phoneme is also adopted. Specifically, when it is determined that the phoneme Sy is a specific phoneme, the first detection unit 31 sets the first threshold value according to the type of consonant of the phoneme specifying unit 212. Then, the first detection unit 31 determines whether or not the hand H is in a specific state by comparing the set first threshold value with the distance P.

(8) In the above-described embodiment, the operation reception unit 11 is composed of the housing 112 and the light transmission unit 114, but the operation reception unit 11 is not limited to the above examples. For example, in a configuration in which the first sensor 13 and the second sensor 15 are installed outside the operation reception unit 11, a flat plate-shaped member may be used as the operation reception unit 11. Further, the keyboard-type operator may be used as the operation reception unit 11. In the configuration in which the keyboard-type operator is the operation reception unit 11, it is not necessary to specify the pitch Sx for each note of the composite data S. The user instructs the sounding start point of each note and the pitch of the note by operating the operation reception unit 11. That is, the pitch of each note may be set according to an instruction from the user. Regardless of the shape of the operation reception unit 11, the surface of the operation reception unit 11 that the user comes into contact with when hitting corresponds to the hitting surface F.

(9) In the above-described embodiment, when the user hits the striking surface F with the hand H, the state of the user's hand H may be detected and the pronunciation may be controlled according to the detection result. For example, note conditions (for example, pitch, phoneme or continuation length) are set according to the detection result. That is, it is not essential to set the pitch Sx and the phoneme Sy for each note of the composite data S. The state of the user's hand H is, for example, the moving speed of the hand H, the moving direction of the hand H, the shape of the hand H, or the like. The combination of the detected hand H state and the note condition is arbitrary. In the action of striking the hand H against the striking surface F, the user can instruct the condition of the note by changing the state of the hand H. Hereinafter, a specific configuration for controlling pronunciation according to the state of the user's hand H will be illustrated.

A. Movement speed of hand H For example, the type of phoneme (that is, pronunciation content) may be set according to the movement speed of hand H. Specifically, the first detection unit 31 detects the moving speed of the hand H from the image data D1. The moving speed is detected from the time change of the distance P specified from the image data D1. The first detection unit 31 may detect the moving speed of the hand H by using, for example, the output from the speed sensor that detects the speed. Then, the phoneme specifying unit 212 sets the type of the specific phoneme according to the moving speed. The phoneme specifying unit 212 sets the type of the specific phoneme before the hand H is in the specific state. FIG. 5 is a schematic diagram showing the relationship between the moving speed of the hand H and the type of the specific phoneme. FIG. 5 illustrates a specific phoneme set when the moving speed of the hand H1 is fast and a specific phoneme set when the moving speed of the hand H2 is slow. For example, when the moving speed of the hand H1 is fast, it is set to a specific phoneme (for example, "ta (ta)") including a consonant (for example, [t]) having a short duration, and when the moving speed of the hand H2 is slow. Is set to a specific phoneme (eg, "sa") that includes a consonant with a long duration (eg, [s]). Regardless of the moving speed, the consonant is started to be pronounced at the arrival time t1 when the distance P becomes the specific distance Pz, and the vowel is started to be pronounced at the hit time t2. When the moving speed of the hand H1 is fast, the time length from the arrival point t1 to the striking time t2 is shorter than when the moving speed of the hand H2 is slow. Set. Further, the continuous length or pitch of the note may be set according to the moving speed of the hand H. In the above examples, the case of setting the specific phoneme type is illustrated, but the phoneme type other than the specific phoneme may be controlled according to the moving speed.

B. Moving direction of hand H For example, the type of phoneme may be set according to the moving direction of hand H. The user hits the hitting surface F by moving the hand H from different directions according to the desired phoneme. The user can hit the hitting surface F by moving the hand H from various directions with respect to the hitting surface F. For example, when the hand H is moved from the right or left direction when viewed from the user to hit the striking surface F, or when the hand H is moved in the direction away from or closer to the user to hit the striking surface F. Etc. are assumed. Specifically, the first detection unit 31 detects the moving direction of the hand H from the image data D1, and the phoneme specifying unit 212 sets the type of phoneme according to the moving direction. The phoneme specifying unit 212 sets the phoneme type before the hand H is in a specific state. The continuous length or pitch of the note may be set according to the moving direction of the hand H.

C. Shape of hand H For example, the type of phoneme may be set according to the shape of hand H. The user hits the striking surface F with the hand H having an arbitrary shape by, for example, moving a finger. For example, move the hand H so that it has a goo, choki, or par shape. FIG. 6 is a table showing the relationship between the shape of the hand H and the phonology. As illustrated in FIG. 6, in addition to the shape of the hand H, the type of phoneme may be set in consideration of whether the hand H is the right hand or the left hand. The state of the hand H also includes whether the user's hand H is the right hand or the left hand. The first detection unit 31 detects whether the hand H is the right hand or the left hand and the shape of the hand H from the image data D1. A known image analysis technique is arbitrarily adopted for detecting whether the hand H is the right hand or the left hand and the shape of the hand H. The phoneme specifying unit 212 sets the phoneme type before the hand H is in a specific state. The phoneme specifying unit 212 specifies the phoneme according to the shapes of the right / left hand and the hand H. As illustrated in FIG. 6, for example, when the striking surface F is struck with the left hand in the shape of a goo, the phoneme "ta" is pronounced. The continuous length or pitch of the note may be set according to the shape of the hand H.

As understood from the above explanation, at least one of the moving speed of the hand H, the moving direction of the hand H, and the shape of the hand H is detected, and the pronunciation of the phoneme is controlled according to the content of the detection. When controlling the pronunciation of a specific phoneme, it suffices to control the pronunciation of at least one of a consonant (example of the first sound) and a vowel (example of the second sound). According to the above configuration, the user can control the pronunciation of the first sound and the second sound by changing the moving speed, moving direction, and shape of the object. The state of the hand H is not limited to the moving speed of the hand H, the moving direction of the hand H, and the shape of the hand H. For example, the moving angle of the hand H (the angle at which the hand H moves with respect to the striking surface F) may be set as the state of the hand H.

(10) The time length from when the hand H reaches the specific distance Pz to when the striking surface F is hit becomes longer when the moving speed of the hand H is slow, and becomes shorter when the moving speed of the hand H is fast. Therefore, in the configuration in which the first threshold value is constant (fixed value) regardless of the moving speed of the hand H, there is a problem that the continuous length of the consonant of the specific phoneme changes according to the moving speed of the hand H. Specifically, when the moving speed of the hand H is slow, the continuous length of the consonant becomes long, and when the moving speed of the hand H is fast, the continuous length of the consonant becomes short. Therefore, the first threshold value may be changed according to the moving speed of the hand H. Specifically, the first detection unit 31 detects the moving speed of the hand H from, for example, the image data D1. The moving speed is detected before the hand H is in a specific state. Next, the first detection unit 31 sets the first threshold value according to the moving speed of the hand H. Specifically, the first detection unit 31 sets the first threshold value relatively large when the moving speed of the hand H is fast, and sets the first threshold value relatively small when the moving speed of the hand H is slow. To do. Then, the first detection unit 31 compares the set first threshold value with the distance P, and determines whether or not the distance P has reached the specific distance Pz. According to the above configuration, it is possible to reduce the change in the continuous length of the consonant according to the moving speed of the hand H.

Further, the first threshold value may be changed according to the moving direction of the hand H. Specifically, the first detection unit 31 detects the moving direction of the hand H from, for example, the image data D1. The moving direction is detected before the hand H is in a specific state. Next, the first detection unit 31 sets the first threshold value according to the moving direction of the hand H. For example, when the moving direction of the hand H is the first direction, the first detection unit 31 sets the first threshold value to the first value, and the moving direction of the hand H is different from the first direction in the second direction. If, the first threshold value is set to a second value larger than the first value. Then, the first detection unit 31 compares the set first threshold value with the distance P, and determines whether or not the distance P has reached the specific distance Pz. When the moving speed of the hand H is constant, the continuation length of the consonant of the specific phoneme changes according to the first threshold value. Specifically, the continuation length of a consonant of a specific phonology becomes longer as the first threshold value becomes larger, and becomes shorter as the first threshold value becomes smaller. When the user wants to lengthen the continuous length of the consonant of the specific phoneme, he / she hits the striking surface F from the second direction. On the other hand, when the user wants to shorten the continuation length of the consonant of the specific phoneme, he / she strikes the striking surface F from the first direction. As understood from the above description, the first threshold value may be set variably.

(11) In the above-described embodiment, the time point at which the pronunciation of the phoneme is finished may be controlled according to the movement of the hand H by the user. For example, the pronunciation of the phoneme may be terminated when the hand H separates from the striking surface F after striking the striking surface F. FIG. 7 is a block diagram illustrating the configuration of the detection unit 213 according to the modified example. The detection unit 213 includes a third detection unit 33 in addition to the first detection unit 31 and the second detection unit 32. The third detection unit 33 detects that the hand H is separated from the striking surface F. For example, the analysis of the image data D1 detects that the hand H is separated from the striking surface F. The third detection unit 33 may detect that the hand H is separated from the striking surface F by using the output from the pressure sensor that detects the pressure applied to the striking surface F. When the third detection detects that the hand H is separated from the striking surface F, the pronunciation control unit 214 ends the pronunciation of the phoneme.

(12) In the above-described embodiment, the striking surface F is hit by the user's hand H, but for example, the user hits the virtual striking surface F by using the antennae technology (haptics) using tactile feedback. Is also adopted. The user strikes the striking surface F prepared in the virtual space by operating an operator capable of operating a pseudo hand in the virtual space displayed on the display device. By mounting a vibration motor that vibrates when hitting the hitting surface F in the virtual space on the operator, the user perceives that the hitting surface F is actually hit. When the hand in the virtual space is in a specific state, a consonant of a specific phoneme is pronounced, and when the striking surface F is hit in the virtual space, a vowel of the specific phoneme is sounded. As understood from the above description, the striking surface F may be a surface in the virtual space. Similarly, the hand H may be a hand in the virtual space.

(13) As described above, the function of the sound control device 20 illustrated above is realized by the cooperation of one or more processors constituting the control device 21 and the program stored in the storage device 23. The program according to the present disclosure may be provided and installed on a computer in a form stored in a computer-readable recording medium. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disc) such as a CD-ROM is a good example, but a known arbitrary such as a semiconductor recording medium or a magnetic recording medium. Recording media in the format of are also included. The non-transient recording medium includes any recording medium other than the transient propagation signal (transitory, propagating signal), and the volatile recording medium is not excluded. Further, in the configuration in which the distribution device distributes the program via the communication network, the storage device 23 that stores the program in the distribution device corresponds to the above-mentioned non-transient recording medium.

<Additional notes>
From the above-exemplified form, for example, the following configuration can be grasped.

The sound control method according to one aspect (aspect 1) of the present disclosure detects that an object is in a specific state while the object is moving toward a surface, and when the specific state is detected. Is made to sound the first sound, the hitting event in which the object hits the surface due to the movement of the object is detected, and the second sound is sounded when the hitting event is detected. In the above aspect, the first sound is produced when the object reaches a specific state while moving toward the surface, and the second sound is produced when the object hits the surface. Therefore, the first sound can be pronounced before the object hits the surface. Further, since the second sound is pronounced by detecting the impact of the object on the surface, the first sound can be sounded before the second sound while maintaining the operation feeling for sounding the second sound. ..

In one example of aspect 1 (aspect 2), the first sound is a first phoneme, and the second sound is a second phoneme different from the first phoneme. In the above aspect, when the object is in a specific state, the first phoneme is pronounced, and when the object hits the surface, the second phoneme is pronounced following the first phoneme. Therefore, the first phoneme can be pronounced before the object hits.

In an example of aspect 2 (aspect 3), the first phoneme is a consonant and the second phoneme is a vowel. In the above aspects, a consonant is pronounced when the object is in a specific state, and a vowel is pronounced following the consonant when the object hits the surface. Therefore, it is possible to reduce the perception that the pronunciation of a phoneme composed of consonants and vowels is delayed.

In one example of aspect 1 (aspect 4), the first sound is a sound related to the preparatory movement for pronunciation, and the second sound is a sound following the preparatory movement. In the above aspect, when the object is in a specific state, a sound related to the preparatory movement is sounded, and when the object hits the surface, a sound following the preparatory movement is sounded. Therefore, it is possible to pronounce the sound related to the preparatory movement for making the sound sound before the target sound.

In one example of any one of aspects 1 to 4 (aspect 5), the specific state is that the distance between the object and the surface is at a specific distance. In the above aspect, the first sound is produced when the distance between the object and the surface becomes a specific distance. That is, the first sound is pronounced in the middle of the movement of the object toward the surface. Therefore, the first sound can be pronounced without the user being aware of the operation for pronouncing the first sound.

In any one of the first to fifth aspects (aspect 6), in the detection of the impact of the object, the impact is detected by analyzing the sound signal generated by the sound collecting device. In the above aspect, the impact of the object on the surface is detected by analyzing the sound signal generated by the sound collecting device. Therefore, the striking sound generated by striking the surface can be used for the pronunciation of the second sound.

In any one example of Aspects 1 to 6 (Aspect 7), at least one of the moving speed of the object, the moving direction of the object, and the shape of the object is detected, and depending on the content of the detection, Controls the pronunciation of at least one of the first sound and the second sound. In the above aspect, the pronunciation in at least one of the first sound and the second sound is controlled according to the speed at which the object moves, the direction in which the object moves, and at least one of the shapes of the objects. Therefore, the pronunciation of the first sound and the second sound can be controlled by the user changing the moving speed, moving direction, and shape of the object.

In the sound control device according to one aspect (aspect 1) of the present disclosure, the object is in a specific state while the object is moving toward the surface, and the movement of the object causes the object to move to the surface. It is provided with a detection unit that detects a striking event that hits, and a sound control unit that sounds the first sound when the specific state is detected and sounds the second sound when the hit event is detected. To do.

This application is based on the Japanese application filed on September 26, 2019 (Japanese Patent Application No. 2019-175253), the contents of which are incorporated herein by reference.

The pronunciation control method and the pronunciation control device of the present disclosure can start pronunciation before an object such as a user's finger comes into contact with a surface such as a key.

100 ... Sound control system 10 ... Operation unit 11 ... Operation reception unit 112 ... Housing 114 ... Light transmission unit 13 ... First sensor 15 ... Second sensor 20 ... Sound control device 21 ... Control device 212 ... Sound identification unit 213 ... Detection Unit 214 ... Sound control unit 23 ... Storage device 25 ... Sound emitting device 31 ... First detection unit 32 ... Second detection unit 33 ... Third detection unit F ... Strike surface

Claims

Detecting that an object is in a specific state while it is moving toward a surface,
When the specific state is detected, the first sound is sounded.
A striking event in which the object hits the surface due to the movement of the object is detected.
A sound control method realized by a computer that sounds a second sound when the hit event is detected.
The first sound is the first phoneme and
The pronunciation control method according to claim 1, wherein the second sound is a second phoneme different from the first phoneme.
The first phoneme is a consonant and
The pronunciation control method according to claim 2, wherein the second phoneme is a vowel.
The first sound is a sound related to the preparatory movement for pronunciation.
The pronunciation control method according to claim 1, wherein the second sound is a sound following the preparatory operation.
The pronunciation control method according to any one of claims 1 to 4, wherein the specific state is that the distance between the object and the surface is at a specific distance.
The pronunciation control method according to claim 5, wherein the specific distance is changed according to the type of consonant of the first phoneme.
The hitting event of the object is detected according to any one of claims 1 to 6, wherein the hitting event is detected by analyzing a sound signal representing the hitting sound picked up by the sound collecting device. Sound control method.
The feature data represented by at least one of the moving speed of the object, the moving direction of the object, and the shape of the object is detected.
The pronunciation control method according to any one of claims 1 to 7, wherein at least one of the pronunciation of the first sound and the pronunciation of the second sound is controlled according to the detected feature data.
A detection unit that detects that the object is in a specific state while the object is moving toward the surface, and that the object hits the surface due to the movement of the object.
A sound control device including a sound control unit that sounds a first sound when the specific state is detected and sounds a second sound when the striking event is detected.