WO2010084830A1 - Dispositif de traitement vocal, système de salon de discussion, procédé de traitement vocal, support de mémorisation d'informations, et programme - Google Patents

Dispositif de traitement vocal, système de salon de discussion, procédé de traitement vocal, support de mémorisation d'informations, et programme Download PDF

Info

Publication number
WO2010084830A1
WO2010084830A1 PCT/JP2010/050442 JP2010050442W WO2010084830A1 WO 2010084830 A1 WO2010084830 A1 WO 2010084830A1 JP 2010050442 W JP2010050442 W JP 2010050442W WO 2010084830 A1 WO2010084830 A1 WO 2010084830A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speech
unit
data
predetermined
Prior art date
Application number
PCT/JP2010/050442
Other languages
English (en)
Japanese (ja)
Inventor
森 昌二
Original Assignee
株式会社コナミデジタルエンタテインメント
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社コナミデジタルエンタテインメント filed Critical 株式会社コナミデジタルエンタテインメント
Publication of WO2010084830A1 publication Critical patent/WO2010084830A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/315Sound category-dependent sound synthesis processes [Gensound] for musical use; Sound category-specific synthesis-controlling parameters or control means therefor
    • G10H2250/455Gensound singing voices, i.e. generation of human voices for musical applications, vocal singing sounds or intelligible words at a desired pitch or with desired vocal effects, e.g. by phoneme synthesis

Definitions

  • the present invention relates to a voice processing apparatus, a chat system, a voice processing method, an information storage medium, and a program suitable for preventing communication from being performed inappropriately while allowing communication between users to some extent. .
  • Patent Literature 1 proposes a technique for enhancing the sense of realism of voice chat by synthesizing environmental sounds around a user in a virtual space and voices uttered by the user and transmitting them to other users. .
  • the present invention solves the above-described problems, and is suitable for preventing inappropriate conversations while enabling communication between users to some extent, and a chat system, voice, and the like. It is an object to provide a processing method, an information storage medium, and a program.
  • the speech processing apparatus includes an input reception unit, an extraction unit, a generation unit, and an output unit, and is configured as follows.
  • the input reception unit receives an input of a voice uttered by the user.
  • the waveform data of the voice is acquired using a microphone, A / D (Analog / Digital) converted at a predetermined sampling frequency, and processed as a numerical string.
  • the extraction unit extracts feature parameters of the received voice.
  • Typical characteristic parameters are the amplitude or volume of the waveform, the fundamental frequency, the magnitude of the fundamental frequency component, or the magnitude of a predetermined representative frequency component, and the feature parameter changes over time.
  • Such information can be typically extracted by using a technique such as discrete fast Fourier transform.
  • the generation unit generates synthesized speech from predetermined speech data.
  • the generation unit generates synthesized speech by replacing the feature parameter of the predetermined speech data with the value of the extracted feature parameter.
  • audio data such as sine waves, voice data such as voices of voice actors and instruments prepared in advance can be used.
  • the difference between the predetermined voice data and the generated synthesized voice is in the value of the feature parameter.
  • the synthesized speech feature parameter is obtained by replacing the feature parameter of predetermined speech data with the value of the extracted feature parameter.
  • synthesized speech is generated by changing the amplitude or volume of predetermined audio data.
  • synthesized speech is generated by changing the key of predetermined speech data.
  • the synthesized speech is generated by changing the size of the component of the predetermined audio data.
  • the synthesized speech is no longer a “voice uttered by a human being”, even if the user utters a word or sentence, the synthesized speech does not know the content of the word or sentence.
  • the output unit outputs the generated synthesized speech.
  • the synthesized speech output here reflects a change in the user's emotions but cannot transmit words or sentences by voice. Therefore, even if the user makes a statement that violates privacy or a statement contrary to public order and morals, the content of the statement is not transmitted to the other user.
  • the details of the user's remarks cannot be acquired as language information, it is possible to communicate with each other by feelings between users. In particular, it is possible to suppress troubles based on the content of messages between users.
  • the characteristic parameter is configured so that the amplitude or volume of the waveform, the fundamental frequency, the magnitude of the fundamental frequency component, or the magnitude of the predetermined representative frequency component is changed over time. Can do.
  • the predetermined representative frequency component may be obtained as a component size for each of a plurality of predetermined frequencies, or a combination of the frequency and the component size for the upper predetermined number of peaks in the frequency distribution. It is also good to get.
  • the extraction unit can be configured to extract the feature parameter with a frequency of less than 20 times per second.
  • a sampling frequency of 40 kHz or more is required to completely restore the voice waveform data. Further, since the voice of a telephone has a sound quality that does not include a frequency component of 4000 Hz or higher, a sampling frequency of about 8000 Hz is required to maintain the voice quality equivalent to that of a telephone. Conversely, if a sampling frequency of less than 20 Hz is employed, the amount of data that must be processed can be drastically reduced, and the language information of words and sentences transmitted by voice can be completely removed. Become.
  • the extraction unit performs discrete Fourier transform on the received voice, extracts the magnitudes of a predetermined plurality of frequency components from the obtained frequency distribution as feature parameters, and the generation unit
  • the waveform data associated with each of the extracted frequency components in advance can be amplified to be extracted and synthesized to generate synthesized speech.
  • the extraction unit uses a predetermined frequency component size or a frequency component size corresponding to the top several predetermined peaks as a characteristic parameter, and a mask for removing other frequency components. As a result, synthesized speech is generated.
  • the fundamental frequency of the waveform data previously associated with each frequency component matches the center frequency of the frequency component, and the waveform data includes a harmonic component of the fundamental frequency.
  • the waveform data associated with each frequency component the waveform data having the same fundamental frequency but different timbre is adopted.
  • the waveform data it is possible to employ a sound emitted by a musical instrument. For example, a piano sound is assigned to the first frequency component of the peak, a guitar sound is assigned to the second frequency component of the peak, and a bass sound is assigned to the third frequency component of the peak.
  • the generation unit selects a candidate closest to the extracted feature parameter from the plurality of speech data candidates, and selects the selected candidate speech data. It can be configured to be predetermined audio data.
  • voice data is assigned in the order of drum, bass, guitar, and piano in ascending order of the frequency.
  • the chat system which concerns on the other viewpoint of this invention receives the input of the voice which a 1st user utters, the 1st voice processing apparatus which outputs a synthetic voice to a 2nd user, and the voice which a 2nd user utters And a second voice processing device that outputs a synthesized voice to the first user, and each of the first voice processing device and the second voice processing device includes
  • This is a voice processing device, and includes an input reception unit, an extraction unit, a generation unit, and an output unit.
  • the input receiving unit receives an input of a voice uttered by the user.
  • the extraction unit extracts feature parameters of the received voice.
  • the generation unit generates synthesized speech from predetermined audio data.
  • the output unit outputs the generated synthesized speech.
  • the generation unit generates synthesized speech by replacing the feature parameter of the predetermined speech data with the value of the extracted feature parameter.
  • the extracted feature parameters are transmitted from the extraction unit to the generation unit via a computer communication network.
  • the above-described voice processing device is applied to voice chat, and the extraction unit and the generation unit are connected by a computer communication network.
  • the present invention by providing a system similar to voice chat, it is not possible to acquire details of user's utterance content as linguistic information, but it is possible to communicate with each other through emotions between users. In particular, it is possible to suppress troubles based on the content of messages between users.
  • a speech processing method is executed by a speech processing apparatus including an input receiving unit, an extracting unit, a generating unit, and an output unit, and includes an input receiving process, an extracting process, a generating process, and an output process.
  • the configuration is as follows.
  • the input receiving unit receives an input of a voice uttered by the user.
  • the extraction unit extracts feature parameters of the received voice.
  • the generation unit generates synthesized speech from predetermined speech data.
  • synthesized speech is generated by replacing the feature parameter of the predetermined speech data with the value of the extracted feature parameter.
  • the output unit outputs the generated synthesized speech.
  • An information storage medium stores a program that causes a computer to function as an input reception unit, an extraction unit, a generation unit, and an output unit.
  • the input receiving unit receives an input of a voice uttered by the user.
  • the extraction unit extracts feature parameters of the received voice.
  • the generation unit generates synthesized speech from predetermined audio data.
  • the output unit outputs the generated synthesized speech. Further, the generation unit generates synthesized speech by replacing the feature parameter of the predetermined speech data with the value of the extracted feature parameter.
  • a computer it is possible to cause a computer to function as a voice processing device that operates as described above.
  • a program causes a computer to function as an input reception unit, an extraction unit, a generation unit, and an output unit.
  • the input receiving unit receives an input of a voice uttered by the user.
  • the extraction unit extracts feature parameters of the received voice.
  • the generation unit generates synthesized speech from predetermined audio data.
  • the output unit outputs the generated synthesized speech. Further, the generation unit generates synthesized speech by replacing the feature parameter of the predetermined speech data with the value of the extracted feature parameter.
  • a computer it is possible to cause a computer to function as a voice processing device that operates as described above.
  • program of the present invention can be recorded on a computer-readable information storage medium such as a compact disk, flexible disk, hard disk, magneto-optical disk, digital video disk, magnetic tape, and semiconductor memory.
  • a computer-readable information storage medium such as a compact disk, flexible disk, hard disk, magneto-optical disk, digital video disk, magnetic tape, and semiconductor memory.
  • the above program can be distributed and sold via a computer communication network independently of the computer on which the program is executed.
  • the information storage medium can be distributed and sold independently from the computer.
  • a voice processing device a chat system, a voice processing method, an information storage medium suitable for preventing inappropriate conversations while enabling communication between users to some extent, and A program can be provided.
  • FIG. 1 It is a schematic diagram which shows schematic structure of a typical information processing apparatus. It is explanatory drawing which shows schematic structure of the speech processing apparatus which concerns on embodiment of this invention, and the chat system using the said speech processing apparatus. It is a flowchart which shows the flow of control of the transmission process performed with the audio processing apparatus which concerns on this embodiment. It is a flowchart which shows the flow of control of the reception process performed with the audio processing apparatus which concerns on this embodiment.
  • FIG. 1 is a schematic diagram showing a schematic configuration of a typical information processing apparatus that can function as a sound processing apparatus of the present embodiment by executing a program.
  • a description will be given with reference to FIG. 1
  • the information processing apparatus 100 includes a CPU (Central Processing Unit) 101, a ROM 102, a RAM (Random Access Memory) 103, an interface 104, a controller 105, an external memory 106, an image processing unit 107, and a DVD-ROM.
  • a (Digital Versatile Disc ROM) drive 108, a NIC (Network Interface Card) 109, an audio processing unit 110, and a microphone 111 can be provided.
  • Various input / output devices can be omitted as appropriate.
  • a DVD-ROM storing a game program and data is loaded into the DVD-ROM drive 108 and the information processing apparatus 100 is turned on to execute the program, thereby realizing the audio processing apparatus of the present embodiment. Is done.
  • the speech processing apparatus is realized by inserting a flash memory or the like in which the program is recorded and executing the program.
  • the terminal devices and the server device work together to function as a chat system.
  • the server device typically has a configuration essentially the same as that of the information processing device 100, although there are some differences in calculation capability and device configuration. Further, in this case, the server device may be responsible only for the introduction of the terminal device, and thereafter, the terminal device may communicate with the peer-to-peer to form a chat system.
  • the CPU 101 controls the overall operation of the information processing apparatus 100 and is connected to each component to exchange control signals and data.
  • the CPU 101 uses arithmetic operations such as addition / subtraction / multiplication / division, logical sum, logical product using ALU (Arithmetic Logic Unit) (not shown) for a storage area called a register (not shown) that can be accessed at high speed.
  • Logic operations such as logical negation, bit operations such as bit sum, bit product, bit inversion, bit shift, and bit rotation can be performed.
  • the CPU 101 itself is configured and can be implemented with a coprocessor so that saturation operations such as addition, subtraction, multiplication, division, etc. for multimedia processing, and trigonometric functions, etc., can be performed at high speed. There is.
  • the ROM 102 records an IPL (Initial Program Loader) that is executed immediately after the power is turned on, and when this is executed, the program recorded on the DVD-ROM is read into the RAM 103 and the execution by the CPU 101 is started.
  • the ROM 102 stores an operating system program and various data necessary for operation control of the entire information processing apparatus 100.
  • the RAM 103 is for temporarily storing data and programs, and holds programs and data read from the DVD-ROM and other data necessary for game progress and chat communication.
  • the CPU 101 provides a variable area in the RAM 103, performs an operation by directly operating the ALU on the value stored in the variable, or temporarily stores the value stored in the RAM 103 in a register. Perform operations such as performing operations on registers and writing back the operation results to memory.
  • the controller 105 connected via the interface 104 receives an operation input performed when the user executes the game.
  • controller 105 is not necessarily provided externally to the information processing apparatus 100, and may be integrally formed.
  • the controller 105 of the portable terminal device is composed of various buttons and switches, and handles these pressing operations as operation inputs.
  • the user traces the trace of the touch screen using a pen or a finger as an operation input.
  • the external memory 106 detachably connected via the interface 104 stores data indicating game play status (past results, etc.), data indicating the progress of the game, and log of chat communication in a network battle ( Data) is stored in a rewritable manner. The user can record these data in the external memory 106 as appropriate by inputting an instruction via the controller 105.
  • the DVD-ROM drive 108 On the DVD-ROM mounted on the DVD-ROM drive 108, a program for realizing the game and image data and sound data associated with the game are recorded. Under the control of the CPU 101, the DVD-ROM drive 108 performs a reading process on the DVD-ROM mounted thereon, reads necessary programs and data, and these are temporarily stored in the RAM 103 or the like.
  • the image processing unit 107 processes the data read from the DVD-ROM by the CPU 101 or an image arithmetic processor (not shown) included in the image processing unit 107, and then processes the processed data in a frame memory ( (Not shown).
  • the image information recorded in the frame memory is converted into a video signal at a predetermined synchronization timing and output to a monitor (not shown) connected to the image processing unit 107. Thereby, various image displays are possible.
  • a small liquid crystal display is typically used as a monitor of a portable game device.
  • the display panel of the touch screen functions as a monitor.
  • a display device such as a CRT (Cathode Ray Tube) or a plasma display can be used as a monitor for a game device or a server device for playing at home.
  • the image calculation processor can execute a two-dimensional image overlay calculation, a transparency calculation such as ⁇ blending, and various saturation calculations at high speed.
  • polygon information arranged in the virtual three-dimensional space and added with various texture information is rendered by the Z buffer method, and the polygon arranged in the virtual three-dimensional space from the predetermined viewpoint position is determined in the direction of the predetermined line of sight It is also possible to perform high-speed execution of operations for obtaining rendered images.
  • the CPU 101 and the image arithmetic processor work together to draw a character string as a two-dimensional image in the frame memory or on the surface of each polygon according to the font information that defines the character shape. is there.
  • the NIC 109 is used to connect the information processing apparatus 100 to a computer communication network (not shown) such as the Internet, and conforms to the 10BASE-T / 100BASE-T standard used when configuring a LAN.
  • Analog modem for connecting to the Internet using a telephone line, ISDN (Integrated Services Digital Network) modem, ADSL (Asymmetric Digital Subscriber Line) modem, cable modem for connecting to the Internet using a cable television line, etc. This is an interface (not shown) for mediating with the CPU 101.
  • the audio processing unit 110 converts audio data read from the DVD-ROM into an analog audio signal and outputs it from a speaker (not shown) connected thereto. Also, under the control of the CPU 101, sound effects and music data to be generated during the progress of the game are generated, and the corresponding sound is output from a speaker, headphones (not shown), and earphones (not shown). Output.
  • the audio processing unit 110 converts the MIDI data into PCM data with reference to the sound source data included in the audio data. If the compressed audio data is in ADPCM format or Ogg Vorbis format, it is expanded and converted to PCM data.
  • the PCM data can be output by performing D / A (Digital / Analog) conversion at a timing corresponding to the sampling frequency and outputting it to a speaker.
  • a microphone 111 can be connected to the information processing apparatus 100 via the interface 104.
  • the analog signal from the microphone 111 is subjected to A / D conversion at an appropriate sampling frequency so that processing such as mixing in the audio processing unit 110 can be performed as a PCM format digital signal.
  • the information processing apparatus 100 uses a large-capacity external storage device such as a hard disk to perform the same function as the ROM 102, the RAM 103, the external memory 106, the DVD-ROM attached to the DVD-ROM drive 108, and the like. You may comprise.
  • a large-capacity external storage device such as a hard disk to perform the same function as the ROM 102, the RAM 103, the external memory 106, the DVD-ROM attached to the DVD-ROM drive 108, and the like. You may comprise.
  • a keyboard for receiving a character string editing input from a user a mouse for receiving various position designations and selection inputs, and the like are connected.
  • a general-purpose personal computer can be used instead of the information processing apparatus 100 of the present embodiment.
  • the information processing apparatus 100 described above corresponds to a so-called consumer game apparatus.
  • the game of the present invention can be performed on various computers such as a mobile phone, a portable game device, a karaoke apparatus, and a general business computer.
  • An apparatus can be realized.
  • a general computer like the information processing apparatus 100, includes a CPU, RAM, ROM, DVD-ROM drive, and NIC, and an image processing unit that has simpler functions than the information processing apparatus 100.
  • a hard disk as an external storage device
  • a flexible disk a magneto-optical disk, a magnetic tape, and the like can be used.
  • a keyboard or a mouse is used as an input device.
  • FIG. 2 is an explanatory diagram showing a schematic configuration of a voice processing device according to the present embodiment and a chat system using the voice processing device.
  • the outline of each part of the voice processing apparatus will be described with reference to FIG.
  • the chat system 211 includes two voice processing devices 201.
  • the audio processing device 201 includes an input receiving unit 202, an extracting unit 203, a generating unit 204, and an output unit 205, respectively.
  • the input reception unit 202 receives an input of a voice uttered by the user.
  • the microphone 111 functions as the input reception unit 202 under the control of the CPU 101.
  • the extraction unit 203 extracts the feature parameters of the accepted voice.
  • the CPU 101 and the sound processing unit 110 serve as the extraction unit 203.
  • the generation unit 204 generates synthesized speech from predetermined audio data.
  • the synthesized speech generated here is obtained by replacing the feature parameter of the predetermined speech data with the feature parameter extracted by the extraction unit 203.
  • the CPU 101 and the voice processing unit 110 function as the generation unit 204.
  • the output unit 205 outputs the generated synthesized speech.
  • the sound processing unit 110 drives a speaker and headphones to fulfill the function of the output unit 205.
  • the chat system 211 and the two voice processing devices 201 are realized by two information processing devices 100 used by two users A and B, but one voice processing device 201.
  • the feature parameters are transmitted between the extraction unit 203 and the generation unit 204 by communication via a computer communication network.
  • the information processing apparatus 100 used by the user A functions as the input receiving unit 202 and the extraction unit 203 for the voice uttered by the user A, and the generation unit 204 and the output unit 205 for the voice uttered by the user B.
  • the information processing apparatus 100 used by the user B functions as an input reception unit 202 and an extraction unit 203 for a voice uttered by the user B, and a generation unit 204 and an output unit 205 for a voice uttered by the user A.
  • FIG. 3 is a flowchart showing a flow of control of transmission processing performed by the voice processing device 201. This corresponds to processing performed by the input receiving unit 202 and the extracting unit 203.
  • FIG. 3 is a flowchart showing a flow of control of transmission processing performed by the voice processing device 201. This corresponds to processing performed by the input receiving unit 202 and the extracting unit 203.
  • FIG. 3 is a flowchart showing a flow of control of transmission processing performed by the voice processing device 201. This corresponds to processing performed by the input receiving unit 202 and the extracting unit 203.
  • the CPU 101 initializes the voice waveform input function from the microphone 111 and the RAM 103 (step S301).
  • the RAM 103 is provided with two buffers capable of recording the waveform data of the voice input from the microphone 111 for a predetermined time length, and the contents are cleared to zero.
  • the sampling frequency of the waveform data of the sound from the microphone 111 can be changed depending on the capability and setting of the sound processing unit 110, and is set to 44100 Hz, 22050 Hz, or 11025 Hz, and the accuracy of A / D conversion is 8 Typically, it is a bit or 16 bit mono.
  • the predetermined time length for storing in the buffer is typically an integer multiple of the vertical synchronization interrupt cycle of the information processing apparatus 100 that implements the audio processing apparatus 201. For example, when the period of the vertical synchronization interrupt is 1/60 second (this corresponds to 60 Hz), the buffer time length is 1/60 second, 1/30 second, or 1 / 20th. Typically seconds.
  • 1/20 second corresponds to the lower limit of the human audible frequency range. That is, when the waveform data changes, it corresponds to the boundary between whether a human feels “change in volume (magnitude of vibration)” or “change in timbre (waveform)”. Typically this time length is employed.
  • step S302 accumulation of waveform data from the microphone 111 is started in one of the buffers in the RAM 103, and in parallel with this, the following processing is performed on the other buffer in the RAM 103.
  • feature parameters are extracted from the waveform data string in the buffer (step S303).
  • the waveform data strings stored in the buffer are a 1 , a 2 ,.
  • (4) 1 square sum sigma t of displacement L a t 2 Etc. can be adopted. These are characteristic parameters corresponding to the volume of the voice input from the microphone 111. More complicated feature parameters will be described later.
  • the characteristic parameter is transmitted to the information processing apparatus 100 of the other party via the NIC 109 of the information processing apparatus 100 (step S304) and waits until the accumulation in the buffer started in step S302 is completed (step S305). .
  • other processes can be executed in a coroutine manner in parallel. Typically, reception processing described later is performed in parallel.
  • step S306 When the accumulation in the buffer is completed, the roles of the two buffers are exchanged (step S306), and the process returns to step S302.
  • FIG. 4 is a flowchart showing a flow of control of reception processing performed by the voice processing device 201. This corresponds to processing performed by the generation unit 204 and the output unit 205.
  • FIG. 4 a description will be given with reference to FIG.
  • the CPU 101 starts outputting predetermined audio waveform data at a volume 0 (step S401).
  • predetermined voice waveform data various data such as sine wave, square wave, voice waveform data of various musical instruments prepared by MIDI, voice data such as voice actors, and the like can be adopted.
  • the NIC 109 is controlled to wait until the feature parameter transmitted from the information processing apparatus 100 of the other party arrives (step S402).
  • other processes can be executed in a coroutine manner in parallel.
  • the above transmission processing is performed in parallel.
  • the feature parameter is received (step S403).
  • step S401 the output volume of the predetermined audio waveform data started in step S401 is changed to a volume proportional to the received characteristic parameter (step S404), and the process returns to step S402.
  • the reception-side user can hear the sound whose volume changes in accordance with the volume of the voice uttered by the transmission-side user.
  • the correlation between loudness and emotion does not depend much on what language is used.
  • the phoneme since it is assumed that the phoneme is unknown on the receiving side and is originally a communication that cannot be understood as a language, there is no language that can be understood by the user on the transmitting side and the user on the receiving side. Even in this case, there is a situation where communication is promoted because there is no language barrier.
  • the loudness of the voice is extracted as the characteristic parameter, and the volume of the output voice is changed.
  • this aspect can be variously modified.
  • a method of adopting a fundamental frequency can be considered.
  • the waveform data sequences a 1 , a 2 ,..., A L accumulated in the buffer may be subjected to discrete fast Fourier transform to obtain the peak frequency having the largest component.
  • the fundamental frequency and any one of the above (1) to (4) are combined and transmitted as a characteristic parameter to the information processing apparatus 100 of the other party.
  • step S404 in addition to changing the volume, the pitch (frequency or key) for reproducing the predetermined waveform data is changed to the basic frequency of the received characteristic parameter.
  • the playback frequency of the voice waveform data may be changed in accordance with the received characteristic parameter. This corresponds to control for performing “key change” in karaoke or the like in more detail.
  • the key of the voice waveform data may be changed up and down in accordance with the up and down change of the reproduction frequency specified by the transmitted characteristic parameter.
  • the voice level in addition to the loudness of the voice, the voice level can be communicated to the other party, and the user's emotions can be understood in more detail through intonation and intonation, thereby further communicating. It becomes like this.
  • the magnitude of frequency components at a plurality of predetermined frequencies may be used as the characteristic parameter.
  • waveform data corresponding to each of a plurality of frequencies is prepared, and the amplification factor of each waveform data is made proportional to the size of the corresponding frequency component. Typically, it is proportional to any one of the above (1) to (4).
  • drums, bass, guitar, and piano have different pitch ranges. Therefore, the representative sound frequencies of these musical instruments are referred to as the “predetermined plural frequencies”.
  • the volume of each instrument is changed in accordance with the size of the component extracted for the representative frequency of the instrument in the result of the Fourier transform.
  • human speech is reproduced like a jazz band performance.
  • the drum frequency band, the bass frequency band, the guitar frequency band, and the piano frequency band are determined in advance, and the peak in each frequency band is selected from the result of the Fourier transform.
  • one or more peaks may be selected for each frequency band. For example, since the piano can cover a wider frequency band than other instruments, the number of peaks to be selected is increased accordingly.
  • the output pitch of the waveform data of each instrument is adjusted to the peak frequency, and is changed according to the magnitude of the peak frequency component.
  • a plurality of peaks are selected for a certain instrument, it may be set so that a plurality of sounds are played by the instrument.
  • a voice processing device As described above, according to the present invention, a voice processing device, a chat system, a voice processing method, and the like suitable for preventing inappropriate conversations while enabling communication between users to a certain extent.
  • An information storage medium and a program can be provided.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

Un système de salon de discussion (211) est composé de deux dispositifs de traitement vocal (201). Chaque dispositif de traitement vocal (201) est pourvu : d'une unité de réception d'entrée (202) qui reçoit une entrée de voix émise par un utilisateur ; d'une unité d'extraction (203) qui extrait le paramètre de caractéristique de la voix reçue ; d'une unité de génération (204) qui génère une voix synthétique à partir de données vocales prédéterminées ; et d'une unité de sortie (205) qui délivre la voix synthétique générée. Généralement, en utilisant en tant que paramètres l'amplitude de formes d'onde ou un volume sonore, ou une variation temporelle de l'amplitude d'une composante de fréquence de base ou de l'amplitude d'une composante de fréquence représentative prédéterminée, le paramètre de caractéristique des données vocales prédéterminées est remplacé par le paramètre de caractéristique extrait, et la voix synthétique est de ce fait générée.
PCT/JP2010/050442 2009-01-23 2010-01-15 Dispositif de traitement vocal, système de salon de discussion, procédé de traitement vocal, support de mémorisation d'informations, et programme WO2010084830A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2009012753A JP2010169925A (ja) 2009-01-23 2009-01-23 音声処理装置、チャットシステム、音声処理方法、ならびに、プログラム
JP2009-012753 2009-06-05

Publications (1)

Publication Number Publication Date
WO2010084830A1 true WO2010084830A1 (fr) 2010-07-29

Family

ID=42355884

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2010/050442 WO2010084830A1 (fr) 2009-01-23 2010-01-15 Dispositif de traitement vocal, système de salon de discussion, procédé de traitement vocal, support de mémorisation d'informations, et programme

Country Status (3)

Country Link
JP (1) JP2010169925A (fr)
TW (1) TW201040940A (fr)
WO (1) WO2010084830A1 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5664480B2 (ja) * 2011-06-30 2015-02-04 富士通株式会社 異常状態検出装置、電話機、異常状態検出方法、及びプログラム
JP6887102B2 (ja) * 2016-02-29 2021-06-16 パナソニックIpマネジメント株式会社 音声処理装置、画像処理装置、マイクアレイシステム、及び音声処理方法
KR102526699B1 (ko) * 2018-09-13 2023-04-27 라인플러스 주식회사 통화 품질 정보를 제공하는 방법 및 장치

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH01173098A (ja) * 1987-12-28 1989-07-07 Komunikusu:Kk 電子楽器
JPH0413187A (ja) * 1990-05-02 1992-01-17 Brother Ind Ltd ボイスチェンジャー機能付楽音発生装置
JPH0527771A (ja) * 1991-07-23 1993-02-05 Yamaha Corp 電子楽器
JPH05257467A (ja) * 1992-03-11 1993-10-08 Sony Corp 音声信号処理装置
JPH06102877A (ja) * 1992-09-21 1994-04-15 Sony Corp 音響構成装置
JPH0756589A (ja) * 1993-08-23 1995-03-03 Nippon Telegr & Teleph Corp <Ntt> 音声合成方法
JP2005086707A (ja) * 2003-09-10 2005-03-31 Yamaha Corp 遠隔地の様子を伝達する通信装置およびプログラム

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3806030B2 (ja) * 2001-12-28 2006-08-09 キヤノン電子株式会社 情報処理装置及び方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH01173098A (ja) * 1987-12-28 1989-07-07 Komunikusu:Kk 電子楽器
JPH0413187A (ja) * 1990-05-02 1992-01-17 Brother Ind Ltd ボイスチェンジャー機能付楽音発生装置
JPH0527771A (ja) * 1991-07-23 1993-02-05 Yamaha Corp 電子楽器
JPH05257467A (ja) * 1992-03-11 1993-10-08 Sony Corp 音声信号処理装置
JPH06102877A (ja) * 1992-09-21 1994-04-15 Sony Corp 音響構成装置
JPH0756589A (ja) * 1993-08-23 1995-03-03 Nippon Telegr & Teleph Corp <Ntt> 音声合成方法
JP2005086707A (ja) * 2003-09-10 2005-03-31 Yamaha Corp 遠隔地の様子を伝達する通信装置およびプログラム

Also Published As

Publication number Publication date
TW201040940A (en) 2010-11-16
JP2010169925A (ja) 2010-08-05

Similar Documents

Publication Publication Date Title
JP3949701B1 (ja) 音声処理装置、音声処理方法、ならびに、プログラム
US6826530B1 (en) Speech synthesis for tasks with word and prosody dictionaries
JP5306702B2 (ja) 年齢層推定装置、年齢層推定方法、ならびに、プログラム
Smus Web audio API: advanced sound for games and interactive apps
JP3419754B2 (ja) 入力音声をキャラクタの動作に反映させるエンタテインメント装置、方法および記憶媒体
CN108831437A (zh) 一种歌声生成方法、装置、终端和存储介质
WO2006093145A1 (fr) Dispositif de sortie vocale, méthode de sortie vocale, support d’enregistrement d’informations et programme
CN112289300B (zh) 音频处理方法、装置及电子设备和计算机可读存储介质
WO2010084830A1 (fr) Dispositif de traitement vocal, système de salon de discussion, procédé de traitement vocal, support de mémorisation d&#39;informations, et programme
JP2006189471A (ja) プログラム、歌唱力判定方法、ならびに、判定装置
JP3734801B2 (ja) カラオケ装置、音程判定方法、ならびに、プログラム
WO2022143530A1 (fr) Procédé et appareil de traitement audio, dispositif informatique et support de stockage
JP2002006900A (ja) 音声還元再生システム及び音声還元再生方法
JP2004240065A (ja) カラオケ装置、音声出力制御方法、ならびに、プログラム
JP4468963B2 (ja) 音声画像処理装置、音声画像処理方法、ならびに、プログラム
JP3878180B2 (ja) カラオケ装置、カラオケ方法、ならびに、プログラム
JP5357805B2 (ja) 音声処理装置、音声処理方法、ならびに、プログラム
JP4437993B2 (ja) 音声処理装置、音声処理方法、ならびに、プログラム
JP3892433B2 (ja) カラオケ装置、カラオケ方法、ならびに、プログラム
JP4294712B1 (ja) 音声処理装置、音声処理方法、ならびに、プログラム
JP3854263B2 (ja) カラオケ装置、カラオケ方法、ならびに、プログラム
JP4714230B2 (ja) 音声処理装置、音声処理方法、ならびに、プログラム
JP4563418B2 (ja) 音声処理装置、音声処理方法、ならびに、プログラム
JP3875203B2 (ja) カラオケ装置、歌唱力採点方法、ならびに、プログラム
Chen et al. Conan's Bow Tie: A Streaming Voice Conversion for Real-Time VTuber Livestreaming

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10733437

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10733437

Country of ref document: EP

Kind code of ref document: A1