WO2007015319A1

WO2007015319A1 - Voice output apparatus, voice communication apparatus and voice output method

Info

Publication number: WO2007015319A1
Application number: PCT/JP2006/304390
Authority: WO
Inventors: Kouji Hatano
Original assignee: Matsushita Electric Industrial Co., Ltd.
Priority date: 2005-08-02
Filing date: 2006-03-07
Publication date: 2007-02-08
Also published as: JPWO2007015319A1

Abstract

A mobile telephone terminal that, when receiving a telephone solicitation or receiving an ill-disposed telephone call which tries to steal the identity information of the user of the mobile telephone terminal, causes the caller to think it impossible to communicate with the user, without letting the caller know the nationality of the user. A mobile telephone terminal (100) comprises a voice analyzing means (2) that analyzes a user input voice (201), which is acquired by a microphone (61), to output phonological information (202); a voice output control means (3) that randomly replaces phonemes or syllables included in the phonological information (202) to output them as voice output instructions (302); a voice output means (4) that output, based on the voice output instructions (302), output voice (402); a signal switching means (7) that outputs the output voice (402) as a transmitted telephone signal (501); and a wireless communication means (5) that outputs the transmitted telephone signal (501) to a wireless public network. In this way, the voice of the user is converted to meaningless voice, which is then caused to be heard by the caller.

Description

Specification

Audio output device, audio communication device, and audio output method

Technical field

TECHNICAL FIELD [0001] The present invention relates to an apparatus that performs processing on input voice and outputs the voice, and more particularly to a voice output apparatus that can be used as a means for preventing unjust or illegal calls such as mischievous telephone calls in telephone communications.

Background art

[0002] In recent years, “transfer fraud” that tries to gain profits by fraudulently using people over the telephone and transferring money to a designated account has become a social problem. In addition, there is no end to the unscrupulous commercial law that attempts to sell expensive services and products that users do not want using telephones. In addition to malicious calls for the purpose of fraud and solicitation, a random phone number is called and the user speaks to obtain personal information such as the nationality, gender, and age of the user. It is also known that there are malicious calls trying to lead to fraud and solicitation.

[0003] Conventionally, devices for repelling malicious calls such as prank calls have been proposed. For example, a telephone device has been proposed in which the pitch period of voice uttered by a user is converted and the other party can hear. According to the present invention, even if a female user utters, the other party can hear a male voice, so that the other party can think that the user is a male and stop the prank call (for example, Patent Document 1). reference).

[0004] In addition, there is a known method of giving up a mischievous call by using the automatic answering function of an answering machine and letting the other party hear a standard message such as "I can't answer the call right now".

Patent Document 1: Japanese Patent Laid-Open No. 2000-78246 (Page 6, Figure 1)

Disclosure of the invention

Problems to be solved by the invention

[0005] However, in the above-described conventional apparatus, it is avoided that the user's gender is known to the other party, but it is transmitted to the other party while maintaining the user's utterance content. If the nationality of the user is determined and the malicious call attempts to obtain the user's personal information, There was a problem that it was not enough as a countermeasure. In addition, there was a problem that a person aiming at fraud or solicitation knew that communication with the user was possible and continued or repeated malicious calls.

[0006] In the case of the conventional method of listening to a fixed message to the other party, the other party can easily determine that the message is not due to the user's own utterance. There was a problem that repeated malicious calls many times without giving up

[0007] The present invention solves the above-described conventional problems by making the partner think that the other party does not know the nationality of the user and cannot communicate with the user. An object of the present invention is to provide a voice communication device that can suppress another malicious call.

Means for solving the problem

[0008] In order to solve the above-described conventional problems, the speech output device of the present invention includes speech analysis means for extracting phonological information from input speech, and speech output control means for instructing speech output based on the phonological information. In addition, a voice output unit that outputs voice based on an instruction from the voice output control means is used to output a random phoneme voice based on the phoneme information of the input voice.

[0009] With the above configuration, since it is possible to output a voice that does not make sense for the listener based on the input voice, there is no communication with the user who does not know the nationality of the user. You can make it seem possible.

In the speech output device of the present invention, the phoneme information includes information for identifying a phoneme or syllable included in the input speech, and the speech output control unit outputs the phoneme or syllable by replacing the phoneme or syllable according to a predetermined rule. Determine phonemes or syllables that make up speech

[0011] With the configuration described above, since a voice in which the phoneme or syllable of the input voice is replaced with another phoneme or syllable is output, it is possible to output a voice that does not make sense for the listener. Therefore, it can be said that it is impossible to communicate with the user without knowing the nationality of the user.

[0012] Further, in the audio output device of the present invention, the phoneme information includes information indicating whether or not the input sound is voiced. In addition, the voice output control means instructs the voice output based on the information indicating whether there is sound.

[0013] With the above configuration, since the start and stop of voice output can be controlled depending on whether or not the input voice is voiced, voice output that matches the utterance timing of the user or the other party can be performed. For this reason, it is possible to make it seem that communication with the user is impossible without giving the other party the possibility that the output speech is a standard message or synthesized speech.

[0014] In the audio output device of the present invention, the phoneme information includes information indicating the fundamental frequency of the input speech, and the speech output control means determines the fundamental frequency of the output speech based on the fundamental frequency information.

[0015] With the above configuration, the change in the fundamental frequency of the output speech has fluctuations having the statistically similar properties to the change in the fundamental frequency of the input speech, and the naturalness of the output speech is increased. For this reason, it is possible to make it seem impossible to communicate with the user without giving the other party the room to suspect that the output speech is a standard message or synthesized speech.

The present invention also constitutes a voice communication apparatus further comprising a communication unit that executes communication processing and outputs output voice to a communication destination.

[0017] With the above configuration, a voice communication device capable of suppressing a second malicious call by making it seem that communication with a user whose partner's nationality is not known is impossible. Can be provided.

The speech output method of the present invention also includes a first step of extracting phoneme information of input speech power and a second step of indicating output of output speech composed of random phonemes based on the phoneme information. And a third step of outputting the output sound based on the instruction.

[0019] By the above method, it is possible to output a voice that does not make sense for the listener based on the input voice, so it is impossible to communicate with the user without knowing the nationality of the user. Can be thought of as

The invention's effect

[0020] According to the present invention, by generating random phonological information based on the phonological information of the input speech and outputting the speech, it is possible to know the nationality of the user to the other party who originated the malicious call. It is possible to make it seem impossible to communicate with the users who are short-lived, and it is possible to deter malicious calls again.

Brief Description of Drawings

FIG. 1 is a schematic configuration diagram of an audio output device according to Embodiment 1 of the present invention.

FIG. 2 is a configuration diagram of a mobile phone terminal according to Embodiment 1 of the present invention.

FIG. 3 is an external view of a mobile phone terminal according to Embodiment 1 of the present invention.

FIG. 4 is a diagram for explaining the contents of a V, phoneme replacement candidate table held by the voice output control means of the mobile phone terminal according to the first embodiment of the present invention.

FIG. 5 is a flowchart showing a processing procedure of voice output control means of the mobile phone terminal in the first embodiment of the present invention.

FIG. 6 is a diagram for explaining the operation of the mobile phone terminal according to the first embodiment of the present invention.

FIG. 7 is a configuration diagram of a mobile phone terminal according to Embodiment 2 of the present invention.

FIG. 8 is a diagram for explaining the contents of the syllable string table held by the voice output means of the mobile phone terminal in the second embodiment of the present invention.

FIG. 9 is a first flowchart showing the processing procedure of the voice output control means of the mobile phone terminal in the second embodiment of the present invention.

FIG. 10 is a second flowchart showing the processing procedure of the voice output control means of the mobile phone terminal in the second embodiment of the present invention.

FIG. 11 is a third flowchart showing the processing procedure of the voice output control means of the mobile phone terminal in the second embodiment of the present invention.

FIG. 12 is a diagram illustrating a first operation of the mobile phone terminal according to the second embodiment of the present invention. FIG. 13 is a diagram illustrating a second operation of the mobile phone terminal according to the second embodiment of the present invention. Explanation

[0022] 1 Audio output device

2 Voice analysis means

3 Audio output control means

4 Audio output means

5 Wireless communication means 7 Signal switching means

41 Standard syllable holding means

61 Microphone

62 Speaker

91 Signal switch button

92 years old hook button

93 Off-hook button

100, 200 mobile phone terminals

201 Input audio

202 Phonological information

302 Audio output instruction

304 On-hook signal

402 output audio

501 Transmission signal

502 Receive signal

612 Microphone output signal

BEST MODE FOR CARRYING OUT THE INVENTION

The best mode for carrying out the present invention will be described below with reference to the drawings. Note that in all the drawings for explaining the embodiments, the same reference numerals are given to the same components, and duplicate explanations are omitted.

[Embodiment 1]

FIG. 1 is a schematic configuration diagram of an audio output device 1 according to Embodiment 1 of the present invention. In FIG. 1, an audio output device 1 includes audio analysis means 2, audio output control means 3, and audio output means.

With four.

The voice analysis means 2 receives the input voice 201 and executes voice analysis processing, extracts the phoneme information of the input voice 201 and outputs the phoneme information 202. Note that “phonological information” described in this specification refers to phonological information of speech obtained as a result of speech analysis. For example, a symbol or symbol string that identifies the phonemes or syllables that make up the speech, and the strength or weakness of the speech Information, information on the fundamental frequency, information for distinguishing between sound and silence, and the like correspond to “phonological information” described in this specification.

The voice output control means 3 receives the phoneme information 202 and outputs a voice output instruction 302 that indicates the output of the output voice 402. The sound output means 4 outputs the output sound 402 based on the sound output instruction 302. Here, the voice output control means 3 outputs a voice output instruction 302 based on the phoneme information 202 so that the output voice 402 is composed of random phonemes.

[0027] It should be noted that the voice analysis processing in the voice analysis means 2 may be performed using a known voice analysis device or voice analysis method, and therefore detailed description thereof will be omitted in this specification. In addition, the voice output process in the voice output means 4 may be performed using a known voice synthesizer or voice synthesizer. Therefore, detailed description will be omitted in this specification.

FIG. 2 is a configuration diagram when the audio output device 1 according to Embodiment 1 of the present invention is configured as the mobile phone terminal 100. In FIG. 2, the mobile phone terminal 100 includes wireless communication means 5, microphone 61, speaker 62, signal switching means 7, signal switching button 91, on-hook button 92, and off-hook button 93 in addition to the components described in FIG. .

[0029] The voice analysis means 2 receives the input voice 201, executes voice analysis processing, calculates the short-term average power of the input voice 201, and determines whether the received input voice 201 is a voiced section or a silent section. If there is sound, the signal “ON” is output as the phoneme information 202, and if there is no sound, the signal “OFF” is output as the phoneme information 202. Further, the speech analysis means 2 identifies phonemes by performing cepstrum analysis or the like on the input speech 201 and outputs phoneme symbols (for example, ZpZ, ZaZ) as phoneme information 202. Further, the voice analysis means 2 outputs the fundamental frequency information representing the fundamental frequency of every 30 milliseconds of the input voice 201 as the phoneme information 202.

The speech output control means 3 determines the phoneme to be included in the output speech 402 and the fundamental frequency of the output speech 402 based on the phoneme symbol and the fundamental frequency information included in the phoneme information 202, and the speech output instruction 302 Output as.

[0031] The voice output means 4 synthesizes and outputs the output voice 402 based on the phoneme and the fundamental frequency indicated by the voice output instruction 302. Also, the audio output means 4 performs the articulation coupling process when the phoneme transitions, the interpolation process of the fundamental frequency change, etc., so that the output voice 402 does not become unnatural as a voice! / RU [0032] The wireless communication means 5 is a part that performs communication processing by connecting the mobile phone terminal 100 to a wireless public network (not shown). The wireless communication means 5 outputs a transmission signal 501 to the wireless public network. Further, the wireless communication means 5 outputs the reception signal 502 that has acquired the wireless public network power to the speaker 62. Further, the wireless communication means 5 starts outputting the transmission signal 501 to the wireless public network and acquiring the reception signal 502 from the wireless public network based on the operation of the off-hook button 93. Further, the wireless communication means 5 terminates the connection with the wireless public network based on the operation of the on-hook button 92.

The microphone 61 converts the user's voice into an electrical signal and outputs a microphone output signal 612. The speaker 62 converts the received signal 502 into air vibration and emits sound. The signal switching means 7 is a means for switching and outputting two types of signals. By operating the signal switching button 91, the output sound 402 is output as the transmission signal 501 or the microphone output signal 612 is used as the transmission signal 501. The output can be switched. That is, the signal switching button determines which of the output voice 402 processed for the user's voice and the microphone output signal 612 that is not processed is output to the wireless public network to be heard by the other party. It is configured so that the user can select by operating 91. The signal switching means 7 outputs the output sound 402 as the transmission signal 501 in the initial state of the communication processing by the wireless communication means 5. This prevents the inconvenience that the other party hears the voice that the user unexpectedly uttered without knowing that the other party is malicious and the user's personal information is known to the other party. Can do.

FIG. 3 is a diagram showing an appearance of the mobile phone terminal 100 in the embodiment of the present invention.

The signal switching button 91 is arranged on the side of the lower part of the casing so that the user can operate it without looking at his / her hand during a call.

FIG. 4 is a diagram showing the contents of the phoneme replacement candidate table T31 held by the audio output control means 3 of the mobile phone terminal 100 according to Embodiment 1 of the present invention. The phoneme replacement candidate table T31 is a table showing another phoneme candidate for replacing a phoneme included in the phoneme information 202. Each record R311 to R318 of the phoneme replacement candidate table T31 includes a first field F311 representing a phoneme p constituting the phoneme information 202, and a second field F312 representing a candidate of another phoneme P ′ that replaces the phoneme p. Become. Sound output control means 3 is sound A speech output instruction 302 is generated by replacing the phoneme p constituting the phoneme information 202 with the phoneme p ′ according to a predetermined rule (hereinafter referred to as a phoneme replacement rule) based on the phoneme replacement candidate table T31. That is, the speech output control means 3 searches the phoneme replacement candidate table T31 for a record including the phoneme P in the first field (F311), and also searches for the medium force of the records R311 to R318, and the second field (F312) of the corresponding record indicates The phoneme p is also selected as a phoneme P ′, and the phoneme p ′ is replaced with the phoneme p ′, and the speech output instruction 302 is generated.

FIG. 5 is a flowchart showing a processing procedure of audio output control means 3 of mobile phone terminal 100 according to Embodiment 1 of the present invention. The voice output control means 3 first acquires the phoneme information 202 (step S 101), and based on whether the signal included in the phoneme information 202 is “ON” or “OFF”, It is determined whether or not is a voiced section (step S102). If it is a voiced section (YES), the audio output control means 3 proceeds to step S103, and if it is a silent section (NO), proceeds to step S101.

[0037] In step S103, the speech output control means 3 determines whether the phoneme included in the phoneme information 202 acquired this time has the same power as the phoneme included in the phoneme information 202 acquired last time. ) Proceed to step S105, if different! / (YES) Proceed to step S104.

In step S104, the speech output control means 3 replaces the phoneme p constituting the phoneme information 202 with a new phoneme P ′ in accordance with the phoneme replacement rule, thereby obtaining a speech output instruction 302. On the other hand, in step S105, the speech output control means 3 replaces the phoneme p constituting the phoneme information 202 with the phoneme p ′ obtained in the previous processing of step S104 to obtain a speech output instruction 302. The determination in step S103 and the processing in step S105 are for converting a section in which the same phoneme in the input speech 201 is continued into a section in which another same phoneme is continued.

In step S 106, the audio output control means 3 calculates the basic frequency F 0 ′ according to the frequency conversion equation based on the basic frequency F 0 indicated by the basic frequency information included in the phoneme information 202, and sets it as the audio output instruction 302. . The frequency conversion formula is as follows.

[0040] F0, = F0 * r * (random number (0.4) +0.8)

Where r is the fundamental frequency of the output audio 402 relative to the basic frequency of the input audio 201. It is a predetermined coefficient for instructing whether to increase the degree of. By setting the coefficient r to r <1, the fundamental frequency of the output speech 402 can be made lower overall than the fundamental frequency of the input speech 201. For example, if r = 0.5, the female voice of the input voice 201 will be converted to a male voice and output as the output voice 402, thus preventing the calling party from knowing the gender of the user. Can do. The coefficient r may be changeable by user operation. Random number (0.4) represents a random number less than 0.4. By giving fluctuations to the fundamental frequency with random numbers, etc., the change in intonation of the input voice 402 can be disturbed, and the intonation power of the output voice 402 can be recognized by the other party as to what language the input voice 201 is in. To prevent.

In step S107, the audio output control means 3 outputs the audio output instruction 302 generated by the processing up to step S106 to the audio output means 4, and returns to step S101.

[0042] Hereinafter, a specific operation of the first embodiment of the present invention will be described. In other words, this is a specific example in which the user's voice is processed and output to the wireless public network, and the other party can be heard. Here, the case where the user is a woman whose native language is Japanese and the coefficient r is 0.5 will be described as an example.

FIG. 6 shows the contents of input voice 201 and output voice 402 when mobile phone terminal 100 according to Embodiment 1 of the present invention utters a voice “Who?” During a call. It is. In FIG. 6, the horizontal axis represents time and the vertical axis represents the fundamental frequency. A line group L101 represents a change in the fundamental frequency of the input sound 201, and a line group L102 represents a change in the fundamental frequency of the output sound 402. The phoneme symbols described immediately above the line group L101 and the line group L102 are the phoneme (D101 to D106) included in the phoneme information 202 and the phoneme (D111 to D116) indicated by the voice output instruction 302, respectively. Indicates. The operation of mobile phone terminal 100 in the first embodiment of the present invention will be described below with reference to FIG.

First, when the user starts a call, the voice analysis means 2 analyzes the input voice 201 and extracts the phoneme information 202 at the time tlOl. The phoneme information 202 includes information on the phoneme ZdZ (DlOl) and the fundamental frequency F0, and a signal “ON” indicating that it is a voiced section.

Next, the audio output control means 3 executes processing according to the procedure shown in the flowchart of FIG.

In step S101, the audio output control means 3 acquires phonological information 202. Step In S102, since the sound output control means 3 is a sound section (“ON”) (YES), the process proceeds to Step S103. In step S103, the voice output control means 3 has newly received the phoneme ZdZ, so the determination is “YES”, and the flow proceeds to step S104.

In step S104, the speech output control means 3 refers to the phoneme replacement candidate table T31 in FIG. 4, and selects a phoneme P ′ for replacing the phoneme ZdZ according to the phoneme replacement rule. At this time, since the record having phoneme ZdZ as the first field in phoneme replacement candidate table T31 is record R313, speech output control means 3 uses phoneme p ′ as the phoneme selected randomly from phoneme ZkZ or ZgZ. And Here, ZkZ is selected as phoneme p '.

[0047] In step S106 of FIG. 5, the audio output control means 3 calculates the basic frequency FO 'according to the frequency conversion equation, and in step 107, the audio output control means 3 instructs the phoneme ZkZ and the basic frequency FO' to be output as an audio signal. Output as 302.

Further, the audio output means 4 synthesizes and outputs the output audio 402 based on the audio output instruction 302. As shown in FIG. 6, the output speech 402 is speech with phoneme ZkZ (Di l i) force, and the fundamental frequency is about half of the fundamental frequency of the input speech 201.

By performing the same processing on the phoneme ZoZ (D102) portion of the input speech 201, the output speech 402 having the phoneme ZiZ (D112) force is output from the speech output means 4. The output voice 402 is sent to the wireless communication means 5 as the transmission signal 501 through the signal switching means, and is further outputted to the terminal of the other party through the wireless public network.

[0050] Since the following phoneme ZchZ (D103) is an unvoiced consonant, the speech analysis means 2 cannot extract the fundamental frequency and outputs 0 as the value of the fundamental frequency FO in the phoneme information 202. The audio output control means 3 outputs 0 as the value of the basic frequency FO ′ in the process of step S106, but the audio output means 4 outputs the basic frequency of the phoneme ZiZ (D112) accepted last time and the next accepted phoneme. The basic frequency of the output speech 402 corresponding to the phoneme ZrZ (D113) is smoothly changed by interpolating with the basic frequency of ZaZ (Dl 14).

[0051] The fundamental frequency of the phoneme ZiZ (Dl 15) of the output speech 402 corresponding to the phoneme ZaZ (D105) of the input speech 201 is determined by the influence of random numbers in the frequency conversion formula of step S106 in FIG. It is slightly higher than the fundamental frequency of ZnZ. [0052] Subsequently, by repeating the same process, the Japanese input voice 201 “Whisama?” 1S is converted into a male voice “Kirani Johi?” That does not make sense as Japanese, and the wireless public network is Via the other party's terminal.

[0053] As described above, in the mobile phone terminal 100 according to Embodiment 1 of the present invention, as a first step, the speech analysis means 2 extracts the phoneme information from the input speech 201 and outputs the phoneme information 202. As the second step, the voice output control means 3 outputs the voice output instruction 302 based on the phoneme information 202 by executing the processing in the procedure shown in the flowchart of FIG. 5, and as the third step, the voice output means 4 Synthesizes the output sound 402 based on the sound output instruction 302 and outputs it. By this voice output method, random voice in which the phoneme of the input voice is replaced with another phoneme can be output to the terminal of the other party.

As is clear from the above description, since the mobile phone terminal according to Embodiment 1 of the present invention outputs the voice in which the phoneme of the input voice is replaced with another phoneme to the other party's terminal, Since it can output voice that does not make sense to the other party, it can be thought to the other party that it is impossible to communicate with the user without knowing the nationality of the user. .

[0055] In addition, the mobile phone terminal according to Embodiment 1 of the present invention has a fluctuation in which the change in the fundamental frequency of the output voice has a property that is statistically similar to the change in the fundamental frequency of the input voice. Since the naturalness of the voice increases, it is possible to make the other party think that it is impossible to communicate with the user without giving the other party the room to suspect that the output voice is a standard message or synthesized voice. .

[Embodiment 2]

Next, a second embodiment of the present invention will be described. The second embodiment is characterized in that the received signal that is the output of the wireless communication means 5 is used as the input voice.

FIG. 7 is a configuration diagram of the mobile phone terminal 200 according to Embodiment 2 of the present invention. In FIG. 7, the mobile phone terminal 200 includes a voice analysis means 2, a voice output control means 3, a voice output means 4, a fixed syllable holding means 41, a wireless communication means 5, a microphone 61, a speaker 62, a signal switching means 7, A signal switching button 91, an on-hook button 92, and an off-hook button 93 are provided. Here, the fixed syllable holding means 41 holds a fixed syllable string for constituting the output speech 402. It is means to have.

[0058] The voice analysis means 2 accepts the received signal 502, which is the output of the wireless communication means 5, as the input voice 201, executes voice analysis processing, calculates the short-term average power of the input voice 201, and accepts it. It is determined whether the input voice 201 is a voiced section or a silent section, and the signal “ON” is output as phoneme information 202 when there is a voice and the signal “OFF” is output when there is no sound.

The voice output control means 3 generates a syllable string ID for identifying a syllable string to be output as the output voice 402 based on the phonological information 202, and outputs it as the voice output instruction 302. The audio output control means 3 outputs an on-hook signal 304 that is an instruction for the wireless communication means 5 to end the connection with the wireless public network.

The voice output means 4 synthesizes and outputs the output voice 402 based on the syllable string data acquired from the fixed syllable holding means 41 based on the syllable string ID included in the voice output instruction 302. The fixed syllable holding means 41 holds a syllable string table which is a table for obtaining syllable string data based on the syllable string ID.

FIG. 8 is a diagram showing the contents of the syllable string table T41 held by the fixed syllable holding means 41 of the mobile phone terminal 200 according to Embodiment 2 of the present invention. Each record R411 to R414 of the syllable string table T41 includes a first field F411 representing a syllable string ID and a second field F412 representing syllable string data corresponding to the syllable string ID. The syllable string data is expressed as a string of pairs of syllables (alphabet enclosed in parentheses []) to be output as the output speech 402 and coefficients of the fundamental frequency of syllables (numerical values enclosed in parentheses 0). And For example, the syllable string data “[ho] (1. 0), [la] (l. 2)” corresponding to the syllable string ID = 1 stored in the record R411 includes the syllable [ho] and the syllable [la] Indicates that the output is continuous at a medium fundamental frequency and a higher fundamental frequency.

[0062] In FIG. 8, the syllable string data corresponding to syllable string ID = 1 in record R411 is a call as if it were a fixed phrase (for example, "Hello") spoken as a greeting immediately after the start of the call. It is data to make the other party think. In addition, the syllable string data corresponding to syllable string ID = 2 in record R412 is data that makes the caller feel as if it is a fixed phrase (for example, “Yes?”) Spoken when asking the caller. It is. Syllable string data is not intended to generate speech that makes no sense to the other party. By preparing the syllable string data to make it seem to the other party as if it was a fixed phrase, and making it appear in the output voice 402 from time to time naturally Natural language is born, and it is possible to make the other party believe that the output speech 402 is the voice of the user himself / herself.

FIGS. 9 to 11 are flowcharts showing the processing procedure of audio output control means 3 of mobile phone terminal 200 according to Embodiment 2 of the present invention. The voice output control means 3 first acquires the phoneme information 202 in step S211. In step S212, the audio output control means 3 determines whether the signal included in the phoneme information 202 is “ON” or “OFF”. It is determined whether or not the input voice 201 is in a voice section. If it is a voiced section (YES), the audio output control means 3 proceeds to step S221 in FIG. 10, and if it is a silent section (NO), proceeds to step S213.

[0064] In step S213, the audio output control unit 3 determines whether the audio output unit 4 is outputting the output audio 402. If the audio output unit 4 is outputting (YES), the process returns to step S211 and is not being output (NO ), Go to step S214. In step S214, the voice output control means 3 determines whether or not the silent section of the input voice 201 is the first silent section after the start of the call, that is, immediately after the start of the call. (YES) Proceed to step S231 in FIG. 11, otherwise (NO) proceed to step S215.

In step 215, the audio output control means 3 generates a pseudorandom number d that is greater than or equal to 0 and less than 1, and branches the process depending on the value. If d force .2 is less than 2, go to step S216, if 0.2 or more and less than 0.9, go to step S217, if d force .9 or more, go to step S21 9 o

[0066] In step S216, the voice output control means 3 selects to output a syllable string ID = 2, that is, to output a syllable string that asks the other party. In step S217, the voice output control means 3 randomly selects a syllable string ID from syllable string ID or syllable string ID = 0 stored in the syllable string table T41 of FIG. The syllable string ID = 0 indicates that the output sound 402 of the sound output means 4 is stopped. In step S218, the audio output control means 3 outputs the syllable string ID selected in step S216 or S217 as the audio output instruction 302, and returns to step S211. [0067] In step S219, the audio output control means 3 outputs the on-hook signal 304, whereby the radio communication means 5 ends the connection with the radio public network. In this way, by executing the process of probabilistically disconnecting the call, it seems as if the user did not understand the meaning of the other party's utterance and judged that it was useless to continue the call and performed an on-hook operation. Can remind the other party to call you. In addition, since the voice output control means 3 automatically instructs to disconnect the call after a while, the user does not have to operate the on-hook button 92 while listening to the voice of the other party with malicious intent. As a result, convenience is improved.

[0068] Steps S221 to S229i in FIG. 10 and step S212 in FIG. 9! /, And input voice 201 force S When there is a voiced section, that is, voice output control means when the other party is making voice 3 is a processing procedure. In step S221, the audio output control means 3 generates a pseudo-random number d that is greater than or equal to 0 and less than 1, and branches the process depending on the value. If d is less than 0.998, the process proceeds to step S222. If d is 0.998 or more, the process proceeds to step S229.

In step S222, the audio output control means 3 sets syllable string ID = 0, outputs an audio output instruction 302 in step S223, and returns to step S211. Steps S222 to S223 are processes for making the other party think as if the user had stopped speaking because the other party started speaking while the user was speaking. Through the processing of steps S222 to S223, it is possible to make the other party believe that the user is speaking while listening to the voice of the other party.

[0070] In step S229, when the audio output control means 3 outputs the on-hook signal 304, the wireless communication means 5 ends the connection with the wireless public network.

Steps S231 to S239 in FIG. 11 are voice output control means when it is determined in step S214 in FIG. 9 that the silent section of the input voice 201 is the first silent section after the start of the call, that is, immediately after the start of the call. 3 is a processing procedure. In step S231, the audio output control means 3 determines whether or not the two output of the syllable string ID has already been completed. If not completed (NO), the process proceeds to step S232, and if completed, (YES) step. Proceed to S239 to output the on-hook signal 304.

[0072] In step S232, the voice output control means 3 selects to output a syllable string ID = 1, that is, to output a syllable string such as a greeting immediately after the start of a call. Output and return to step S211 in FIG. Steps S231 to S239 are processes to make the caller feel as if the call is disconnected because the user greets immediately after the call starts and the caller does not respond even if greeting is made twice. is there.

[0073] When the audio output control means 3 outputs the audio output instruction 302 in steps S218, S223, and S233, the process in which the audio output means 4 receives it and outputs the output audio 402 is executed in the following procedure. . First, the voice output means 4 can obtain the syllable string data corresponding to the syllable string ID included in the voice output instruction 302 by searching the syllable string table T41 (FIG. 8). Next, the voice output means 4 synthesizes and outputs the output voice 402 based on the syllable string data. At this time, the voice output means 4 calculates the fundamental frequency FO ′ of the output voice 402 for each syllable according to the frequency calculation formula based on the fundamental frequency coefficient (X based on the syllable string data. The frequency calculation formula is as follows. is there.

[0074] FO, = FObase * a * (random number (0. 4) +0.8)

Here, FObase is an initial value of the fundamental frequency of the output sound 402. By adjusting the value of FObase, the output sound 402 can be made male voice or female voice. For example, if FObase = 120 Hz, the output voice 402 can be changed to a male voice. The value of FObase may be changed by user operation. The random number (0. 4) represents a random number less than 0.4. By multiplying the fundamental frequency by a random number, a variation can be given to the intonation change of the output voice 402, thereby preventing the other party from realizing that the output voice 402 is due to synthesis. The voice output means 4 performs processing for interpolating the fundamental frequencies of the preceding and following syllables at the syllable boundary so that the output voice 402 does not become unnatural as a voice. The specific operation of Embodiment 2 of the present invention will be described below. In other words, in the second embodiment, the synthesized voice based on the result of analyzing the other party's voice as the input voice is output to the wireless public network as the output voice, and the other party is told. Hereinafter, a case where the other party makes a silent call will be described.

FIG. 12 is a diagram showing the contents of output sound 402 in mobile phone terminal 200 according to Embodiment 2 of the present invention. In Fig. 12, the horizontal axis represents time and the vertical axis represents the fundamental frequency. Syllable symbols D211 to D214 represent syllables included in the output speech 402. Hereinafter, the operation of the mobile phone terminal 200 according to the second embodiment of the present invention will be described with reference to FIG. The

[0076] First, at time t201 when an incoming call is received, when the user starts a call by operating the off-hook button 93, the voice analysis means 2 analyzes the input voice 201 and extracts phonological information 202. . The extracted phoneme information 202 includes a signal “OFF” indicating silence.

Next, the audio output control means 3 executes processing in the procedure shown in the flowchart of FIG. In step S211, the audio output control means 3 acquires the phoneme information 202. In step S212, the audio output control means 3 is a silent section (“OFF”) (NO), so the process proceeds to step S213. In step S213, the sound output control means 3 outputs the output sound 402, and the process proceeds to step S214 because! / ヽ (NO). In step S 214! /, The voice output control means 3 is the first silent section after the start of the call (YES), so the process proceeds to step S 231 in FIG.

In step S231, since the audio output control means 3 has not output the output audio 402 even once (NO), the process proceeds to step S232. In step S232, the audio output control means 3 sets the syllable string ID to 1, and outputs an audio output instruction 302 in step 233.

[0079] Further, the voice output means 4 receives the voice output instruction 302, and refers to the syllable string table T41, and the syllable string data “[ho] (1. 0), [la] ( l. 2) ”(the second field F412 of the record R411 in FIG. 8) is obtained and the output speech 402 (D211 and D212 in FIG. 12) is synthesized. Here, the coefficient α of the syllable string data is set so that the syllable [ho] is 1.0 and [la] is 1.2, and the fundamental frequency of the syllable [la] is set higher. The fundamental frequency of 402 is higher for syllable [ho] (D211) than for syllable [la] (D212). This is because fluctuation is given to the fundamental frequency of the output speech 402 by the term of the random number in the frequency calculation formula.

Thereafter, the second output of the output sound 402 is performed by the same processing (parts D213 and D214 in FIG. 12). Since the audio output control means 3 has already been output twice in the third process of step S231 (YES), the process proceeds to step S239, and the on-hook signal 304 is output so that the wireless communication means 5 at time t202 in FIG. The connection with the wireless public network is terminated.

[0081] As described above, the mobile phone terminal according to Embodiment 2 of the present invention has a communication mode as if it is a fixed phrase that is uttered as a greeting when the other party makes a silent call. After outputting the voice to make it feel like a hand for a specified number of times, the connection with the public network is automatically terminated.

As another specific operation of the second embodiment of the present invention, a case where a malicious call for the purpose of fictitious billing is received will be described as an example.

FIG. 13 is a diagram showing the contents of the input sound 201 and the output sound 402 in the mobile phone terminal 200 in the mobile phone 2 of the present embodiment. In FIG. 13, the horizontal axis represents time. FIG. 13 (a), FIG. 13 (b), and FIG. 13 (c) show the continuous operation divided into three for convenience. In FIG. 13, texts D301 to D305 represent the contents of the input voice 201, that is, the voice of the other party, and texts D311 to D315 represent the contents of the output voice 402. In FIG. 13, information on the fundamental frequencies of the input sound 201 and the output sound 402 is omitted. Hereinafter, the operation of the mobile phone terminal 200 according to the second embodiment will be described with reference to FIG. Note that the process from when the user operates the off-hook button 93 at time t301 to the time when the first output sound 402 (D311) is output is the process of the above-described operation example according to the second embodiment of the present invention. Since it is the same as that of FIG.

When the voice D301 is input, the voice analysis means 2 analyzes the input voice 201 and extracts phonological information 202. The extracted phoneme information 202 includes a signal “ON” indicating sound.

Next, the audio output control means 3 executes processing in the procedure shown in the flowchart of FIG. In step S211, the audio output control means 3 acquires the phoneme information 202. In step S212, since the sound output control means 3 is a sound section (“ON”) (YES), the process proceeds to step S221 in FIG.

[0086] In step S221, the voice output control means 3 generates a pseudo-random number d. Here, if d = 0.2, d is 0.999, so the process proceeds to step S222, and the syllable string ID = This is set to 0, and this is output as an audio output instruction 302 in step S223. The voice output means 4 does not output the output voice 402 because the force syllable string ID = 0 for receiving the voice output instruction 302.

While the input sound 201 is being input, the above-described processing is repeated, so that the output sound 4002 is not output. When the speech D301 ends, the speech analysis means 2 outputs phoneme information 202 including a signal “OFF” indicating silence. The voice output control means 3 acquires the phoneme information 202 (step S211), and proceeds to step S213 based on the determination (NO) in step S212, and further proceeds to step S214 because the voice is not being output. In step S214, since the sound output control means 3 is not the first silent section (NO), the process proceeds to step S215.

[0089] If the value of the pseudorandom number d generated by the audio output control means 3 in step S215 is 0.3 here, the process proceeds to step S217 because 0.2≤d <0.9. In step S217, the voice output control means 3 randomly selects a syllable string ID. Here, if ID = 3, the voice output control means 3 in step S218 gives a voice output instruction 302 including the syllable string ID = 3. Output.

[0090] The voice output means 4 accepts the voice output instruction 302 and refers to the syllable string table T41, and the syllable string data [[ki] (l. 0), [ru] (0. 9), [mi] (0 9), [ji] (l. 2), [hi] (l. 1), [go] (1. 0), [che] (1. 3), [si] (1. 5) ”( The second field F412) of the record R413 in FIG. 8 is obtained, and the sound D312 (FIG. 13) is output as the output sound 402.

[0091] The voice D313 “chegi?” Is generated when the value of the pseudo-random number d generated by the voice output control means 3 in step S215 is smaller than 0.2 and the syllable string ID = 2 in step S216, or in step S217. This is the output when 2 is selected as the syllable string ID.

If the other party starts speaking at time t302 while the voice D314 is being output, the voice output control means 3 proceeds to the process of FIG. 10 based on the determination in step S212 (YES). Here, if the pseudorandom number d generated by the audio output control means 3 in step S221 is 0.7, since d <0.99, the audio output control means 3 executes the processing of steps S222 and S223. The audio output means 4 interrupts the output of the output audio 402 at time t303 in FIG.

[0093] If the value of the pseudorandom number d generated by the audio output control means 3 in step S221 while the audio D305 is being input is 0.9984 here, d≥0.999. The output control means 3 outputs the on-hook signal 304 in step S229, so that the wireless communication means 5 terminates the connection with the wireless public network at time t304.

[0094] As described above, the mobile phone terminal according to Embodiment 2 of the present invention is used by outputting a voice that does not make sense to the other party in the middle of the other party's voice. The call is automatically disconnected after pretending that the person is talking.

As is apparent from the above description, the mobile phone terminal according to Embodiment 2 of the present invention outputs a syllable string to the call partner's terminal at random, and therefore outputs voice that does not make sense to the call partner. As a result, it is possible to make the call partner think that it is impossible to communicate with the user without knowing the nationality of the user.

[0096] In addition, since the mobile phone terminal according to Embodiment 2 of the present invention can change the output of the sound based on whether the other party's voice is sounded, the voice according to the other party's utterance situation Since it can be output, it does not give the other party the room to suspect that the output voice is a standard message or synthesized voice, and it makes sure that the other party thinks that communication with the user is impossible.

[0097] In addition, since the mobile phone terminal according to Embodiment 2 of the present invention outputs a standard voice occasionally, the naturalness of the output voice increases, so it is suspected that the output voice is a standard message or a synthesized voice. Without giving room to the other party, it is possible to make the other party think that the communication with the user is impossible.

[0098] In addition, since the mobile phone terminal according to Embodiment 2 of the present invention stops voice output when the other party starts speaking during voice output, the user actually listens to the voice of the other party. Since the call partner can be made to think that he / she is speaking, the call partner can be surely made to think that communication with the user is impossible.

[0099] In addition, since the mobile phone terminal according to Embodiment 2 of the present invention automatically terminates communication during a call, it is useless even if the user continues the call without knowing the meaning of the other party's call. Therefore, it is possible to make the caller feel as if an on-hook operation has been performed, so that it is possible to make the caller believe that communication with the user is impossible.

[0100] In Embodiment 2 of the present invention, the syllable string data is also selected in advance, but any means can be used as long as it can output a meaningless voice for the other party. For example, a random syllable string or phoneme string may be generated each time a voice is output.

[0101] Further, in the second embodiment of the present invention, the condition for determining whether to output the output sound or disconnect the communication (the range of the value of d in the branch determination of steps S215 and S221) is constant. Although there are cases where this is not the case, it may be possible to change the conditions, such as increasing the probability of disconnecting communications with each incoming call.

[0102] Furthermore, in Embodiment 1 and Embodiment 2 of the present invention, the terminal incorporates the audio output device of the present invention. However, for example, an external device connected to the terminal, such as an exchange, a repeater, a server, etc. The voice output device of the invention may be built in, and the voice of the user or the other party may be processed to output the voice.

[0103] In the first and second embodiments of the present invention, the example of the mobile phone terminal has been described. However, the present invention is not limited to this, and the same applies to a landline phone, an IP phone, a voice chat, an interphone, and the like. An effect is obtained.

[0104] Further, the audio output device of the present invention can be used for an apparatus that outputs a sound having randomness with respect to a user's input voice. For example, an electronic pet, a pet robot, a toy, a game machine, It can also be used as an audio output device for configuring game software or the like. Although the invention has been described in detail and with reference to specific embodiments, it will be apparent to those skilled in the art that various changes and modifications can be made without departing from the spirit and scope of the invention.

This application is based on Japanese Patent Application No. 2005-223652 filed on August 2, 2005, the contents of which are incorporated herein by reference.

Industrial applicability

[0105] The voice output device and voice output method of the present invention make the caller think that the caller of the malicious call cannot communicate with the user who cannot know the nationality of the user. Therefore, it is effective for a voice communication apparatus capable of suppressing another malicious call by a caller and capable of making a call with an unspecified person.

Claims

The scope of the claims

[1] speech analysis means for extracting phonological information from input speech;

Voice output control means for instructing voice output based on the phonological information;

Audio output means for outputting audio based on the instructions;

An audio output device comprising:

A speech output device configured to output speech of random phonemes based on phoneme information of the input speech.

[2] The phonological information includes information for identifying a phoneme or syllable included in the input speech, and the speech output control means configures the output speech by replacing the phoneme or syllable according to a predetermined rule. The audio output device according to claim 1, wherein the phoneme or syllable to be determined is determined.

[3] The phonological information includes information indicating whether or not the input speech is speech,

3. The audio output device according to claim 1, wherein the audio output control means instructs audio output based on information indicating whether the sound is present.

[4] The phoneme information includes information indicating a fundamental frequency of the input speech,

4. The audio output device according to claim 1, wherein the audio output control means determines a fundamental frequency of the output audio based on the fundamental frequency.

[5] A communication means for executing communication processing and outputting the output sound to a communication destination;

An audio communication device comprising the audio output device according to any one of claims 1 to 4.

[6] a first step of extracting phonological information from the input speech;

A second step of instructing random phonological speech output based on the phonological information; a third step of outputting the output speech based on the instruction;

An audio output method comprising: