WO2020085323A1

WO2020085323A1 - Speech processing method, speech processing device, and speech processing program

Info

Publication number: WO2020085323A1
Application number: PCT/JP2019/041367
Authority: WO
Inventors: 嘉山　啓
Original assignee: ヤマハ株式会社
Priority date: 2018-10-22
Filing date: 2019-10-21
Publication date: 2020-04-30
Also published as: JP2020067495A

Abstract

The present invention appropriately determines the intent of a speaker even when it is difficult to determine the intent of the speaker only by a pitch transition at the end of a word of a speaking section. A plurality of partial speaking sections included in one speaking section are identified within a speech signal, and the time change of the speech signal is analyzed for each of the partial speaking sections. Specifically, for each period, a speech signal is divided into speaking sections (UP1) divided by a silent section the duration time length of which is longer than a time threshold value (TH4), and each speaking section is divided into one or more partial speaking sections (PUP1-PUP3) divided by a silent section the duration time length of which is shorter than the time threshold value (TH4).

Description

Audio processing method, audio processing device, and audio processing program

The present disclosure relates to, for example, a voice processing method, a voice processing device, and a voice processing program applied to a dialogue device or the like.

In order to realize a natural dialogue in a dialogue device that provides a response to the utterance of the user, the dialogue device side determines the intention of the utterer based on the pitch change of the utterance of the user. However, it is necessary to provide a response corresponding to the intention of the speaker. As a technique that meets such a demand, there is a technique disclosed in Patent Document 1, for example. In the technique disclosed in Patent Document 1, the response is controlled based on the pitch change of the ending of the utterance section.

JP-A-2015-69038

When extracting the utterance section from the voice signal and judging the intention of the speaker from the change in the pitch at the end of this utterance section, if there is a change in the pitch that is in doubt during the utterance section, it is possible to judge the intention. It can be difficult.

In the following speaker's utterance examples, punctuation marks represent pitch downward transitions, and question marks represent pitch upward transitions.
Example 1: "hirugohan, ramen de ii?" (In Roman letters) (meaning "Would you like to eat Japanese noodles for lunch?")
Example 2: "hirugohan, ramen de ii? Ne." (In Roman letters) ("You would like to eat Japanese noodles for lunch, wouldn't you?" (Intonation of the ending of the pronunciation is lowered, intended to confirm (Sentence) is meant.)
Example 3: “hirugohan, ramen de ii? Ne?” (In Roman letters) (“You would like to eat Japanese noodles for lunch, wouldn't you?” (Intonation of the ending when pronouncing this English sentence goes up, question Means a sentence)).
Here, in the utterance example 2, the Japanese "ne." In romanization means confirmation, and for example, "ne." (Romanization) following the question sentence "ramen de ii?" (Romanization) is in English. Means “Isn't that right?”
In utterance example 3, the Japanese "ne?" In roman letters means a question. For example, the question sentence "ramen de ii?" (In roman letters) is followed by "ne?" (In roman letters) in English. Means "Is it OK?" The romanized Japanese "hirugohan" in Example 1-3 means "lunch" in English, and the romanized Japanese "ramen" means "Japanese noodles" in English.

In the utterance example 1, when the dialogue device analyzes the speech in units of the utterance section, it detects that the pitch transition of the end “ii?” (In Roman characters) of the utterance section is increased. Judgment is made and a prerecorded response to the question of the intention of doubt is output. In this case, an appropriate dialogue is realized.

In the utterance example 2, when the dialogue device analyzes the voice for each utterance section, it detects the downward transition of the pitch at the end “ne.” (In Roman characters) of the utterance section, and thus has the intention of confirmation. Judgment is made, and a prerecorded response to the inquiry about confirmation intention is output. In this case, the pitch of "ii?" (In Roman alphabet) before the end "ne." (In Roman alphabet) is changing upward, and this voice represents the intent of doubt. Therefore, the response does not meet the intention of the speaker, resulting in an inappropriate dialogue.

In the utterance example 3, when the speech analysis is performed in units of utterance sections, the dialogue device detects a rising transition of the pitch "ne?" (In Roman characters) at the end of the utterance section, and therefore, there is a questionable intention. Judgment is made and a prerecorded response to the question of the intention of doubt is output. However, since the dialogue device does not consider the rising transition of the pitch of "ii?" (In Roman alphabet) before the end "ne?" (In Roman alphabet) in the judgment about the intention of the speaker, the strength of the doubtful intention is high. Failure to judge (the question intent of utterance example 3 is stronger than the question intent of utterance example 1). This leads to inappropriate dialogue.

In order to give an appropriate response to the speaker, it may be possible to analyze the intention of the speaker by performing speech recognition of the utterance. However, if voice recognition is performed, there is a problem that the device becomes large-scale and the time from utterance to response becomes long.

The present disclosure has been made in view of the above circumstances, and it is possible to appropriately and easily perform the intention of the speaker even when it is difficult to determine the intention of the speaker only with the pitch transition of the ending of the utterance section. The purpose is to provide a technical means capable of determining.

In order to solve the above problems, a voice processing method according to an aspect of the present disclosure specifies a plurality of partial utterance sections included in one utterance section in a voice signal, and outputs a voice signal for each partial utterance section. Analyze changes over time.
An audio processing device according to an aspect of the present disclosure stores one or more computers and a plurality of instructions, and causes the audio processing device to perform the following operations when executed by the one or more computers. One or more data storage devices are provided, a plurality of partial utterance sections included in one utterance section are specified in the voice signal, and a temporal change of the voice signal is analyzed for each partial utterance section.
A voice processing program according to an aspect of the present disclosure includes a step of identifying a plurality of partial utterance sections included in one utterance section in a voice signal, and a step of analyzing a temporal change of a voice signal for each of the partial utterance sections. , Is executed by the computer.

FIG. 1 is a block diagram showing a configuration of a dialogue device according to an embodiment of the present disclosure. It is a time chart explaining a function as a voice analysis device of the dialog device. It is a functional block diagram which shows the structure of the function implement | achieved when the control apparatus in the same embodiment executes a voice analysis program. It is a flow chart which shows the processing contents of the voice analysis program. It is a flow chart which shows the processing contents of the speech section processing of the voice analysis program. 8 is a time chart showing a first operation example of the same embodiment. 8 is a time chart showing a second operation example of the same embodiment. It is a time chart which shows the 3rd example of operation of the embodiment.

Hereinafter, embodiments of the present disclosure will be described with reference to the drawings.

FIG. 1 is a block diagram showing a configuration of a dialogue device which is an embodiment of a voice analysis device according to the present disclosure. The dialogue device includes a control device 1, a computing device 2, a storage device 3, a display device 4, an operating device 5, a sound collecting device 6, and a sound emitting device 7.

The control device 1 is the control center of the dialogue device and is composed of a CPU. The storage device 3 has a volatile storage unit such as a RAM and a non-volatile storage unit such as a ROM or a hard disk. Various programs are stored in the non-volatile storage unit. These programs include a voice analysis program for analyzing a user's uttered voice and a voice synthesis program for synthesizing a response voice to the user's uttered voice based on the analysis result of the uttered voice. The control device 1 uses the volatile storage unit as a work area and executes each program stored in the non-volatile storage unit. The arithmetic unit 2 is, for example, a DSP, and when the control unit 1 executes the voice analysis program and the voice synthesis program, under the control of the control unit 1, executes arithmetic processing for voice analysis and voice synthesis. The display device 4 is, for example, a liquid crystal panel, and displays various information to the user. The operation device 5 includes various operators such as a keyboard and a mouse for receiving instructions from the user. The sound pickup device 6 includes a microphone that picks up a voice uttered by a user, and an A / D converter that A / D-converts an analog voice signal output by the microphone and outputs a sample sequence of the voice signal. The control device 1 processes the sample sequence of the voice signal output by the sound collecting device 6 as a processing target, executes the above-mentioned voice analysis program, executes the voice synthesis program, and outputs the response voice sample sequence. The sound emitting device 7 includes a D / A converter that D / A converts the sample sequence of the response sound and outputs an analog sound signal, and a speaker that emits the analog sound signal as sound.

In the present embodiment, the control device 1 functions as a voice analysis device by executing a voice analysis program. FIG. 2 is a time chart explaining the function of the control device 1 as a voice analysis device. In FIG. 2, the horizontal axis represents time and the vertical axis represents the sound level (sound pressure level) of the sound signal to be processed.

In the present embodiment, the control device 1 divides the sample sequence of the audio signal output from the sound collection device 6 into frames of a fixed time length, and executes the audio analysis program 10 while monitoring the generation time of each frame. . FIG. 3 is a functional block diagram showing a configuration of functions realized by the control device 1 executing the voice analysis program 10 in the present embodiment. As shown in FIG. 3, this functional configuration based on the voice analysis program 10 includes a specifying unit 11 and an analyzing unit 12. The specifying unit 11 (control device 1) specifies a plurality of partial utterance sections PUP1 to PUP3 included in one utterance section UP1 in the voice signal. The analysis unit 12 (control device 1) analyzes the change in the audio signal for each partial utterance section.

More specifically, the identifying unit 11 determines the end of the utterance in the voice signal based on the first end determination criterion (hereinafter referred to as the first determination criterion), and the utterance having the end t6 at which the end of the utterance is determined. The section UP1 is specified, the utterance end in the utterance section UP1 is determined by a second end determination criterion (hereinafter, second determination criterion) that can be subdivided from the first end determination criterion, and the utterance end is determined. A plurality of partial utterance sections PUP1 to PUP3 having the end times t2, t4, and t6 are identified.

Here, the first and second end determination criteria are, for example, determination criteria relating to the length of the silent section from when the voice level of the voice signal is less than the threshold TH2 to when it exceeds the threshold TH1 which is greater than the threshold TH2. .

In the example shown in FIG. 2, the audio level does not exceed the threshold TH1 even at time t7 when a time longer than the threshold TH4 elapses after the audio level of the audio signal becomes less than the threshold TH2 at time t6. That is, since the duration (silence time) length of the silent section from time t6 to time t7 exceeds the threshold TH4, it is determined that the time t6 is the end of the utterance section UP1.

On the other hand, in the example shown in FIG. 2, the audio level exceeds the threshold TH1 at time t3 when a time longer than the threshold TH3 shorter than the threshold TH4 has passed since the audio level of the audio signal became less than the threshold TH2 at the time t2. . That is, the silent period from time t2 to time t3 is shorter than the threshold TH4 and longer than the threshold TH3. Therefore, the time t2 is determined to be the end of the partial utterance period PUP1. The same applies to the partial utterance section PUP2.

Here, the threshold TH3 used for the second determination criterion regarding the end of the partial utterance section is shorter than the threshold TH4 used for the first determination criterion regarding the end of the utterance section. Therefore, by using the second criterion, the utterance section detected by the first criterion can be subdivided into partial utterance sections shorter than that. That is, it can be said that the second criterion (TH3) is looser than the first criterion (TH4) as the criterion for the silent section. In this case, the second criterion is "loose". In other words, it is possible to determine a short silent period which is a delimiter of the partial utterance period within one utterance period divided based on the first criterion. ,That's what it means.

In the present embodiment, the end of the partial utterance section or the utterance section is determined based on the length of the silent section. Therefore, if this point is taken into consideration, the specifying unit 11 of the functional configuration related to the voice analysis program 10 divides the voice signal into silent periods between which the duration is longer than the threshold TH4. The utterance section is identified, and within each utterance section, one or a plurality of partial utterance sections PUP1 to PUP3 separated by short silent sections t2 to t3 and t4 to t5 whose duration is shorter than the time threshold TH4 are identified. It can be said that the analysis unit 12 of the functional configuration according to the voice analysis program 10 analyzes the time change of the voice signal for each partial utterance section.

In the present embodiment, the control device 1 executes the voice synthesis program in parallel with the voice analysis program 10. The voice analysis program 10 analyzes the change in pitch of the voice signal for each partial utterance section constituting the utterance section, and delivers the analysis result to the voice synthesis program. Based on the analysis result, the voice synthesis program determines the content of the response to the user's utterance, synthesizes a sample sequence of the response voice, and supplies it to the sound emitting device 7. That is, the control device 1 functions as a voice synthesizing device that synthesizes a response voice with respect to the voice in the utterance section by executing the voice synthesis program in parallel with the voice analysis program 10.
The above is the configuration of the present embodiment.

FIG. 4 is a flowchart showing the processing contents of the voice analysis program 10 in this embodiment. FIG. 5 is a flowchart showing the processing contents of the utterance section processing S4 in the program 10. Of the processes shown in FIG. 5, S42433 is the process executed by the analysis unit 12 (control device 1) described above, and the other processes are executed by the identification unit 11 (control device 1) described above. Is. 6 to 8 are time charts showing first to third operation examples of this embodiment. 6 to 8, the horizontal axis represents time and the vertical axis represents the audio level of the audio signal to be processed.

First, the first operation example of FIG. 6 will be described with reference to the flowcharts of FIGS. 4 and 5. When a predetermined operation is performed on the operation device 5, the control device 1 starts executing the voice analysis program 10 and the voice synthesis program stored in the storage device 3. Since the feature of this embodiment resides in the voice analysis program 10, the description of the processing content of the voice analysis program 10 will be mainly described below.

In the following description, the provisional utterance section is a section started when the voice level of the voice signal exceeds the threshold TH1. In the present embodiment, a section in which the duration is longer than the threshold TH5 is a partial utterance section. Therefore, at the timing when the voice level of the voice signal exceeds the threshold TH1, it is still unknown whether the section started from that timing is the partial utterance section. Therefore, in the present embodiment, a section started when the voice level of the voice signal becomes higher than the threshold value TH1 is defined as a temporary utterance section. When the duration of the provisional utterance section exceeds the threshold TH5, the provisional utterance section becomes the partial utterance section. Further, in the following description, the temporary silence section is a section started when the voice level of the voice signal becomes less than the threshold TH2. In the present embodiment, a section whose duration is longer than the threshold TH3 is a silent section. Therefore, at the timing when the audio level of the audio signal becomes lower than the threshold value TH2, it is unclear whether the section started from that timing is a silent section. Therefore, in the present embodiment, a section that starts when the audio level of the audio signal becomes less than the threshold TH2 is a temporary silence section. When the duration of the temporary silent section exceeds the threshold TH3, the silent section becomes a silent section.

In the voice analysis program 10, the control device 1 first executes an initialization process S1. In the initialization process S1, the control device 1 sets the temporary silence duration, which is the duration of the temporary silence section, to "0", the number of partial speech segments to "0", and the temporary speech duration, which is the duration of the temporary speech segment. Is set to "0" and the provisional utterance section state flag is set to OFF.

Next, the control device 1 acquires a sample sequence of the input audio signal for one frame from the sound collection device 6 and stores it in the buffer area in the storage device 3 (S2). Next, the control device 1 extracts parameters of the input voice such as pitch and voice level from the sample sequence of the input voice signal stored in the buffer area (S3). Next, the control device 1 executes the utterance section processing S4 shown in FIG. In the utterance section processing S4, the utterance section and the partial utterance section are extracted from the voice signal, and the time change of the voice signal is analyzed for each of the partial utterance sections constituting the utterance section. Next, the control device 1 determines whether or not an end instruction has been issued by operating the operating device 5 or the like. When this determination result is “YES”, the control device 1 ends the voice analysis program 10. On the other hand, when this determination result is “NO”, the control device 1 returns to S2 and executes the processes S2 to S4 again. In this way, the processes S2 to S5 are repeated until the end instruction is issued.

Next, the processing content of the speech section processing S4 of FIG. 5 will be described.
In the utterance section processing S4 of FIG. 5, first, it is determined whether or not the provisional utterance section state flag is OFF (S41). In the first operation example of FIG. 6, after the initialization process S1, during the period when the voice level of the voice signal is equal to or lower than the threshold value TH1, the provisional utterance period state flag is OFF, so the determination result of S41 is “YES”, The process of the control device 1 proceeds to the temporary silence section process S42.

In the temporary silence section process S42, the control device 1 first determines whether or not the voice level of the voice signal is higher than the threshold TH1 (S421). In the first operation example of FIG. 6, since the voice level of the voice signal is lower than the threshold TH1 in the period before time t1, the determination result of S421 is “NO”. As a result, the control device 1 executes the temporary silent section continuation process S424.

In the temporary silence section continuation process S424, the control device 1 first updates the temporary silence period (S4241). Specifically, the elapsed time from the latest timing among the execution timings of the initialization processes S1, S42434, S4241 and S433 is added to the temporary silence time. The calculated temporary silence period is the elapsed time from the start of the current temporary silence section to that point. Next, the control device 1 determines whether the temporary silence time is longer than the threshold TH4 (S4242). When this determination result is “NO”, the control device 1 ends the temporary silence section continuation process S424, the temporary silence section process S42, and the speech section process S4, and proceeds to S5 of FIG.

In the first operation example of FIG. 6, in the silent section before time t1, the determination result of S41 is “YES”, the determination result of S421 is “NO”, the determination result of S4242 is “NO”, and the temporary silence is generated. The time update (S4241) is repeated. Then, when the temporary silence time exceeds the threshold TH4, the determination result of S4242 becomes "YES", and the temporary silence time is reset to "0" in the processing of S42431 and thereafter, which will be described later in detail.

After that, the voice level of the voice signal rises and exceeds the threshold TH1 at time t1, so the determination result of S421 is "YES", and the control device 1 determines whether the temporary silence time is longer than the threshold TH3. (S422). If this determination result is “YES”, the process of the control device 1 proceeds to S423. On the other hand, when the determination result of S422 is "NO", the control device 1 determines whether the number of partial utterance sections is 0 (S425). If this determination result is “YES”, the process of the control device 1 proceeds to S423.

At time t1 in the first operation example of FIG. 6, if the temporary silence time exceeds the threshold TH3, the determination result of S422 is “YES” and the process proceeds to S423. On the other hand, if the temporary silence time is less than or equal to the threshold TH3 at time t1, the determination result of S422 is "NO" and the process proceeds to S425, but the number of partial utterance sections is 0 at time t1 immediately after the initialization process S1. Therefore, the determination result of S425 is “YES” and the process proceeds to S423. Thus, at time t1, the process proceeds to S423 regardless of whether the temporary silence time exceeds the threshold TH3.

Next, in S423, the control device 1 executes a temporary utterance section start process. Specifically, the control device 1 turns on the provisional utterance section state flag and initializes the provisional utterance time to zero. When the provisional utterance section start process S423 is finished, the control device 1 finishes the provisional silence section process S42 and the utterance section process S4, and proceeds to S5 of FIG.

After that, in the utterance period processing S4, the temporary utterance period state flag is ON, so the determination result in S41 is “NO”, and the control device 1 executes the temporary utterance period process S43. In the provisional utterance section process S43, the control device 1 first determines whether or not the voice level of the input voice signal is less than the threshold TH2 (S431). In the first operation example of FIG. 6, the audio level of the input audio signal is higher than the threshold TH2 during the period from the time t1 to the time t2. Therefore, during this period, the determination result of S431 is "NO", and the control device 1 executes the temporary utterance section continuation process S434. In this temporary utterance section continuation process S434, the temporary utterance time is updated. Specifically, the elapsed time from the latest timing of the execution timings of S423 and S434 is added to the temporary utterance time. The calculated provisional utterance time is the elapsed time from the start of the current provisional utterance section to that point. When S434 ends, the control device 1 ends the temporary utterance section process S43 and the utterance section process S4, and proceeds to S5 in FIG.

After that, the audio level of the input audio signal drops and becomes less than the threshold TH2 at time t2. Then, in the utterance section process S4, the determination result of S41 becomes "NO", and in the temporary utterance section process S43, the determination result of S431 becomes "YES", and the control device 1 determines whether the temporary utterance time is longer than the threshold value TH5. Is determined (S432). In the first operation example of FIG. 6, since the temporary utterance time from time t1 to time t2 exceeds the threshold TH5, the determination result of S432 is “YES”, and the control device 1 executes the temporary silence section start process S433. In the temporary silence section start process S433, the control device 1 sets the section from the time t1 to the time t2 in the input voice signal as the unregistered partial utterance section PUP1, sets the temporary utterance section state flag to OFF, and sets the temporary silence time to 0. Initialize to. At this time, the number of partial utterance sections is 1. When the temporary silence section start processing S433 ends, the control device 1 ends the temporary speech section processing S43 and the speech section processing S4, and proceeds to S5 in FIG.

After that, in the utterance section process S4, since the temporary utterance section state flag is OFF, the determination result in S41 becomes "YES", and the process proceeds to the temporary silence section process S42. Then, in the temporary silence section process S42, when the voice level of the input audio signal is less than the threshold value TH1, the determination result of S421 becomes “NO”, and the process proceeds to the temporary silence section continuation process S424. Then, in the temporary silence duration continuation process S424, the temporary silence duration is updated (S4241), it is determined whether the temporary silence duration is longer than the threshold TH4 (S4242), and the determination result of S4242 is "NO". Ends the temporary silence section continuation process S424, the temporary silence section process S42, and the speech section process S4, and proceeds to S5 in FIG. In the first provisional silence section of the first operation example, such processing is repeated until the time t3 without the provisional silence time exceeding the threshold TH4.

Then, since the voice level of the input voice signal rises and exceeds the threshold TH1 at time t3, in the temporary silence section process S42, the determination result of S421 becomes "YES", and the control device 1 causes the temporary silence time to be greater than the threshold TH3. It is determined whether it is long (S422). In the first operation example, since the temporary silence time t3-t2 exceeds the threshold TH3, the determination result of S422 is “YES”, the control device 1 executes the temporary utterance section start processing S423, and the temporary silence section processing S42 and The utterance section processing S4 is ended, and the process proceeds to S5 in FIG. After that, the control device 1 repeats the processing of S41, S431, and S434 until time t4.

Then, since the voice level of the input voice signal decreases and becomes less than the threshold value TH2 at time t4, the determination result of S431 becomes “YES” in the temporary utterance section process S43, and the control device 1 causes the temporary utterance time t4-t3. Is determined to be longer than the threshold TH5 (S432). In the first operation example, the determination result of S432 is “YES”. As a result, the control device 1 executes the temporary silence section start process S433, sets the section from the time t3 to the time t4 in the input voice signal as the unregistered partial utterance section PUP2, turns off the temporary utterance section state flag, and The silent time is initialized to 0. At this time, the number of partial utterance sections is 2. When the temporary silence section start processing S433 ends, the control device 1 ends the temporary speech section processing S43 and the speech section processing S4, and proceeds to S5 in FIG.

Then, in the first operation example, t5-t4> TH3 at time t5 when the voice level of the input voice signal rises and exceeds the threshold TH1, and t6-- at time t6 when the voice level of the input signal drops and falls below the threshold TH2. t5> TH5. The operation in this case is similar to the operation performed for the partial utterance sections PUP1 and PUP2.

At time t6, in the temporary utterance period process S43 of the utterance period process S4, the determination result of S431 is “YES”, the determination result of S432 is “YES”, and the control device 1 executes the temporary silence period start process S433, The section from time t5 to time t6 in the input voice signal is set as the unregistered partial utterance section PUP3, the temporary utterance section state flag is set to OFF, and the temporary silence duration is initialized to 0. After that, the control device 1 repeats the processing of S41, S421, S4241, and S4242.

Then, in the first operation example, the provisional silence period exceeds the threshold TH4 at time t7, and it is determined that this provisional silence section is a silence section. Therefore, in the provisional silence section continuation process S424, the determination result of S4242 is “YES”. , And the control device 1 executes the partial utterance section process S4243.

In this partial speech section processing S4243, the control device 1 first determines whether the number of partial speech sections is 1 or more (S42431). In the first operation example, at time t7, three partial utterance sections, PUP1, PUP2, and PUP3, are detected, and the number of partial utterance sections is 3, so the determination result of S42431 is “YES”, and the control device 1 Executes the speech segment configuration processing S42432. Specifically, the control device 1 registers a section including the partial utterance sections PUP1, PUP2, and PUP3 from time t1 to time t6 as the utterance section UP1. Next, the control device 1 executes the speech segment analysis processing S42433. Details of the utterance section analysis processing S42433 will be described later. Next, the control device 1 executes reset S42434. In this reset S42434, the temporary silence period is reset to "0" and the number of partial utterance sections is reset to "0". Even after the time t7, the update of the temporary silence time is continued in S4241, and every time the temporary silence time exceeds the threshold TH4, it is determined to be “YES” in S4242, but since the number of partial utterance sections is “0”, it is determined in S42431 “ It is determined to be "NO", and the temporary silence time is reset to "0" in S42434. The provisional silence duration does not have to be updated after the silence section is determined.

The above is the first operation example of the present embodiment. Although there are a plurality of branches based on the comparison with the threshold value in the above-described processing, which of YES and NO is branched when they are equal to the threshold value, respectively, since it does not have much relation to the essence of the present disclosure. It may be changed as needed.

Next, the second operation example of FIG. 7 will be described with reference to the flowcharts of FIGS. 4 and 5. The second operation example differs from the first operation example (FIG. 6) in the following points. In the first operation example, the temporary silence time t3-t2 from the time t2 when the audio level of the input audio signal becomes less than the threshold TH2 to the time t3 when the audio level exceeds the threshold TH1 is longer than the threshold TH3. On the other hand, in the second operation example, the temporary silent time t3-t2 is less than or equal to the threshold TH3.

In this second operation example, at time t3, when the determination result of S41 of the utterance interval processing S4 is “YES” and the determination result of S421 of the temporary silence interval processing S42 is “YES”, and the process proceeds to S422, the temporary silence is generated. Since the time is equal to or less than the threshold TH3, the determination result of S422 is "NO". Then, at time t3, the section from time t1 to time t2 is a partial utterance section, and therefore the determination result of S425 is “NO”. As a result, the control device 1 executes the temporary utterance period restart processing S426. In this temporary speech section restart processing S426, the (immediately before) partial speech section continued from time t1 to time t2 and the temporary speech section after time t3 are connected and integrated. Specifically, the temporary utterance section state flag is turned on, and the elapsed time from time t1 to time t3 is set as the temporary utterance time. As a result of performing the provisional utterance section restart processing S426, in the second operation example, the time t1 is the start of the partial utterance section PUP1, and after the time t3, the time t4 when the sound level of the input sound signal becomes less than the threshold TH2 is the same. It is the end of the partial speech section PUP1. As a result, in the second operation example, two partial speech sections PUP1 and PUP2 are detected.

Next, the third operation example of FIG. 8 will be described with reference to the flowcharts of FIGS. 4 and 5. The third operation example differs from the first operation example (FIG. 6) in the following points. In the first operation example, the provisional utterance time t2-t1 from the time t1 when the audio level of the input audio signal exceeds the threshold TH1 to the time t2 when the audio level becomes less than the threshold TH2 exceeds the threshold TH5. On the other hand, in the third operation example, the provisional utterance time t2-t1 is less than or equal to the threshold TH5.

In the third operation example, at time t2, when the determination result of S41 of the utterance period processing S4 is “NO” and the determination result of S431 of the temporary utterance period process S43 is “YES”, and the process proceeds to S432, the temporary utterance is performed. Since the time is equal to or less than the threshold TH5, the determination result of S432 is "NO". As a result, the control device 1 executes the temporary silence interval restart processing S435. In this temporary silence section restart processing S435, the temporary silence section in the silent section up to time t1 and the temporary silence section after time t2 are connected and integrated as one temporary silence section. Specifically, the temporary utterance section state flag is set to OFF, and the elapsed time from time 0 to time t3 is set to the temporary silence time. As a result of performing the provisional silence section restart processing S435, in the third operation example, the partial speech section starting from time t3 becomes the first partial speech section PUP1. That is, in the present embodiment, the section in which the temporary utterance time is equal to or less than the threshold TH5 is not regarded as the partial utterance section, but is treated as a continuation part of the immediately preceding temporary silence section. In addition, in the third operation example, an example in which the first provisional utterance section after the silence section is incorporated into the provisional silence section immediately before is shown. For example, the partial utterance section PUP2 in FIG. The same applies to the section, and when the duration of the provisional utterance section is equal to or less than the threshold TH5, the provisional utterance section is incorporated into the provisional silent section immediately before that. As a result, in the third operation example, two partial speech periods PUP1 and PUP2 are detected.

Next, the utterance section analysis S42433 executed in the utterance section process S4 will be described. In the following, the utterance section analysis S42433 will be described, taking as an example the case where the above-mentioned utterance examples 1 to 3 are the utterance contents of the utterance section.
Example 1: "hirugohan, ramen de ii?" (In Roman letters) (meaning "Would you like to eat Japanese noodles for lunch?")
Example 2: "hirugohan, ramen de ii? Ne." (In Roman letters) ("You would like to eat Japanese noodles for lunch, wouldn't you?" (Intonation of the ending of pronunciation is lowered (Sentence) is meant.)
Example 3: "hirugohan, ramen de ii? Ne?" (In Roman letters) ("You would like to eat Japanese noodles for lunch, wouldn't you?" Means a sentence)).

In utterance section analysis S42433, pitch transition of a voice signal is obtained for each partial utterance section that constitutes the utterance section configured in S42432.

In the case of Example 1, in the speech segment analysis S42433, the pitch transitions of the partial speech segment “hirugohan,” (in Roman letters) and the partial speech segment “ramen de ii?” (In Roman letters) that form the speech segment are obtained. , A rising pitch transition is observed at the end of the last partial speech section "ramen de ii?" (In Roman letters). Therefore, in the utterance section analysis S42433, it is determined that the utterance in the utterance section has a questioning intention.

In the case of Example 2, in utterance segment analysis S42433, a partial utterance segment "hirugohan," (in Roman letters) that forms an utterance segment, a partial utterance segment "ramen de ii?" (In Roman letters), and a partial utterance segment "ne. ”(Romanized notation) for each pitch transition, but a rising pitch transition is observed at the end of the partial utterance section“ ramen de ii? ”(Romanized notation) in the middle of the utterance section. Therefore, in the utterance section analysis S42433, it is determined that the utterance in the utterance section has a questioning intention.

In the case of example 3, in utterance segment analysis S42433, a partial utterance segment "hirugohan," (in Roman letters) that constitutes an utterance segment, a partial utterance segment "ramen de ii?" (In Roman letters), and a partial utterance segment "ne? "(Romaji notation) for each pitch transition, but the end of the second partial utterance section" ramen de ii? "(Romaji notation) and the last partial utterance section" ne? "Of the utterance section A rising pitch transition is observed at the end of (in Roman letters). Then, in the utterance section analysis S42433, the number of partial utterance sections in which a rising transition of the pitch is observed at the end of the partial utterance sections constituting the utterance section is defined as ). Therefore, in the case of Example 3, in the speech section analysis S42433, it is determined that the speaker is pushing the question intention.

In the voice analysis program 10, the information indicating the intention of the speaker judged by the speech section analysis S42433 is delivered to the voice synthesis program. The voice synthesis program determines the content of the response voice to the speaker based on the information indicating the intention of the speaker. In each of the utterance example 1 to the utterance example 3, since it is determined that the question is intentional, the voice of the response to the utterance is controlled to have a characteristic peculiar to the question response, as in Patent Document 1. Note that, with regard to the voice generated in the utterance example 3, the question corresponds to "remembering." Therefore, the characteristics are controlled so as to have characteristics different from the characteristics of the responses to the questions in the utterance examples 1 and 2 by that amount. Good.

As described above, according to the present embodiment, the voice signal is divided into utterance sections including one or a plurality of partial utterance sections, and the time change of the voice signal, specifically, the change in pitch is analyzed for each partial utterance section. Therefore, even if it is difficult to judge the intention of the speaker in the utterance only by the pitch transition of the ending of the utterance section of one utterance (for example, utterance example 2), the utterance of the utterer can be appropriately and easily. It is possible to judge the intention and control the voice of the response to the utterance.

Although one embodiment of the present disclosure has been described above, the present disclosure may have other embodiments. For example:

(1) In the above embodiment, when the input voice signal is divided into the partial utterance sections divided by the short silence section, and when a temporary silence section (end of the utterance section) longer than the short silence section occurs, One or a plurality of partial utterance sections divided into two are combined to form one utterance section. However, the application range of the present disclosure is not limited to such an aspect. For example, the following other modes are also possible. First, a silent section whose duration exceeds the first time threshold is found in the audio signal, and one or a plurality of utterance sections separated by the silent section are extracted from the audio signal. Next, in one utterance section, a short silent section whose duration exceeds the second time threshold value (<first time threshold value) is found, and one or more short silence sections are separated from the utterance section. Extract a partial utterance section. Even in such an aspect, the same effect as that of the above embodiment can be obtained.

(2) In the above embodiment, based on the length of the duration of the temporary silence section (temporary silence duration) determined based on the voice level, the segment of the partial utterance section (short silence section) and the segment of the utterance section ( (Silent section) is determined. However, for at least one of the first criterion and the second criterion, in addition to or in place of the temporary silence time reference, the sound level, pitch, spectrum, etc. of the section other than the temporary silence time are excluded. The temporary silence section or the silent section may be determined based on the factor of. For example, the feature of the voice that is likely to appear at the end of the utterance may be a requirement for ending the partial utterance section or the utterance section. In that case, the ending condition of the partial utterance section or the utterance section may be set so that the end of the utterance section has a stronger feeling of “finished” than the end of the partial utterance section.

(3) In order to analyze the intention of the speaker, the pitch transition analysis and the voice recognition engine or the emotion recognition engine may be used together. By doing so, it is possible to robustly analyze the intention of the speaker.

(4) The partial utterance section may be used not only as a unit of intention analysis but also as a unit of voice recognition or emotion recognition.

(5) The voice analysis program 10 of the above embodiment may be applied to a device other than the dialogue device, such as a voice control device or a voice dialogue evaluation device.

(6) The cloud server may provide a service for using the voice analysis program of the above embodiment.

(7) The voice analysis program of the above embodiment may be provided as a PC application or a smartphone application.

(8) The present disclosure can also be realized as a device that analyzes voice in a toy, a car navigation system, or the like.

(9) In order to make the dialogue natural, the pitch of a part of the partial utterance section that constitutes the utterance section, for example, the pitch of the partial utterance section in which the intention of the speaker where the rising transition of the pitch is at the end appears. On the other hand, the pitch of the response voice may be controlled so as to have a predetermined relationship, for example, a consonant relationship.

This application is based on the Japanese patent application (Japanese Patent Application No. 2018-198271) filed on October 22, 2018, which is incorporated herein by reference.

According to the present disclosure, even if it is difficult to determine the intention of the speaker only by the pitch transition of the ending of the utterance section, a voice processing method and a voice that can appropriately and easily determine the intention of the speaker. A processing device and a voice processing program can be provided.

1 ... control device, 2 ... arithmetic device, 3 ... storage device, 4 ... display device, 5 ... operation device, 6 ... sound collecting device, 7 ... sound emitting device, UP1 ... utterance section, PUP1 to PUP3 ... Partial utterance section, 10 ... Speech analysis program, 11 ... Specification section, 12 ... Analysis section.

Claims

Specify a plurality of partial utterance sections included in one utterance section in the audio signal,
A voice processing method for analyzing a time change of a voice signal for each of the partial utterance sections.
Determining the end of the first utterance in the voice signal according to a first criterion,
In the voice signal, the utterance section having the ending period when the end of the first utterance is determined is specified,
Determining the end of the second utterance in the audio signal by a second criterion different from the first criterion,
The voice processing method according to claim 1, wherein the plurality of partial utterance sections having an end period when the end of the second utterance is determined are specified in the voice signal.
In the voice signal, a speech duration in which the duration of silence is divided by a silence duration longer than a time threshold is specified, and in the voice signal, the duration of silence is shorter than the time threshold. The voice processing method according to claim 1, wherein the plurality of partial utterance sections divided into sections are specified.
The timing at which the voice level of the voice signal exceeds a first voice level threshold is set as the start of the partial utterance section, and the voice level of the voice signal is lower than the second voice level threshold lower than the first voice level threshold. 4. The voice processing method according to claim 1, wherein the timing of becoming is set as the end of the partial utterance section.
The response voice to the voice based on the voice signal of the utterance period is synthesized based on the analysis result of the plurality of partial utterance periods included in the utterance period for each of the utterance periods. Voice processing method.
A voice processing device,
One or more computers,
One or more data storage devices that store a plurality of instructions and, when executed by the one or more computers, cause the audio processing device to perform the following operations:
Equipped with
Specify a plurality of partial utterance sections included in one utterance section in the audio signal,
A voice processing device for analyzing a temporal change of a voice signal for each of the partial utterance sections.
Specifying a plurality of partial utterance sections included in one utterance section in the audio signal,
Analyzing time change of the voice signal for each of the partial utterance sections,
A voice processing program that causes a computer to execute.