WO2019142231A1

WO2019142231A1 - Voice analysis device, voice analysis method, voice analysis program, and voice analysis system

Info

Publication number: WO2019142231A1
Application number: PCT/JP2018/000942
Authority: WO
Inventors: 武志水本; 哲也菅原
Original assignee: ハイラブル株式会社
Priority date: 2018-01-16
Filing date: 2018-01-16
Publication date: 2019-07-25
Also published as: JP6589040B1; JPWO2019142231A1

Abstract

The purpose of the present invention is to provide a voice analysis device, a voice analysis method, a voice analysis program, and a voice analysis system, which enable outputting of information for analysis based on a temporal change in the amount of speech spoken by a participant during a discussion. A voice analysis device 100 according to one embodiment of the present invention includes: a voice acquisition unit 112 that acquires voices uttered by a plurality of participants; an analysis unit 114 that specifies respective amounts of speech, among the voices, spoken by the plurality of participants for each time period; a section setting unit 115 that sets sections in the voices on the basis of an input from a user; and an output unit 116 that outputs a graph in which temporal changes in the amounts of speech spoken by the respective participants are accumulated, and information which indicates the sections in the graph.

Description

Speech analysis device, speech analysis method, speech analysis program and speech analysis system

The present invention relates to a voice analysis device for analyzing voice, a voice analysis method, a voice analysis program and a voice analysis system.

The Harkness method (also referred to as the Harkness method) is known as a method for analyzing discussions in group learning and meetings (see, for example, Non-Patent Document 1). In the Harkness method, the transition of each participant's utterance is recorded in a line. In this way, it is possible to analyze the contribution of each participant to the discussion and the relationship with others. The Harkness Law can also be effectively applied to active learning where students take the initiative in learning.

However, since the Harkness method shows the tendency of the speech of the whole period from the start to the end of the discussion, it can not show the change of the speech volume of each participant along the time series. Therefore, there is a problem that it is difficult to analyze based on the time change of the volume of each participant.

The present invention has been made in view of these points, and a speech analysis device, a speech analysis method, a speech analysis program, and a speech analysis that can output information for performing an analysis based on a time change of a participant's speech volume in a discussion It aims to provide a system.

According to a first aspect of the present invention, there is provided a voice analysis device including: an acquisition unit for acquiring voices uttered by a plurality of participants; and an analysis unit for identifying an utterance amount of each of the plurality of participants in the voice. A section setting unit for setting a section in the voice based on an input from the user, a graph in which temporal changes of the utterance amount of the plurality of participants are accumulated, and information indicating the section in the graph And an output unit for outputting.

The output unit may output, as the information indicating the section, a position on the graph corresponding to a time when switching between the two sections.

The section setting unit is configured to set the section based on at least one of an operation in a communication terminal that communicates with the voice analysis device, an operation in a sound collection device for obtaining the voice, and a predetermined sound included in the voice. A section may be set.

The output unit may output the graph in which temporal changes in the amount of utterance are stacked in ascending order of the degree of variation in the amount of utterance calculated for each of the plurality of participants.

The output unit outputs the graph in which temporal changes of the utterance amount are accumulated for each of the sections in ascending order of the variation degree of the utterance amount for each of the sections calculated for each of the plurality of participants. It is also good.

The output unit may output a plurality of graphs of the same section set to a plurality of the voices.

In addition to the graph and the information indicating the section, information indicating an event that has occurred within the time of the voice may be output on the graph.

The analysis unit may specify a value obtained by dividing the length of time during which a participant speaks within a predetermined time window by the length of the time window as the speech amount.

In the speech analysis method according to a second aspect of the present invention, the processor acquires a speech uttered by a plurality of participants, and a step of specifying an amount of speech of each of the plurality of participants in the speech per hour A step of setting a section in the voice based on an input from the user, a graph in which temporal changes in the amount of utterance of the plurality of participants are accumulated, and information indicating the section in the graph And the following steps:

A voice analysis program according to a third aspect of the present invention includes the steps of: obtaining voices uttered by a plurality of participants on a computer; and identifying an amount of time of each of the plurality of participants in the voice. A step of setting a section in the voice based on an input from the user, a graph in which temporal changes in the amount of utterance of the plurality of participants are accumulated, and information indicating the section in the graph And to execute.

A voice analysis system according to a fourth aspect of the present invention includes a voice analysis device and a communication terminal capable of communicating with the voice analysis device, the communication terminal having a display unit for displaying information, the voice The analysis apparatus is based on an acquisition unit that acquires voices uttered by a plurality of participants, an analysis unit that identifies an amount of speech of each of the plurality of participants in the voice, and an input from a user. An output unit that causes the display unit to display a section setting unit that sets a section in the voice, a graph in which temporal changes of the utterance amount of the plurality of participants are accumulated, and information indicating the section in the graph And.

According to the present invention, it is possible to output a change in the amount of speech of each participant along the time series of the discussion.

It is a schematic diagram of the speech analysis system concerning this embodiment. It is a block diagram of a speech analysis system concerning this embodiment. It is a schematic diagram of the speech analysis method which the speech analysis system concerning this embodiment performs. It is a front view of the display part of the communication terminal which is displaying the setting screen. It is a front view of the display part of the communication terminal which is displaying the speech volume screen. It is a front view of the display part of the communication terminal which is displaying the speech volume screen. It is a front view of the display part of the communication terminal which is displaying the speech volume screen. It is a front view of the display part of the communication terminal which is displaying the area extraction screen. It is a front view of the display part of the communication terminal which is displaying the speech volume screen. It is a sequence diagram of the speech analysis method which the speech analysis system concerning this embodiment performs.

[Overview of speech analysis system S]
FIG. 1 is a schematic view of a speech analysis system S according to the present embodiment. The voice analysis system S includes a voice analysis device 100, a sound collection device 10, and a communication terminal 20. The number of sound collectors 10 and communication terminals 20 included in the speech analysis system S is not limited. The voice analysis system S may include devices such as other servers and terminals.

The voice analysis device 100, the sound collection device 10, and the communication terminal 20 are connected via a network N such as a local area network or the Internet. At least a part of the voice analysis device 100, the sound collection device 10, and the communication terminal 20 may be directly connected without the network N.

The sound collector 10 includes a microphone array including a plurality of sound collectors (microphones) arranged in different orientations. For example, the microphone array includes eight microphones equally spaced on the same circumference in the horizontal plane with respect to the ground. The sound collection device 10 transmits the voice acquired using the microphone array to the voice analysis device 100 as data.

The communication terminal 20 is a communication device capable of performing wired or wireless communication. The communication terminal 20 is, for example, a portable terminal such as a smart phone terminal or a computer terminal such as a personal computer. The communication terminal 20 receives the setting of analysis conditions from the analyst and displays the analysis result by the voice analysis device 100.

The voice analysis device 100 is a computer that analyzes the voice acquired by the sound collection device 10 by a voice analysis method described later. Further, the voice analysis device 100 transmits the result of the voice analysis to the communication terminal 20.

[Configuration of speech analysis system S]
FIG. 2 is a block diagram of the speech analysis system S according to the present embodiment. Arrows in FIG. 2 indicate the main data flow, and there may be data flows not shown in FIG. In FIG. 2, each block is not a hardware (apparatus) unit configuration but a function unit configuration. As such, the blocks shown in FIG. 2 may be implemented in a single device or may be implemented separately in multiple devices. Transfer of data between the blocks may be performed via any means such as a data bus, a network, a portable storage medium, and the like.

The communication terminal 20 has a display unit 21 for displaying various information, and an operation unit 22 for receiving an operation by an analyst. The display unit 21 includes a display device such as a liquid crystal display or an organic light emitting diode (OLED) display. The operation unit 22 includes operation members such as a button, a switch, and a dial. The display unit 21 and the operation unit 22 may be integrally configured by using a touch screen capable of detecting the position of contact by the analyst as the display unit 21.

The voice analysis device 100 includes a control unit 110, a communication unit 120, and a storage unit 130. The control unit 110 includes a setting unit 111, a sound acquisition unit 112, a sound source localization unit 113, an analysis unit 114, a section setting unit 115, and an output unit 116. The storage unit 130 includes a setting information storage unit 131, a voice storage unit 132, and an analysis result storage unit 133.

The communication unit 120 is a communication interface for communicating with the sound collection device 10 and the communication terminal 20 via the network N. The communication unit 120 includes a processor for performing communication, a connector, an electric circuit, and the like. The communication unit 120 performs predetermined processing on a communication signal received from the outside to acquire data, and inputs the acquired data to the control unit 110. Further, the communication unit 120 performs predetermined processing on the data input from the control unit 110 to generate a communication signal, and transmits the generated communication signal to the outside.

The storage unit 130 is a storage medium including a read only memory (ROM), a random access memory (RAM), a hard disk drive, and the like. The storage unit 130 stores in advance a program to be executed by the control unit 110. The storage unit 130 may be provided outside the voice analysis device 100, and in this case, data may be exchanged with the control unit 110 via the communication unit 120.

The setting information storage unit 131 stores setting information indicating analysis conditions set by the analyst in the communication terminal 20. The voice storage unit 132 stores the voice acquired by the sound collection device 10. The analysis result storage unit 133 stores an analysis result indicating the result of analyzing the voice. The setting information storage unit 131, the voice storage unit 132, and the analysis result storage unit 133 may be storage areas on the storage unit 130, or a database configured on the storage unit 130.

The control unit 110 is a processor such as a central processing unit (CPU), for example, and executes the program stored in the storage unit 130 to set the setting unit 111, the sound acquisition unit 112, the sound source localization unit 113, the analysis unit 114, It functions as a section setting unit 115 and an output unit 116. The functions of the setting unit 111, the sound acquisition unit 112, the sound source localization unit 113, the analysis unit 114, the section setting unit 115, and the output unit 116 will be described later with reference to FIGS. 3 to 9. At least a part of the functions of the control unit 110 may be performed by an electrical circuit. In addition, at least a part of the functions of the control unit 110 may be executed by a program executed via a network.

The speech analysis system S according to the present embodiment is not limited to the specific configuration shown in FIG. For example, the voice analysis device 100 is not limited to one device, and may be configured by connecting two or more physically separated devices in a wired or wireless manner.

[Description of voice analysis method]
FIG. 3 is a schematic view of the speech analysis method performed by the speech analysis system S according to the present embodiment. First, the analyst sets the analysis conditions by operating the operation unit 22 of the communication terminal 20. For example, the analysis condition is information indicating the number of participants in the argument to be analyzed and the direction in which each participant (that is, each of a plurality of participants) is located with reference to the sound collection device 10. The communication terminal 20 receives the setting of analysis conditions from the analyst, and transmits the setting as the setting information to the voice analysis device 100 (a). The setting unit 111 of the voice analysis device 100 acquires setting information from the communication terminal 20 and causes the setting information storage unit 131 to store the setting information.

FIG. 4 is a front view of the display unit 21 of the communication terminal 20 displaying the setting screen A. As shown in FIG. The communication terminal 20 displays the setting screen A on the display unit 21 and receives the setting of the analysis condition by the analyst. The setting screen A includes a position setting area A1, a start button A2, and an end button A3. The positioning area A1 is an area for setting the direction in which each participant U is actually positioned with reference to the sound collection device 10 in the argument to be analyzed. For example, the position setting area A1 represents a circle centered on the position of the sound collector 10 as shown in FIG. 4, and further represents an angle based on the sound collector 10 along the circle.

The analyst sets the position of each participant U in the position setting area A1 by operating the operation unit 22 of the communication terminal 20. In the vicinity of the position set for each participant U, identification information (here, U1 to U4) for identifying each participant U is allocated and displayed. In the example of FIG. 4, four participants U1 to U4 are set. The portion corresponding to each participant U in the positioning area A1 is displayed in a different color for each participant. Thereby, the analyst can easily recognize the direction in which each participant U is set.

The start button A2 and the end button A3 are virtual buttons displayed on the display unit 21, respectively. The communication terminal 20 transmits a signal of a start instruction to the voice analysis device 100 when the analyst presses the start button A2. The communication terminal 20 transmits a signal of a termination instruction to the voice analysis device 100 when the analyst presses the termination button A3. In the present embodiment, from the start instruction to the end instruction by the analyst is one discussion.

When the voice acquisition unit 112 of the voice analysis device 100 receives the signal of the start instruction from the communication terminal 20, the voice acquisition unit 112 transmits a signal instructing acquisition of voice to the sound collection device 10 (b). When the sound collection device 10 receives a signal instructing acquisition of voice from the voice analysis device 100, the collection of voice is started. Further, when the voice acquisition unit 112 of the voice analysis device 100 receives the signal of the termination instruction from the communication terminal 20, the voice acquisition unit 112 transmits a signal instructing the termination of the voice acquisition to the sound collection device 10. When the sound collection device 10 receives a signal instructing the end of the acquisition of the sound from the speech analysis device 100, the sound collection device 10 ends the acquisition of the sound.

The sound collection device 10 acquires voices in each of a plurality of sound collection units, and internally records the sound as the sound of each channel corresponding to each sound collection unit. Then, the sound collection device 10 transmits the acquired voices of the plurality of channels to the voice analysis device 100 (c). The sound collector 10 may transmit the acquired voice sequentially or may transmit a predetermined amount or a predetermined time of sound. Further, the sound collection device 10 may collectively transmit the sound from the start to the end of the acquisition. The voice acquisition unit 112 of the voice analysis device 100 receives voice from the sound collection device 10 and stores the voice in the voice storage unit 132.

The voice analysis device 100 analyzes voice at predetermined timing using the voice acquired from the sound collection device 10. The voice analysis device 100 may analyze the voice when the analyst gives an analysis instruction at the communication terminal 20 by a predetermined operation. In this case, the analyst selects a voice corresponding to the argument to be analyzed from the voices stored in the voice storage unit 132.

In addition, the voice analysis device 100 may analyze the voice when the voice acquisition ends. In this case, the speech from the start to the end of the acquisition corresponds to the argument to be analyzed. In addition, the voice analysis device 100 may analyze voice sequentially (that is, in real time processing) during acquisition of voice. In this case, the voice analysis device 100 goes back from the current time, and the voice for a predetermined time (for example, 30 seconds) in the past corresponds to the argument to be analyzed.

When analyzing speech, first, the sound source localization unit 113 performs sound source localization based on the plurality of channels of speech acquired by the speech acquisition unit 112 (d). The sound source localization is processing for estimating the direction of the sound source included in the sound acquired by the sound acquisition unit 112 for each time (for example, every 10 milliseconds to 100 milliseconds). The sound source localization unit 113 associates the direction of the sound source estimated for each time with the direction of the participant indicated by the setting information stored in the setting information storage unit 131.

If the sound source localization unit 113 can identify the direction of the sound source based on the sound acquired from the sound collection device 10, a known sound source localization method such as Multiple Signal Classification (MUSIC) method or beam forming method can be used. .

Next, the analysis unit 114 analyzes the voice based on the voice acquired by the voice acquisition unit 112 and the direction of the sound source estimated by the sound source localization unit 113 (e). The analysis unit 114 may analyze the entire completed discussion as an analysis target, or may analyze a part of the discussion in the case of real-time processing.

Specifically, based on the voice acquired by the voice acquisition unit 112 and the direction of the sound source estimated by the sound source localization unit 113, the analysis unit 114 first performs analysis (for example, 10 milliseconds to 100 milliseconds) in the discussion of the analysis target. Every second), it is determined which participant speaks (speaks). The analysis unit 114 specifies a continuous period from the start to the end of one participant's speech as a speech period, and causes the analysis result storage unit 133 to store the same. When a plurality of participants speak at the same time, the analysis unit 114 specifies a speech period for each participant.

In addition, the analysis unit 114 calculates the amount of speech of each participant for each time, and causes the analysis result storage unit 133 to store the amount. Specifically, in a certain time window (for example, 5 seconds), the analysis unit 114 divides the length of time during which the participant speaks by the length of the time window, the amount of speech per hour (activity Calculated as a degree). Then, the analysis unit 114 calculates the amount of speech per hour for each participant while shifting the time window by a predetermined time (for example, one second) from the start time of the discussion to the end time (currently in the case of real time processing). repeat.

The section setting unit 115 sets one or more sections for the voice corresponding to the argument to be analyzed based on the input from the user (the participant or the analyst). The section may be set for each subject subject to a discussion such as "Japanese language", "Science" or "Society", for example, and a discussion such as "Discussion", "Idea" or "Summary" It may be set for each stage of The section setting unit 115 stores section information indicating a section in the analysis result storage section 133 in association with the voice to be set.

The section information includes the section name and the section time (ie, the start time and end time of the section in the voice). The section setting unit 115 determines a section based on at least one of (1) an operation in the communication terminal 20, (2) an operation in the sound collector 10, and (3) a predetermined sound acquired by the sound collector 10. Set

When setting a section based on an operation on the communication terminal 20, the participant or the analyst includes the section information by operating the operation unit 22 (for example, a touch screen, a mouse, a keyboard, etc.) of the communication terminal 20. Input the character string and time. The participant or the analyst may input the section information after the end of the discussion, or may input the section information in the middle of the discussion. Then, the section setting unit 115 receives section information specified in the communication terminal 20 via the communication unit 120 and stores the information in the analysis result storage unit 133.

When setting a section based on an operation in the sound collection device 10, the participant or the analyst operates the operation unit such as a switch or a touch screen provided on the sound collection device 10 when switching the section. , Set the interval. The operation of the operation unit of the sound collection device 10 is associated in advance with switching of a predetermined section (for example, switching from a "discussion" section to an "idea out" section). The section setting unit 115 receives information indicating an operation from the operation unit of the sound collection device 10 via the communication unit 120, and specifies switching of a predetermined section at the timing of the operation. Then, the section setting unit 115 stores the specified section information in the analysis result storage unit 133.

When setting the section based on the predetermined sound acquired by the sound collection device 10, the participant or the analyst uses the device capable of generating the sound (for example, a portable terminal, a music reproduction apparatus, etc.) to set the section. A predetermined switching sound indicating switching is generated. The switching sound may be a sound wave that can be heard by humans, or an ultrasonic wave that can not be heard by humans. The switching sound indicates the switching of the section by, for example, a predefined frequency or an on / off pattern. The switching sound may be emitted only at the switching timing of the section, or may be emitted continuously in the section.

Different sounds can be used for each section as the switching sound. In this case, the section setting unit 115 detects the switching sound included in the sound acquired by the sound collection device 10. Then, the section setting unit 115 specifies switching from the section corresponding to the switching sound before the change to the section corresponding to the switching sound after the change at the timing when the switching sound changes. Then, the section setting unit 115 stores the specified section information in the analysis result storage unit 133.

In addition, as the switching sound, a sound indicating switching of a predetermined section (for example, switching from a “discussion” section to an “ideal” section) can be used. In this case, the section setting unit 115 detects the switching sound included in the sound acquired by the sound collection device 10. Then, the section setting unit 115 specifies switching of a predetermined section at the timing when the switching sound is emitted. Then, the section setting unit 115 stores the specified section information in the analysis result storage unit 133.

The output unit 116 performs control to display the analysis result by the analysis unit 114 on the display unit 21 by transmitting the display information to the communication terminal 20 (f). The output unit 116 is not limited to the display on the display unit 21 and may output the analysis result by other methods such as printing by a printer, data recording to a storage device, and the like. The method of outputting the analysis result by the output unit 116 will be described below with reference to FIGS. 5 to 9.

[Explanation of the display method of the amount of utterance for each section]
When displaying the analysis result, the output unit 116 of the voice analysis device 100 reads out, from the analysis result storage unit 133, the analysis result by the analysis unit 114 and the section information by the section setting unit 115 for the display target discussion. The output unit 116 may display a discussion immediately after the analysis by the analysis unit 114 is completed, or may display a discussion specified by the analyst.

FIG. 5 is a front view of the display unit 21 of the communication terminal 20 displaying the speech amount screen B. As shown in FIG. The speech amount screen B is a screen for displaying information indicating time change of the speech amount for each section, and includes a graph B1 of the speech amount, the name of the section B2, and the switching line B3 of the section.

When displaying the speech amount screen B, the output unit 116 is a display for displaying the time change of the speech amount of each participant for each section based on the analysis result and the section information read from the analysis result storage section 133. Generate information.

The graph B1 is a graph showing the time change of the amount of speech of each participant U. The output unit 116 displays the amount of speech (activity) on the vertical axis and the time on the horizontal axis, and displays the amount of speech for each participant U at each time indicated by the analysis result on the display unit 21 as a line graph. At this time, the output unit 116 accumulates the amounts of speech of the participants U at each point in time, that is, displays the sum of the amounts of speech of the participants U in order on the vertical axis.

In the example of FIG. 5, the amount of speech of participant U4 is the total value of the amounts of speech of participants U3 and U4, and the amount of speech of participant U2 is the total value of the amounts of speech of participants U2, U3 and U4. The amount of speech of the participant U1 is a total value of the amounts of speech of the participants U1, U2, U3, and U4. The output unit 116 may randomly determine the order of accumulating (summing) the utterance amounts of the participants U, or may determine the order according to a predetermined rule.

Thus, the output unit 116 can display the amount of speech of the entire group of discussions in addition to the amount of speech of each participant U. The analyst can grasp the time change of contribution of each participant U, and at the same time grasp the time change of excitement of the whole group of the participant U.

The output unit 116 displays an area or a line indicating the graph B1 for each participant U in a display mode such as a color, a pattern, or the like different for each participant. In the example of FIG. 5, the graph B1 is displayed in a different pattern for each participant U, and a legend that associates the participant U with the pattern is displayed in the vicinity of the graph B1. Thereby, the analyst can easily determine which participant U the graph B1 corresponds to.

The section name B2 is a character string representing the section name. The section switching line B3 is a line indicating the switching timing of the two sections. The output unit 116 displays, for each section indicated by the section information, the section name in the vicinity of the graph B1 of the time range corresponding to the section. Further, the output unit 116 specifies the switching timing of the two sections based on the time of the section indicated by the section information. Then, the output unit 116 causes the switching line B3 to be displayed at the time (horizontal axis) position of the graph B1 corresponding to the specified switching timing. Thereby, the output unit 116 can display which section the graph B1 of the amount of speech of each participant U corresponds to each time.

As described above, the output unit 116 superimposes on the time change of the utterance amount of each participant U, and displays the information indicating the section set in the discussion. Therefore, the analyst can grasp the time change of the amount of speech of each participant U for each section.

Since the graph B1 displays the amount of speech of each participant U accumulated (summed), when the amount of speech of the participant U arranged below changes, the participants arranged upward accordingly The amount of U's speech is also displayed as apparently changed. Therefore, the time change of the amount of speech of each participant U may be difficult to understand at first glance. Therefore, the output unit 116 can display the time change of the utterance amount of each participant U in a legible manner by determining the order of accumulating the utterance amount of the participant U in the graph B1 based on the utterance amount of each participant U it can.

6 is a front view of the display unit 21 of the communication terminal 20 displaying the speech amount screen B. As shown in FIG. In the speech amount screen B of FIG. 6, the order of stacking the speech amount is changed for each section, and the other is the same as the speech amount screen B of FIG. 5. The output unit 116 may switch between and display the speech amount screen B in FIG. 5 and the speech amount screen B in FIG. 6 according to the operation of the analyst, or may display at least one of the predetermined ones.

When changing the stacking order, the output unit 116 measures the degree of variation (for example, variance or standard deviation) of the utterance amount of each participant U in each section based on the analysis result and the section information read from the analysis result storage section 133 Calculate). Then, the output unit 116 generates the graph B1 by accumulating the utterance amounts of the participants U in the order in which the degree of variation is small in each section. The output unit 116 may determine the stacking order based on the degree of variation of all sections, not for each section.

By accumulating from the bottom of the graph B1 in ascending order of the degree of variation in the amount of utterance in this manner, the change in the amount of utterance of the participant U disposed below is the apparent amount of utterance of the participant U disposed above It is possible to reduce the impact. Further, since the tendency of the amount of speech of each participant U changes depending on the section, by changing the stacking order for each section, it is possible to display the time change of the amount of speech more easily.

[Description of event display method]
The output unit 116 may display a predetermined event that has occurred during the discussion (that is, within the time of the sound acquired by the sound acquisition unit 112) in the graph B1. Thereby, the analyst can analyze the influence of the occurrence of the event on the volume of each participant U's utterance. The event is, for example, (1) access to a group of assistants (teachers, facilitators, etc.) of the discussion, or (2) specific remarks (words) of the assistants. The event shown here is an example, and the output unit 116 may display the occurrence of other events that can be recognized by the voice analysis device 100.

In order to detect the approach of a group of assistants, the output unit 116 uses a signal transmitted and received between the sound collector 10 and the assistants. In this case, the assistant holds a transmitter that emits a predetermined signal by radio waves or ultrasonic waves of wireless communication such as Bluetooth (registered trademark), for example, and the sound collection device 10 includes a receiver that receives the signal. The output unit 116 indicates that the assistant has approached when the receiver of the sound collection device 10 can receive the signal from the transmitter of the assistant or when the strength of receiving the signal becomes equal to or higher than a predetermined threshold. Determine Further, the output unit 116 is configured to leave the assistant when the receiver of the sound collection device 10 can not receive the signal from the transmitter of the assistant or when the intensity at which the signal is received becomes less than a predetermined threshold. Determine what you did.

Also, the output unit 116 may use the voiceprint of the assistant (ie, the frequency spectrum of the assistant's voice) to detect the approach of the assistant's group. In this case, the output unit 116 registers the voiceprint of the assistant in advance, and detects the voiceprint of the assistant in the voice acquired by the sound collection device 10 during the discussion. Then, the output unit 116 determines that the assistant has approached when detecting the assistant's voiceprint, and determines that the assistant has left when the assistant's voiceprint can not be detected.

In order to detect a specific word of the assistant, the output unit 116 performs speech recognition on the speech of the assistant. In this case, the assistant holds a sound collector (for example, a pin microphone), and the output unit 116 receives the voice of the assistant acquired by the sound collector held by the assistant. By using a sound collector held by the assistant separately from the sound collector 10, the voice of the participant U and the voice of the assistant can be clearly distinguished.

The output unit 116 converts the voice acquired from the sound collection device held by the assistant into a character string. The output unit 116 can use a known speech recognition method to convert speech into a character string. Then, the output unit 116 outputs specific words (for example, words related to the progress of the discussion such as “first”, “summary”, “last”, and words such as “good” or “bad”) in the converted character string. ) To detect. The words to be detected are set in the voice analysis device 100 in advance. Then, when the specific word is detected, the output unit 116 determines that the specific word is uttered.

The output unit 116 may perform speech recognition only before and after the timing at which the change in the amount of speech of each participant U is large. In this case, based on the analysis result read out from the analysis result storage unit 133, the output unit 116 calculates the degree of change in the amount of speech per time (for example, the amount or ratio of change per unit time). The degree of change in the amount of speech may be calculated for each participant U, or may be calculated as the sum of all participants U.

Then, the output unit 116 outputs the voice acquired by the sound collector held by the assistant in a predetermined time range (for example, 5 seconds after 5 seconds before the timing) including timing when the degree of change is equal to or higher than the predetermined threshold. Perform voice recognition. Generally speaking, speech recognition is heavy in processing. Thus, by performing speech recognition only before and after the timing at which the degree of change in the amount of speech is large, it is possible to analyze the words causing the change in the amount of speech while reducing the processing load.

Then, the output unit 116 generates display information in which the information indicating the event detected by the above method is associated with the time in the voice. FIG. 7 is a front view of the display unit 21 of the communication terminal 20 displaying the speech amount screen B. In the speech amount screen B of FIG. 7, event information B4 is displayed on the graph B1, and the other is the same as the speech amount screen B of FIG. The output unit 116 may switch and display the speech amount screen B in FIG. 5 and the speech amount screen B in FIG. 7 according to the operation of the analyst, or may display at least one of the predetermined amounts.

The event information B4 is information indicating the content and timing of the event. The event information B4 indicates the content of the event by, for example, a character string indicating that the assistant has approached or left or a character string indicating the speech of the assistant detected by speech recognition. Further, the event information B4 indicates the timing of the event by an arrow indicating the timing at which the event occurs on the graph B1.

Thus, the output unit 116 displays information indicating the content and timing of an event that has occurred in the discussion, superimposed on the time change of the utterance amount of each participant U. Therefore, the analyst can analyze how the event that occurred during the discussion influenced the time change of the volume of each participant U's utterance. The analyst can evaluate that the teacher has activated the discussion, for example, when the amount of speech increases when the teacher approaches the group. The analyst can also evaluate that the word is a valid word for activating the discussion, for example, when the amount of speech increases when a specific word is issued by the teacher.

[Explanation of the display method of the amount of speech of the same section]
The output unit 116 can extract and display a graph of a plurality of utterance amounts in the same section. FIG. 8 is a front view of the display unit 21 of the communication terminal 20 displaying the section extraction screen C. The output unit 116 displays the section extraction screen C for the specified section when, for example, the analyst specifies the name B2 of any section on the speech amount screen B in FIGS. 5 to 7. The section extraction screen C is a screen for displaying a result of extracting a graph of the amount of speech of the same section, and includes a graph C1 of the amount of speech, a name C2 of the section, and a name C3 of the group.

When displaying the section extraction screen C, the output unit 116 extracts analysis results and section information of a plurality of groups for the designated section from the analysis result storage section 133. The groups to be displayed may be different groups discussed at the same time, or the same or different groups discussed in the past. Then, the output unit 116 generates display information for displaying the time change of the utterance amount of each participant for a plurality of groups in the designated section based on the extracted analysis result and the section information.

The graph C1 of the amount of speech is a graph showing the time change of the amount of speech of each participant U in the designated section for each of two or more groups. The display mode of the graph C1 is the same as that of the graph B1. The section name C2 is a character string indicating the name of the designated section.

The group name C3 is a name for identifying a group to be displayed, and may be set by the analyst or may be automatically determined by the voice analysis device 100. In the example of FIG. 8, the output unit 116 displays the graph C1 of two groups, but the graph C1 of three or more groups may be displayed. Also, the output unit 116 may display the names of one or more participants U belonging to the group instead of or in addition to the name C3 of the group.

As described above, the output unit 116 displays a plurality of graphs indicating temporal change in the amount of speech of each participant in different groups in the same section. This allows the analyst to compare and analyze temporal changes in the volume of speech of different groups for the same section (e.g., the same subject, or the same stage in the discussion). For example, an analyst can grasp the tendency of the volume of utterance for each group by comparing different groups discussed at the same time. Also, for example, the analyst can grasp the change in the tendency of the utterance amount of the same group by comparing a plurality of past discussions of the same section for the same group.

[Explanation of how to display the heat map of the statement volume]
The output unit 116 is not limited to the stacked graph as illustrated in FIG. 5, and may display a heat map indicating time change of the amount of speech of each participant U. FIG. 9 is a front view of the display unit 21 of the communication terminal 20 displaying the speech amount screen D. The speech amount screen D includes a heat map D1 of the speech amount, a section name D2, and a section switching line D3. The section name D2 and the section switching line D3 are the same as the section name B2 and the section switching line B3 in FIG.

The speech amount heat map D1 displays the amount of speech along time by color. FIG. 9 shows the color difference by the density of the points, for example, the higher the density of the points, the darker the color, and the lower the density of the points, the lighter the color. The output unit 116 takes time in a predetermined direction (for example, the horizontal direction in FIG. 9) and causes the display unit 21 to display, for each participant U, an area of a color according to the amount of speech per hour.

As described above, the analyzer can also grasp the time change of the amount of speech of each participant U for each section by displaying the heat map instead of the graph. The output unit 116 may switch and display the graph of FIG. 5 and the heat map of FIG. 9 according to the operation of the analyst, or may display at least one of the predetermined ones.

[Sequence of voice analysis method]
FIG. 10 is a sequence diagram of the speech analysis method performed by the speech analysis system S according to the present embodiment. First, the communication terminal 20 receives the setting of analysis conditions from the analyst, and transmits the setting as setting information to the voice analysis device 100 (S11). The setting unit 111 of the voice analysis device 100 acquires setting information from the communication terminal 20 and causes the setting information storage unit 131 to store the setting information.

Next, the voice acquisition unit 112 of the voice analysis device 100 transmits a signal instructing voice acquisition to the sound collection device 10 (S12). When the sound collection device 10 receives a signal instructing acquisition of voice from the voice analysis device 100, recording of voice is started using a plurality of sound collection units, and the voice analysis device 100 collects voices of a plurality of channels recorded. (S13). The voice acquisition unit 112 of the voice analysis device 100 receives voice from the sound collection device 10 and stores the voice in the voice storage unit 132.

The voice analysis device 100 starts voice analysis at one of timings when an analyst gives instructions, when voice acquisition ends, or during voice acquisition (that is, real-time processing). When analyzing speech, first, the sound source localization unit 113 performs sound source localization based on the speech acquired by the speech acquisition unit 112 (S14).

Next, for each participant, the analysis unit 114 determines, based on the voice acquired by the voice acquisition unit 112 and the direction of the sound source estimated by the sound source localization unit 113, which participant has spoken at each time. The speech period and the speech volume are specified in (S15). The analysis unit 114 causes the analysis result storage unit 133 to store the utterance period and the utterance amount for each participant.

The section setting unit 115 sets one or more sections for the voice corresponding to the argument to be analyzed (S16). At this time, the section setting unit 115 sets a section based on at least one of the operation in the communication terminal 20, the operation in the sound collection device 10, and the predetermined sound acquired by the sound collection device 10. The section setting unit 115 stores section information indicating a section in the analysis result storage section 133 in association with the voice to be set.

The output unit 116 performs control to display the analysis result on the display unit 21 of the communication terminal 20 (S17). Specifically, based on the analysis result by the analysis unit 114 and the section information by the section setting unit 115, the output unit 116 displays the above-mentioned utterance amount screen B, the section extraction screen C, or the utterance amount screen D. Information is generated and transmitted to the communication terminal 20.

The communication terminal 20 causes the display unit 21 to display the analysis result in accordance with the display information received from the voice analysis device 100 (S18).

[Effect of this embodiment]
Since the Harkness Law shows the tendency of the speech of the whole period from the start to the end of the discussion, it can not show the change of the speech volume of each participant along the time series of the discussion. Therefore, there is a problem that it is difficult to analyze based on the time change of the volume of each participant. On the other hand, the voice analysis device 100 according to the present embodiment displays the time change of the amount of speech of each participant for each section. Thereby, the analyst can grasp the time change of the amount of speech of each participant for each section.

In addition, the voice analysis device 100 automatically analyzes the discussions of the plurality of participants based on the voice acquired using the sound collection device 10 having the plurality of sound collection units. Therefore, it is not necessary to have the recorder monitor the discussion as in the Harkness method described in Non-Patent Document 1, and it is not necessary to arrange the recorder for each group, so the cost is low.

As mentioned above, although the present invention was explained using an embodiment, the technical scope of the present invention is not limited to the range given in the above-mentioned embodiment, and various modification and change are possible within the range of the gist. is there. For example, a specific embodiment of device distribution and integration is not limited to the above embodiment, and all or a part thereof may be functionally or physically distributed and integrated in any unit. Can. In addition, new embodiments produced by any combination of a plurality of embodiments are also included in the embodiments of the present invention. The effects of the new embodiment generated by the combination combine the effects of the original embodiment.

The processor of the speech analysis device 100, the sound collection device 10, and the communication terminal 20 is a main body of each step (process) included in the speech analysis method shown in FIG. That is, the processors of the voice analysis device 100, the sound collection device 10 and the communication terminal 20 read a program for executing the voice analysis method shown in FIG. 10 from the storage unit, and execute the program to execute the voice analysis device 100. By controlling each part of the sound device 10 and the communication terminal 20, the voice analysis method shown in FIG. 10 is performed. The steps included in the speech analysis method shown in FIG. 10 may be partially omitted, the order between the steps may be changed, and a plurality of steps may be performed in parallel.

S voice analysis system 100 voice analysis device 110 control unit 112 voice acquisition unit 114 analysis unit 115 section setting unit 116 output unit 10 sound collector 20 communication terminal 21 display unit

Claims

An acquisition unit for acquiring voices uttered by a plurality of participants;
An analysis unit for specifying an amount of time of each of the plurality of participants in the voice;
A section setting unit for setting a section in the voice based on an input from a user;
An output unit configured to output a graph in which temporal changes in the amount of utterance of the plurality of participants are accumulated with each other, and information indicating the section in the graph;
Voice analyzer with.
The voice analysis device according to claim 1, wherein the output unit outputs, as the information indicating the section, a position on the graph that corresponds to a time when switching between the two sections.
The section setting unit is configured to set the section based on at least one of an operation in a communication terminal that communicates with the voice analysis device, an operation in a sound collection device for obtaining the voice, and a predetermined sound included in the voice. The voice analysis device according to claim 1, wherein a section is set.
The said output part outputs the said graph which piled up the time change of the said utterance amount mutually in an order with the small grade of the dispersion | variation degree of the said utterance amount calculated about each of the said several participants. The voice analysis device according to one item.
The output unit outputs the graph in which temporal changes of the utterance amount are accumulated for each of the sections in ascending order of the variation degree of the utterance amount for each of the sections calculated for each of the plurality of participants. The voice analysis device according to claim 4.
The voice analysis device according to any one of claims 1 to 5, wherein the output unit outputs a plurality of the graphs for the same section set to a plurality of the voices.
The voice analysis device according to any one of claims 1 to 6, wherein, in addition to the graph and the information indicating the section, information indicating an event occurring within the time of the voice is output on the graph.
The analysis unit according to any one of claims 1 to 7, wherein a value obtained by dividing the length of time during which a participant speaks within a predetermined time window by the length of the time window is specified as the speech amount. The voice analysis device according to any one of the preceding claims.
Processor is
Acquiring voices uttered by a plurality of participants;
Identifying an amount of time of each of the plurality of participants in the voice;
Setting an interval in the voice based on an input from a user;
Outputting a graph in which temporal changes of the utterance amount of the plurality of participants are stacked together, and information indicating the section in the graph;
Voice analysis method to perform.
On the computer
Acquiring voices uttered by a plurality of participants;
Identifying an amount of time of each of the plurality of participants in the voice;
Setting an interval in the voice based on an input from a user;
Outputting a graph in which temporal changes of the utterance amount of the plurality of participants are stacked together, and information indicating the section in the graph;
Voice analysis program to run.
A voice analysis device; and a communication terminal capable of communicating with the voice analysis device;
The communication terminal has a display unit for displaying information;
The voice analysis device
An acquisition unit for acquiring voices uttered by a plurality of participants;
An analysis unit for specifying an amount of time of each of the plurality of participants in the voice;
A section setting unit for setting a section in the voice based on an input from a user;
An output unit that causes the display unit to display a graph in which temporal changes of the utterance amount of the plurality of participants are accumulated with one another, and information indicating the section in the graph.
Voice analysis system.