WO2019142232A1

WO2019142232A1 - Voice analysis device, voice analysis method, voice analysis program, and voice analysis system

Info

Publication number: WO2019142232A1
Application number: PCT/JP2018/000943
Authority: WO
Inventors: 武志水本; 哲也菅原
Original assignee: ハイラブル株式会社
Priority date: 2018-01-16
Filing date: 2018-01-16
Publication date: 2019-07-25
Also published as: JP6589041B1; JPWO2019142232A1

Abstract

The objective of the present invention is to provide a voice analysis device, a voice analysis method, a voice analysis program, and a voice analysis system, which, when voices in a discussion are analyzed, enable reduction in time and effort for setting the positions of participants. A voice analysis device 100 according to one embodiment of the present invention includes: a position setting unit 111 that acquires information about a plurality of participants from a sound collecting device, and sets the positions of the plurality of participants on the basis of the acquired information about the participants; a voice acquisition unit 112 that acquires voices from the sound collecting device; and an analysis unit 115 that analyzes voices uttered by the respective participants on the basis of the positions set by the position setting unit 111.

Description

Speech analysis device, speech analysis method, speech analysis program and speech analysis system

The present invention relates to a voice analysis device for analyzing voice, a voice analysis method, a voice analysis program and a voice analysis system.

The Harkness method (also referred to as the Harkness method) is known as a method for analyzing discussions in group learning and meetings (see, for example, Non-Patent Document 1). In the Harkness method, the transition of each participant's utterance is recorded in a line. In this way, it is possible to analyze the contribution of each participant to the discussion and the relationship with others. The Harkness Law can also be effectively applied to active learning where students take the initiative in learning.

When performing analysis similar to the method of using a computer with a computer, by setting the position of each participant based on the position of a sound collection device such as a microphone, the voice acquired by the sound collection device is determined for each participant analyse. Therefore, there is a problem that it takes a lot of time and effort to set the position of each participant for each group.

The present invention has been made in view of these points, and it is an object of the present invention to provide a voice analysis device, a voice analysis method, a voice analysis program, and a voice analysis system capable of reducing the time and effort required to set the positions of participants when analyzing voices of discussion. Intended to be provided.

The voice analysis device according to the first aspect of the present invention acquires information on a plurality of participants from the sound collection device, and sets a position of each of the plurality of participants based on the acquired information on the participants And an acquisition unit configured to acquire a voice from the sound collection device, and an analysis unit configured to analyze the voice uttered by each of the plurality of participants based on the position set by the setting unit.

The setting unit may set a position of each of the plurality of participants by acquiring a voice from the sound collection device as the information on the participant and specifying a direction in which the acquired voice is emitted. .

The setting unit acquires an image captured by an imaging unit provided on the sound collection device as the information on the participant, and recognizes the faces of the plurality of participants included in the acquired image. The positions of each of the plurality of participants may be set.

The setting unit acquires information of a card read by a reading unit provided on the sound collection device as the information on the participant, and the plurality of participants are each according to the direction in which the card is presented to the reading unit. The position of may be set.

The setting unit may set the position of each of the plurality of participants based on the information input in the communication terminal in addition to the information on the participants.

It may further include a following unit that updates the position set by the setting unit in the middle of the voice being analyzed by the analysis unit.

The tracking unit may update the position set by the setting unit when the direction in which the voice analyzed by the analysis unit is emitted does not correspond to the position set by the setting unit. .

The tracking unit may update the position set by the setting unit to a direction in which the voice analyzed by the analysis unit is emitted.

In the voice analysis method according to the second aspect of the present invention, the processor acquires information on a plurality of participants from the sound collection device, and sets the position of each of the plurality of participants based on the acquired information on the participants Performing the steps of: acquiring voice from the sound collection device; and analyzing the voice emitted by each of the plurality of participants based on the position set in the setting step. .

The voice analysis program according to the third aspect of the present invention acquires information on a plurality of participants from a sound collection device on a computer, and sets the positions of the plurality of participants based on the acquired information on the participants. Performing the steps of: obtaining voice from the sound collection device; and analyzing the voice emitted by each of the plurality of participants based on the position set in the setting step. .

A voice analysis system according to a fourth aspect of the present invention includes a voice analysis device and a sound collection device capable of communicating with the voice analysis device, wherein the sound collection device acquires voice and a plurality of participants. The voice analysis device acquires information on the participant from the sound collection device, and sets positions of the plurality of participants based on the acquired information on the participant A setting unit for acquiring the voice from the sound collection device, and an analysis unit for analyzing the voice emitted by each of the plurality of participants based on the position set by the setting unit; Have.

According to the present invention, it is possible to reduce the trouble of setting the position of the participant when analyzing the voice of the discussion.

It is a schematic diagram of the speech analysis system concerning this embodiment. It is a block diagram of a speech analysis system concerning this embodiment. It is a schematic diagram of the speech analysis method which the speech analysis system concerning this embodiment performs. It is a figure which shows the flowchart of the whole speech analysis method which the speech analysis device concerning this embodiment performs. It is a front view of the display part of the communication terminal which is displaying the setting screen. It is a schematic diagram of the automatic setting process which the speech analyzer which concerns on this embodiment performs. It is a figure which shows the flowchart of the position setting process which the speech analyzer which concerns on this embodiment performs. It is a schematic diagram of the follow-up process which the speech analyzer concerning this embodiment performs. It is a figure which shows the flowchart of the follow-up process which the speech analyzer which concerns on this embodiment performs.

[Overview of speech analysis system S]
FIG. 1 is a schematic view of a speech analysis system S according to the present embodiment. The voice analysis system S includes a voice analysis device 100, a sound collection device 10, and a communication terminal 20. The number of sound collectors 10 and communication terminals 20 included in the speech analysis system S is not limited. The voice analysis system S may include devices such as other servers and terminals.

The voice analysis device 100, the sound collection device 10, and the communication terminal 20 are connected via a network N such as a local area network or the Internet. At least a part of the voice analysis device 100, the sound collection device 10, and the communication terminal 20 may be directly connected without the network N.

The sound collector 10 includes a microphone array including a plurality of sound collectors (microphones) arranged in different orientations. For example, the microphone array includes eight microphones equally spaced on the same circumference in the horizontal plane with respect to the ground. The sound collection device 10 transmits the voice acquired using the microphone array to the voice analysis device 100 as data.

The communication terminal 20 is a communication device capable of performing wired or wireless communication. The communication terminal 20 is, for example, a portable terminal such as a smart phone terminal or a computer terminal such as a personal computer. The communication terminal 20 receives the setting of analysis conditions from the analyst and displays the analysis result by the voice analysis device 100.

The voice analysis device 100 is a computer that analyzes the voice acquired by the sound collection device 10 by a voice analysis method described later. Further, the voice analysis device 100 transmits the result of the voice analysis to the communication terminal 20.

[Configuration of speech analysis system S]
FIG. 2 is a block diagram of the speech analysis system S according to the present embodiment. Arrows in FIG. 2 indicate the main data flow, and there may be data flows not shown in FIG. In FIG. 2, each block is not a hardware (apparatus) unit configuration but a function unit configuration. As such, the blocks shown in FIG. 2 may be implemented in a single device or may be implemented separately in multiple devices. Transfer of data between the blocks may be performed via any means such as a data bus, a network, a portable storage medium, and the like.

The sound collection device 10 includes an imaging unit 11 for imaging a participant in the discussion, and a reading unit 12 for reading information such as a card presented by the participant in the discussion. The imaging unit 11 is an imaging device capable of imaging a predetermined imaging range including the face of each participant (that is, each of a plurality of participants). The imaging unit 11 includes imaging elements of the number and arrangement capable of imaging the faces of all the participants surrounding the sound collection device 10. For example, the imaging unit 11 includes two imaging elements arranged in different orientations of 180 degrees in a horizontal plane with respect to the ground. Moreover, the imaging part 11 may image the face of all the participants who surround the sound collection apparatus 10 by rotating in the horizontal surface with respect to the ground.

The imaging unit 11 may perform imaging at a timing (for example, every 10 seconds) set in advance in the sound collection device 10, or may perform imaging in accordance with an imaging instruction received from the voice analysis device 100. The imaging unit 11 transmits an image indicating the imaged content to the voice analysis device 100.

The reading unit 12 has a reader (card reader) for reading information recorded in an IC (Integrated Circuit) card or a magnetic card (hereinafter collectively referred to as a card) presented by a participant by a contact method or a non-contact method. . An IC chip incorporated in a smartphone or the like may be used as an IC card. The reading unit 12 is configured to be able to specify the orientation of the participant who presented the card. For example, the reading unit 12 includes twelve reading devices arranged in different directions every 30 degrees in a horizontal plane with respect to the ground. Also, for example, the reading unit 12 may be provided with a button for specifying the orientation of the participant in addition to the reading device.

When a card is presented by the participant, the reading unit 12 reads the information of the card by the reading device, and identifies the orientation of the participant based on which reading device has read the card. Then, the reading unit 12 associates the read information with the direction of the participant and transmits the information to the voice analysis device 100.

The communication terminal 20 has a display unit 21 for displaying various information, and an operation unit 22 for receiving an operation by an analyst. The display unit 21 includes a display device such as a liquid crystal display or an organic light emitting diode (OLED) display. The operation unit 22 includes operation members such as a button, a switch, and a dial. The display unit 21 and the operation unit 22 may be integrally configured by using a touch screen capable of detecting the position of contact by the analyst as the display unit 21.

The voice analysis device 100 includes a control unit 110, a communication unit 120, and a storage unit 130. The control unit 110 includes a position setting unit 111, an audio acquisition unit 112, a sound source localization unit 113, a tracking unit 114, an analysis unit 115, and an output unit 116. The storage unit 130 includes a position storage unit 131, a voice storage unit 132, and an analysis result storage unit 133.

The communication unit 120 is a communication interface for communicating with the sound collection device 10 and the communication terminal 20 via the network N. The communication unit 120 includes a processor for performing communication, a connector, an electric circuit, and the like. The communication unit 120 performs predetermined processing on a communication signal received from the outside to acquire data, and inputs the acquired data to the control unit 110. Further, the communication unit 120 performs predetermined processing on the data input from the control unit 110 to generate a communication signal, and transmits the generated communication signal to the outside.

The storage unit 130 is a storage medium including a read only memory (ROM), a random access memory (RAM), a hard disk drive, and the like. The storage unit 130 stores in advance a program to be executed by the control unit 110. The storage unit 130 may be provided outside the voice analysis device 100, and in this case, data may be exchanged with the control unit 110 via the communication unit 120.

The position storage unit 131 stores information indicating the positions of participants in the discussion. The voice storage unit 132 stores the voice acquired by the sound collection device 10. The analysis result storage unit 133 stores an analysis result indicating the result of analyzing the voice. The position storage unit 131, the voice storage unit 132, and the analysis result storage unit 133 may be storage areas on the storage unit 130, or may be databases configured on the storage unit 130.

The control unit 110 is, for example, a processor such as a central processing unit (CPU), and executes the program stored in the storage unit 130 to obtain the position setting unit 111, the sound acquisition unit 112, the sound source localization unit 113, and the tracking unit 114. Functions as an analysis unit 115 and an output unit 116. The functions of the position setting unit 111, the sound acquisition unit 112, the sound source localization unit 113, the tracking unit 114, the analysis unit 115, and the output unit 116 will be described later with reference to FIGS. At least a part of the functions of the control unit 110 may be performed by an electrical circuit. In addition, at least a part of the functions of the control unit 110 may be executed by a program executed via a network.

The speech analysis system S according to the present embodiment is not limited to the specific configuration shown in FIG. For example, the voice analysis device 100 is not limited to one device, and may be configured by connecting two or more physically separated devices in a wired or wireless manner.

[Description of voice analysis method]
FIG. 3 is a schematic view of the speech analysis method performed by the speech analysis system S according to the present embodiment. First, the position setting unit 111 of the voice analysis device 100 sets the position of each participant in the argument to be analyzed by position setting processing described later (a). The position setting unit 111 sets the position of each participant by storing the position of each participant specified in the position setting processing described later in the position storage unit 131.

The audio acquisition unit 112 of the audio analysis device 100 transmits a signal instructing acquisition of audio to the sound collection device 10 when starting acquisition of audio (b). When the sound collection device 10 receives a signal instructing acquisition of voice from the voice analysis device 100, the collection of voice is started. In addition, when the voice acquisition unit 112 of the voice analysis device 100 ends the voice acquisition, the voice acquisition unit 112 transmits a signal instructing the end of the voice acquisition to the sound collection device 10. When the sound collection device 10 receives a signal instructing the end of the acquisition of sound from the speech analysis device 100, the sound collection device 10 ends the acquisition of sound.

The sound collection device 10 acquires voices in each of a plurality of sound collection units, and internally records the sound as the sound of each channel corresponding to each sound collection unit. Then, the sound collection device 10 transmits the acquired voices of the plurality of channels to the voice analysis device 100 (c). The sound collector 10 may transmit the acquired voice sequentially or may transmit a predetermined amount or a predetermined time of sound. Further, the sound collection device 10 may collectively transmit the sound from the start to the end of the acquisition. The voice acquisition unit 112 of the voice analysis device 100 receives voice from the sound collection device 10 and stores the voice in the voice storage unit 132.

The voice analysis device 100 analyzes voice at predetermined timing using the voice acquired from the sound collection device 10. The voice analysis device 100 may analyze the voice when the analyst gives an analysis instruction at the communication terminal 20 by a predetermined operation. In this case, the analyst selects a voice corresponding to the argument to be analyzed from the voices stored in the voice storage unit 132.

In addition, the voice analysis device 100 may analyze the voice when the voice acquisition ends. In this case, the speech from the start to the end of the acquisition corresponds to the argument to be analyzed. In addition, the voice analysis device 100 may analyze voice sequentially (that is, in real time processing) during acquisition of voice. In this case, the voice analysis device 100 goes back from the current time, and the voice for a predetermined time (for example, 30 seconds) in the past corresponds to the argument to be analyzed.

When analyzing speech, first, the sound source localization unit 113 performs sound source localization based on the plurality of channels of speech acquired by the speech acquisition unit 112 (d). The sound source localization is processing for estimating the direction of the sound source included in the sound acquired by the sound acquisition unit 112 for each time (for example, every 10 milliseconds to 100 milliseconds). The sound source localization unit 113 associates the direction of the sound source estimated for each time with the position of the participant stored in the position storage unit 131.

If the sound source localization unit 113 can identify the direction of the sound source based on the sound acquired from the sound collection device 10, a known sound source localization method such as Multiple Signal Classification (MUSIC) method or beam forming method can be used. .

Next, the analysis unit 115 analyzes the voice based on the voice acquired by the voice acquisition unit 112, the direction of the sound source estimated by the sound source localization unit 113, and the position of the participant stored in the position storage unit 131 (e) . The analysis unit 115 may analyze the entire completed discussion as an analysis target, or may analyze a part of the discussion in the case of real-time processing.

Specifically, the analysis unit 115 first analyzes the voice based on the voice acquired by the voice acquisition unit 112, the direction of the sound source estimated by the sound source localization unit 113, and the position of the participant stored in the position storage unit 131. In the discussion, it is determined at time intervals (eg, every 10 milliseconds to 100 milliseconds) which participants have made a speech. The analysis unit 115 specifies, as a speech period, a continuous period from when one participant starts talking to when it ends, and causes the analysis result storage unit 133 to store the same. When a plurality of participants speak at the same time, the analysis unit 115 specifies a speech period for each participant.

Further, the analysis unit 115 calculates the amount of speech of each participant for each time, and causes the analysis result storage unit 133 to store the amount. Specifically, in a certain time window (for example, 5 seconds), the analysis unit 115 calculates a value obtained by dividing the length of time during which the participant speaks by the length of the time window as the amount of speech per time Do. Then, the analysis unit 115 calculates the amount of speech per hour for each participant while shifting the time window by a predetermined time (for example, one second) from the start time of the discussion to the end time (currently in the case of real time processing). repeat.

The follow-up unit 114 acquires the position of the latest participant at predetermined time intervals in the voice to be analyzed by the sound source localization unit 113 and the analysis unit 115 by the following processing described later, and the participant stored in the position storage unit 131 Update the position of. Thereby, even if the participant moves from the already set position during acquisition of the sound, the sound source localization unit 113 and the analysis unit 115 can analyze the sound by following up.

The output unit 116 performs control to display the analysis result by the analysis unit 115 on the display unit 21 by transmitting the display information to the communication terminal 20 (f). The output unit 116 is not limited to the display on the display unit 21 and may output the analysis result by other methods such as printing by a printer, data recording to a storage device, and the like.

FIG. 4 is a diagram showing a flowchart of the entire speech analysis method performed by the speech analysis apparatus 100 according to the present embodiment. First, the position setting unit 111 specifies the position of each participant of the argument to be analyzed by position setting processing described later, and stores the position in the position storage unit 131 (S1). Next, the voice acquisition unit 112 obtains a voice from the sound collection device 10 and stores the voice in the voice storage unit 132 (S2).

The voice analysis device 100 analyzes the voice acquired by the voice acquisition unit 112 in step S2 for each predetermined time range (time window) from the start time to the end time. The sound source localization unit 113 executes sound source localization in the time range of the voice to be analyzed, and associates the estimated direction of the sound source with the position of each participant stored in the position storage unit 131 (S3).

The following unit 114 acquires the latest participant's position in the time range of the voice to be analyzed by the following processing described later, and updates the participant's position stored in the position storage unit 131 (S4).

The analysis unit 115 analyzes the voice based on the voice acquired by the voice acquisition unit 112 in step S2, the direction of the sound source estimated in step S3 by the sound source localization unit 113, and the position of the participant stored in the position storage unit 131. To do (S5). The analysis unit 115 stores the analysis result in the analysis result storage unit 133.

If the analysis is not completed until the end time of the voice acquired by the voice acquisition unit 112 in step S2 (NO in S6), the voice analysis device 100 executes the steps S3 to S5 for the next time range in the voice to be analyzed. repeat. If the analysis is completed until the end time of the voice acquired by the voice acquisition unit 112 in step S2 (YES in S6), the output unit 116 outputs the analysis result in step S5 according to a predetermined method (S7).

[Description of positioning process]
First, the position setting process shown in step S1 of FIG. 4 will be described. FIG. 5 is a front view of the display unit 21 of the communication terminal 20 displaying the setting screen A. As shown in FIG. In the position setting process, the analyst manually sets the position of each participant by operating the communication terminal 20, and each participant inputs information for specifying the position of the participant in the sound collection device 10. And automatic setting processing.

The communication terminal 20 displays the setting screen A on the display unit 21 and receives the setting of the analysis condition by the analyst. The setting screen A includes a position setting area A1, a start button A2, an end button A3, and an automatic setting button A4. The positioning area A1 is an area for setting the direction in which each participant U is actually positioned with reference to the sound collection device 10 in the argument to be analyzed. For example, as shown in FIG. 5, the position setting area A1 represents a circle centered on the position of the sound collector 10, and further represents an angle based on the sound collector 10 along the circle.

The analyst who desires the manual setting process sets the position of each participant U in the position setting area A1 by operating the operation unit 22 of the communication terminal 20. In the vicinity of the position set for each participant U, identification information (here, U1 to U4) for identifying each participant U is allocated and displayed. In the example of FIG. 5, four participants U1 to U4 are set. The portion corresponding to each participant U in the positioning area A1 is displayed in a different color for each participant. Thereby, the analyst can easily recognize the direction in which each participant U is set.

The start button A2, the end button A3 and the automatic setting button A4 are virtual buttons displayed on the display unit 21 respectively. The communication terminal 20 transmits a signal of a start instruction to the voice analysis device 100 when the analyst presses the start button A2. The communication terminal 20 transmits a signal of a termination instruction to the voice analysis device 100 when the analyst presses the termination button A3. In the present embodiment, from the start instruction to the end instruction by the analyst is one discussion.

An analyst who desires the automatic setting process causes the voice analysis device 100 to start the automatic setting process by pressing the automatic setting button A4. When the automatic setting button A4 is pressed, the communication terminal 20 transmits an automatic setting instruction signal to the voice analysis device 100.

[Description of automatic setting process]
FIGS. 6A to 6C are schematic views of the automatic setting process performed by the voice analysis device 100 according to the present embodiment. When the automatic setting process is instructed, the voice analysis device 100 sets the position of the participant U by at least one of the processes shown in FIGS. 6 (a) to 6 (c).

FIG. 6A shows a process of setting the position of the participant U based on the voice uttered by the participant U. In this case, the position setting unit 111 of the voice analysis device 100 causes the sound collection unit of the sound collection device 10 to acquire the voice emitted by each participant U. The position setting unit 111 acquires the sound acquired by the sound collection device 10.

The position setting unit 111 specifies the direction of each participant U based on the direction in which the acquired voice is emitted. The position setting unit 111 uses the result of sound source localization by the above-described sound source localization unit 113 in order to specify the direction of the participant from the voice. Then, the position setting unit 111 causes the position storage unit 131 to store the positions of the identified participants U.

The position setting unit 111 may specify the individual of the participant U by comparing the acquired voice of each participant U with the voice of the individual stored in advance in the voice analysis device 100. For example, the position setting unit 111 identifies the individual by comparing the voiceprints of the voices of the participants U (that is, the frequency spectrum of the voice). As a result, personal information of the participant U can be displayed together with the analysis result, and a plurality of analysis results of the same participant U can be displayed.

FIG. 6B shows a process of setting the position of the participant U based on the image of the face of the participant U. In this case, the position setting unit 111 of the voice analysis device 100 causes the imaging unit 11 provided in the sound collection device 10 to capture an area including the faces of all the participants U surrounding the sound collection device 10. The position setting unit 111 acquires an image captured by the imaging unit 11.

The position setting unit 111 recognizes the face of each participant U in the acquired image. The position setting unit 111 can use a known face recognition technology to recognize a human face from an image. Then, the position setting unit 111 identifies the position of each participant U based on the sound collection device 10 based on the position of the face of each participant U recognized from the image, and stores the position in the position storage unit 131. The relationship between the position in the image (for example, the coordinates of the pixels in the image) and the position based on the sound collection device 10 (for example, the angle with respect to the sound collection device 10) is set in advance in the voice analysis device 100.

The position setting unit 111 may specify the individual of the participant U by comparing the face of each participant U recognized from the image with the face of the individual stored in the voice analysis device 100 in advance. As a result, personal information of the participant U can be displayed together with the analysis result, and a plurality of analysis results of the same participant U can be displayed.

FIG. 6C shows a process of setting the position of the participant U based on the information of the card C presented by the participant U. In this case, the position setting unit 111 of the voice analysis device 100 causes the reading unit 12 provided in the sound collection device 10 to read the information of the card C presented by each participant U. The position setting unit 111 acquires the information on the card C read by the reading unit 12 and the direction of the participant U who presented the card C. The position setting unit 111 identifies the position of each participant U with respect to the sound collection device 10 based on the acquired information of the card C and the direction of the participant U, and causes the position storage unit 131 to store the position.

The position setting unit 111 may specify the individual of the participant U by acquiring personal information stored in advance in the voice analysis device 100 using the acquired information of the card C. As a result, personal information of the participant U can be displayed together with the analysis result, and a plurality of analysis results of the same participant U can be displayed.

The position setting unit 111 may execute the automatic setting process and the manual setting process in combination. In this case, for example, the position setting unit 111 displays the position of each participant U set by the automatic setting process of FIGS. 6A to 6C in the position setting area A1 of FIG. Accept manual settings by the user. Thereby, the position of each participant U set by the automatic setting process can be corrected by the manual setting process, and the position of each participant U can be set more reliably.

Thus, since the voice analysis device 100 can automatically set the position of each participant U based on the information on the participant U acquired on the sound collection device 10, the analyst can set the positions of all the groups on the communication terminal 20. It is possible to reduce the trouble of setting the position of each participant U. The voice analysis device 100 is not limited to information on voice, an image, or a card as information on the participant U that can be acquired on the sound collection device 10 (that is, information for specifying the position of the participant U). Other information that can identify the orientation may be used.

FIG. 7 is a diagram showing a flowchart of position setting processing performed by the voice analysis device 100 according to the present embodiment. First, the position setting unit 111 determines whether or not the automatic setting process is instructed by the analyst on the setting screen A in FIG. 5. When the automatic setting process is not instructed (ie, in the case of manual setting) (NO in S11), the position setting unit 111 follows the contents input on the setting screen A displayed on the communication terminal 20 for each participant. The position is specified and set in the position storage unit 131 (S12).

When automatic setting is instructed (YES in S11), the position setting unit 111 acquires information on the participant (that is, information for specifying the position of the participant) on the sound collection device 10 (S13). The position setting unit 111 uses at least one of the voice of the participant, the image of the face of the participant, and the information of the card presented by the participant as the information on the participant.

The position setting unit 111 specifies the position of each participant U with respect to the sound collection device 10 based on the acquired information on the participants (S14). Then, the position setting unit 111 sets the positions of the participants by storing the positions of the identified participants in the position storage unit 131 (S15).

[Description of follow-up processing]
Next, the follow-up process shown in step S4 of FIG. 4 will be described. FIG. 8 is a schematic view of the follow-up process performed by the voice analysis device 100 according to the present embodiment. The following process is a process of updating the position of each participant U stored in the position storage unit 131 in the middle of the voice to be analyzed by the sound source localization unit 113 and the analysis unit 115.

The upper part of FIG. 8 shows the position of each participant U before the update, and the lower part of FIG. 8 shows the position of each participant U after the update. The upper view of FIG. 8 shows a state in which the participant U1 has moved from the position P1 set in the position storage section 131 to another position P2. In this state, the voice emitted by the participant U1 enters the sound collector 10 from a position P2 different from the set position P1 of the participant U1. Therefore, the analysis unit 115 can not detect the utterance of the participant U1 from the voice.

Therefore, the tracking unit 114 updates the position of the participant U1 from the position P1 to the position P2 in the position storage unit 131 as illustrated in the lower part of FIG. Thus, the analysis unit 115 can correctly detect the utterance of the participant U1.

In order to update the position of each participant U, the tracking unit 114 acquires the direction of the sound source estimated by the sound source localization unit 113 every predetermined time (for example, one minute). If the estimated direction of the sound source does not correspond to any of the positions of the participants U stored in the position storage unit 131, the tracking unit 114 determines that any participant U moves in the direction of the sound source judge. Then, the tracking unit 114 specifies the moved participant U, and updates the position stored in the position storage unit 131 for the moved participant U to a position corresponding to the direction of the sound source.

For example, the tracking unit 114 specifies that the participant U set at the position closest to the direction of the sound source estimated from the sound acquired by the sound collection device 10 has moved to the position corresponding to the direction of the sound source. In this case, the tracking unit 114 selects the moved participant U from among the participants U set at a position within a predetermined range (for example, within a range of -30 degrees to +30 degrees) from the direction of the sound source. It is also good. By limiting the range of movement, the following unit 114 can suppress, for example, the position of the participant U from being moved to the wrong position.

Also, the tracking unit 114 compares the voiceprint of the sound source with the voiceprint of each participant U, and specifies that the participant U having the voiceprint similar to the voiceprint of the sound source has moved to the position corresponding to the direction of the sound source. It is also good. In this case, the tracking unit 114 may acquire the voiceprint of each participant U from the voice of each participant U at the start of analysis, or may acquire the voiceprint of each participant U stored in the storage unit 130 in advance. It is also good. The tracking unit 114 calculates the degree of similarity of voiceprints between the voiceprint of the sound source and the voiceprint of each participant U. The tracking unit 114 selects the participant U whose voiceprint similarity is the highest in the group, or selects the participant U whose voiceprint similarity is equal to or higher than a predetermined threshold. The tracking accuracy can be improved by specifying the moved participant U using the voiceprint.

Further, the following unit 114 acquires a face located in the direction of the sound source in the image captured by the imaging unit 11 of the sound collection device 10, and the participant U having a face similar to the acquired face has the direction of the sound source It may specify that it moved to the position corresponding to. In this case, the tracking unit 114 may acquire the face of each participant U from the image captured by the imaging unit 11 at the start of analysis, or may acquire the face of each participant U stored in the storage unit 130 in advance. It is also good. The tracking unit 114 calculates the similarity of the face between the face located in the direction of the sound source and the face of each participant U. The tracking unit 114 selects the participant U whose face similarity is the highest in the group, or selects the participant U whose face similarity is equal to or higher than a predetermined threshold. By specifying the moved participant U using the face, it is possible to improve the tracking accuracy.

When following using voiceprints or faces, the following unit 114 recognizes voiceprints for each participant U based on the difference between the direction of the sound source and the position (direction) stored in the position storage unit 131. The face similarity may be weighted. The probability that the participant U moves to the position of the sound source is higher as the position set for the participant U and the direction of the sound source are closer, and the distance between the position set for the participant U and the direction of the sound source is higher It can be said that the probability of moving to the position of the sound source is low. Therefore, the tracking unit 114 weights the similarity of the voiceprint or face higher as the difference between the position set for the participant U and the direction of the sound source decreases, and the difference between the position set for the participant U and the direction of the sound source The greater the value of V, the lower the degree of similarity of the voiceprint or face. This can further improve the tracking accuracy.

FIG. 9 is a diagram showing a flowchart of the follow-up process performed by the voice analysis device 100 according to the present embodiment. First, the tracking unit 114 acquires the direction of the sound source estimated by the sound source localization unit 113. If the direction of the sound source corresponds to any of the positions of the participants U stored in the position storage unit 131 (YES in S41), the following unit 114 ends the process without updating the position. .

When the direction of the sound source does not correspond to any of the positions of the participants U stored in the position storage unit 131 (NO in S41), the following unit 114 selects information about the participants on the sound collection device 10 (Ie, information for specifying the position of the participant) is acquired (S42). The tracking unit 114 uses at least one of the voice of the participant and the image of the face of the participant as the information on the participant.

The following unit 114 specifies which participant has moved based on the acquired information on the participant (S43). Then, the tracking unit 114 updates the position stored in the position storage unit 131 to the position corresponding to the direction of the sound source for the participant U specified as having moved (S44).

[Effect of this embodiment]
The voice analysis device 100 according to the present embodiment is information on each participant such as a voice emitted by the participant, an image of the participant's face, information of a card presented by the participant, etc. in the sound collection device 10 arranged in each group. And automatically set the position of each participant based on the acquired information. Therefore, when analyzing the voice of the discussion, it is possible to reduce the trouble of setting the position of each participant for each group.

In addition, the voice analysis device 100 updates the position of each participant based on the information on each participant during analysis of voice. Therefore, even if the participant moves in the middle of acquiring the voice, it is possible to follow and analyze.

As mentioned above, although the present invention was explained using an embodiment, the technical scope of the present invention is not limited to the range given in the above-mentioned embodiment, and various modification and change are possible within the range of the gist. is there. For example, a specific embodiment of device distribution and integration is not limited to the above embodiment, and all or a part thereof may be functionally or physically distributed and integrated in any unit. Can. In addition, new embodiments produced by any combination of a plurality of embodiments are also included in the embodiments of the present invention. The effects of the new embodiment generated by the combination combine the effects of the original embodiment.

In the above description, the speech analysis device 100 is used for analysis of speech in the discussion, but can be applied to other applications. For example, the voice analysis device 100 can also analyze the voice emitted by a passenger sitting in a car.

The processor of the voice analysis device 100, the sound collection device 10, and the communication terminal 20 is a main body of each step (process) included in the voice analysis method shown in FIGS. That is, the processors of the voice analysis device 100, the sound collection device 10 and the communication terminal 20 read a program for executing the voice analysis method shown in FIGS. 4, 7 and 9 from the storage unit and execute the program to perform voice analysis. By controlling the respective units of the device 100, the sound collection device 10 and the communication terminal 20, the voice analysis method shown in FIGS. The steps included in the speech analysis method shown in FIGS. 4, 7 and 9 may be partially omitted, the order between the steps may be changed, and a plurality of steps may be performed in parallel.

S speech analysis system 100 speech analysis device 110 control unit 111 position setting unit 112 speech acquisition unit 114 tracking unit 115 analysis unit 10 sound collection device 20 communication terminal

Claims

A setting unit configured to acquire information on a plurality of participants from the sound collection device, and to set positions of the plurality of participants based on the acquired information on the participants;
An acquisition unit that acquires voice from the sound collection device;
An analysis unit that analyzes the voice emitted by each of the plurality of participants based on the position set by the setting unit;
Voice analyzer with.
The setting unit sets a position of each of the plurality of participants by acquiring a voice from the sound collection device as the information on the participant, and specifying a direction in which the acquired voice is emitted. The voice analysis device according to 1.
The setting unit acquires an image captured by an imaging unit provided on the sound collection device as the information on the participant, and recognizes the faces of the plurality of participants included in the acquired image. The voice analysis device according to claim 1, wherein a position of each of the plurality of participants is set.
The setting unit acquires information of a card read by a reading unit provided on the sound collection device as the information on the participant, and the plurality of participants are each according to the direction in which the card is presented to the reading unit. The voice analysis device according to any one of claims 1 to 3, wherein the position of is set.
The setting unit sets the position of each of the plurality of participants based on the information input in the communication terminal in addition to the information on the participants. Voice analyzer.
The voice analysis device according to any one of claims 1 to 5, further comprising: a following unit that updates the position set by the setting unit in the middle of the voice being analyzed by the analysis unit.
The tracking unit updates the position set by the setting unit when the direction in which the voice analyzed by the analysis unit is emitted does not correspond to the position set by the setting unit. The voice analysis device according to 6.
The voice analysis device according to claim 6, wherein the tracking unit updates the position set by the setting unit in a direction in which the voice analyzed by the analysis unit is emitted.
Processor is
Acquiring information on a plurality of participants from the sound collection device, and setting the position of each of the plurality of participants based on the acquired information on the participants;
Acquiring voice from the sound collection device;
Analyzing the voice emitted by each of the plurality of participants based on the position set in the setting step;
Voice analysis method to perform.
On the computer
Acquiring information on a plurality of participants from the sound collection device, and setting the position of each of the plurality of participants based on the acquired information on the participants;
Acquiring voice from the sound collection device;
Analyzing the voice emitted by each of the plurality of participants based on the position set in the setting step;
Voice analysis program to run.
A voice analysis device, and a sound collection device capable of communicating with the voice analysis device;
The sound collection device is configured to obtain audio and to obtain information on a plurality of participants,
The voice analysis device
A setting unit configured to acquire information on the participant from the sound collection device, and to set positions of the plurality of participants based on the acquired information on the participant;
An acquisition unit for acquiring the sound from the sound collection device;
An analysis unit that analyzes the voice emitted by each of the plurality of participants based on the position set by the setting unit;
Voice analysis system.