WO2023210052A1

WO2023210052A1 - Voice analysis device, voice analysis method, and voice analysis program

Info

Publication number: WO2023210052A1
Application number: PCT/JP2022/045694
Authority: WO
Inventors: 武志水本; 直希安良岡; 浩平柳楽
Original assignee: ハイラブル株式会社
Priority date: 2022-04-27
Filing date: 2022-12-12
Publication date: 2023-11-02
Also published as: WO2023209898A1

Abstract

A voice analysis device 1 according to an embodiment of the present invention has: a voice acquisition unit 122 for acquiring the arrival direction of a voice for each of a plurality of sound collecting devices arranged in a prescribed region; an identifying unit 123 for identifying an utterance position at which an utterance was made using a relationship of a plurality of arrival directions to the plurality of sound collecting devices, and identifying the utterance length per unit time at the position thereof using a relationship between the utterance position and each position in a region; and an output control unit 125 for causing an information terminal to display map information in which each position in the region and a degree of activity corresponding to the utterance length per unit time at the position thereof are associated.

Description

Speech analysis device, speech analysis method and speech analysis program

The present invention relates to a speech analysis device, a speech analysis method, and a speech analysis program for analyzing speech.

Patent Document 1 discloses a system that extracts sounds that meet predetermined conditions from a spectrogram representing acoustics in a space and displays the sound pressure for each direction in which the extracted sounds are present.

JP 2021-152573 Publication

There is a need to analyze whether communication between people is active in companies and schools. There are individual differences in the volume of human voices, and the volume of voices changes depending on location and situation, so analysts can also analyze whether communication is active by referring to sound pressure and volume. It may be difficult to do so.

The present invention has been made in view of these points, and its purpose is to make it easier to analyze whether or not voice communication is actively taking place.

A voice analysis device according to a first aspect of the present invention includes: a voice acquisition unit that acquires a direction of arrival of voice to each of a plurality of sound collection devices arranged in a predetermined area; Identifying the utterance position where the utterance was made using the directional relationship, and identifying the length of the utterance per unit time at the position using the relationship between the utterance position and each position in the area. a specifying unit; an output control unit that causes an information terminal to display map information that associates each position in the area with an activation level corresponding to the length of the utterance per unit time at the position; have

The voice analysis device further includes a reception unit that receives a setting of an object area in which an object is located within the area, and the identification unit determines whether the arrival direction is The position where the utterance was made may be specified by excluding a part farther from the object area in the direction with reference to the position of the sound collecting device.

The map information may be information in which information corresponding to the degree of activity is superimposed on a map representing the area.

The map information is information in which information corresponding to the degree of activity and information indicating the position of one or more call terminals arranged in the area are superimposed on a map representing the area, and the The analysis device is configured to analyze audio between the selected call terminal and the information terminal in response to selection of one or more of the call terminals in the map information displayed on the information terminal. The communication device may further include a call control unit that starts sending and receiving calls.

The output control unit may output intervention information associated with the condition to the information terminal in response to the degree of activity at a position within the region satisfying a predetermined condition.

The voice analysis device may further include a reception unit that receives settings of the condition and the intervention information associated with the condition from the information terminal.

The identifying unit identifies a temporal change in the position where the utterance was made as a locus of movement of the position where the utterance was made, and the output control unit transmits information including the locus of movement to the information terminal. It may be displayed.

The output control unit may cause the information terminal to display the activity level in a first period in a sub-region that is at least a part of the area and the activity level in a second period in the sub-region in association with each other. good.

The identification unit determines the number of people who uttered the utterances at each position within the area by recognizing one or more speakers who uttered each of the plurality of voices acquired from the plurality of sound collection devices. and the speech analysis device calculates a provisional activation degree using the length of the utterance per unit time, and determines the activation degree by correcting the provisional activation degree according to the number of people. It may further include a degree determining section.

The activity level determination unit may set the activity level when the number of people is a plurality of people to be larger than the activity level when the number of people is one person.

The output control unit may repeatedly display the map information including the activity determined at predetermined time intervals on the information terminal.

The voice analysis device further includes an activity determination unit that determines which of a plurality of types the voice used to determine the activation level is, and the output control unit The information terminal may display the map information in which each position in the area is associated with the degree of activity determined using one of the plurality of types of audio.

The output control unit may output, to the imaging device, control information for directing an imaging device that images a part of the region to a position within the region where the degree of activity satisfies a predetermined condition.

A sound analysis method according to a second aspect of the present invention includes the steps of acquiring the direction of arrival of sound to each of a plurality of sound collection devices arranged in a predetermined area, which is executed by a processor; The utterance position where the utterance was made is specified using the relationship between the direction of arrival of the utterance, and the length of the utterance per unit time at the position is determined using the relationship between the utterance position and each position in the area. and displaying on an information terminal map information in which each position in the area is associated with an activation degree corresponding to the length of the utterance per unit time at the position. have

A sound analysis program according to a third aspect of the present invention includes a step of causing a processor to obtain a direction of arrival of sound to each of a plurality of sound collection devices arranged in a predetermined area; Identifying the utterance position where the utterance was made using the relationship of the direction of arrival, and identifying the length of the utterance per unit time at the position using the relationship between the utterance position and each position in the area. and displaying on an information terminal map information that associates each position in the area with the degree of activity corresponding to the length of the utterance per unit time at the position. .

According to the present invention, it is possible to easily analyze whether or not voice communication is actively taking place.

1 is a schematic diagram of a speech analysis system according to an embodiment. FIG. 1 is a block diagram of a speech analysis system according to an embodiment. FIG. 2 is a schematic diagram for explaining the relationship between an analysis target area, a sound collection device, and a local terminal. FIG. 3 is a schematic diagram for explaining a method in which the audio acquisition unit acquires the direction of arrival of audio. FIG. 3 is a schematic diagram for explaining a method in which the specifying unit specifies the utterance position. FIG. 3 is a schematic diagram for explaining the relationship between an arrival direction and an object area. FIG. 2 is a schematic diagram of a local terminal outputting map information and intervention information. FIG. 2 is a schematic diagram of a local terminal outputting intervention information. FIG. 3 is a schematic diagram of an external terminal displaying comparison information. FIG. 3 is a schematic diagram of an external terminal displaying comparison information. FIG. 3 is a schematic diagram of an external terminal displaying movement information. FIG. 3 is a diagram showing a flowchart of an exemplary speech analysis method executed by the speech analysis device according to the embodiment. FIG. 7 is a schematic diagram of map information representing the degree of activity determined using one of a plurality of voice types. FIG. 7 is a schematic diagram of map information representing the degree of activity determined using one of a plurality of voice types. It is a block diagram of the speech analysis system concerning the 3rd modification. FIG. 3 is a schematic diagram illustrating processing for changing the orientation of the imaging device based on the degree of activity.

[Overview of speech analysis system S]
FIG. 1 is a schematic diagram of a speech analysis system S according to this embodiment. The speech analysis system S includes a speech analysis device 1, a sound collection device 2, a local terminal 3, and an external terminal 4. The speech analysis system S may include a plurality of sound collection devices 2, a plurality of local terminals 3, and a plurality of external terminals 4. The speech analysis system S may also include other devices such as servers and terminals.

The voice analysis device 1 is a computer that analyzes the voice uttered by the user in a predetermined analysis target region R, and provides the analysis result to the user or an external user. The analysis target area R is, for example, a room in a company or public facility, a library or library, a classroom in a school or cram school, an event venue, a park, or the like. The user is a person who stays in the analysis target area R and makes sounds for the purpose of conversation or the like. The external user is a person who is outside the analysis target area R, and is, for example, an analyst. The voice analysis device 1 analyzes the voice acquired by the sound collection device 2 and outputs the analysis result to the local terminal 3 or external terminal 4. The voice analysis device 1 is connected to a sound collection device 2, a local terminal 3, and an external terminal 4 by wire or wirelessly via a network such as a local area network or the Internet.

The sound collection device 2 is a device that is placed in the analysis target region R and captures the sounds uttered by the user. The sound collecting device 2 includes, for example, a microphone array including sound collecting sections such as a plurality of microphones arranged in different directions. The microphone array includes, for example, a plurality of (e.g., eight) microphones arranged at equal intervals on the same circumference in a horizontal plane relative to the ground. The voice analysis device 1 identifies the position where the utterance is made by estimating the direction of arrival of the voice to each of the plurality of sound collection devices 2 based on the voice collected using the microphone array. The sound collection device 2 transmits the voice acquired using the microphone array to the voice analysis device 1 as voice data.

Furthermore, the sound collection device 2 may include a single microphone instead of the microphone array. In this case, a plurality of sound collecting devices 2 are arranged in the analysis target region R at predetermined intervals. The voice analysis device 1 identifies the position where the utterance is made by comparing the intensities of the voices acquired by each of the plurality of sound collection devices 2.

The local terminal 3 is an information terminal installed in the analysis target area R and outputs information. The local terminal 3 is, for example, a tablet terminal, a personal computer, or a digital signage. The local terminal 3 includes, for example, a display section such as a liquid crystal display, an audio output section such as a speaker, and a sound collection section such as a microphone. The local terminal 3 displays the information received from the speech analysis device 1 on the display section or outputs it from the speech output section. The local terminal 3 may function as a telephone terminal for making telephone calls with the external terminal 4.

The external terminal 4 is an information terminal that receives settings related to analysis and outputs information. The external terminal 4 is, for example, a smartphone, a tablet terminal, or a personal computer. The external terminal 4 includes, for example, a display section such as a liquid crystal display, an audio output section such as a speaker, and a sound collection section such as a microphone. The external terminal 4 causes the display unit to display the information received from the speech analysis device 1.

An overview of the process by which the voice analysis system S according to this embodiment analyzes voice will be described below. The voice analysis device 1 acquires the voices collected by each of the plurality of sound collection devices 2 arranged in the analysis target region R. The voice analysis device 1 uses the acquired voice to identify the position where the utterance was made. The speech analysis device 1 identifies the length of the utterance per unit time at each position in the analysis target area R by counting where in the analysis target area R the position where the utterance was made is located for each time. .

The speech analysis device 1 calculates the degree of activity corresponding to the length of the utterance per specified unit time. For example, the activity level increases as the length of the utterance per unit time is longer, and the value decreases as the length of the utterance per unit time becomes shorter. The speech analysis device 1 causes at least one of the local terminal 3 and the external terminal 4 to display map information that associates each position within the analysis target region R with the degree of activity.

In this way, the speech analysis system S identifies the length of utterance per unit time at each position within the analysis target area R based on the audio acquired by the sound collection device 2 placed in the analysis target area R. , the degree of activity corresponding to the length of the utterance is output in association with each position within the analysis target region R. As a result, the speech analysis system S can visualize the length of utterances at each position within the analysis target region R, rather than the loudness of the voice, making it easier to analyze whether vocal communication is actively taking place. be able to.

[Configuration of speech analysis system S]
FIG. 2 is a block diagram of the speech analysis system S according to this embodiment. In FIG. 2, arrows indicate main data flows, and data flows other than those shown in FIG. 2 may exist. In FIG. 2, each block shows the configuration of a functional unit rather than a hardware (device) unit. Therefore, the blocks shown in FIG. 2 may be implemented within a single device, or may be implemented separately within multiple devices. Data may be exchanged between blocks via any means such as a data bus, network, or portable storage medium.

The speech analysis device 1 includes a storage section 11 and a control section 12. The speech analysis device 1 may be configured by two or more physically separate devices connected by wire or wirelessly. Further, the speech analysis device 1 may be configured by a cloud that is a collection of computer resources.

The storage unit 11 is a computer-readable non-temporary storage medium including ROM (Read Only Memory), RAM (Random Access Memory), hard disk drive, and the like. The storage unit 11 stores in advance a program to be executed by the control unit 12. The storage unit 11 may be provided outside the speech analysis device 1, and in that case, data may be exchanged with the control unit 12 via a network.

The control unit 12 includes a reception unit 121, a voice acquisition unit 122, a specification unit 123, an activity level determination unit 124, an output control unit 125, and a call control unit 126. The control unit 12 is, for example, a processor such as a CPU (Central Processing Unit), and executes a program stored in the storage unit 11 to control the reception unit 121, the audio acquisition unit 122, the identification unit 123, and the activity level determination unit 124. , functions as an output control section 125 and a call control section 126. Further, at least part of the functions of the control unit 12 may be realized by the control unit 12 executing a program executed via a network.

Hereinafter, the processing executed by the speech analysis system S will be explained in detail. FIG. 3 is a schematic diagram for explaining the relationship among the analysis target area R, the sound collection device 2, and the local terminal 3. In the analysis target region R, a plurality of sound collecting devices 2 and one or more local terminals 3 are arranged.

The reception unit 121 determines the analysis target area R, the positions of the sound collection device 2 and the local terminal 3 within the analysis target area R, and the object area where an object (obstruction) such as a wall is located within the analysis target area R. Accepts settings. For example, the external terminal 4 receives an operation for specifying the analysis target area R, the positions of the sound collection device 2 and the local terminal 3, and the object area from an external user, and transmits information indicating the specified contents to the voice analysis device. Send to 1. In the speech analysis device 1, the reception unit 121 stores information associating the analysis target region R, the positions of the sound collection device 2 and the local terminal 3, and the object region based on the information received from the external terminal 4. The information is stored in the unit 11.

Additionally, the reception unit 121 may accept settings of sub-regions included in the analysis target region R. The sub-region is a region that is at least a part of the analysis target region R that is of interest during analysis. In the example of FIG. 3, a coffee area that is an area that includes a coffee machine, a desk area that is an area that includes a desk, a sofa area that is an area that includes a sofa, etc. may be set as sub-areas. For example, the external terminal 4 receives an operation for specifying a sub-region within the analysis target region R and the name of the sub-region from an external user, and transmits information indicating the specified contents to the speech analysis device 1. In the speech analysis device 1 , the receiving unit 121 causes the storage unit 11 to store information associating the sub-areas with the names of the sub-areas, based on the information received from the external terminal 4 .

Additionally, the reception unit 121 may accept settings for intervention conditions used to determine whether or not to output intervention information. The intervention condition is, for example, that the activity level corresponding to the length of utterance per unit time determined by the activity level determination unit 124 is equal to or greater than a predetermined threshold value. The intervention information is, for example, a message containing the name of a sub-area that satisfies the intervention condition. For example, the external terminal 4 receives an operation for specifying intervention conditions and intervention information from an external user, and transmits information indicating the specified contents to the speech analysis device 1. In the speech analysis device 1 , the reception unit 121 causes the storage unit 11 to store information in which intervention conditions and intervention information are associated, based on the information received from the external terminal 4 .

The audio acquisition unit 122 acquires the audio collected by each of the plurality of sound collection devices 2 arranged in the analysis target region R. The sound collection device 2 transmits, for example, audio data representing sounds collected using a microphone array to the audio analysis device 1. The sound collection device 2 constantly transmits voice data to the voice analysis device 1, or transmits voice data for a predetermined period (one hour, one day, etc.) to the voice analysis device 1 in bulk. In the voice analysis device 1, the voice acquisition unit 122 stores the voice data received from the sound collection device 2 in the storage unit 11, and acquires the voice indicated by the voice data.

The audio acquisition unit 122 may perform a predetermined filtering process on the acquired audio. For example, the audio acquisition unit 122 may remove, from the acquired audio, audio collected during a period different from a period associated with the analysis target region R in advance (business hours of a company or public facility, etc.). Furthermore, the voice acquisition unit 122 may remove, for example, sounds different from voices emitted by humans (sounds in a frequency band corresponding to human voices, etc.) from the acquired voices. Thereby, the speech analysis device 1 can perform analysis by excluding speech that is not important for analysis, and can improve the accuracy of the analysis results.

The audio acquisition unit 122 acquires the arrival direction of the audio collected by each of the plurality of sound collection devices 2 for each time (for example, every 10 milliseconds to 1000 milliseconds). For example, the audio acquisition unit 122 performs known sound source localization processing on multiple channels of audio collected by the microphone array included in the sound collection device 2 . The sound source localization process is a process for estimating the position of a sound source included in the audio acquired by the audio acquisition unit 122. The audio acquisition unit 122 performs sound source localization processing to acquire a reliability distribution indicating the reliability distribution of the existence of a sound source based on the position of the sound collection device 2. The reliability is a value corresponding to the likelihood that a sound source exists at that position, and may be a probability, for example. The reliability distribution represents the arrival direction of the sound with respect to the sound collection device 2.

FIG. 4A is a schematic diagram for explaining how the audio acquisition unit 122 acquires the direction of arrival of audio. The example in FIG. 4A represents the reliability distribution P acquired by the voice acquisition unit 122 based on the voices collected by each of the three sound collection devices 2.

The vertical and horizontal axes of the reliability distribution P correspond to coordinates within the analysis target region R. In the reliability distribution P, the brighter the color of each position (coordinates) (closer to white), the higher the confidence that the sound source exists, and the darker the color of each position (closer to black), the higher the confidence that the sound source exists. indicates that the value is low.

Since the microphone array cannot specify the distance from the sound collection device 2 to the sound source, in the reliability distribution P, regions with the same reliability are distributed linearly (radially) with the sound collection device 2 as a reference. Since the reliability of the presence of the sound source on the straight line connecting the sound collection device 2 and the sound source increases, the linear region in the reliability distribution P where the reliability is equal to or higher than a predetermined value indicates the arrival of sound to the sound collection device 2. Direction D is shown. The direction of arrival D is not limited to a straight line that includes the position of the sound collecting device 2, but may be expressed as a region having a width of a predetermined angle or length based on the position of the sound collecting device 2.

In the present embodiment, the voice analysis device 1 estimates the direction of arrival D, but the direction of arrival D may be estimated based on the voice acquired by each of the plurality of sound collection devices 2 using a microphone array. In this case, in the voice analysis device 1 , the voice acquisition unit 122 receives from each of the plurality of sound collection devices 2 information indicating the direction of arrival D estimated by the sound collection device 2 .

The identification unit 123 identifies the position where the utterance was made within the analysis target region R at each time (for example, every 10 to 1000 milliseconds) based on the plurality of arrival directions D for the plurality of sound collection devices 2. Identify the utterance position. FIG. 4B is a schematic diagram for explaining the method by which the specifying unit 123 specifies the utterance position.

The identification unit 123 superimposes a plurality of reliability distributions P generated from sounds collected by a plurality of sound collection devices 2. The specifying unit 123 overlaps the plurality of reliability distributions P by, for example, calculating the sum or product of the reliabilities indicated by the plurality of reliability distributions P at each position in the analysis target region R. FIG. 4B represents a reliability distribution P1 generated by superimposing the three reliability distributions P illustrated in FIG. 4A.

The identification unit 123 identifies the utterance position using a reliability distribution P1 obtained by superimposing a plurality of reliability distributions P. The utterance position may be represented by one point within the analysis target region R, or may be represented by a region within the analysis target region R. The specifying unit 123 specifies, for example, a position or area whose reliability is equal to or greater than a predetermined value in the reliability distribution P1 as the utterance position.

The position where the plurality of directions of arrival D indicated by the plurality of reliability distributions P intersects is a position with high reliability in the reliability distribution P1 obtained by superimposing the plurality of reliability distributions P. Therefore, the specifying unit 123 may specify the intersection position D1 where a plurality of straight lines along the plurality of arrival directions D intersect as the utterance position. When the direction of arrival D is a region having a width, the intersection position D1 may be a region where a plurality of regions extending along a plurality of directions of arrival D intersect.

In this way, the voice analysis device 1 identifies the utterance position based on the arrival direction D of the voice with respect to the plurality of sound collection devices 2, so even if the distance from one sound collection device 2 to the sound source cannot be determined. It is also possible to pinpoint the utterance position with high accuracy.

The specifying unit 123 may specify the utterance position by considering the object region where the object is located within the analysis target region R received by the receiving unit 121. FIG. 5 is a schematic diagram for explaining the relationship between the direction of arrival D and the object region R2. The example in FIG. 5 shows a state in which an object region R2 exists in the middle of the direction of arrival D.

When a straight line along the direction of arrival D intersects with the object region R2, the identifying unit 123 determines the utterance position by excluding a part in the direction of arrival D that is farther from the object region R2 with respect to the position of the sound collector 2. Identify. The specifying unit 123 identifies, for example, a line segment between the sound collection device 2 and the object region R2 along a first arrival direction among the plurality of arrival directions D, and a second arrival direction among the plurality of arrival directions D. Alternatively, the intersection position D1 where the line segment between the sound collection device 2 and the object region R2 along the second arrival direction intersects is specified as the utterance position. Thereby, the speech analysis device 1 can suppress erroneous recognition that the sound source is beyond an obstacle such as a wall, and can improve the accuracy of the utterance position.

In addition to identifying the utterance position, the identification unit 123 may estimate the number of users who have uttered the utterance at the utterance position. The specifying unit 123 performs a process of emphasizing the sound in the arrival direction D for each of the plurality of sounds that the sound acquisition unit 122 has acquired from the plurality of sound collection devices 2 . The identifying unit 123 emphasizes the sound in the direction of arrival D, for example, by suppressing the sound input from a direction different from the direction of arrival D to the microphone array included in the sound collection device 2.

The identifying unit 123 performs known speaker recognition processing on each of the plurality of voices in which the voice in the direction of arrival D is emphasized, thereby recognizing one or more speakers who have uttered each of the plurality of voices. The identifying unit 123 recognizes one or more speakers corresponding to one or more generated clusters, for example, by clustering the speech divided at predetermined intervals using deep learning.

Then, the identifying unit 123 estimates one or more speakers who are common to all the voices among the one or more speakers who uttered each of the plurality of voices as the user who made the utterance at the utterance position. The specifying unit 123 causes the storage unit 11 to store information associating a speaking position with a user who spoke at the speaking position for each time. Thereby, the speech analysis device 1 can exclude speakers who spoke at a position different from the speaking position and estimate the user who spoke at the speaking position with high accuracy.

The identification unit 123 identifies the speaking position using the sounds collected by the plurality of sound collection devices 2 including a single microphone instead of the sounds collected by the plurality of sound collection devices 2 including the microphone array. It's okay. In this case, in the analysis target region R, a plurality of sound collecting devices 2 are arranged at predetermined intervals. When a user makes a sound within the analysis target region R, each sound collection device 2 acquires a sound with a higher intensity as it is closer to the user, and a sound with a lower intensity as it is farther from the user.

The specifying unit 123 compares the intensities of sounds acquired by each of the plurality of sound collecting devices 2 at the same time, and selects the sound collecting device 2 with the highest sound intensity, or a plurality of sound collecting devices 2 with the highest sound intensity above a threshold value. The utterance position is specified based on the position of the sound collection device 2. Thereby, the speech analysis device 1 can specify the utterance position even when using the sound collection device 2 that does not include a microphone array.

The specifying unit 123 specifies the length of the utterance per unit time at each position within the analysis target region R, based on the utterance position for each time specified by the specifying unit 123. For example, at each position within the analysis target region R (for example, a rectangular region obtained by dividing the analysis target region R), the specifying unit 123 totals the time during which the utterance position was present at the position per unit time (for example, 1 minute). do. For example, if an utterance position exists at a certain position for 30 seconds in one minute going back from the current time, the length of the utterance per unit time at that position is 30 seconds.

The activity determining unit 124 determines the activity corresponding to the length of utterance per unit time specified by the specifying unit 123 at each position within the analysis target region R. For example, the activity determining unit 124 sets the activation value to be larger as the length of the utterance per unit time specified by the specifying unit 123 is longer, and to be smaller as the length of the utterance per unit time specified by the specifying unit 123 is shorter. Determine as degree. For example, the activation level determination unit 124 may determine the value of the length of utterance per unit time itself as the activation level, or may determine the value of the length of utterance per unit time converted according to a predetermined rule as the activation level. It may be determined as a degree.

Furthermore, the activity determining unit 124 may determine the activity by considering the number of users who spoke at the speaking position. In this case, the activity determining unit 124 determines that, for example, the longer the length of the utterance per unit time specified by the specifying unit 123 is, the higher the value is, and the shorter the length of the utterance per unit time specified by the specifying unit 123 is, the smaller the length of the utterance is. A provisional activity value is calculated.

The activity determining unit 124 calculates the activity by correcting the provisional activity according to the number of people specified by the specifying unit 123. The activity determining unit 124 corrects the provisional activity so that, for example, the activity when there is a plurality of people is greater than the activity when there is one person. Thereby, the voice analysis device 1 can reflect the number of people estimated from the voice in the degree of activity.

The output control unit 125 causes at least one of the local terminal 3 or the external terminal 4 to display map information in which each position within the analysis target region R is associated with the activity determined by the activity determination unit 124. For example, the output control unit 125 generates, as map information, a heat map in which information (color, pattern, etc.) corresponding to the degree of activity of each position in the analysis target region R is superimposed on a map representing the analysis target region R. do. Further, the output control unit 125 may generate map information indicating the positions of each of the plurality of sound collecting devices 2 arranged in the analysis target region R, in addition to the activity level of each position within the analysis target region R. . The output control unit 125 transmits the generated map information to at least one of the local terminal 3 and the external terminal 4.

It is desirable that the output control unit 125 repeatedly display map information indicating the degree of activity determined by the degree of activity determination unit 124 at predetermined time intervals on at least one of the local terminal 3 or the external terminal 4. Thereby, the voice analysis system S can notify the user or an external user of the latest communication status in the analysis target area R.

Furthermore, whether a situation with a high degree of activation is regarded as a positive element or a negative element, or a situation with a low degree of activation is regarded as a positive element or a negative element. It depends on the type of the analysis target region R. For example, in the analysis target area R where it is desirable to be quiet (for example, in a library or a library, or in a classroom at a school or cram school where it is desirable for students to be quiet during classes or tests), the activity level is A situation where the activity level is large may be regarded as a negative element, or a situation where the activity level is small may be regarded as a positive element.

Furthermore, in response to the activity determined by the activity determination unit 124 satisfying a predetermined intervention condition, the output control unit 125 transmits intervention information associated with the intervention condition to the local terminal 3 or external terminal 4. It may be output from at least one side. For example, the output control unit 125 acquires the intervention conditions and intervention information received by the reception unit 121 from the storage unit 11. The output control unit 125 determines whether the degree of activity at each position within the analysis target region R satisfies an intervention condition (for example, whether it is equal to or greater than a threshold indicated by the intervention condition).

The output control unit 125 generates intervention information associated with the intervention condition in response to the degree of activity at any position within the analysis target region R satisfying the intervention condition. The output control unit 125 outputs, for example, a message including the name of the sub-area including the position where the degree of activity satisfies the intervention condition (“The coffee area is busy”, “Please be quiet in the library”, etc.) as intervention information. Generate as. In this case, as intervention information when the degree of activation is high, intervention information with positive content (e.g., information to promote communication) is generated, or intervention information with negative content (e.g., information to suppress communication) is generated. Whether or not to generate the information for the analysis may be determined depending on the type of the analysis target region R. As described above, for example, in the analysis target region R where quietness is desirable, a situation with a high degree of activity may be regarded as a negative element, and intervention information with negative content may be generated.

When the analysis target region R is inside a store, the output control unit 125 may, for example, notify an employee of the store in response to the degree of activity at any position within the analysis target region R satisfying the intervention condition. A message instructing the user to move to the relevant location (such as "Please move near the hat section") may be generated as the intervention information. Thereby, the voice analysis system S can automatically instruct store staff to be placed in areas where many conversations are taking place in the store, making it easier to explain products and promote sales.

Furthermore, the intervention condition is not limited to the case where the degree of activity becomes equal to or greater than a predetermined threshold value, but instead or in addition to this, it is also possible to use the case where the degree of activity becomes less than or equal to a predetermined threshold value. In this case, too, as intervention information when the activation level is low, either intervention information with positive content (e.g., praise for being quiet) is generated, or intervention information with negative content (e.g., information that praises being quiet) is generated. Whether or not to generate information that calls attention to things that have not been done can be determined depending on the type of analysis target region R. Then, the output control unit 125 transmits the generated intervention information to at least one of the local terminal 3 and the external terminal 4.

The output control unit 125 may output intervention information from all local terminals 3. Further, the output control unit 125 may output intervention information only from among the plurality of local terminals 3, those located within a sub-region including a position that satisfies the intervention condition. Thereby, the voice analysis system S can notify intervention information to users around the position where the degree of activity satisfies the intervention condition.

FIG. 6A is a schematic diagram of the local terminal 3 displaying map information and intervention information. The local terminal 3 displays the map information and intervention information received from the speech analysis device 1 on the display section. In the example of FIG. 6A, the local terminal 3 displays a heat map H, which is map information, and a message M, which represents intervention information, on the display unit. The external terminal 4 may similarly display the heat map H and the message M on the display section.

FIG. 6B is a schematic diagram of the local terminal 3 outputting intervention information in the form of audio. The local terminal 3 outputs the voice V that displays the intervention information received from the voice analysis device 1 from the voice output section. The voice V may be generated by the output control unit 125 of the voice analysis device, or may be generated by the local terminal 3.

In this way, the speech analysis system S visualizes the length of utterances at each position within the analysis target area R as map information, thereby determining whether vocal communication is actively taking place within the analysis target area R. It can be made easier to analyze. Furthermore, the speech analysis system S can adjust communication within the analysis target region R to promote or suppress communication by outputting intervention information in response to the degree of activity satisfying a predetermined condition. .

The output control unit 125 may change the content of the intervention information depending on the person within the analysis target region R. In this case, the intervention information is associated in advance with, for example, a person or an attribute of the person (age, gender, clothing, etc.). The output control unit 125 recognizes a person within the analysis target region R, for example, by performing known person recognition processing on a captured image around the local terminal 3 acquired by a camera included in the local terminal 3. The output control unit 125 may recognize a person located somewhere in the analysis target region R, or may recognize only a person located in a specific sub-region. Then, the output control unit 125 causes at least one of the local terminal 3 and the external terminal 4 to output intervention information associated with the recognized person or the attribute of the person in response to the intervention condition being satisfied. Thereby, the voice analysis system S can output intervention information suitable for the person within the analysis target region R.

The output control unit 125 may cause the external terminal 4 to display comparison information for comparing the activity levels of different periods within the analysis target region R. In this case, the receiving unit 121 receives from the external terminal 4 the designation of the sub-area to be compared. Further, the receiving unit 121 may receive a designation of a period to be compared from the external terminal 4. The output control unit 125 generates comparison information that associates the degree of activity in the first period in the designated sub-region with the degree of activity in the second period in the sub-region. The output control unit 125 transmits the generated comparison information to at least one of the local terminal 3 and the external terminal 4.

7A and 7B are schematic diagrams of the external terminal 4 displaying comparison information. The external terminal 4 displays the comparison information received from the speech analysis device 1. In the example of FIG. 7A, the external terminal 4 sends a heat map H for each of the first period and the second period, and a message M representing the comparison result of the activity levels for the first period and the second period in the specified sub-region. is displayed as comparative information. In the heat map H, a designated sub-region within the entire analysis target region R is highlighted. The message M is, for example, a message representing the amount or rate of increase/decrease in activity between the first period and the second period in the sub-region.

In the example of FIG. 7B, the external terminal 4 displays the comparison result of the heat map H1 in the designated sub-region or the entire analysis target region R and the activity level for multiple periods in the designated sub-region or the entire analysis target region R. Message M is displayed as comparison information.

The heat map H1 differs from the heat map H that shows the activity level on the map illustrated in FIGS. 6A and 7A, in that the heat map H1 displays information (color, color, This is a heat map showing the patterns (patterns, etc.). Therefore, the heat map H1 visualizes the difference in activity between multiple time periods in the same area. The message M is, for example, a message representing a time period in which the activity level is high or low in the sub-region or the entire analysis target region R.

In this way, the speech analysis system S can make it easier to analyze increases and decreases in the activity level and trends in the activity level for each time period by associating and visualizing the activity level for different periods.

The output control unit 125 may output past audio at the specified position from the external terminal 4. In this case, the receiving unit 121 receives designation of a position within the analysis target region R and a past period on the external terminal 4 displaying map information or comparison information. The output control unit 125 acquires, from the storage unit 11 , the audio at the specified position and period from the audio acquired by the audio acquisition unit 122 , and outputs it from the audio output unit of the external terminal 4 . Thereby, the speech analysis system S can easily analyze the relationship between the degree of activity and the actual speech content.

The output control unit 125 may display movement information including the locus of movement of the speaking position on at least one of the local terminal 3 or the external terminal 4. In this case, the specifying unit 123 specifies, for example, the temporal change in the utterance position at each specified time as the locus of movement of the utterance position. The identifying unit 123 acquires from the storage unit 11 information that is generated by the above-described speaker recognition process and associates the speaking position and the user who spoke at the speaking position on a time-by-time basis. Then, the identifying unit 123 identifies the locus of movement of the speaking position corresponding to the specific user (speaker) based on the acquired information. The output control unit 125 transmits movement information including the locus of movement specified by the identification unit 123 to at least one of the local terminal 3 and the external terminal 4.

FIG. 8 is a schematic diagram of the external terminal 4 displaying movement information. The external terminal 4 displays the movement information received from the speech analysis device 1 on the display section. In the example of FIG. 8, the external terminal 4 displays the movement trajectory T indicated by the movement information on the display unit. The local terminal 3 may similarly display the movement trajectory T on the display unit. Thereby, the speech analysis system S can easily analyze how the speaker moves and communicates within the analysis target region R.

The call control unit 126 may start a call between the local terminal 3 selected on the map information and the external terminal 4 after the output control unit 125 displays the map information on the external terminal 4. In this case, the output control unit 125, for example. On a map representing the analysis target area R, information (color, pattern, etc.) corresponding to the activity level of each position within the analysis target area R and the position of one or more local terminals 3 placed in the analysis target area R are displayed. A heat map in which information (such as an icon) indicating , and are superimposed is displayed on the external terminal 4 as map information.

The reception unit 121 receives the selection of one of the local terminals 3 to be the destination of the call from among the one or more local terminals 3 in the map information displayed on the external terminal 4. For example, in order to support communication in the analysis target area R from outside the analysis target area R, the external user selects the local terminal 3 located at a location where the degree of activity is low in the map information. The call control unit 126 starts transmitting and receiving audio between the selected local terminal 3 and the external terminal 4 in response to selection of one or more local terminals 3. The local terminal 3 functions as a telephone terminal for making a telephone call with the external terminal 4, and outputs the voice received from the external terminal 4 from an audio output section such as a speaker, and also outputs the voice received from the external terminal 4 from a microphone or the like of the local terminal 3. The audio input to the sound section is transmitted to the external terminal 4. The call control unit 126 may allow audio to be exchanged bidirectionally between the selected local terminal 3 and the external terminal 4, or may output audio from the external terminal 4 to the local terminal 3 in one direction.

Thereby, the voice analysis system S can make it easier for an external user who wishes to make a call with the local terminal 3 to select the local terminal 3 to call based on the activity level. The external user can support activation of communication in the analysis target area R by intervening with the local terminal 3 in the analysis target area R from the outside by voice.

[Flowchart of speech analysis method]
FIG. 9 is a diagram showing a flowchart of an exemplary speech analysis method executed by the speech analysis device 1 according to the present embodiment. The reception unit 121 receives from the external terminal 4 the analysis target area R, the positions of the sound collection device 2 and the local terminal 3 within the analysis target area R, and the object area where an object such as a wall is located within the analysis target area R. The settings of , are accepted (S11).

The audio acquisition unit 122 acquires the audio collected by each of the plurality of sound collection devices 2 arranged in the analysis target region R (S12). The audio acquisition unit 122 acquires the time-by-time arrival direction D of the audio collected by each of the plurality of sound collection devices 2 (S13). The direction of arrival D may be estimated by the voice analysis device 1 or may be estimated by each of the plurality of sound collection devices 2.

The specifying unit 123 specifies the utterance position, which is the position where the utterance was made, within the analysis target region R for each time based on the plurality of arrival directions D with respect to the plurality of sound collecting devices 2 (S14). Here, the specifying unit 123 may specify the utterance position by considering the object region where the object is located within the analysis target region R received by the receiving unit 121.

The specifying unit 123 specifies the length of the utterance per unit time at each position within the analysis target region R, based on the utterance position for each time specified by the specifying unit 123 (S15). The activity determining unit 124 determines the activity corresponding to the length of utterance per unit time specified by the specifying unit 123 at each position within the analysis target region R (S16). For example, the activity level increases as the length of the utterance per unit time specified by the specifying unit 123 increases, and decreases as the length of the utterance per unit time specified by the specifying unit 123 decreases.

The output control unit 125 causes at least one of the local terminal 3 and the external terminal 4 to output map information in which each position within the analysis target region R is associated with the activity determined by the activity determination unit 124 (S17 ). Further, the output control unit 125 may cause the external terminal 4 to display comparison information for comparing the activity levels of different periods within the analysis target region R.

When the activity level determined by the activity level determination unit 124 satisfies a predetermined intervention condition (YES in S18), the output control unit 125 transmits the intervention information associated with the intervention condition to the local terminal 3 or external terminal 4. (S19). The speech analysis device 1 ends the process when the activity level determined by the activity level determination unit 124 does not satisfy the predetermined intervention condition (NO in S18).

[Effects of this embodiment]
The speech analysis system S according to this embodiment specifies the length of utterance per unit time at each position in the analysis target region R based on the sound acquired by the sound collection device 2 placed in the analysis target region R. Then, the degree of activity corresponding to the length of the utterance is output in association with each position within the analysis target region R. As a result, the speech analysis system S can visualize the length of utterances at each position within the analysis target region R, rather than the loudness of the voice, making it easier to analyze whether vocal communication is actively taking place. be able to.

Furthermore, the speech analysis system S can adjust communication within the analysis target region R to promote or suppress communication by outputting intervention information in response to the degree of activity satisfying a predetermined condition. . Moreover, the speech analysis system S can make it easier to analyze increases and decreases in the activity level and trends in the activity level for each time period by associating and visualizing the activity level for different periods.

[First modification]
In the above embodiment, an example was explained in which the voice analysis system S analyzes the voice uttered by a human in a closed space such as a company or a public facility. Sounds made by animals such as monkeys and birds may also be analyzed.

In this case, in the speech analysis device 1, the reception unit 121 accepts the setting for setting the open space as the analysis target region R. The audio acquisition unit 122 acquires the sounds emitted by animals, which are collected by each of the plurality of sound collection devices 2 arranged in the analysis target region R, which is an open space. Then, the speech analysis device 1 specifies the length of the utterance at each position, as in the above-described embodiment, and outputs information corresponding to the degree of activity corresponding to the length of the utterance.

In this way, the voice analysis system S can easily analyze whether communication among animals, not limited to humans, is actively occurring in an open space.

[Second modification]
The speech analysis system S according to this modification displays a heat map of the degree of activity for each type of voice, such as laughter or screaming, and outputs intervention information based on the degree of activity for each type of voice. Hereinafter, parts that are different from the above-described embodiment will be mainly described.

After determining the activation level of each position within the analysis target region R, the activation level determining unit 124 determines which of the plurality of audio types the audio used to determine the activation level is. Determine the audio type. The types of sounds include, for example, laughter, screams, voices of specific emotions (anger, etc.), voices of specific animals (deer, bear, etc.), strange noises from machinery, and the like.

The activity determination unit 124 obtains, for example, a machine learning model that is stored in advance in the storage unit 11 and outputs information indicating which of a plurality of voice types the input voice is. The machine learning model is configured by, for example, a neural network, and is generated by performing known machine learning processing using a plurality of voices and voice types as training data.

By inputting the voice acquired by the voice acquisition unit 122 into the acquired machine learning model, the activation level determining unit 124 determines the type of voice output by the machine learning model, and determines whether the voice used to determine the level of activity corresponds to the type of voice output by the machine learning model. Determine the type of voice to be used.

Furthermore, instead of the voice analysis device 1, the sound collection device 2 may determine the voice type. In this case, the sound collection device 2 inputs the collected sound into the above-mentioned machine learning model stored in advance in the storage unit of the sound collection device 2, thereby determining the type of sound output by the machine learning model. get. The sound collection device 2 transmits to the voice analysis device 1, along with voice data indicating the collected voice, information indicating the type of the voice. In the speech analysis device 1 , the activity level determination unit 124 determines the voice type to which the voice used to determine the activity level corresponds, based on the information indicating the voice type received from the sound collection device 2 .

The output control unit 125 outputs map information that associates each position within the analysis target region R with the degree of activity determined using one of the plurality of voice types to the local terminal 3 or It is displayed on at least one of the external terminals 4. Map information such as a heat map corresponding to one audio type is called one layer.

For example, the output control unit 125 transmits map information representing the degree of activity determined using the voice of the voice type specified in at least one of the local terminal 3 or the external terminal 4 to at least one of the local terminal 3 or the external terminal 4. Display it on one side. In this case, at least one of the local terminal 3 and the external terminal 4 displays a layer corresponding to any one voice type based on the information received from the voice analysis device 1.

Furthermore, the output control unit 125 may display a list of map information representing the degree of activity determined using the voices of each of the plurality of voice types on at least one of the local terminal 3 and the external terminal 4, for example. In this case, at least one of the local terminal 3 and the external terminal 4 displays a plurality of layers corresponding to a plurality of voice types based on the information received from the speech analysis device 1.

FIGS. 10A and 10B are schematic diagrams of map information representing the degree of activity determined using one of a plurality of audio types. FIG. 10A shows, as map information, a heat map H2 corresponding to the activity level determined using a sound whose sound type is "laughter". FIG. 10B shows, as map information, a heat map H3 corresponding to the activity level determined using a voice whose voice type is "scream".

In this way, the speech analysis system S makes it easier to analyze what types of communication are actively occurring within the analysis target region R by visualizing the degree of activity of each speech type as map information. be able to.

The output control unit 125 outputs the intervention information associated with the intervention condition to the local terminal 3 or external terminal in response to the activation level of each voice type determined by the activation level determination unit 124 satisfying a predetermined intervention condition. The output may be output from at least one of the four. In this case, the output control unit 125 obtains, for example, intervention conditions and intervention information for each voice type from the storage unit 11, which are stored in advance in the storage unit 11. Intervention conditions may or may not be set for each of the plurality of voice types.

For example, the output control unit 125 may not output the intervention information when the sound type is "laughter", but may output the intervention information when the sound type is "scream". For example, the output control unit 125 outputs intervention information for promoting communication in response to the activation level being below a predetermined threshold when the voice type is "laughter", and outputs intervention information for promoting communication when the voice type is "scream". In some cases, intervention information for suppressing communication may be output in response to the degree of activity being equal to or higher than a predetermined threshold.

The output control unit 125 determines whether the activity level of each of the plurality of voice types at each position in the analysis target region R satisfies the intervention condition for the voice type (for example, whether it is equal to or greater than the threshold value indicated by the intervention condition). judge. The output control unit 125 generates intervention information associated with the intervention condition in response to the degree of activity of each voice type at any position within the analysis target region R satisfying the intervention condition. The output control unit 125 causes the generated intervention information to be output from at least one of the local terminal 3 and the external terminal 4.

Thereby, the speech analysis system S can switch the presence or absence and content of intervention for each speech type by outputting intervention information in response to the degree of activity satisfying the condition set for each speech type.

[Third modification]
The audio analysis system S according to this modification determines the degree of activity using a captured image generated by imaging the analysis target region R with the imaging device, and also controls the orientation of the imaging device. Hereinafter, parts that are different from the above-described embodiment will be mainly described.

FIG. 11 is a block diagram of the speech analysis system S according to this modification. The control unit 12 of the speech analysis device 1 according to this modification further includes an image acquisition unit 127 in addition to the units shown in FIG.

The image acquisition unit 127 acquires a captured image generated by capturing a part of the analysis target region R from the imaging device 5. The imaging device 5 is arranged at a position where it can image a part of the analysis target region R, for example. The imaging device 5 has a drive unit for changing the direction of the imaging device 5. The imaging device 5 can image any position within the analysis target region R by changing the direction using the drive unit. The imaging device 5 may change its orientation according to control information output from an output control unit 125, which will be described later, or may change its orientation automatically.

The imaging device 5 periodically or in response to receiving an imaging instruction from the voice analysis device 1 captures a portion of the analysis target region R to obtain a captured image (i.e., the direction in which the imaging device 5 is facing). A captured image defined as an imaging range is generated, and the generated captured image is transmitted to the voice analysis device 1 via wired or wireless communication. In the speech analysis device 1 , the image acquisition unit 127 receives the captured image from the imaging device 5 and stores the received captured image in the storage unit 11 .

The output control unit 125 outputs control information for changing the orientation of the imaging device 5 based on the activity level determined by the activity level determination unit 124. FIG. 12 is a schematic diagram for explaining a process for changing the direction of the imaging device 5 based on the degree of activity. FIG. 12 schematically shows the position and orientation of the imaging device 5 in the heat map H of the analysis target region R.

The output control unit 125 identifies a position within the analysis target region R where the degree of activity satisfies a predetermined condition (for example, being equal to or greater than a predetermined threshold) as a target position L. The output control unit 125 transmits control information for directing the imaging device 5 to the target position L to the imaging device 5. The control information is, for example, information indicating the relative position of the target position L with respect to the position of the imaging device 5. The imaging device 5 changes its direction so as to face the target position L using the drive unit according to the control information received from the voice analysis device 1. The imaging device 5 generates a captured image by capturing the direction in which the imaging device 5 is facing as an imaging range, and transmits the generated captured image to the voice analysis device 1 via wired or wireless communication.

If there are multiple target positions L that satisfy a predetermined condition within the analysis target region R, after the imaging device 5 finishes imaging one target position L, the imaging device 5 is directed to the next target position L. control information is sent sequentially.

Thereby, even if the imaging device 5 is unable to image the entire analysis target region R at once, the audio analysis system S can point the imaging device 5 at a position where the degree of activity satisfies a predetermined condition to capture the image. Therefore, it is possible to preferentially acquire captured images at positions that are likely to have high importance or urgency within the analysis target region R.

Further, the output control unit 125 outputs control information for moving a movable device such as a robot or a drone to the target position L instead of or in addition to the control information for directing the imaging device 5 to the target position L. Good too. In this case, the mobile device moves to the target position L according to the control information received from the speech analysis device 1. Thereby, the voice analysis system S can automatically provide robots, drones, etc. to positions within the analysis target area R that are likely to have a high degree of importance or urgency.

The activity determination unit 124 identifies the number of users at each position within the analysis target region R based on the captured image acquired by the image acquisition unit 127, and converts the identified number of users into information output by the output control unit 125. It may be reflected.

The activity level determination unit 124 calculates the activity level per user, for example. In this case, the activity determining unit 124 specifies the number of users at each position within the analysis target region R, for example, by performing known image recognition processing on the captured image. The activity determination unit 124 may specify only the number of users at a position where the activity satisfies a predetermined condition (for example, being equal to or greater than a predetermined threshold).

Furthermore, the activity determining unit 124 may identify the number of users at each position within the analysis target region R using the position of a communication device such as a smartphone owned by the user instead of the captured image. In this case, the activity determining unit 124 acquires, for example, information indicating the reception strength of the signal emitted by each of a plurality of transmitters such as beacons placed in the analysis target region R from a communication device owned by the user. The activity determining unit 124 identifies the user's position using the relationship between the positions of the plurality of transmitters and the reception strength of the signal emitted by each of the plurality of transmitters in the communication device of the user. The activity determining unit 124 then specifies the number of users at each position within the analysis target region R by totaling the positions of the identified one or more users.

The activity determining unit 124 is not limited to the specific method shown here, and may specify the number of users at each position within the analysis target region R using other methods.

The activity determining unit 124 calculates the activity per person by dividing the activity at each location by the number of people specified for each location. The output control unit 125 causes at least one of the local terminal 3 and the external terminal 4 to display map information that associates each position within the analysis target region R with the degree of activity per person. Thereby, the speech analysis system S can visualize the level of excitement per user instead of the level of excitement for the entire group including a plurality of users.

In addition, the output control unit 125 associates each position in the analysis target region R excluding positions where the number of users is 0 or 1 with the activity determined by the activity determination unit 124. The map information may be displayed on at least one of the local terminal 3 and the external terminal 4. Thereby, the voice analysis system S can display the user's activity level excluding the user's soliloquy, mechanical sounds, natural sounds, etc. as map information, and can make the map information easier to view.

Although the present invention has been described above using the embodiments, the technical scope of the present invention is not limited to the scope described in the above embodiments, and various modifications and changes can be made within the scope of the gist. be. For example, all or part of the device can be functionally or physically distributed and integrated into arbitrary units. In addition, new embodiments created by arbitrary combinations of multiple embodiments are also included in the embodiments of the present invention. The effects of the new embodiment resulting from the combination have the effects of the original embodiment.

The processor of the speech analysis device 1 executes each step (process) included in the speech analysis method shown in FIG. That is, the processor of the speech analysis device 1 executes the speech analysis method shown in FIG. 9 by executing a program for executing the speech analysis method shown in FIG. Some of the steps included in the speech analysis method shown in FIG. 9 may be omitted, the order of the steps may be changed, or a plurality of steps may be performed in parallel.

S Speech analysis system 1 Speech analysis device 11 Storage section 12 Control section 121 Reception section 122 Speech acquisition section 123 Specification section 124 Activity level determination section 125 Output control section 126 Call control section 127 Image acquisition section 2 Sound collection device 3 Local terminal 4 External Terminal 5 Imaging device

Claims

an audio acquisition unit that acquires the arrival direction of audio to each of the plurality of sound collection devices arranged in a predetermined area;
The utterance position where the utterance was made is specified using the relationships of the plurality of arrival directions with respect to the plurality of sound collecting devices, and the unit at the position is identified using the relationship between the utterance position and each position in the area. a specifying unit that specifies the length of the utterance per unit of time;
an output control unit that causes an information terminal to display map information that associates each position in the area with an activation level corresponding to the length of the utterance per unit time at the position;
A voice analysis device having a.
further comprising a reception unit that receives a setting of an object area in which an object is located within the area,
When a straight line along the direction of arrival intersects with the object region, the specifying unit is configured to exclude a portion of the direction of arrival that is farther from the object region with respect to the position of the sound collecting device, and determine the utterance. identify the location where the
The speech analysis device according to claim 1.
The map information is information in which information corresponding to the degree of activity is superimposed on a map representing the area.
The speech analysis device according to claim 1 or 2.
The map information is information in which information corresponding to the degree of activity and information indicating the position of one or more call terminals arranged in the area are superimposed on a map representing the area,
In response to selection of one of the one or more call terminals in the map information displayed on the information terminal, starting transmission and reception of audio between the selected call terminal and the information terminal. further comprising a call control unit;
The speech analysis device according to claim 3.
The output control unit outputs intervention information associated with the condition to the information terminal in response to the degree of activity at a position within the region satisfying a predetermined condition.
The speech analysis device according to claim 1.
further comprising a reception unit that receives settings of the condition and the intervention information associated with the condition from the information terminal;
The speech analysis device according to claim 5.
The identification unit identifies a temporal change in the position where the utterance was made as a locus of movement of the position where the utterance was made,
the output control unit causes the information terminal to display information including the trajectory of the movement;
The speech analysis device according to claim 1 or 2.
The output control unit causes the information terminal to display the activity level in a first period in a sub-region that is at least a part of the area and the activity level in a second period in the sub-region in association with each other.
The speech analysis device according to claim 1 or 2.
The identification unit determines the number of people who uttered the utterances at each position within the area by recognizing one or more speakers who uttered each of the plurality of voices acquired from the plurality of sound collection devices. Estimate,
further comprising an activity determination unit that calculates a provisional activation level using the length of the utterance per unit time and determines the activation level by correcting the provisional activation level according to the number of people;
The speech analysis device according to claim 1 or 2.
The activity level determining unit makes the activity level when the number of people is a plurality of people larger than the activity level when the number of people is one person.
The speech analysis device according to claim 9.
The output control unit repeatedly displays the map information including the activity level determined at predetermined time intervals on the information terminal.
The speech analysis device according to claim 1 or 2.
further comprising an activation level determination unit that determines which type of the voice used to determine the activation level is among a plurality of types;
The output control unit transmits, to the information terminal, the map information in which each position in the area is associated with the degree of activity determined using the voice of any one of the plurality of types. display,
The speech analysis device according to claim 1 or 2.
The output control unit outputs, to the imaging device, control information for directing an imaging device that images a part of the region to a position within the region where the degree of activity satisfies a predetermined condition.
The speech analysis device according to claim 1 or 2.
The processor executes
acquiring the direction of arrival of the sound to each of the plurality of sound collection devices arranged in a predetermined area;
The utterance position where the utterance was made is specified using the relationships of the plurality of arrival directions with respect to the plurality of sound collecting devices, and the unit at the position is identified using the relationship between the utterance position and each position in the area. determining the length of the utterance per hour;
displaying on an information terminal map information that associates each position in the area with the degree of activity corresponding to the length of the utterance per unit time at the position;
A speech analysis method comprising:
to the processor,
acquiring the direction of arrival of the sound to each of the plurality of sound collection devices arranged in a predetermined area;
The utterance position where the utterance was made is specified using the relationships of the plurality of arrival directions with respect to the plurality of sound collecting devices, and the unit at the position is identified using the relationship between the utterance position and each position in the area. determining the length of the utterance per hour;
displaying on an information terminal map information that associates each position in the area with the degree of activity corresponding to the length of the utterance per unit time at the position;
A speech analysis program that runs