WO2023210052A1 - Voice analysis device, voice analysis method, and voice analysis program - Google Patents

Voice analysis device, voice analysis method, and voice analysis program Download PDF

Info

Publication number
WO2023210052A1
WO2023210052A1 PCT/JP2022/045694 JP2022045694W WO2023210052A1 WO 2023210052 A1 WO2023210052 A1 WO 2023210052A1 JP 2022045694 W JP2022045694 W JP 2022045694W WO 2023210052 A1 WO2023210052 A1 WO 2023210052A1
Authority
WO
WIPO (PCT)
Prior art keywords
utterance
information
unit
activity
area
Prior art date
Application number
PCT/JP2022/045694
Other languages
French (fr)
Japanese (ja)
Inventor
武志 水本
直希 安良岡
浩平 柳楽
Original Assignee
ハイラブル株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ハイラブル株式会社 filed Critical ハイラブル株式会社
Publication of WO2023210052A1 publication Critical patent/WO2023210052A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present invention relates to a speech analysis device, a speech analysis method, and a speech analysis program for analyzing speech.
  • Patent Document 1 discloses a system that extracts sounds that meet predetermined conditions from a spectrogram representing acoustics in a space and displays the sound pressure for each direction in which the extracted sounds are present.
  • the present invention has been made in view of these points, and its purpose is to make it easier to analyze whether or not voice communication is actively taking place.
  • a voice analysis device includes: a voice acquisition unit that acquires a direction of arrival of voice to each of a plurality of sound collection devices arranged in a predetermined area; Identifying the utterance position where the utterance was made using the directional relationship, and identifying the length of the utterance per unit time at the position using the relationship between the utterance position and each position in the area. a specifying unit; an output control unit that causes an information terminal to display map information that associates each position in the area with an activation level corresponding to the length of the utterance per unit time at the position; have
  • the voice analysis device further includes a reception unit that receives a setting of an object area in which an object is located within the area, and the identification unit determines whether the arrival direction is The position where the utterance was made may be specified by excluding a part farther from the object area in the direction with reference to the position of the sound collecting device.
  • the map information may be information in which information corresponding to the degree of activity is superimposed on a map representing the area.
  • the map information is information in which information corresponding to the degree of activity and information indicating the position of one or more call terminals arranged in the area are superimposed on a map representing the area, and the The analysis device is configured to analyze audio between the selected call terminal and the information terminal in response to selection of one or more of the call terminals in the map information displayed on the information terminal.
  • the communication device may further include a call control unit that starts sending and receiving calls.
  • the output control unit may output intervention information associated with the condition to the information terminal in response to the degree of activity at a position within the region satisfying a predetermined condition.
  • the voice analysis device may further include a reception unit that receives settings of the condition and the intervention information associated with the condition from the information terminal.
  • the identifying unit identifies a temporal change in the position where the utterance was made as a locus of movement of the position where the utterance was made, and the output control unit transmits information including the locus of movement to the information terminal. It may be displayed.
  • the output control unit may cause the information terminal to display the activity level in a first period in a sub-region that is at least a part of the area and the activity level in a second period in the sub-region in association with each other. good.
  • the identification unit determines the number of people who uttered the utterances at each position within the area by recognizing one or more speakers who uttered each of the plurality of voices acquired from the plurality of sound collection devices. and the speech analysis device calculates a provisional activation degree using the length of the utterance per unit time, and determines the activation degree by correcting the provisional activation degree according to the number of people. It may further include a degree determining section.
  • the activity level determination unit may set the activity level when the number of people is a plurality of people to be larger than the activity level when the number of people is one person.
  • the output control unit may repeatedly display the map information including the activity determined at predetermined time intervals on the information terminal.
  • the voice analysis device further includes an activity determination unit that determines which of a plurality of types the voice used to determine the activation level is, and the output control unit
  • the information terminal may display the map information in which each position in the area is associated with the degree of activity determined using one of the plurality of types of audio.
  • the output control unit may output, to the imaging device, control information for directing an imaging device that images a part of the region to a position within the region where the degree of activity satisfies a predetermined condition.
  • a sound analysis method includes the steps of acquiring the direction of arrival of sound to each of a plurality of sound collection devices arranged in a predetermined area, which is executed by a processor;
  • the utterance position where the utterance was made is specified using the relationship between the direction of arrival of the utterance, and the length of the utterance per unit time at the position is determined using the relationship between the utterance position and each position in the area. and displaying on an information terminal map information in which each position in the area is associated with an activation degree corresponding to the length of the utterance per unit time at the position.
  • a sound analysis program includes a step of causing a processor to obtain a direction of arrival of sound to each of a plurality of sound collection devices arranged in a predetermined area; Identifying the utterance position where the utterance was made using the relationship of the direction of arrival, and identifying the length of the utterance per unit time at the position using the relationship between the utterance position and each position in the area. and displaying on an information terminal map information that associates each position in the area with the degree of activity corresponding to the length of the utterance per unit time at the position. .
  • FIG. 1 is a schematic diagram of a speech analysis system according to an embodiment.
  • FIG. 1 is a block diagram of a speech analysis system according to an embodiment.
  • FIG. 2 is a schematic diagram for explaining the relationship between an analysis target area, a sound collection device, and a local terminal.
  • FIG. 3 is a schematic diagram for explaining a method in which the audio acquisition unit acquires the direction of arrival of audio.
  • FIG. 3 is a schematic diagram for explaining a method in which the specifying unit specifies the utterance position.
  • FIG. 3 is a schematic diagram for explaining the relationship between an arrival direction and an object area.
  • FIG. 2 is a schematic diagram of a local terminal outputting map information and intervention information.
  • FIG. 2 is a schematic diagram of a local terminal outputting intervention information.
  • FIG. 1 is a block diagram of a speech analysis system according to an embodiment.
  • FIG. 2 is a schematic diagram for explaining the relationship between an analysis target area, a sound collection device, and a local terminal.
  • FIG. 3 is a schematic diagram of an external terminal displaying comparison information.
  • FIG. 3 is a schematic diagram of an external terminal displaying comparison information.
  • FIG. 3 is a schematic diagram of an external terminal displaying movement information.
  • FIG. 3 is a diagram showing a flowchart of an exemplary speech analysis method executed by the speech analysis device according to the embodiment.
  • FIG. 7 is a schematic diagram of map information representing the degree of activity determined using one of a plurality of voice types.
  • FIG. 7 is a schematic diagram of map information representing the degree of activity determined using one of a plurality of voice types. It is a block diagram of the speech analysis system concerning the 3rd modification.
  • FIG. 3 is a schematic diagram illustrating processing for changing the orientation of the imaging device based on the degree of activity.
  • FIG. 1 is a schematic diagram of a speech analysis system S according to this embodiment.
  • the speech analysis system S includes a speech analysis device 1, a sound collection device 2, a local terminal 3, and an external terminal 4.
  • the speech analysis system S may include a plurality of sound collection devices 2, a plurality of local terminals 3, and a plurality of external terminals 4.
  • the speech analysis system S may also include other devices such as servers and terminals.
  • the voice analysis device 1 is a computer that analyzes the voice uttered by the user in a predetermined analysis target region R, and provides the analysis result to the user or an external user.
  • the analysis target area R is, for example, a room in a company or public facility, a library or library, a classroom in a school or cram school, an event venue, a park, or the like.
  • the user is a person who stays in the analysis target area R and makes sounds for the purpose of conversation or the like.
  • the external user is a person who is outside the analysis target area R, and is, for example, an analyst.
  • the voice analysis device 1 analyzes the voice acquired by the sound collection device 2 and outputs the analysis result to the local terminal 3 or external terminal 4.
  • the voice analysis device 1 is connected to a sound collection device 2, a local terminal 3, and an external terminal 4 by wire or wirelessly via a network such as a local area network or the Internet.
  • the sound collection device 2 is a device that is placed in the analysis target region R and captures the sounds uttered by the user.
  • the sound collecting device 2 includes, for example, a microphone array including sound collecting sections such as a plurality of microphones arranged in different directions.
  • the microphone array includes, for example, a plurality of (e.g., eight) microphones arranged at equal intervals on the same circumference in a horizontal plane relative to the ground.
  • the voice analysis device 1 identifies the position where the utterance is made by estimating the direction of arrival of the voice to each of the plurality of sound collection devices 2 based on the voice collected using the microphone array.
  • the sound collection device 2 transmits the voice acquired using the microphone array to the voice analysis device 1 as voice data.
  • the sound collection device 2 may include a single microphone instead of the microphone array.
  • a plurality of sound collecting devices 2 are arranged in the analysis target region R at predetermined intervals.
  • the voice analysis device 1 identifies the position where the utterance is made by comparing the intensities of the voices acquired by each of the plurality of sound collection devices 2.
  • the local terminal 3 is an information terminal installed in the analysis target area R and outputs information.
  • the local terminal 3 is, for example, a tablet terminal, a personal computer, or a digital signage.
  • the local terminal 3 includes, for example, a display section such as a liquid crystal display, an audio output section such as a speaker, and a sound collection section such as a microphone.
  • the local terminal 3 displays the information received from the speech analysis device 1 on the display section or outputs it from the speech output section.
  • the local terminal 3 may function as a telephone terminal for making telephone calls with the external terminal 4.
  • the external terminal 4 is an information terminal that receives settings related to analysis and outputs information.
  • the external terminal 4 is, for example, a smartphone, a tablet terminal, or a personal computer.
  • the external terminal 4 includes, for example, a display section such as a liquid crystal display, an audio output section such as a speaker, and a sound collection section such as a microphone.
  • the external terminal 4 causes the display unit to display the information received from the speech analysis device 1.
  • the voice analysis device 1 acquires the voices collected by each of the plurality of sound collection devices 2 arranged in the analysis target region R.
  • the voice analysis device 1 uses the acquired voice to identify the position where the utterance was made.
  • the speech analysis device 1 identifies the length of the utterance per unit time at each position in the analysis target area R by counting where in the analysis target area R the position where the utterance was made is located for each time. .
  • the speech analysis device 1 calculates the degree of activity corresponding to the length of the utterance per specified unit time. For example, the activity level increases as the length of the utterance per unit time is longer, and the value decreases as the length of the utterance per unit time becomes shorter.
  • the speech analysis device 1 causes at least one of the local terminal 3 and the external terminal 4 to display map information that associates each position within the analysis target region R with the degree of activity.
  • the speech analysis system S identifies the length of utterance per unit time at each position within the analysis target area R based on the audio acquired by the sound collection device 2 placed in the analysis target area R. , the degree of activity corresponding to the length of the utterance is output in association with each position within the analysis target region R.
  • the speech analysis system S can visualize the length of utterances at each position within the analysis target region R, rather than the loudness of the voice, making it easier to analyze whether vocal communication is actively taking place. be able to.
  • FIG. 2 is a block diagram of the speech analysis system S according to this embodiment.
  • arrows indicate main data flows, and data flows other than those shown in FIG. 2 may exist.
  • each block shows the configuration of a functional unit rather than a hardware (device) unit. Therefore, the blocks shown in FIG. 2 may be implemented within a single device, or may be implemented separately within multiple devices. Data may be exchanged between blocks via any means such as a data bus, network, or portable storage medium.
  • the speech analysis device 1 includes a storage section 11 and a control section 12.
  • the speech analysis device 1 may be configured by two or more physically separate devices connected by wire or wirelessly. Further, the speech analysis device 1 may be configured by a cloud that is a collection of computer resources.
  • the storage unit 11 is a computer-readable non-temporary storage medium including ROM (Read Only Memory), RAM (Random Access Memory), hard disk drive, and the like.
  • the storage unit 11 stores in advance a program to be executed by the control unit 12.
  • the storage unit 11 may be provided outside the speech analysis device 1, and in that case, data may be exchanged with the control unit 12 via a network.
  • the control unit 12 includes a reception unit 121, a voice acquisition unit 122, a specification unit 123, an activity level determination unit 124, an output control unit 125, and a call control unit 126.
  • the control unit 12 is, for example, a processor such as a CPU (Central Processing Unit), and executes a program stored in the storage unit 11 to control the reception unit 121, the audio acquisition unit 122, the identification unit 123, and the activity level determination unit 124. , functions as an output control section 125 and a call control section 126. Further, at least part of the functions of the control unit 12 may be realized by the control unit 12 executing a program executed via a network.
  • a processor such as a CPU (Central Processing Unit)
  • FIG. 3 is a schematic diagram for explaining the relationship among the analysis target area R, the sound collection device 2, and the local terminal 3.
  • the analysis target region R a plurality of sound collecting devices 2 and one or more local terminals 3 are arranged.
  • the reception unit 121 determines the analysis target area R, the positions of the sound collection device 2 and the local terminal 3 within the analysis target area R, and the object area where an object (obstruction) such as a wall is located within the analysis target area R. Accepts settings.
  • the external terminal 4 receives an operation for specifying the analysis target area R, the positions of the sound collection device 2 and the local terminal 3, and the object area from an external user, and transmits information indicating the specified contents to the voice analysis device.
  • the reception unit 121 stores information associating the analysis target region R, the positions of the sound collection device 2 and the local terminal 3, and the object region based on the information received from the external terminal 4. The information is stored in the unit 11.
  • the reception unit 121 may accept settings of sub-regions included in the analysis target region R.
  • the sub-region is a region that is at least a part of the analysis target region R that is of interest during analysis.
  • a coffee area that is an area that includes a coffee machine
  • a desk area that is an area that includes a desk
  • a sofa area that is an area that includes a sofa, etc.
  • the external terminal 4 receives an operation for specifying a sub-region within the analysis target region R and the name of the sub-region from an external user, and transmits information indicating the specified contents to the speech analysis device 1.
  • the receiving unit 121 causes the storage unit 11 to store information associating the sub-areas with the names of the sub-areas, based on the information received from the external terminal 4 .
  • the reception unit 121 may accept settings for intervention conditions used to determine whether or not to output intervention information.
  • the intervention condition is, for example, that the activity level corresponding to the length of utterance per unit time determined by the activity level determination unit 124 is equal to or greater than a predetermined threshold value.
  • the intervention information is, for example, a message containing the name of a sub-area that satisfies the intervention condition.
  • the external terminal 4 receives an operation for specifying intervention conditions and intervention information from an external user, and transmits information indicating the specified contents to the speech analysis device 1.
  • the reception unit 121 causes the storage unit 11 to store information in which intervention conditions and intervention information are associated, based on the information received from the external terminal 4 .
  • the audio acquisition unit 122 acquires the audio collected by each of the plurality of sound collection devices 2 arranged in the analysis target region R.
  • the sound collection device 2 transmits, for example, audio data representing sounds collected using a microphone array to the audio analysis device 1.
  • the sound collection device 2 constantly transmits voice data to the voice analysis device 1, or transmits voice data for a predetermined period (one hour, one day, etc.) to the voice analysis device 1 in bulk.
  • the voice acquisition unit 122 stores the voice data received from the sound collection device 2 in the storage unit 11, and acquires the voice indicated by the voice data.
  • the audio acquisition unit 122 may perform a predetermined filtering process on the acquired audio. For example, the audio acquisition unit 122 may remove, from the acquired audio, audio collected during a period different from a period associated with the analysis target region R in advance (business hours of a company or public facility, etc.). Furthermore, the voice acquisition unit 122 may remove, for example, sounds different from voices emitted by humans (sounds in a frequency band corresponding to human voices, etc.) from the acquired voices. Thereby, the speech analysis device 1 can perform analysis by excluding speech that is not important for analysis, and can improve the accuracy of the analysis results.
  • the audio acquisition unit 122 acquires the arrival direction of the audio collected by each of the plurality of sound collection devices 2 for each time (for example, every 10 milliseconds to 1000 milliseconds). For example, the audio acquisition unit 122 performs known sound source localization processing on multiple channels of audio collected by the microphone array included in the sound collection device 2 .
  • the sound source localization process is a process for estimating the position of a sound source included in the audio acquired by the audio acquisition unit 122.
  • the audio acquisition unit 122 performs sound source localization processing to acquire a reliability distribution indicating the reliability distribution of the existence of a sound source based on the position of the sound collection device 2.
  • the reliability is a value corresponding to the likelihood that a sound source exists at that position, and may be a probability, for example.
  • the reliability distribution represents the arrival direction of the sound with respect to the sound collection device 2.
  • FIG. 4A is a schematic diagram for explaining how the audio acquisition unit 122 acquires the direction of arrival of audio.
  • the example in FIG. 4A represents the reliability distribution P acquired by the voice acquisition unit 122 based on the voices collected by each of the three sound collection devices 2.
  • the vertical and horizontal axes of the reliability distribution P correspond to coordinates within the analysis target region R.
  • the microphone array cannot specify the distance from the sound collection device 2 to the sound source, in the reliability distribution P, regions with the same reliability are distributed linearly (radially) with the sound collection device 2 as a reference. Since the reliability of the presence of the sound source on the straight line connecting the sound collection device 2 and the sound source increases, the linear region in the reliability distribution P where the reliability is equal to or higher than a predetermined value indicates the arrival of sound to the sound collection device 2.
  • Direction D is shown.
  • the direction of arrival D is not limited to a straight line that includes the position of the sound collecting device 2, but may be expressed as a region having a width of a predetermined angle or length based on the position of the sound collecting device 2.
  • the voice analysis device 1 estimates the direction of arrival D, but the direction of arrival D may be estimated based on the voice acquired by each of the plurality of sound collection devices 2 using a microphone array.
  • the voice acquisition unit 122 receives from each of the plurality of sound collection devices 2 information indicating the direction of arrival D estimated by the sound collection device 2 .
  • the identification unit 123 identifies the position where the utterance was made within the analysis target region R at each time (for example, every 10 to 1000 milliseconds) based on the plurality of arrival directions D for the plurality of sound collection devices 2. Identify the utterance position.
  • FIG. 4B is a schematic diagram for explaining the method by which the specifying unit 123 specifies the utterance position.
  • the identification unit 123 superimposes a plurality of reliability distributions P generated from sounds collected by a plurality of sound collection devices 2.
  • the specifying unit 123 overlaps the plurality of reliability distributions P by, for example, calculating the sum or product of the reliabilities indicated by the plurality of reliability distributions P at each position in the analysis target region R.
  • FIG. 4B represents a reliability distribution P1 generated by superimposing the three reliability distributions P illustrated in FIG. 4A.
  • the identification unit 123 identifies the utterance position using a reliability distribution P1 obtained by superimposing a plurality of reliability distributions P.
  • the utterance position may be represented by one point within the analysis target region R, or may be represented by a region within the analysis target region R.
  • the specifying unit 123 specifies, for example, a position or area whose reliability is equal to or greater than a predetermined value in the reliability distribution P1 as the utterance position.
  • the position where the plurality of directions of arrival D indicated by the plurality of reliability distributions P intersects is a position with high reliability in the reliability distribution P1 obtained by superimposing the plurality of reliability distributions P. Therefore, the specifying unit 123 may specify the intersection position D1 where a plurality of straight lines along the plurality of arrival directions D intersect as the utterance position.
  • the intersection position D1 may be a region where a plurality of regions extending along a plurality of directions of arrival D intersect.
  • the voice analysis device 1 identifies the utterance position based on the arrival direction D of the voice with respect to the plurality of sound collection devices 2, so even if the distance from one sound collection device 2 to the sound source cannot be determined. It is also possible to pinpoint the utterance position with high accuracy.
  • the specifying unit 123 may specify the utterance position by considering the object region where the object is located within the analysis target region R received by the receiving unit 121.
  • FIG. 5 is a schematic diagram for explaining the relationship between the direction of arrival D and the object region R2. The example in FIG. 5 shows a state in which an object region R2 exists in the middle of the direction of arrival D.
  • the identifying unit 123 determines the utterance position by excluding a part in the direction of arrival D that is farther from the object region R2 with respect to the position of the sound collector 2. Identify.
  • the specifying unit 123 identifies, for example, a line segment between the sound collection device 2 and the object region R2 along a first arrival direction among the plurality of arrival directions D, and a second arrival direction among the plurality of arrival directions D.
  • the intersection position D1 where the line segment between the sound collection device 2 and the object region R2 along the second arrival direction intersects is specified as the utterance position.
  • the speech analysis device 1 can suppress erroneous recognition that the sound source is beyond an obstacle such as a wall, and can improve the accuracy of the utterance position.
  • the identification unit 123 may estimate the number of users who have uttered the utterance at the utterance position.
  • the specifying unit 123 performs a process of emphasizing the sound in the arrival direction D for each of the plurality of sounds that the sound acquisition unit 122 has acquired from the plurality of sound collection devices 2 .
  • the identifying unit 123 emphasizes the sound in the direction of arrival D, for example, by suppressing the sound input from a direction different from the direction of arrival D to the microphone array included in the sound collection device 2.
  • the identifying unit 123 performs known speaker recognition processing on each of the plurality of voices in which the voice in the direction of arrival D is emphasized, thereby recognizing one or more speakers who have uttered each of the plurality of voices.
  • the identifying unit 123 recognizes one or more speakers corresponding to one or more generated clusters, for example, by clustering the speech divided at predetermined intervals using deep learning.
  • the identifying unit 123 estimates one or more speakers who are common to all the voices among the one or more speakers who uttered each of the plurality of voices as the user who made the utterance at the utterance position.
  • the specifying unit 123 causes the storage unit 11 to store information associating a speaking position with a user who spoke at the speaking position for each time. Thereby, the speech analysis device 1 can exclude speakers who spoke at a position different from the speaking position and estimate the user who spoke at the speaking position with high accuracy.
  • the identification unit 123 identifies the speaking position using the sounds collected by the plurality of sound collection devices 2 including a single microphone instead of the sounds collected by the plurality of sound collection devices 2 including the microphone array. It's okay.
  • a plurality of sound collecting devices 2 are arranged at predetermined intervals. When a user makes a sound within the analysis target region R, each sound collection device 2 acquires a sound with a higher intensity as it is closer to the user, and a sound with a lower intensity as it is farther from the user.
  • the specifying unit 123 compares the intensities of sounds acquired by each of the plurality of sound collecting devices 2 at the same time, and selects the sound collecting device 2 with the highest sound intensity, or a plurality of sound collecting devices 2 with the highest sound intensity above a threshold value.
  • the utterance position is specified based on the position of the sound collection device 2. Thereby, the speech analysis device 1 can specify the utterance position even when using the sound collection device 2 that does not include a microphone array.
  • the specifying unit 123 specifies the length of the utterance per unit time at each position within the analysis target region R, based on the utterance position for each time specified by the specifying unit 123. For example, at each position within the analysis target region R (for example, a rectangular region obtained by dividing the analysis target region R), the specifying unit 123 totals the time during which the utterance position was present at the position per unit time (for example, 1 minute). do. For example, if an utterance position exists at a certain position for 30 seconds in one minute going back from the current time, the length of the utterance per unit time at that position is 30 seconds.
  • the activity determining unit 124 determines the activity corresponding to the length of utterance per unit time specified by the specifying unit 123 at each position within the analysis target region R. For example, the activity determining unit 124 sets the activation value to be larger as the length of the utterance per unit time specified by the specifying unit 123 is longer, and to be smaller as the length of the utterance per unit time specified by the specifying unit 123 is shorter. Determine as degree.
  • the activation level determination unit 124 may determine the value of the length of utterance per unit time itself as the activation level, or may determine the value of the length of utterance per unit time converted according to a predetermined rule as the activation level. It may be determined as a degree.
  • the activity determining unit 124 may determine the activity by considering the number of users who spoke at the speaking position. In this case, the activity determining unit 124 determines that, for example, the longer the length of the utterance per unit time specified by the specifying unit 123 is, the higher the value is, and the shorter the length of the utterance per unit time specified by the specifying unit 123 is, the smaller the length of the utterance is. A provisional activity value is calculated.
  • the activity determining unit 124 calculates the activity by correcting the provisional activity according to the number of people specified by the specifying unit 123.
  • the activity determining unit 124 corrects the provisional activity so that, for example, the activity when there is a plurality of people is greater than the activity when there is one person.
  • the voice analysis device 1 can reflect the number of people estimated from the voice in the degree of activity.
  • the output control unit 125 causes at least one of the local terminal 3 or the external terminal 4 to display map information in which each position within the analysis target region R is associated with the activity determined by the activity determination unit 124. For example, the output control unit 125 generates, as map information, a heat map in which information (color, pattern, etc.) corresponding to the degree of activity of each position in the analysis target region R is superimposed on a map representing the analysis target region R. do. Further, the output control unit 125 may generate map information indicating the positions of each of the plurality of sound collecting devices 2 arranged in the analysis target region R, in addition to the activity level of each position within the analysis target region R. . The output control unit 125 transmits the generated map information to at least one of the local terminal 3 and the external terminal 4.
  • the output control unit 125 repeatedly display map information indicating the degree of activity determined by the degree of activity determination unit 124 at predetermined time intervals on at least one of the local terminal 3 or the external terminal 4. Thereby, the voice analysis system S can notify the user or an external user of the latest communication status in the analysis target area R.
  • a situation with a high degree of activation is regarded as a positive element or a negative element, or a situation with a low degree of activation is regarded as a positive element or a negative element. It depends on the type of the analysis target region R. For example, in the analysis target area R where it is desirable to be quiet (for example, in a library or a library, or in a classroom at a school or cram school where it is desirable for students to be quiet during classes or tests), the activity level is A situation where the activity level is large may be regarded as a negative element, or a situation where the activity level is small may be regarded as a positive element.
  • the output control unit 125 transmits intervention information associated with the intervention condition to the local terminal 3 or external terminal 4. It may be output from at least one side. For example, the output control unit 125 acquires the intervention conditions and intervention information received by the reception unit 121 from the storage unit 11. The output control unit 125 determines whether the degree of activity at each position within the analysis target region R satisfies an intervention condition (for example, whether it is equal to or greater than a threshold indicated by the intervention condition).
  • an intervention condition for example, whether it is equal to or greater than a threshold indicated by the intervention condition.
  • the output control unit 125 generates intervention information associated with the intervention condition in response to the degree of activity at any position within the analysis target region R satisfying the intervention condition.
  • the output control unit 125 outputs, for example, a message including the name of the sub-area including the position where the degree of activity satisfies the intervention condition (“The coffee area is busy”, “Please be quiet in the library”, etc.) as intervention information. Generate as.
  • intervention information when the degree of activation is high, intervention information with positive content (e.g., information to promote communication) is generated, or intervention information with negative content (e.g., information to suppress communication) is generated.
  • intervention information with positive content e.g., information to promote communication
  • intervention information with negative content e.g., information to suppress communication
  • Whether or not to generate the information for the analysis may be determined depending on the type of the analysis target region R. As described above, for example, in the analysis target region R where quietness is desirable, a situation with a high degree of activity may be regarded as a negative element
  • the output control unit 125 may, for example, notify an employee of the store in response to the degree of activity at any position within the analysis target region R satisfying the intervention condition.
  • a message instructing the user to move to the relevant location (such as "Please move near the hat section") may be generated as the intervention information.
  • the voice analysis system S can automatically instruct store staff to be placed in areas where many conversations are taking place in the store, making it easier to explain products and promote sales.
  • the intervention condition is not limited to the case where the degree of activity becomes equal to or greater than a predetermined threshold value, but instead or in addition to this, it is also possible to use the case where the degree of activity becomes less than or equal to a predetermined threshold value.
  • intervention information when the activation level is low, either intervention information with positive content (e.g., praise for being quiet) is generated, or intervention information with negative content (e.g., information that praises being quiet) is generated.
  • intervention information with positive content e.g., praise for being quiet
  • intervention information with negative content e.g., information that praises being quiet
  • the output control unit 125 may output intervention information from all local terminals 3. Further, the output control unit 125 may output intervention information only from among the plurality of local terminals 3, those located within a sub-region including a position that satisfies the intervention condition. Thereby, the voice analysis system S can notify intervention information to users around the position where the degree of activity satisfies the intervention condition.
  • FIG. 6A is a schematic diagram of the local terminal 3 displaying map information and intervention information.
  • the local terminal 3 displays the map information and intervention information received from the speech analysis device 1 on the display section.
  • the local terminal 3 displays a heat map H, which is map information, and a message M, which represents intervention information, on the display unit.
  • the external terminal 4 may similarly display the heat map H and the message M on the display section.
  • FIG. 6B is a schematic diagram of the local terminal 3 outputting intervention information in the form of audio.
  • the local terminal 3 outputs the voice V that displays the intervention information received from the voice analysis device 1 from the voice output section.
  • the voice V may be generated by the output control unit 125 of the voice analysis device, or may be generated by the local terminal 3.
  • the speech analysis system S visualizes the length of utterances at each position within the analysis target area R as map information, thereby determining whether vocal communication is actively taking place within the analysis target area R. It can be made easier to analyze. Furthermore, the speech analysis system S can adjust communication within the analysis target region R to promote or suppress communication by outputting intervention information in response to the degree of activity satisfying a predetermined condition. .
  • the output control unit 125 may change the content of the intervention information depending on the person within the analysis target region R.
  • the intervention information is associated in advance with, for example, a person or an attribute of the person (age, gender, clothing, etc.).
  • the output control unit 125 recognizes a person within the analysis target region R, for example, by performing known person recognition processing on a captured image around the local terminal 3 acquired by a camera included in the local terminal 3.
  • the output control unit 125 may recognize a person located somewhere in the analysis target region R, or may recognize only a person located in a specific sub-region.
  • the output control unit 125 causes at least one of the local terminal 3 and the external terminal 4 to output intervention information associated with the recognized person or the attribute of the person in response to the intervention condition being satisfied.
  • the voice analysis system S can output intervention information suitable for the person within the analysis target region R.
  • the output control unit 125 may cause the external terminal 4 to display comparison information for comparing the activity levels of different periods within the analysis target region R.
  • the receiving unit 121 receives from the external terminal 4 the designation of the sub-area to be compared. Further, the receiving unit 121 may receive a designation of a period to be compared from the external terminal 4.
  • the output control unit 125 generates comparison information that associates the degree of activity in the first period in the designated sub-region with the degree of activity in the second period in the sub-region.
  • the output control unit 125 transmits the generated comparison information to at least one of the local terminal 3 and the external terminal 4.
  • FIG. 7A and 7B are schematic diagrams of the external terminal 4 displaying comparison information.
  • the external terminal 4 displays the comparison information received from the speech analysis device 1.
  • the external terminal 4 sends a heat map H for each of the first period and the second period, and a message M representing the comparison result of the activity levels for the first period and the second period in the specified sub-region. is displayed as comparative information.
  • a designated sub-region within the entire analysis target region R is highlighted.
  • the message M is, for example, a message representing the amount or rate of increase/decrease in activity between the first period and the second period in the sub-region.
  • the external terminal 4 displays the comparison result of the heat map H1 in the designated sub-region or the entire analysis target region R and the activity level for multiple periods in the designated sub-region or the entire analysis target region R.
  • Message M is displayed as comparison information.
  • the heat map H1 differs from the heat map H that shows the activity level on the map illustrated in FIGS. 6A and 7A, in that the heat map H1 displays information (color, color, This is a heat map showing the patterns (patterns, etc.). Therefore, the heat map H1 visualizes the difference in activity between multiple time periods in the same area.
  • the message M is, for example, a message representing a time period in which the activity level is high or low in the sub-region or the entire analysis target region R.
  • the speech analysis system S can make it easier to analyze increases and decreases in the activity level and trends in the activity level for each time period by associating and visualizing the activity level for different periods.
  • the output control unit 125 may output past audio at the specified position from the external terminal 4.
  • the receiving unit 121 receives designation of a position within the analysis target region R and a past period on the external terminal 4 displaying map information or comparison information.
  • the output control unit 125 acquires, from the storage unit 11 , the audio at the specified position and period from the audio acquired by the audio acquisition unit 122 , and outputs it from the audio output unit of the external terminal 4 .
  • the speech analysis system S can easily analyze the relationship between the degree of activity and the actual speech content.
  • the output control unit 125 may display movement information including the locus of movement of the speaking position on at least one of the local terminal 3 or the external terminal 4.
  • the specifying unit 123 specifies, for example, the temporal change in the utterance position at each specified time as the locus of movement of the utterance position.
  • the identifying unit 123 acquires from the storage unit 11 information that is generated by the above-described speaker recognition process and associates the speaking position and the user who spoke at the speaking position on a time-by-time basis. Then, the identifying unit 123 identifies the locus of movement of the speaking position corresponding to the specific user (speaker) based on the acquired information.
  • the output control unit 125 transmits movement information including the locus of movement specified by the identification unit 123 to at least one of the local terminal 3 and the external terminal 4.
  • FIG. 8 is a schematic diagram of the external terminal 4 displaying movement information.
  • the external terminal 4 displays the movement information received from the speech analysis device 1 on the display section.
  • the external terminal 4 displays the movement trajectory T indicated by the movement information on the display unit.
  • the local terminal 3 may similarly display the movement trajectory T on the display unit.
  • the call control unit 126 may start a call between the local terminal 3 selected on the map information and the external terminal 4 after the output control unit 125 displays the map information on the external terminal 4.
  • the output control unit 125 for example.
  • On a map representing the analysis target area R information (color, pattern, etc.) corresponding to the activity level of each position within the analysis target area R and the position of one or more local terminals 3 placed in the analysis target area R are displayed.
  • a heat map in which information (such as an icon) indicating , and are superimposed is displayed on the external terminal 4 as map information.
  • the reception unit 121 receives the selection of one of the local terminals 3 to be the destination of the call from among the one or more local terminals 3 in the map information displayed on the external terminal 4. For example, in order to support communication in the analysis target area R from outside the analysis target area R, the external user selects the local terminal 3 located at a location where the degree of activity is low in the map information.
  • the call control unit 126 starts transmitting and receiving audio between the selected local terminal 3 and the external terminal 4 in response to selection of one or more local terminals 3.
  • the local terminal 3 functions as a telephone terminal for making a telephone call with the external terminal 4, and outputs the voice received from the external terminal 4 from an audio output section such as a speaker, and also outputs the voice received from the external terminal 4 from a microphone or the like of the local terminal 3.
  • the audio input to the sound section is transmitted to the external terminal 4.
  • the call control unit 126 may allow audio to be exchanged bidirectionally between the selected local terminal 3 and the external terminal 4, or may output audio from the external terminal 4 to the local terminal 3 in one direction.
  • the voice analysis system S can make it easier for an external user who wishes to make a call with the local terminal 3 to select the local terminal 3 to call based on the activity level.
  • the external user can support activation of communication in the analysis target area R by intervening with the local terminal 3 in the analysis target area R from the outside by voice.
  • FIG. 9 is a diagram showing a flowchart of an exemplary speech analysis method executed by the speech analysis device 1 according to the present embodiment.
  • the reception unit 121 receives from the external terminal 4 the analysis target area R, the positions of the sound collection device 2 and the local terminal 3 within the analysis target area R, and the object area where an object such as a wall is located within the analysis target area R.
  • the settings of are accepted (S11).
  • the audio acquisition unit 122 acquires the audio collected by each of the plurality of sound collection devices 2 arranged in the analysis target region R (S12).
  • the audio acquisition unit 122 acquires the time-by-time arrival direction D of the audio collected by each of the plurality of sound collection devices 2 (S13).
  • the direction of arrival D may be estimated by the voice analysis device 1 or may be estimated by each of the plurality of sound collection devices 2.
  • the specifying unit 123 specifies the utterance position, which is the position where the utterance was made, within the analysis target region R for each time based on the plurality of arrival directions D with respect to the plurality of sound collecting devices 2 (S14).
  • the specifying unit 123 may specify the utterance position by considering the object region where the object is located within the analysis target region R received by the receiving unit 121.
  • the specifying unit 123 specifies the length of the utterance per unit time at each position within the analysis target region R, based on the utterance position for each time specified by the specifying unit 123 (S15).
  • the activity determining unit 124 determines the activity corresponding to the length of utterance per unit time specified by the specifying unit 123 at each position within the analysis target region R (S16). For example, the activity level increases as the length of the utterance per unit time specified by the specifying unit 123 increases, and decreases as the length of the utterance per unit time specified by the specifying unit 123 decreases.
  • the output control unit 125 causes at least one of the local terminal 3 and the external terminal 4 to output map information in which each position within the analysis target region R is associated with the activity determined by the activity determination unit 124 (S17 ). Further, the output control unit 125 may cause the external terminal 4 to display comparison information for comparing the activity levels of different periods within the analysis target region R.
  • the output control unit 125 transmits the intervention information associated with the intervention condition to the local terminal 3 or external terminal 4. (S19).
  • the speech analysis device 1 ends the process when the activity level determined by the activity level determination unit 124 does not satisfy the predetermined intervention condition (NO in S18).
  • the speech analysis system S specifies the length of utterance per unit time at each position in the analysis target region R based on the sound acquired by the sound collection device 2 placed in the analysis target region R. Then, the degree of activity corresponding to the length of the utterance is output in association with each position within the analysis target region R. As a result, the speech analysis system S can visualize the length of utterances at each position within the analysis target region R, rather than the loudness of the voice, making it easier to analyze whether vocal communication is actively taking place. be able to.
  • the speech analysis system S can adjust communication within the analysis target region R to promote or suppress communication by outputting intervention information in response to the degree of activity satisfying a predetermined condition. Moreover, the speech analysis system S can make it easier to analyze increases and decreases in the activity level and trends in the activity level for each time period by associating and visualizing the activity level for different periods.
  • the reception unit 121 accepts the setting for setting the open space as the analysis target region R.
  • the audio acquisition unit 122 acquires the sounds emitted by animals, which are collected by each of the plurality of sound collection devices 2 arranged in the analysis target region R, which is an open space. Then, the speech analysis device 1 specifies the length of the utterance at each position, as in the above-described embodiment, and outputs information corresponding to the degree of activity corresponding to the length of the utterance.
  • the voice analysis system S can easily analyze whether communication among animals, not limited to humans, is actively occurring in an open space.
  • the speech analysis system S displays a heat map of the degree of activity for each type of voice, such as laughter or screaming, and outputs intervention information based on the degree of activity for each type of voice.
  • a heat map of the degree of activity for each type of voice such as laughter or screaming
  • intervention information based on the degree of activity for each type of voice.
  • the activation level determining unit 124 determines which of the plurality of audio types the audio used to determine the activation level is. Determine the audio type.
  • the types of sounds include, for example, laughter, screams, voices of specific emotions (anger, etc.), voices of specific animals (deer, bear, etc.), strange noises from machinery, and the like.
  • the activity determination unit 124 obtains, for example, a machine learning model that is stored in advance in the storage unit 11 and outputs information indicating which of a plurality of voice types the input voice is.
  • the machine learning model is configured by, for example, a neural network, and is generated by performing known machine learning processing using a plurality of voices and voice types as training data.
  • the activation level determining unit 124 determines the type of voice output by the machine learning model, and determines whether the voice used to determine the level of activity corresponds to the type of voice output by the machine learning model. Determine the type of voice to be used.
  • the sound collection device 2 may determine the voice type.
  • the sound collection device 2 inputs the collected sound into the above-mentioned machine learning model stored in advance in the storage unit of the sound collection device 2, thereby determining the type of sound output by the machine learning model. get.
  • the sound collection device 2 transmits to the voice analysis device 1, along with voice data indicating the collected voice, information indicating the type of the voice.
  • the activity level determination unit 124 determines the voice type to which the voice used to determine the activity level corresponds, based on the information indicating the voice type received from the sound collection device 2 .
  • the output control unit 125 outputs map information that associates each position within the analysis target region R with the degree of activity determined using one of the plurality of voice types to the local terminal 3 or It is displayed on at least one of the external terminals 4. Map information such as a heat map corresponding to one audio type is called one layer.
  • the output control unit 125 transmits map information representing the degree of activity determined using the voice of the voice type specified in at least one of the local terminal 3 or the external terminal 4 to at least one of the local terminal 3 or the external terminal 4. Display it on one side. In this case, at least one of the local terminal 3 and the external terminal 4 displays a layer corresponding to any one voice type based on the information received from the voice analysis device 1.
  • the output control unit 125 may display a list of map information representing the degree of activity determined using the voices of each of the plurality of voice types on at least one of the local terminal 3 and the external terminal 4, for example.
  • at least one of the local terminal 3 and the external terminal 4 displays a plurality of layers corresponding to a plurality of voice types based on the information received from the speech analysis device 1.
  • FIGS. 10A and 10B are schematic diagrams of map information representing the degree of activity determined using one of a plurality of audio types.
  • FIG. 10A shows, as map information, a heat map H2 corresponding to the activity level determined using a sound whose sound type is "laughter”.
  • FIG. 10B shows, as map information, a heat map H3 corresponding to the activity level determined using a voice whose voice type is "scream”.
  • the speech analysis system S makes it easier to analyze what types of communication are actively occurring within the analysis target region R by visualizing the degree of activity of each speech type as map information. be able to.
  • the output control unit 125 outputs the intervention information associated with the intervention condition to the local terminal 3 or external terminal in response to the activation level of each voice type determined by the activation level determination unit 124 satisfying a predetermined intervention condition.
  • the output may be output from at least one of the four.
  • the output control unit 125 obtains, for example, intervention conditions and intervention information for each voice type from the storage unit 11, which are stored in advance in the storage unit 11. Intervention conditions may or may not be set for each of the plurality of voice types.
  • the output control unit 125 may not output the intervention information when the sound type is "laughter”, but may output the intervention information when the sound type is "scream". For example, the output control unit 125 outputs intervention information for promoting communication in response to the activation level being below a predetermined threshold when the voice type is "laughter", and outputs intervention information for promoting communication when the voice type is "scream". In some cases, intervention information for suppressing communication may be output in response to the degree of activity being equal to or higher than a predetermined threshold.
  • the output control unit 125 determines whether the activity level of each of the plurality of voice types at each position in the analysis target region R satisfies the intervention condition for the voice type (for example, whether it is equal to or greater than the threshold value indicated by the intervention condition). judge.
  • the output control unit 125 generates intervention information associated with the intervention condition in response to the degree of activity of each voice type at any position within the analysis target region R satisfying the intervention condition.
  • the output control unit 125 causes the generated intervention information to be output from at least one of the local terminal 3 and the external terminal 4.
  • the speech analysis system S can switch the presence or absence and content of intervention for each speech type by outputting intervention information in response to the degree of activity satisfying the condition set for each speech type.
  • the audio analysis system S determines the degree of activity using a captured image generated by imaging the analysis target region R with the imaging device, and also controls the orientation of the imaging device.
  • parts that are different from the above-described embodiment will be mainly described.
  • FIG. 11 is a block diagram of the speech analysis system S according to this modification.
  • the control unit 12 of the speech analysis device 1 according to this modification further includes an image acquisition unit 127 in addition to the units shown in FIG.
  • the image acquisition unit 127 acquires a captured image generated by capturing a part of the analysis target region R from the imaging device 5.
  • the imaging device 5 is arranged at a position where it can image a part of the analysis target region R, for example.
  • the imaging device 5 has a drive unit for changing the direction of the imaging device 5.
  • the imaging device 5 can image any position within the analysis target region R by changing the direction using the drive unit.
  • the imaging device 5 may change its orientation according to control information output from an output control unit 125, which will be described later, or may change its orientation automatically.
  • the imaging device 5 periodically or in response to receiving an imaging instruction from the voice analysis device 1 captures a portion of the analysis target region R to obtain a captured image (i.e., the direction in which the imaging device 5 is facing).
  • a captured image defined as an imaging range is generated, and the generated captured image is transmitted to the voice analysis device 1 via wired or wireless communication.
  • the image acquisition unit 127 receives the captured image from the imaging device 5 and stores the received captured image in the storage unit 11 .
  • the output control unit 125 outputs control information for changing the orientation of the imaging device 5 based on the activity level determined by the activity level determination unit 124.
  • FIG. 12 is a schematic diagram for explaining a process for changing the direction of the imaging device 5 based on the degree of activity.
  • FIG. 12 schematically shows the position and orientation of the imaging device 5 in the heat map H of the analysis target region R.
  • the output control unit 125 identifies a position within the analysis target region R where the degree of activity satisfies a predetermined condition (for example, being equal to or greater than a predetermined threshold) as a target position L.
  • the output control unit 125 transmits control information for directing the imaging device 5 to the target position L to the imaging device 5.
  • the control information is, for example, information indicating the relative position of the target position L with respect to the position of the imaging device 5.
  • the imaging device 5 changes its direction so as to face the target position L using the drive unit according to the control information received from the voice analysis device 1.
  • the imaging device 5 generates a captured image by capturing the direction in which the imaging device 5 is facing as an imaging range, and transmits the generated captured image to the voice analysis device 1 via wired or wireless communication.
  • control information is sent sequentially.
  • the audio analysis system S can point the imaging device 5 at a position where the degree of activity satisfies a predetermined condition to capture the image. Therefore, it is possible to preferentially acquire captured images at positions that are likely to have high importance or urgency within the analysis target region R.
  • the output control unit 125 outputs control information for moving a movable device such as a robot or a drone to the target position L instead of or in addition to the control information for directing the imaging device 5 to the target position L. Good too.
  • the mobile device moves to the target position L according to the control information received from the speech analysis device 1.
  • the voice analysis system S can automatically provide robots, drones, etc. to positions within the analysis target area R that are likely to have a high degree of importance or urgency.
  • the activity determination unit 124 identifies the number of users at each position within the analysis target region R based on the captured image acquired by the image acquisition unit 127, and converts the identified number of users into information output by the output control unit 125. It may be reflected.
  • the activity level determination unit 124 calculates the activity level per user, for example.
  • the activity determining unit 124 specifies the number of users at each position within the analysis target region R, for example, by performing known image recognition processing on the captured image.
  • the activity determination unit 124 may specify only the number of users at a position where the activity satisfies a predetermined condition (for example, being equal to or greater than a predetermined threshold).
  • the activity determining unit 124 may identify the number of users at each position within the analysis target region R using the position of a communication device such as a smartphone owned by the user instead of the captured image. In this case, the activity determining unit 124 acquires, for example, information indicating the reception strength of the signal emitted by each of a plurality of transmitters such as beacons placed in the analysis target region R from a communication device owned by the user. The activity determining unit 124 identifies the user's position using the relationship between the positions of the plurality of transmitters and the reception strength of the signal emitted by each of the plurality of transmitters in the communication device of the user. The activity determining unit 124 then specifies the number of users at each position within the analysis target region R by totaling the positions of the identified one or more users.
  • a communication device such as a smartphone owned by the user instead of the captured image.
  • the activity determining unit 124 acquires, for example, information indicating the reception strength of the signal emitted by each of a plurality of transmitters
  • the activity determining unit 124 is not limited to the specific method shown here, and may specify the number of users at each position within the analysis target region R using other methods.
  • the activity determining unit 124 calculates the activity per person by dividing the activity at each location by the number of people specified for each location.
  • the output control unit 125 causes at least one of the local terminal 3 and the external terminal 4 to display map information that associates each position within the analysis target region R with the degree of activity per person. Thereby, the speech analysis system S can visualize the level of excitement per user instead of the level of excitement for the entire group including a plurality of users.
  • the output control unit 125 associates each position in the analysis target region R excluding positions where the number of users is 0 or 1 with the activity determined by the activity determination unit 124.
  • the map information may be displayed on at least one of the local terminal 3 and the external terminal 4.
  • the voice analysis system S can display the user's activity level excluding the user's soliloquy, mechanical sounds, natural sounds, etc. as map information, and can make the map information easier to view.
  • the processor of the speech analysis device 1 executes each step (process) included in the speech analysis method shown in FIG. That is, the processor of the speech analysis device 1 executes the speech analysis method shown in FIG. 9 by executing a program for executing the speech analysis method shown in FIG. Some of the steps included in the speech analysis method shown in FIG. 9 may be omitted, the order of the steps may be changed, or a plurality of steps may be performed in parallel.
  • Speech analysis system 1 Speech analysis device 11 Storage section 12
  • Control section 121 Reception section 122
  • Activity level determination section 125 Output control section 126
  • Image acquisition section 2 Sound collection device 3
  • External Terminal 5 Imaging device

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephone Function (AREA)

Abstract

A voice analysis device 1 according to an embodiment of the present invention has: a voice acquisition unit 122 for acquiring the arrival direction of a voice for each of a plurality of sound collecting devices arranged in a prescribed region; an identifying unit 123 for identifying an utterance position at which an utterance was made using a relationship of a plurality of arrival directions to the plurality of sound collecting devices, and identifying the utterance length per unit time at the position thereof using a relationship between the utterance position and each position in a region; and an output control unit 125 for causing an information terminal to display map information in which each position in the region and a degree of activity corresponding to the utterance length per unit time at the position thereof are associated.

Description

音声分析装置、音声分析方法及び音声分析プログラムSpeech analysis device, speech analysis method and speech analysis program
 本発明は、音声を分析するための音声分析装置、音声分析方法及び音声分析プログラムに関する。 The present invention relates to a speech analysis device, a speech analysis method, and a speech analysis program for analyzing speech.
 特許文献1には、空間における音響を表すスペクトログラムから所定の条件を満たす音を抽出し、抽出した音が存在する方向ごとの音圧を表示するシステムが開示されている。 Patent Document 1 discloses a system that extracts sounds that meet predetermined conditions from a spectrogram representing acoustics in a space and displays the sound pressure for each direction in which the extracted sounds are present.
特開2021-152573号公報JP 2021-152573 Publication
 会社や学校において、人間同士のコミュニケーションが活発に行われているかどうかを分析することが求められている。人間が発する声の大きさには個人差があり、また場所や状況によって声の大きさが変わるため、分析者は音圧や音量を参照してもコミュニケーションが活発に行われているかどうかを分析することが難しい場合がある。 There is a need to analyze whether communication between people is active in companies and schools. There are individual differences in the volume of human voices, and the volume of voices changes depending on location and situation, so analysts can also analyze whether communication is active by referring to sound pressure and volume. It may be difficult to do so.
 本発明はこれらの点に鑑みてなされたものであり、音声によるコミュニケーションが活発に行われているかどうかを分析しやすくすることを目的とする。 The present invention has been made in view of these points, and its purpose is to make it easier to analyze whether or not voice communication is actively taking place.
 本発明の第1の態様の音声分析装置は、所定の領域に配置された複数の集音装置それぞれに対する音声の到来方向を取得する音声取得部と、前記複数の集音装置に対する複数の前記到来方向の関係を用いて、発話が行われた発話位置を特定し、前記発話位置と前記領域内の各位置との関係を用いて、当該位置における単位時間あたりの前記発話の長さを特定する特定部と、前記領域内の各位置と、当該位置における前記単位時間あたりの前記発話の長さに対応する活性度と、を関連付けたマップ情報を、情報端末に表示させる出力制御部と、を有する。 A voice analysis device according to a first aspect of the present invention includes: a voice acquisition unit that acquires a direction of arrival of voice to each of a plurality of sound collection devices arranged in a predetermined area; Identifying the utterance position where the utterance was made using the directional relationship, and identifying the length of the utterance per unit time at the position using the relationship between the utterance position and each position in the area. a specifying unit; an output control unit that causes an information terminal to display map information that associates each position in the area with an activation level corresponding to the length of the utterance per unit time at the position; have
 前記音声分析装置は、前記領域内において物体が位置する物体領域の設定を受け付ける受付部をさらに有し、前記特定部は、前記到来方向に沿った直線が前記物体領域と交わる場合に、前記到来方向の中で前記集音装置の位置を基準として前記物体領域よりも遠い部分を除外して、前記発話が行われた位置を特定してもよい。 The voice analysis device further includes a reception unit that receives a setting of an object area in which an object is located within the area, and the identification unit determines whether the arrival direction is The position where the utterance was made may be specified by excluding a part farther from the object area in the direction with reference to the position of the sound collecting device.
 前記マップ情報は、前記領域を表す地図上に、前記活性度に対応する情報を重畳した情報であってもよい。 The map information may be information in which information corresponding to the degree of activity is superimposed on a map representing the area.
 前記マップ情報は、前記領域を表す地図上に、前記活性度に対応する情報と、前記領域に配置された一又は複数の通話端末の位置を示す情報と、を重畳した情報であり、前記音声分析装置は、前記情報端末に表示された前記マップ情報において前記一又は複数の通話端末のいずれかが選択されたことに応じて、選択された前記通話端末と前記情報端末との間で音声の授受を開始させる通話制御部をさらに有してもよい。 The map information is information in which information corresponding to the degree of activity and information indicating the position of one or more call terminals arranged in the area are superimposed on a map representing the area, and the The analysis device is configured to analyze audio between the selected call terminal and the information terminal in response to selection of one or more of the call terminals in the map information displayed on the information terminal. The communication device may further include a call control unit that starts sending and receiving calls.
 前記出力制御部は、前記領域内の位置における前記活性度が所定の条件を満たしたことに応じて、前記条件に関連付けられた介入情報を前記情報端末に出力してもよい。 The output control unit may output intervention information associated with the condition to the information terminal in response to the degree of activity at a position within the region satisfying a predetermined condition.
 前記音声分析装置は、前記情報端末から、前記条件、及び当該条件と関連付けられた前記介入情報の設定を受け付ける受付部をさらに有してもよい。 The voice analysis device may further include a reception unit that receives settings of the condition and the intervention information associated with the condition from the information terminal.
 前記特定部は、前記発話が行われた位置の時間変化を、前記発話が行われた位置の移動の軌跡として特定し、前記出力制御部は、前記移動の軌跡を含む情報を前記情報端末に表示させてもよい。 The identifying unit identifies a temporal change in the position where the utterance was made as a locus of movement of the position where the utterance was made, and the output control unit transmits information including the locus of movement to the information terminal. It may be displayed.
 前記出力制御部は、前記領域の少なくとも一部であるサブ領域における第1期間の前記活性度と、前記サブ領域における第2期間の前記活性度と、を関連付けて前記情報端末に表示させてもよい。 The output control unit may cause the information terminal to display the activity level in a first period in a sub-region that is at least a part of the area and the activity level in a second period in the sub-region in association with each other. good.
 前記特定部は、前記複数の集音装置から取得した複数の前記音声それぞれを発した一又は複数の話者を認識することによって、前記領域内の各位置における前記発話を行った人物の人数を推定し、前記音声分析装置は、前記単位時間あたりの前記発話の長さを用いて暫定活性度を算出し、前記人数に応じて前記暫定活性度を補正することによって前記活性度を決定する活性度決定部をさらに有してもよい。 The identification unit determines the number of people who uttered the utterances at each position within the area by recognizing one or more speakers who uttered each of the plurality of voices acquired from the plurality of sound collection devices. and the speech analysis device calculates a provisional activation degree using the length of the utterance per unit time, and determines the activation degree by correcting the provisional activation degree according to the number of people. It may further include a degree determining section.
 前記活性度決定部は、前記人数が複数人である場合の前記活性度を、前記人数が1人である場合の前記活性度よりも大きくしてもよい。 The activity level determination unit may set the activity level when the number of people is a plurality of people to be larger than the activity level when the number of people is one person.
 前記出力制御部は、所定の時間間隔で決定された前記活性度を含む前記マップ情報を繰り返し前記情報端末に表示させてもよい。 The output control unit may repeatedly display the map information including the activity determined at predetermined time intervals on the information terminal.
 前記音声分析装置は、前記活性度を決定するために用いられた前記音声が複数の種類のうちいずれの種類であるかを決定する活性度決定部をさらに有し、前記出力制御部は、前記領域内の各位置と、前記複数の種類のうちいずれかの種類の前記音声を用いて決定された前記活性度と、を関連付けた前記マップ情報を、前記情報端末に表示させてもよい。 The voice analysis device further includes an activity determination unit that determines which of a plurality of types the voice used to determine the activation level is, and the output control unit The information terminal may display the map information in which each position in the area is associated with the degree of activity determined using one of the plurality of types of audio.
 前記出力制御部は、前記領域の一部を撮像する撮像装置を前記活性度が所定の条件を満たした前記領域内の位置に向けるための制御情報を、前記撮像装置に出力してもよい。 The output control unit may output, to the imaging device, control information for directing an imaging device that images a part of the region to a position within the region where the degree of activity satisfies a predetermined condition.
 本発明の第2の態様の音声分析方法は、プロセッサが実行する、所定の領域に配置された複数の集音装置それぞれに対する音声の到来方向を取得するステップと、前記複数の集音装置に対する複数の前記到来方向の関係を用いて、発話が行われた発話位置を特定し、前記発話位置と前記領域内の各位置との関係を用いて、当該位置における単位時間あたりの前記発話の長さを特定するステップと、前記領域内の各位置と、当該位置における前記単位時間あたりの前記発話の長さに対応する活性度と、を関連付けたマップ情報を、情報端末に表示させるステップと、を有する。 A sound analysis method according to a second aspect of the present invention includes the steps of acquiring the direction of arrival of sound to each of a plurality of sound collection devices arranged in a predetermined area, which is executed by a processor; The utterance position where the utterance was made is specified using the relationship between the direction of arrival of the utterance, and the length of the utterance per unit time at the position is determined using the relationship between the utterance position and each position in the area. and displaying on an information terminal map information in which each position in the area is associated with an activation degree corresponding to the length of the utterance per unit time at the position. have
 本発明の第3の態様の音声分析プログラムは、プロセッサに、所定の領域に配置された複数の集音装置それぞれに対する音声の到来方向を取得するステップと、前記複数の集音装置に対する複数の前記到来方向の関係を用いて、発話が行われた発話位置を特定し、前記発話位置と前記領域内の各位置との関係を用いて、当該位置における単位時間あたりの前記発話の長さを特定するステップと、前記領域内の各位置と、当該位置における前記単位時間あたりの前記発話の長さに対応する活性度と、を関連付けたマップ情報を、情報端末に表示させるステップと、を実行させる。 A sound analysis program according to a third aspect of the present invention includes a step of causing a processor to obtain a direction of arrival of sound to each of a plurality of sound collection devices arranged in a predetermined area; Identifying the utterance position where the utterance was made using the relationship of the direction of arrival, and identifying the length of the utterance per unit time at the position using the relationship between the utterance position and each position in the area. and displaying on an information terminal map information that associates each position in the area with the degree of activity corresponding to the length of the utterance per unit time at the position. .
 本発明によれば、音声によるコミュニケーションが活発に行われているかどうかを分析しやすくすることができるという効果を奏する。 According to the present invention, it is possible to easily analyze whether or not voice communication is actively taking place.
実施形態に係る音声分析システムの模式図である。1 is a schematic diagram of a speech analysis system according to an embodiment. 実施形態に係る音声分析システムのブロック図である。FIG. 1 is a block diagram of a speech analysis system according to an embodiment. 分析対象領域、集音装置及び現地端末の関係を説明するための模式図である。FIG. 2 is a schematic diagram for explaining the relationship between an analysis target area, a sound collection device, and a local terminal. 音声取得部が音声の到来方向を取得する方法を説明するための模式図である。FIG. 3 is a schematic diagram for explaining a method in which the audio acquisition unit acquires the direction of arrival of audio. 特定部が発話位置を特定する方法を説明するための模式図である。FIG. 3 is a schematic diagram for explaining a method in which the specifying unit specifies the utterance position. 到来方向と物体領域との関係を説明するための模式図である。FIG. 3 is a schematic diagram for explaining the relationship between an arrival direction and an object area. マップ情報及び介入情報を出力している現地端末の模式図である。FIG. 2 is a schematic diagram of a local terminal outputting map information and intervention information. 介入情報を出力している現地端末の模式図である。FIG. 2 is a schematic diagram of a local terminal outputting intervention information. 比較情報を表示している外部端末の模式図である。FIG. 3 is a schematic diagram of an external terminal displaying comparison information. 比較情報を表示している外部端末の模式図である。FIG. 3 is a schematic diagram of an external terminal displaying comparison information. 移動情報を表示している外部端末の模式図である。FIG. 3 is a schematic diagram of an external terminal displaying movement information. 実施形態に係る音声分析装置が実行する例示的な音声分析方法のフローチャートを示す図である。FIG. 3 is a diagram showing a flowchart of an exemplary speech analysis method executed by the speech analysis device according to the embodiment. 複数の音声種類のうちいずれかの音声種類の音声を用いて決定された活性度を表すマップ情報の模式図である。FIG. 7 is a schematic diagram of map information representing the degree of activity determined using one of a plurality of voice types. 複数の音声種類のうちいずれかの音声種類の音声を用いて決定された活性度を表すマップ情報の模式図である。FIG. 7 is a schematic diagram of map information representing the degree of activity determined using one of a plurality of voice types. 第3変形例に係る音声分析システムのブロック図である。It is a block diagram of the speech analysis system concerning the 3rd modification. 活性度に基づいて撮像装置の向きを変えるための処理を説明するための模式図である。FIG. 3 is a schematic diagram illustrating processing for changing the orientation of the imaging device based on the degree of activity.
[音声分析システムSの概要]
 図1は、本実施形態に係る音声分析システムSの模式図である。音声分析システムSは、音声分析装置1と、集音装置2と、現地端末3と、外部端末4と、を含む。音声分析システムSは、複数の集音装置2、複数の現地端末3及び複数の外部端末4を含んでもよい。音声分析システムSは、その他のサーバ、端末等の機器を含んでもよい。
[Overview of speech analysis system S]
FIG. 1 is a schematic diagram of a speech analysis system S according to this embodiment. The speech analysis system S includes a speech analysis device 1, a sound collection device 2, a local terminal 3, and an external terminal 4. The speech analysis system S may include a plurality of sound collection devices 2, a plurality of local terminals 3, and a plurality of external terminals 4. The speech analysis system S may also include other devices such as servers and terminals.
 音声分析装置1は、所定の分析対象領域Rにおいてユーザによって発せられた音声を分析し、分析結果をユーザ又は外部ユーザに提供するコンピュータである。分析対象領域Rは、例えば、会社や公共施設の部屋、図書館や図書室、学校や塾の教室、イベント会場、公園等の領域である。ユーザは、分析対象領域Rに滞在し、会話等を目的として音声を発する人間である。外部ユーザは、分析対象領域R外にいる人間であり、例えば分析者である。音声分析装置1は、集音装置2が取得した音声を分析し、分析結果を現地端末3又は外部端末4に出力する。音声分析装置1は、ローカルエリアネットワーク、インターネット等のネットワークを介して、集音装置2、現地端末3及び外部端末4に有線又は無線で接続される。 The voice analysis device 1 is a computer that analyzes the voice uttered by the user in a predetermined analysis target region R, and provides the analysis result to the user or an external user. The analysis target area R is, for example, a room in a company or public facility, a library or library, a classroom in a school or cram school, an event venue, a park, or the like. The user is a person who stays in the analysis target area R and makes sounds for the purpose of conversation or the like. The external user is a person who is outside the analysis target area R, and is, for example, an analyst. The voice analysis device 1 analyzes the voice acquired by the sound collection device 2 and outputs the analysis result to the local terminal 3 or external terminal 4. The voice analysis device 1 is connected to a sound collection device 2, a local terminal 3, and an external terminal 4 by wire or wirelessly via a network such as a local area network or the Internet.
 集音装置2は、分析対象領域Rに配置され、ユーザによって発せられた音声を取得する装置である。集音装置2は、例えば、異なる向きに配置された複数のマイクロフォン等の集音部を含むマイクロフォンアレイを備える。マイクロフォンアレイは、例えば、地面に対する水平面において、同一円周上に等間隔で配置された複数個(例えば、8個)のマイクロフォンを含む。音声分析装置1は、マイクロフォンアレイを用いて集音した音声に基づいて複数の集音装置2それぞれへの音声の到来方向を推定することにより、発話が行われた位置を特定する。集音装置2は、マイクロフォンアレイを用いて取得した音声を音声データとして音声分析装置1へ送信する。 The sound collection device 2 is a device that is placed in the analysis target region R and captures the sounds uttered by the user. The sound collecting device 2 includes, for example, a microphone array including sound collecting sections such as a plurality of microphones arranged in different directions. The microphone array includes, for example, a plurality of (e.g., eight) microphones arranged at equal intervals on the same circumference in a horizontal plane relative to the ground. The voice analysis device 1 identifies the position where the utterance is made by estimating the direction of arrival of the voice to each of the plurality of sound collection devices 2 based on the voice collected using the microphone array. The sound collection device 2 transmits the voice acquired using the microphone array to the voice analysis device 1 as voice data.
 また、集音装置2は、マイクロフォンアレイに代えて、単一のマイクロフォンを備えてもよい。この場合に、分析対象領域Rには、複数の集音装置2が所定間隔で配置される。音声分析装置1は、複数の集音装置2それぞれが取得した音声の強度を比較することにより、発話が行われた位置を特定する。 Furthermore, the sound collection device 2 may include a single microphone instead of the microphone array. In this case, a plurality of sound collecting devices 2 are arranged in the analysis target region R at predetermined intervals. The voice analysis device 1 identifies the position where the utterance is made by comparing the intensities of the voices acquired by each of the plurality of sound collection devices 2.
 現地端末3は、分析対象領域Rに設置され、情報を出力する情報端末である。現地端末3は、例えば、タブレット端末、パーソナルコンピュータ又はデジタルサイネージである。現地端末3は、例えば、液晶ディスプレイ等の表示部と、スピーカ等の音声出力部と、マイクロフォン等の集音部と、を有する。現地端末3は、音声分析装置1から受信した情報を、表示部に表示させ、又は音声出力部から出力する。現地端末3は、外部端末4との間で通話を行うための通話端末として機能してもよい。 The local terminal 3 is an information terminal installed in the analysis target area R and outputs information. The local terminal 3 is, for example, a tablet terminal, a personal computer, or a digital signage. The local terminal 3 includes, for example, a display section such as a liquid crystal display, an audio output section such as a speaker, and a sound collection section such as a microphone. The local terminal 3 displays the information received from the speech analysis device 1 on the display section or outputs it from the speech output section. The local terminal 3 may function as a telephone terminal for making telephone calls with the external terminal 4.
 外部端末4は、分析に関する設定を受け付けるとともに、情報を出力する情報端末である。外部端末4は、例えば、スマートフォン、タブレット端末又はパーソナルコンピュータである。外部端末4は、例えば、液晶ディスプレイ等の表示部と、スピーカ等の音声出力部と、マイクロフォン等の集音部と、を有する。外部端末4は、音声分析装置1から受信した情報を、表示部に表示させる。 The external terminal 4 is an information terminal that receives settings related to analysis and outputs information. The external terminal 4 is, for example, a smartphone, a tablet terminal, or a personal computer. The external terminal 4 includes, for example, a display section such as a liquid crystal display, an audio output section such as a speaker, and a sound collection section such as a microphone. The external terminal 4 causes the display unit to display the information received from the speech analysis device 1.
 本実施形態に係る音声分析システムSが音声を分析する処理の概要を以下に説明する。音声分析装置1は、分析対象領域Rに配置された複数の集音装置2それぞれが集音した音声を取得する。音声分析装置1は、取得した音声を用いて、発話が行われた位置を特定する。音声分析装置1は、時間ごとに発話が行われた位置が分析対象領域Rのどこにあるかを集計することによって、分析対象領域R内の各位置における単位時間あたりの発話の長さを特定する。 An overview of the process by which the voice analysis system S according to this embodiment analyzes voice will be described below. The voice analysis device 1 acquires the voices collected by each of the plurality of sound collection devices 2 arranged in the analysis target region R. The voice analysis device 1 uses the acquired voice to identify the position where the utterance was made. The speech analysis device 1 identifies the length of the utterance per unit time at each position in the analysis target area R by counting where in the analysis target area R the position where the utterance was made is located for each time. .
 音声分析装置1は、特定した単位時間あたりの発話の長さに対応する活性度を算出する。活性度は、例えば、単位時間あたりの発話の長さが長いほど大きく、単位時間あたりの発話の長さが短いほど小さい値である。音声分析装置1は、分析対象領域R内の各位置と、活性度と、を関連付けたマップ情報を、現地端末3又は外部端末4の少なくとも一方に表示させる。 The speech analysis device 1 calculates the degree of activity corresponding to the length of the utterance per specified unit time. For example, the activity level increases as the length of the utterance per unit time is longer, and the value decreases as the length of the utterance per unit time becomes shorter. The speech analysis device 1 causes at least one of the local terminal 3 and the external terminal 4 to display map information that associates each position within the analysis target region R with the degree of activity.
 このように、音声分析システムSは、分析対象領域Rに配置された集音装置2が取得した音声に基づいて、分析対象領域R内の各位置における単位時間あたりの発話の長さを特定し、発話の長さに対応する活性度を分析対象領域R内の各位置と関連付けて出力する。これにより、音声分析システムSは、音声の大きさではなく、分析対象領域R内の各位置における発話の長さを可視化できるため、音声によるコミュニケーションが活発に行われているかどうかを分析しやすくすることができる。 In this way, the speech analysis system S identifies the length of utterance per unit time at each position within the analysis target area R based on the audio acquired by the sound collection device 2 placed in the analysis target area R. , the degree of activity corresponding to the length of the utterance is output in association with each position within the analysis target region R. As a result, the speech analysis system S can visualize the length of utterances at each position within the analysis target region R, rather than the loudness of the voice, making it easier to analyze whether vocal communication is actively taking place. be able to.
[音声分析システムSの構成]
 図2は、本実施形態に係る音声分析システムSのブロック図である。図2において、矢印は主なデータの流れを示しており、図2に示したもの以外のデータの流れがあってもよい。図2において、各ブロックはハードウェア(装置)単位の構成ではなく、機能単位の構成を示している。そのため、図2に示すブロックは単一の装置内に実装されてもよく、あるいは複数の装置内に分かれて実装されてもよい。ブロック間のデータの授受は、データバス、ネットワーク、可搬記憶媒体等、任意の手段を介して行われてもよい。
[Configuration of speech analysis system S]
FIG. 2 is a block diagram of the speech analysis system S according to this embodiment. In FIG. 2, arrows indicate main data flows, and data flows other than those shown in FIG. 2 may exist. In FIG. 2, each block shows the configuration of a functional unit rather than a hardware (device) unit. Therefore, the blocks shown in FIG. 2 may be implemented within a single device, or may be implemented separately within multiple devices. Data may be exchanged between blocks via any means such as a data bus, network, or portable storage medium.
 音声分析装置1は、記憶部11と、制御部12と、を有する。音声分析装置1は、2つ以上の物理的に分離した装置が有線又は無線で接続されることにより構成されてもよい。また、音声分析装置1は、コンピュータ資源の集合であるクラウドによって構成されてもよい。 The speech analysis device 1 includes a storage section 11 and a control section 12. The speech analysis device 1 may be configured by two or more physically separate devices connected by wire or wirelessly. Further, the speech analysis device 1 may be configured by a cloud that is a collection of computer resources.
 記憶部11は、ROM(Read Only Memory)、RAM(Random Access Memory)、ハードディスクドライブ等を含む、コンピュータ読み取り可能な非一時的な記憶媒体である。記憶部11は、制御部12が実行するプログラムを予め記憶している。記憶部11は、音声分析装置1の外部に設けられてもよく、その場合にネットワークを介して制御部12との間でデータの授受を行ってもよい。 The storage unit 11 is a computer-readable non-temporary storage medium including ROM (Read Only Memory), RAM (Random Access Memory), hard disk drive, and the like. The storage unit 11 stores in advance a program to be executed by the control unit 12. The storage unit 11 may be provided outside the speech analysis device 1, and in that case, data may be exchanged with the control unit 12 via a network.
 制御部12は、受付部121と、音声取得部122と、特定部123と、活性度決定部124と、出力制御部125と、通話制御部126と、を有する。制御部12は、例えばCPU(Central Processing Unit)等のプロセッサであり、記憶部11に記憶されたプログラムを実行することにより、受付部121、音声取得部122、特定部123、活性度決定部124、出力制御部125及び通話制御部126として機能する。また、制御部12の機能の少なくとも一部は、制御部12がネットワーク経由で実行されるプログラムを実行することによって実現されてもよい。 The control unit 12 includes a reception unit 121, a voice acquisition unit 122, a specification unit 123, an activity level determination unit 124, an output control unit 125, and a call control unit 126. The control unit 12 is, for example, a processor such as a CPU (Central Processing Unit), and executes a program stored in the storage unit 11 to control the reception unit 121, the audio acquisition unit 122, the identification unit 123, and the activity level determination unit 124. , functions as an output control section 125 and a call control section 126. Further, at least part of the functions of the control unit 12 may be realized by the control unit 12 executing a program executed via a network.
 以下、音声分析システムSが実行する処理について詳細に説明する。図3は、分析対象領域R、集音装置2及び現地端末3の関係を説明するための模式図である。分析対象領域Rには、複数の集音装置2と、一又は複数の現地端末3と、が配置されている。 Hereinafter, the processing executed by the speech analysis system S will be explained in detail. FIG. 3 is a schematic diagram for explaining the relationship among the analysis target area R, the sound collection device 2, and the local terminal 3. In the analysis target region R, a plurality of sound collecting devices 2 and one or more local terminals 3 are arranged.
 受付部121は、分析対象領域Rと、分析対象領域R内における集音装置2及び現地端末3の位置と、分析対象領域R内において壁等の物体(障害物)が位置する物体領域と、の設定を受け付ける。外部端末4は、例えば、分析対象領域Rと、集音装置2及び現地端末3の位置と、物体領域と、を指定する操作を外部ユーザから受け付け、指定された内容を示す情報を音声分析装置1に送信する。音声分析装置1において、受付部121は、外部端末4から受信した情報に基づいて、分析対象領域Rと、集音装置2及び現地端末3の位置と、物体領域と、を関連付けた情報を記憶部11に記憶させる。 The reception unit 121 determines the analysis target area R, the positions of the sound collection device 2 and the local terminal 3 within the analysis target area R, and the object area where an object (obstruction) such as a wall is located within the analysis target area R. Accepts settings. For example, the external terminal 4 receives an operation for specifying the analysis target area R, the positions of the sound collection device 2 and the local terminal 3, and the object area from an external user, and transmits information indicating the specified contents to the voice analysis device. Send to 1. In the speech analysis device 1, the reception unit 121 stores information associating the analysis target region R, the positions of the sound collection device 2 and the local terminal 3, and the object region based on the information received from the external terminal 4. The information is stored in the unit 11.
 また、受付部121は、分析対象領域Rが含むサブ領域の設定を受け付けてもよい。サブ領域は、分析時に着目する分析対象領域Rの少なくとも一部である領域である。図3の例では、コーヒーマシンを含む領域であるコーヒーエリア、デスクを含む領域であるデスクエリア、ソファを含む領域であるソファエリア等がサブ領域として設定され得る。外部端末4は、例えば、分析対象領域R内のサブ領域と、サブ領域の名称と、を指定する操作を外部ユーザから受け付け、指定された内容を示す情報を音声分析装置1に送信する。音声分析装置1において、受付部121は、外部端末4から受信した情報に基づいて、サブ領域と、サブ領域の名称と、を関連付けた情報を記憶部11に記憶させる。 Additionally, the reception unit 121 may accept settings of sub-regions included in the analysis target region R. The sub-region is a region that is at least a part of the analysis target region R that is of interest during analysis. In the example of FIG. 3, a coffee area that is an area that includes a coffee machine, a desk area that is an area that includes a desk, a sofa area that is an area that includes a sofa, etc. may be set as sub-areas. For example, the external terminal 4 receives an operation for specifying a sub-region within the analysis target region R and the name of the sub-region from an external user, and transmits information indicating the specified contents to the speech analysis device 1. In the speech analysis device 1 , the receiving unit 121 causes the storage unit 11 to store information associating the sub-areas with the names of the sub-areas, based on the information received from the external terminal 4 .
 また、受付部121は、介入情報を出力するか否かの判定に用いられる介入条件の設定を受け付けてもよい。介入条件は、例えば、活性度決定部124によって決定される、単位時間あたりの発話の長さに対応する活性度が、所定の閾値以上であることである。介入情報は、例えば、介入条件を満たしたサブ領域の名称を含むメッセージである。外部端末4は、例えば、介入条件及び介入情報を指定する操作を外部ユーザから受け付け、指定された内容を示す情報を音声分析装置1に送信する。音声分析装置1において、受付部121は、外部端末4から受信した情報に基づいて、介入条件及び介入情報を関連付けた情報を記憶部11に記憶させる。 Additionally, the reception unit 121 may accept settings for intervention conditions used to determine whether or not to output intervention information. The intervention condition is, for example, that the activity level corresponding to the length of utterance per unit time determined by the activity level determination unit 124 is equal to or greater than a predetermined threshold value. The intervention information is, for example, a message containing the name of a sub-area that satisfies the intervention condition. For example, the external terminal 4 receives an operation for specifying intervention conditions and intervention information from an external user, and transmits information indicating the specified contents to the speech analysis device 1. In the speech analysis device 1 , the reception unit 121 causes the storage unit 11 to store information in which intervention conditions and intervention information are associated, based on the information received from the external terminal 4 .
 音声取得部122は、分析対象領域Rに配置された複数の集音装置2それぞれが集音した音声を取得する。集音装置2は、例えば、マイクロフォンアレイを用いて集音した音声を示す音声データを、音声分析装置1に送信する。集音装置2は、音声データを音声分析装置1に常時送信し、又は所定期間(1時間、1日等)の音声データをまとめて音声分析装置1に送信する。音声分析装置1において、音声取得部122は、集音装置2から受信した音声データを記憶部11に記憶させ、音声データが示す音声を取得する。 The audio acquisition unit 122 acquires the audio collected by each of the plurality of sound collection devices 2 arranged in the analysis target region R. The sound collection device 2 transmits, for example, audio data representing sounds collected using a microphone array to the audio analysis device 1. The sound collection device 2 constantly transmits voice data to the voice analysis device 1, or transmits voice data for a predetermined period (one hour, one day, etc.) to the voice analysis device 1 in bulk. In the voice analysis device 1, the voice acquisition unit 122 stores the voice data received from the sound collection device 2 in the storage unit 11, and acquires the voice indicated by the voice data.
 音声取得部122は、取得した音声に対して、所定のフィルタリング処理を行ってもよい。音声取得部122は、例えば、取得した音声から、分析対象領域Rに予め関連付けられた期間(会社や公共施設の業務時間等)とは異なる期間に集音された音声を除去してもよい。また、音声取得部122は、例えば、取得した音声から、人間が発する音声(人間の声に対応する周波数帯の音等)とは異なる音を除去してもよい。これにより、音声分析装置1は、分析に重要でない音声を除外して分析を行い、分析結果の精度を向上できる。 The audio acquisition unit 122 may perform a predetermined filtering process on the acquired audio. For example, the audio acquisition unit 122 may remove, from the acquired audio, audio collected during a period different from a period associated with the analysis target region R in advance (business hours of a company or public facility, etc.). Furthermore, the voice acquisition unit 122 may remove, for example, sounds different from voices emitted by humans (sounds in a frequency band corresponding to human voices, etc.) from the acquired voices. Thereby, the speech analysis device 1 can perform analysis by excluding speech that is not important for analysis, and can improve the accuracy of the analysis results.
 音声取得部122は、複数の集音装置2それぞれが集音した音声の時間ごと(例えば、10ミリ秒~1000ミリ秒ごと)の到来方向を取得する。音声取得部122は、例えば、集音装置2が備えるマイクロフォンアレイが集音した複数チャネルの音声に対して既知の音源定位処理を行う。音源定位処理は、音声取得部122が取得した音声に含まれる音源の位置を推定する処理である。音声取得部122は、音源定位処理によって、集音装置2の位置を基準とした音源が存在する信頼度の分布を示す信頼度分布を取得する。信頼度は、その位置に音源が存在する尤もらしさに対応する値であり、例えば確率であってもよい。信頼度分布は、集音装置2に対する音声の到来方向を表している。 The audio acquisition unit 122 acquires the arrival direction of the audio collected by each of the plurality of sound collection devices 2 for each time (for example, every 10 milliseconds to 1000 milliseconds). For example, the audio acquisition unit 122 performs known sound source localization processing on multiple channels of audio collected by the microphone array included in the sound collection device 2 . The sound source localization process is a process for estimating the position of a sound source included in the audio acquired by the audio acquisition unit 122. The audio acquisition unit 122 performs sound source localization processing to acquire a reliability distribution indicating the reliability distribution of the existence of a sound source based on the position of the sound collection device 2. The reliability is a value corresponding to the likelihood that a sound source exists at that position, and may be a probability, for example. The reliability distribution represents the arrival direction of the sound with respect to the sound collection device 2.
 図4Aは、音声取得部122が音声の到来方向を取得する方法を説明するための模式図である。図4Aの例は、音声取得部122が3つの集音装置2それぞれが集音した音声に基づいて取得した信頼度分布Pを表している。 FIG. 4A is a schematic diagram for explaining how the audio acquisition unit 122 acquires the direction of arrival of audio. The example in FIG. 4A represents the reliability distribution P acquired by the voice acquisition unit 122 based on the voices collected by each of the three sound collection devices 2.
 信頼度分布Pの縦軸及び横軸は、分析対象領域R内の座標に対応している。信頼度分布Pは、各位置(座標)の色が明るいほど(白色に近いほど)音源が存在する信頼度が高く、各位置の色が暗いほど(黒色に近いほど)音源が存在する信頼度が低いことを表している。 The vertical and horizontal axes of the reliability distribution P correspond to coordinates within the analysis target region R. In the reliability distribution P, the brighter the color of each position (coordinates) (closer to white), the higher the confidence that the sound source exists, and the darker the color of each position (closer to black), the higher the confidence that the sound source exists. indicates that the value is low.
 マイクロフォンアレイでは集音装置2から音源までの距離を特定できないため、信頼度分布Pにおいて集音装置2を基準として直線状(放射線状)に同じ信頼度の領域が分布する。集音装置2と音源とを結ぶ直線上で音源が存在する信頼度が高くなるため、信頼度分布Pにおいて信頼度が所定値以上である直線状の領域が、集音装置2に対する音声の到来方向Dを示している。到来方向Dは、集音装置2の位置を含む直線に限られず、集音装置2の位置を基準とした所定の角度又は長さの幅を有する領域として表されてもよい。 Since the microphone array cannot specify the distance from the sound collection device 2 to the sound source, in the reliability distribution P, regions with the same reliability are distributed linearly (radially) with the sound collection device 2 as a reference. Since the reliability of the presence of the sound source on the straight line connecting the sound collection device 2 and the sound source increases, the linear region in the reliability distribution P where the reliability is equal to or higher than a predetermined value indicates the arrival of sound to the sound collection device 2. Direction D is shown. The direction of arrival D is not limited to a straight line that includes the position of the sound collecting device 2, but may be expressed as a region having a width of a predetermined angle or length based on the position of the sound collecting device 2.
 本実施形態では音声分析装置1が到来方向Dを推定しているが、複数の集音装置2それぞれがマイクロフォンアレイを用いて取得した音声に基づいて到来方向Dを推定してもよい。この場合に、音声分析装置1において、音声取得部122は、複数の集音装置2それぞれから、当該集音装置2が推定した到来方向Dを示す情報を受信する。 In the present embodiment, the voice analysis device 1 estimates the direction of arrival D, but the direction of arrival D may be estimated based on the voice acquired by each of the plurality of sound collection devices 2 using a microphone array. In this case, in the voice analysis device 1 , the voice acquisition unit 122 receives from each of the plurality of sound collection devices 2 information indicating the direction of arrival D estimated by the sound collection device 2 .
 特定部123は、複数の集音装置2に対する複数の到来方向Dに基づいて、時間ごと(例えば10ミリ秒~1000ミリ秒ごと)に、分析対象領域R内で発話が行われた位置である発話位置を特定する。図4Bは、特定部123が発話位置を特定する方法を説明するための模式図である。 The identification unit 123 identifies the position where the utterance was made within the analysis target region R at each time (for example, every 10 to 1000 milliseconds) based on the plurality of arrival directions D for the plurality of sound collection devices 2. Identify the utterance position. FIG. 4B is a schematic diagram for explaining the method by which the specifying unit 123 specifies the utterance position.
 特定部123は、複数の集音装置2が集音した音声から生成した複数の信頼度分布Pを重ね合わせる。特定部123は、例えば、分析対象領域R内の各位置において複数の信頼度分布Pが示す信頼度の和又は積を算出することにより、複数の信頼度分布Pを重ね合わせる。図4Bは、図4Aに例示した3つの信頼度分布Pを重ね合わせることによって生成した信頼度分布P1を表している。 The identification unit 123 superimposes a plurality of reliability distributions P generated from sounds collected by a plurality of sound collection devices 2. The specifying unit 123 overlaps the plurality of reliability distributions P by, for example, calculating the sum or product of the reliabilities indicated by the plurality of reliability distributions P at each position in the analysis target region R. FIG. 4B represents a reliability distribution P1 generated by superimposing the three reliability distributions P illustrated in FIG. 4A.
 特定部123は、複数の信頼度分布Pを重ね合わせた信頼度分布P1を用いて、発話位置を特定する。発話位置は、分析対象領域R内の1点で表されてもよく、分析対象領域R内の領域で表されてもよい。特定部123は、例えば、信頼度分布P1において信頼度が所定値以上である位置又は領域を、発話位置として特定する。 The identification unit 123 identifies the utterance position using a reliability distribution P1 obtained by superimposing a plurality of reliability distributions P. The utterance position may be represented by one point within the analysis target region R, or may be represented by a region within the analysis target region R. The specifying unit 123 specifies, for example, a position or area whose reliability is equal to or greater than a predetermined value in the reliability distribution P1 as the utterance position.
 複数の信頼度分布Pが示す複数の到来方向Dが交差する位置は、複数の信頼度分布Pを重ね合わせた信頼度分布P1において信頼度が高い位置となる。そのため、特定部123は、複数の到来方向Dに沿った複数の直線が交差する交差位置D1を、発話位置として特定してもよい。到来方向Dが幅を有する領域である場合に、交差位置D1は、複数の到来方向Dに沿って延在する複数の領域が交差する領域であってもよい。 The position where the plurality of directions of arrival D indicated by the plurality of reliability distributions P intersects is a position with high reliability in the reliability distribution P1 obtained by superimposing the plurality of reliability distributions P. Therefore, the specifying unit 123 may specify the intersection position D1 where a plurality of straight lines along the plurality of arrival directions D intersect as the utterance position. When the direction of arrival D is a region having a width, the intersection position D1 may be a region where a plurality of regions extending along a plurality of directions of arrival D intersect.
 このように、音声分析装置1は、複数の集音装置2に対する音声の到来方向Dに基づいて発話位置を特定するため、1つの集音装置2から音源までの距離を特定できない場合であっても、高い精度で発話位置を特定できる。 In this way, the voice analysis device 1 identifies the utterance position based on the arrival direction D of the voice with respect to the plurality of sound collection devices 2, so even if the distance from one sound collection device 2 to the sound source cannot be determined. It is also possible to pinpoint the utterance position with high accuracy.
 特定部123は、受付部121が受け付けた分析対象領域R内において物体が位置する物体領域を考慮して、発話位置を特定してもよい。図5は、到来方向Dと物体領域R2との関係を説明するための模式図である。図5の例では、到来方向Dの途中に物体領域R2が存在している状態を表している。 The specifying unit 123 may specify the utterance position by considering the object region where the object is located within the analysis target region R received by the receiving unit 121. FIG. 5 is a schematic diagram for explaining the relationship between the direction of arrival D and the object region R2. The example in FIG. 5 shows a state in which an object region R2 exists in the middle of the direction of arrival D.
 特定部123は、到来方向Dに沿った直線が物体領域R2と交わる場合に、到来方向Dの中で集音装置2の位置を基準として物体領域R2よりも遠い部分を除外して、発話位置を特定する。特定部123は、例えば、複数の到来方向Dのうち第1の到来方向に沿った集音装置2と物体領域R2との間の線分と、複数の到来方向Dのうち第2の到来方向又は第2の到来方向に沿った集音装置2と物体領域R2との間の線分と、が交差する交差位置D1を、発話位置として特定する。これにより、音声分析装置1は、壁等の障害物の先に音源があると誤認識することを抑制し、発話位置の精度を向上できる。 When a straight line along the direction of arrival D intersects with the object region R2, the identifying unit 123 determines the utterance position by excluding a part in the direction of arrival D that is farther from the object region R2 with respect to the position of the sound collector 2. Identify. The specifying unit 123 identifies, for example, a line segment between the sound collection device 2 and the object region R2 along a first arrival direction among the plurality of arrival directions D, and a second arrival direction among the plurality of arrival directions D. Alternatively, the intersection position D1 where the line segment between the sound collection device 2 and the object region R2 along the second arrival direction intersects is specified as the utterance position. Thereby, the speech analysis device 1 can suppress erroneous recognition that the sound source is beyond an obstacle such as a wall, and can improve the accuracy of the utterance position.
 特定部123は、発話位置を特定することに加えて、発話位置において発話を行ったユーザの人数を推定してもよい。特定部123は、音声取得部122が複数の集音装置2から取得した複数の音声それぞれに対して、到来方向Dの音声を強調する処理を行う。特定部123は、例えば、集音装置2が備えるマイクロフォンアレイに対して到来方向Dとは異なる方向から入力された音声を抑圧することにより、到来方向Dの音声を強調する。 In addition to identifying the utterance position, the identification unit 123 may estimate the number of users who have uttered the utterance at the utterance position. The specifying unit 123 performs a process of emphasizing the sound in the arrival direction D for each of the plurality of sounds that the sound acquisition unit 122 has acquired from the plurality of sound collection devices 2 . The identifying unit 123 emphasizes the sound in the direction of arrival D, for example, by suppressing the sound input from a direction different from the direction of arrival D to the microphone array included in the sound collection device 2.
 特定部123は、到来方向Dの音声が強調された複数の音声それぞれに対して既知の話者認識処理を行うことにより、複数の音声それぞれを発した一又は複数の話者を認識する。特定部123は、例えば、所定期間ごとに分割した音声を深層学習によってクラスタリングすることで、生成した一又は複数のクラスタに対応する一又は複数の話者を認識する。 The identifying unit 123 performs known speaker recognition processing on each of the plurality of voices in which the voice in the direction of arrival D is emphasized, thereby recognizing one or more speakers who have uttered each of the plurality of voices. The identifying unit 123 recognizes one or more speakers corresponding to one or more generated clusters, for example, by clustering the speech divided at predetermined intervals using deep learning.
 そして特定部123は、複数の音声それぞれを発した一又は複数の話者のうち、全ての音声に共通する一又は複数の話者を、発話位置において発話を行ったユーザとして推定する。特定部123は、時間ごとに、発話位置と、当該発話位置において発話を行ったユーザと、を関連付けた情報を記憶部11に記憶させる。これにより、音声分析装置1は、発話位置とは異なる位置で発話を行った話者を除外し、発話位置において発話を行ったユーザを高い精度で推定できる。 Then, the identifying unit 123 estimates one or more speakers who are common to all the voices among the one or more speakers who uttered each of the plurality of voices as the user who made the utterance at the utterance position. The specifying unit 123 causes the storage unit 11 to store information associating a speaking position with a user who spoke at the speaking position for each time. Thereby, the speech analysis device 1 can exclude speakers who spoke at a position different from the speaking position and estimate the user who spoke at the speaking position with high accuracy.
 特定部123は、マイクロフォンアレイを備える複数の集音装置2が集音した音声に代えて、単一のマイクロフォンを備える複数の集音装置2が集音した音声を用いて、発話位置を特定してもよい。この場合に、分析対象領域Rには、複数の集音装置2が所定間隔で配置されている。ユーザが分析対象領域R内で音声を発すると、各集音装置2は、ユーザに近いほど高い強度の音声を取得し、ユーザから遠いほど低い強度の音声を取得する。 The identification unit 123 identifies the speaking position using the sounds collected by the plurality of sound collection devices 2 including a single microphone instead of the sounds collected by the plurality of sound collection devices 2 including the microphone array. It's okay. In this case, in the analysis target region R, a plurality of sound collecting devices 2 are arranged at predetermined intervals. When a user makes a sound within the analysis target region R, each sound collection device 2 acquires a sound with a higher intensity as it is closer to the user, and a sound with a lower intensity as it is farther from the user.
 特定部123は、複数の集音装置2それぞれが同時期に取得した音声の強度を比較し、取得した音声の強度が最も高い集音装置2、又は取得した音声の強度が閾値以上である複数の集音装置2の位置に基づいて、発話位置を特定する。これにより、音声分析装置1は、マイクロフォンアレイを備えない集音装置2を用いる場合であっても、発話位置を特定できる。 The specifying unit 123 compares the intensities of sounds acquired by each of the plurality of sound collecting devices 2 at the same time, and selects the sound collecting device 2 with the highest sound intensity, or a plurality of sound collecting devices 2 with the highest sound intensity above a threshold value. The utterance position is specified based on the position of the sound collection device 2. Thereby, the speech analysis device 1 can specify the utterance position even when using the sound collection device 2 that does not include a microphone array.
 特定部123は、特定部123が特定した時間ごとの発話位置に基づいて、分析対象領域R内の各位置における単位時間あたりの発話の長さを特定する。特定部123は、例えば、分析対象領域R内の各位置(例えば、分析対象領域Rを分割した矩形領域)において、単位時間(例えば、1分間)に当該位置に発話位置が存在した時間を集計する。例えばある位置に現在時刻から遡って1分間のうち30秒間にわたって発話位置が存在した場合に、当該位置における単位時間あたりの発話の長さは30秒である。 The specifying unit 123 specifies the length of the utterance per unit time at each position within the analysis target region R, based on the utterance position for each time specified by the specifying unit 123. For example, at each position within the analysis target region R (for example, a rectangular region obtained by dividing the analysis target region R), the specifying unit 123 totals the time during which the utterance position was present at the position per unit time (for example, 1 minute). do. For example, if an utterance position exists at a certain position for 30 seconds in one minute going back from the current time, the length of the utterance per unit time at that position is 30 seconds.
 活性度決定部124は、分析対象領域R内の各位置において、特定部123が特定した単位時間あたりの発話の長さに対応する活性度を決定する。活性度決定部124は、例えば、特定部123が特定した単位時間あたりの発話の長さが長いほど大きく、特定部123が特定した単位時間あたりの発話の長さが短いほど小さい値を、活性度として決定する。活性度決定部124は、例えば、単位時間あたりの発話の長さの値自体を活性度として決定してもよく、単位時間あたりの発話の長さの値を所定の規則に従って変換した値を活性度として決定してもよい。 The activity determining unit 124 determines the activity corresponding to the length of utterance per unit time specified by the specifying unit 123 at each position within the analysis target region R. For example, the activity determining unit 124 sets the activation value to be larger as the length of the utterance per unit time specified by the specifying unit 123 is longer, and to be smaller as the length of the utterance per unit time specified by the specifying unit 123 is shorter. Determine as degree. For example, the activation level determination unit 124 may determine the value of the length of utterance per unit time itself as the activation level, or may determine the value of the length of utterance per unit time converted according to a predetermined rule as the activation level. It may be determined as a degree.
 また、活性度決定部124は、発話位置において発話を行ったユーザの人数を考慮して、活性度を決定してもよい。この場合に、活性度決定部124は、例えば、特定部123が特定した単位時間あたりの発話の長さが長いほど大きく、特定部123が特定した単位時間あたりの発話の長さが短いほど小さい値である暫定活性度を算出する。 Furthermore, the activity determining unit 124 may determine the activity by considering the number of users who spoke at the speaking position. In this case, the activity determining unit 124 determines that, for example, the longer the length of the utterance per unit time specified by the specifying unit 123 is, the higher the value is, and the shorter the length of the utterance per unit time specified by the specifying unit 123 is, the smaller the length of the utterance is. A provisional activity value is calculated.
 活性度決定部124は、特定部123が特定した人数に応じて暫定活性度を補正することによって、活性度を算出する。活性度決定部124は、例えば、人数が複数人である場合の活性度を、人数が1人である場合の活性度よりも大きくするように、暫定活性度を補正する。これにより、音声分析装置1は、音声から推定した人数を活性度に反映することができる。 The activity determining unit 124 calculates the activity by correcting the provisional activity according to the number of people specified by the specifying unit 123. The activity determining unit 124 corrects the provisional activity so that, for example, the activity when there is a plurality of people is greater than the activity when there is one person. Thereby, the voice analysis device 1 can reflect the number of people estimated from the voice in the degree of activity.
 出力制御部125は、分析対象領域R内の各位置と、活性度決定部124が決定した活性度と、を関連付けたマップ情報を、現地端末3又は外部端末4の少なくとも一方に表示させる。出力制御部125は、例えば、分析対象領域Rを表す地図上に、分析対象領域R内の各位置の活性度に対応する情報(色、模様等)を重畳したヒートマップを、マップ情報として生成する。また、出力制御部125は、分析対象領域R内の各位置の活性度に加えて、分析対象領域Rに配置された複数の集音装置2それぞれの位置を示すマップ情報を生成してもよい。出力制御部125は、生成したマップ情報を、現地端末3又は外部端末4の少なくとも一方に送信する。 The output control unit 125 causes at least one of the local terminal 3 or the external terminal 4 to display map information in which each position within the analysis target region R is associated with the activity determined by the activity determination unit 124. For example, the output control unit 125 generates, as map information, a heat map in which information (color, pattern, etc.) corresponding to the degree of activity of each position in the analysis target region R is superimposed on a map representing the analysis target region R. do. Further, the output control unit 125 may generate map information indicating the positions of each of the plurality of sound collecting devices 2 arranged in the analysis target region R, in addition to the activity level of each position within the analysis target region R. . The output control unit 125 transmits the generated map information to at least one of the local terminal 3 and the external terminal 4.
 出力制御部125は、活性度決定部124が所定の時間間隔で決定した活性度を示すマップ情報を、現地端末3又は外部端末4の少なくとも一方に繰り返し表示させることが望ましい。これにより、音声分析システムSは、分析対象領域Rにおける最新のコミュニケーションの状況を、ユーザ又は外部ユーザに通知することができる。 It is desirable that the output control unit 125 repeatedly display map information indicating the degree of activity determined by the degree of activity determination unit 124 at predetermined time intervals on at least one of the local terminal 3 or the external terminal 4. Thereby, the voice analysis system S can notify the user or an external user of the latest communication status in the analysis target area R.
 なお、活性度が大きい状況を肯定的な要素として捉えるか、若しくは、否定的な要素として捉えるか、又は、活性度が小さい状況を肯定的な要素として捉えるか、若しくは、否定的な要素として捉えるかは、分析対象領域Rの種類による。例えば、静かな方が望ましい分析対象領域R(図書館や図書室の場合や、学校や塾の教室で授業中やテスト中のように生徒が静かにしている方が望ましい場合など)では、活性度が大きい状況を否定的な要素とし、又は、活性度が小さい状況を肯定的な要素として、捉えてもよい。 Furthermore, whether a situation with a high degree of activation is regarded as a positive element or a negative element, or a situation with a low degree of activation is regarded as a positive element or a negative element. It depends on the type of the analysis target region R. For example, in the analysis target area R where it is desirable to be quiet (for example, in a library or a library, or in a classroom at a school or cram school where it is desirable for students to be quiet during classes or tests), the activity level is A situation where the activity level is large may be regarded as a negative element, or a situation where the activity level is small may be regarded as a positive element.
 また、出力制御部125は、活性度決定部124が決定した活性度が所定の介入条件を満たしたことに応じて、当該介入条件に関連付けられた介入情報を、現地端末3又は外部端末4の少なくとも一方から出力させてもよい。出力制御部125は、例えば、受付部121が受け付けた介入条件及び介入情報を記憶部11から取得する。出力制御部125は、分析対象領域R内の各位置における活性度が、介入条件を満たすか否か(例えば、介入条件が示す閾値以上か否か)を判定する。 Furthermore, in response to the activity determined by the activity determination unit 124 satisfying a predetermined intervention condition, the output control unit 125 transmits intervention information associated with the intervention condition to the local terminal 3 or external terminal 4. It may be output from at least one side. For example, the output control unit 125 acquires the intervention conditions and intervention information received by the reception unit 121 from the storage unit 11. The output control unit 125 determines whether the degree of activity at each position within the analysis target region R satisfies an intervention condition (for example, whether it is equal to or greater than a threshold indicated by the intervention condition).
 出力制御部125は、分析対象領域R内のいずれかの位置における活性度が介入条件を満たしたことに応じて、当該介入条件に関連付けられた介入情報を生成する。出力制御部125は、例えば、活性度が介入条件を満たした位置を含むサブ領域の名称を含むメッセージ(「コーヒーエリアが盛り上がっています」、「図書室では静かにしてください」等)を介入情報として生成する。この場合、活性度が大きい場合の介入情報として、肯定的な内容の介入情報(例えば、コミュニケーションを促進するための情報)を生成するか、否定的な内容の介入情報(例えば、コミュニケーションを抑制するための情報)を生成するかは、分析対象領域Rの種類により定めればよい。上述したように、例えば、静かな方が望ましい分析対象領域Rでは、活性度が大きい状況を否定的な要素として捉え、否定的な内容の介入情報を生成してもよい。 The output control unit 125 generates intervention information associated with the intervention condition in response to the degree of activity at any position within the analysis target region R satisfying the intervention condition. The output control unit 125 outputs, for example, a message including the name of the sub-area including the position where the degree of activity satisfies the intervention condition (“The coffee area is busy”, “Please be quiet in the library”, etc.) as intervention information. Generate as. In this case, as intervention information when the degree of activation is high, intervention information with positive content (e.g., information to promote communication) is generated, or intervention information with negative content (e.g., information to suppress communication) is generated. Whether or not to generate the information for the analysis may be determined depending on the type of the analysis target region R. As described above, for example, in the analysis target region R where quietness is desirable, a situation with a high degree of activity may be regarded as a negative element, and intervention information with negative content may be generated.
 分析対象領域Rが店舗内である場合に、出力制御部125は、例えば、分析対象領域R内のいずれかの位置における活性度が介入条件を満たしたことに応じて、当該店舗の従業員に対して当該位置に移動することを指示するメッセージ(「帽子売り場近辺に移動してください」等)を介入情報として生成してもよい。これにより、音声分析システムSは、店舗において会話が多く行われている場所に店員を配置するための指示を自動的に行い、商品の説明や販売促進をしやすくすることができる。 When the analysis target region R is inside a store, the output control unit 125 may, for example, notify an employee of the store in response to the degree of activity at any position within the analysis target region R satisfying the intervention condition. A message instructing the user to move to the relevant location (such as "Please move near the hat section") may be generated as the intervention information. Thereby, the voice analysis system S can automatically instruct store staff to be placed in areas where many conversations are taking place in the store, making it easier to explain products and promote sales.
 また、介入条件としては、活性度が所定の閾値以上になった場合に限らず、これに代えて又はこれとともに、活性度が所定の閾値以下になった場合を用いることも可能である。この場合も、活性度が小さい場合の介入情報として、肯定的な内容の介入情報(例えば、静かにできていることをほめる情報)を生成するか、否定的な内容の介入情報(例えば、静かにできていないことに注意を促す情報)を生成するかは、分析対象領域Rの種類により定めることができる。そして、出力制御部125は、生成した介入情報を、現地端末3又は外部端末4の少なくとも一方に送信する。 Furthermore, the intervention condition is not limited to the case where the degree of activity becomes equal to or greater than a predetermined threshold value, but instead or in addition to this, it is also possible to use the case where the degree of activity becomes less than or equal to a predetermined threshold value. In this case, too, as intervention information when the activation level is low, either intervention information with positive content (e.g., praise for being quiet) is generated, or intervention information with negative content (e.g., information that praises being quiet) is generated. Whether or not to generate information that calls attention to things that have not been done can be determined depending on the type of analysis target region R. Then, the output control unit 125 transmits the generated intervention information to at least one of the local terminal 3 and the external terminal 4.
 出力制御部125は、全ての現地端末3から、介入情報を出力させてもよい。また、出力制御部125は、複数の現地端末3のうち、介入条件を満たした位置を含むサブ領域内に配置された現地端末3のみから、介入情報を出力させてもよい。これにより、音声分析システムSは、活性度が介入条件を満たした位置の周辺にいるユーザに向けて介入情報を通知することができる。 The output control unit 125 may output intervention information from all local terminals 3. Further, the output control unit 125 may output intervention information only from among the plurality of local terminals 3, those located within a sub-region including a position that satisfies the intervention condition. Thereby, the voice analysis system S can notify intervention information to users around the position where the degree of activity satisfies the intervention condition.
 図6Aは、マップ情報及び介入情報を表示している現地端末3の模式図である。現地端末3は、音声分析装置1から受信したマップ情報及び介入情報を表示部上に表示する。図6Aの例では、現地端末3は、マップ情報であるヒートマップHと、介入情報を表すメッセージMと、を表示部上に表示している。外部端末4は、同様にヒートマップH及びメッセージMを表示部上に表示してもよい。 FIG. 6A is a schematic diagram of the local terminal 3 displaying map information and intervention information. The local terminal 3 displays the map information and intervention information received from the speech analysis device 1 on the display section. In the example of FIG. 6A, the local terminal 3 displays a heat map H, which is map information, and a message M, which represents intervention information, on the display unit. The external terminal 4 may similarly display the heat map H and the message M on the display section.
 図6Bは、介入情報を音声により出力している現地端末3の模式図である。現地端末3は、音声分析装置1から受信した介入情報を表示す音声Vを、音声出力部から出力している。音声Vは、音声分析装置の出力制御部125によって生成されてもよく、現地端末3によって生成されてもよい。 FIG. 6B is a schematic diagram of the local terminal 3 outputting intervention information in the form of audio. The local terminal 3 outputs the voice V that displays the intervention information received from the voice analysis device 1 from the voice output section. The voice V may be generated by the output control unit 125 of the voice analysis device, or may be generated by the local terminal 3.
 このように、音声分析システムSは、分析対象領域R内の各位置における発話の長さをマップ情報として可視化することにより、分析対象領域R内で音声によるコミュニケーションが活発に行われているかどうかを分析しやすくすることができる。また、音声分析システムSは、活性度が所定条件を満たしたことに応じて介入情報を出力することにより、コミュニケーションを促進又は抑制するように、分析対象領域R内のコミュニケーションを調整することができる。 In this way, the speech analysis system S visualizes the length of utterances at each position within the analysis target area R as map information, thereby determining whether vocal communication is actively taking place within the analysis target area R. It can be made easier to analyze. Furthermore, the speech analysis system S can adjust communication within the analysis target region R to promote or suppress communication by outputting intervention information in response to the degree of activity satisfying a predetermined condition. .
 出力制御部125は、分析対象領域R内にいる人物に応じて、介入情報の内容を変更してもよい。この場合に、介入情報は、例えば、人物又は人物の属性(年齢、性別、服装等)に予め関連付けられている。出力制御部125は、例えば、現地端末3が備えるカメラによって取得された現地端末3周辺の撮像画像に対して既知の人物認識処理を行うことにより、分析対象領域R内にいる人物を認識する。出力制御部125は、分析対象領域Rのどこかにいる人物を認識してもよく、特定のサブ領域にいる人物のみを認識してもよい。そして出力制御部125は、介入条件が満たされたことに応じて、認識した人物又は人物の属性に関連付けられた介入情報を、現地端末3又は外部端末4の少なくとも一方から出力させる。これにより、音声分析システムSは、分析対象領域R内にいる人物に合った介入情報を出力することができる。 The output control unit 125 may change the content of the intervention information depending on the person within the analysis target region R. In this case, the intervention information is associated in advance with, for example, a person or an attribute of the person (age, gender, clothing, etc.). The output control unit 125 recognizes a person within the analysis target region R, for example, by performing known person recognition processing on a captured image around the local terminal 3 acquired by a camera included in the local terminal 3. The output control unit 125 may recognize a person located somewhere in the analysis target region R, or may recognize only a person located in a specific sub-region. Then, the output control unit 125 causes at least one of the local terminal 3 and the external terminal 4 to output intervention information associated with the recognized person or the attribute of the person in response to the intervention condition being satisfied. Thereby, the voice analysis system S can output intervention information suitable for the person within the analysis target region R.
 出力制御部125は、分析対象領域R内において異なる期間の活性度を比較するための比較情報を、外部端末4に表示させてもよい。この場合に、受付部121は、外部端末4から、比較対象とするサブ領域の指定を受け付ける。また、受付部121は、外部端末4から、比較対象とする期間の指定を受け付けてもよい。出力制御部125は、指定されたサブ領域における第1期間の活性度と、当該サブ領域における第2期間の活性度と、を関連付けた比較情報を生成する。出力制御部125は、生成した比較情報を、現地端末3又は外部端末4の少なくとも一方に送信する。 The output control unit 125 may cause the external terminal 4 to display comparison information for comparing the activity levels of different periods within the analysis target region R. In this case, the receiving unit 121 receives from the external terminal 4 the designation of the sub-area to be compared. Further, the receiving unit 121 may receive a designation of a period to be compared from the external terminal 4. The output control unit 125 generates comparison information that associates the degree of activity in the first period in the designated sub-region with the degree of activity in the second period in the sub-region. The output control unit 125 transmits the generated comparison information to at least one of the local terminal 3 and the external terminal 4.
 図7A、図7Bは、比較情報を表示している外部端末4の模式図である。外部端末4は、音声分析装置1から受信した比較情報を表示する。図7Aの例では、外部端末4は、第1期間及び第2期間それぞれのヒートマップHと、指定されたサブ領域における第1期間及び第2期間の活性度の比較結果を表すメッセージMと、を比較情報として表示している。ヒートマップHにおいて、分析対象領域R全体の中で、指定されたサブ領域が強調表示されている。メッセージMは、例えば、サブ領域における第1期間及び第2期間の間の活性度の増減の量又は割合を表すメッセージである。 7A and 7B are schematic diagrams of the external terminal 4 displaying comparison information. The external terminal 4 displays the comparison information received from the speech analysis device 1. In the example of FIG. 7A, the external terminal 4 sends a heat map H for each of the first period and the second period, and a message M representing the comparison result of the activity levels for the first period and the second period in the specified sub-region. is displayed as comparative information. In the heat map H, a designated sub-region within the entire analysis target region R is highlighted. The message M is, for example, a message representing the amount or rate of increase/decrease in activity between the first period and the second period in the sub-region.
 図7Bの例では、外部端末4は、指定されたサブ領域又は分析対象領域R全体におけるヒートマップH1と、指定されたサブ領域又は分析対象領域R全体における複数期間の活性度の比較結果を表すメッセージMと、を比較情報として表示している。 In the example of FIG. 7B, the external terminal 4 displays the comparison result of the heat map H1 in the designated sub-region or the entire analysis target region R and the activity level for multiple periods in the designated sub-region or the entire analysis target region R. Message M is displayed as comparison information.
 ヒートマップH1は、図6A、図7Aに例示した地図上で活性度を表すヒートマップHとは異なり、時間帯ごとに、サブ領域又は分析対象領域R全体の活性度に対応する情報(色、模様等)を表したヒートマップである。そのため、ヒートマップH1は、同じ領域における複数の時間帯の間の活性度の違いを可視化している。メッセージMは、例えば、サブ領域又は分析対象領域R全体において活性度が大きい又は小さい時間帯を表すメッセージである。 The heat map H1 differs from the heat map H that shows the activity level on the map illustrated in FIGS. 6A and 7A, in that the heat map H1 displays information (color, color, This is a heat map showing the patterns (patterns, etc.). Therefore, the heat map H1 visualizes the difference in activity between multiple time periods in the same area. The message M is, for example, a message representing a time period in which the activity level is high or low in the sub-region or the entire analysis target region R.
 このように、音声分析システムSは、異なる期間の活性度を関連付けて可視化することにより、活性度の増減や、時間帯ごとの活性度の傾向を分析しやすくすることができる。 In this way, the speech analysis system S can make it easier to analyze increases and decreases in the activity level and trends in the activity level for each time period by associating and visualizing the activity level for different periods.
 出力制御部125は、指定された位置における過去の音声を外部端末4から出力させてもよい。この場合に、受付部121は、マップ情報又は比較情報を表示している外部端末4において、分析対象領域R内の位置及び過去の期間の指定を受け付ける。出力制御部125は、音声取得部122が取得した音声のうち、指定された位置及び期間の音声を記憶部11から取得し、外部端末4の音声出力部から出力させる。これにより、音声分析システムSは、活性度と実際の音声内容との関係を分析しやすくすることができる。 The output control unit 125 may output past audio at the specified position from the external terminal 4. In this case, the receiving unit 121 receives designation of a position within the analysis target region R and a past period on the external terminal 4 displaying map information or comparison information. The output control unit 125 acquires, from the storage unit 11 , the audio at the specified position and period from the audio acquired by the audio acquisition unit 122 , and outputs it from the audio output unit of the external terminal 4 . Thereby, the speech analysis system S can easily analyze the relationship between the degree of activity and the actual speech content.
 出力制御部125は、発話位置の移動の軌跡を含む移動情報を、現地端末3又は外部端末4の少なくとも一方に表示させてもよい。この場合に、特定部123は、例えば、特定した時間ごとの発話位置の時間変化を、発話位置の移動の軌跡として特定する。特定部123は、上述の話者認識処理によって生成した、発話位置と、当該発話位置において発話を行ったユーザと、を時間ごとに関連付けた情報を、記憶部11から取得する。そして特定部123は、取得した情報に基づいて、特定のユーザ(話者)に対応する発話位置の移動の軌跡を特定する。出力制御部125は、特定部123が特定した移動の軌跡を含む移動情報を、現地端末3又は外部端末4の少なくとも一方に送信する。 The output control unit 125 may display movement information including the locus of movement of the speaking position on at least one of the local terminal 3 or the external terminal 4. In this case, the specifying unit 123 specifies, for example, the temporal change in the utterance position at each specified time as the locus of movement of the utterance position. The identifying unit 123 acquires from the storage unit 11 information that is generated by the above-described speaker recognition process and associates the speaking position and the user who spoke at the speaking position on a time-by-time basis. Then, the identifying unit 123 identifies the locus of movement of the speaking position corresponding to the specific user (speaker) based on the acquired information. The output control unit 125 transmits movement information including the locus of movement specified by the identification unit 123 to at least one of the local terminal 3 and the external terminal 4.
 図8は、移動情報を表示している外部端末4の模式図である。外部端末4は、音声分析装置1から受信した移動情報を表示部上に表示する。図8の例では、外部端末4は、移動情報が示す移動の軌跡Tを表示部上に表示している。現地端末3は、同様に移動の軌跡Tを表示部上に表示してもよい。これにより、音声分析システムSは、話者が分析対象領域R内でどのように移動してコミュニケーションをとるかを分析しやすくすることができる。 FIG. 8 is a schematic diagram of the external terminal 4 displaying movement information. The external terminal 4 displays the movement information received from the speech analysis device 1 on the display section. In the example of FIG. 8, the external terminal 4 displays the movement trajectory T indicated by the movement information on the display unit. The local terminal 3 may similarly display the movement trajectory T on the display unit. Thereby, the speech analysis system S can easily analyze how the speaker moves and communicates within the analysis target region R.
 通話制御部126は、出力制御部125が外部端末4にマップ情報を表示させた後に、マップ情報上で選択された現地端末3と外部端末4との間で通話を開始させてもよい。この場合に、出力制御部125は、例えば。分析対象領域Rを表す地図上に、分析対象領域R内の各位置の活性度に対応する情報(色、模様等)と、分析対象領域Rに配置された一又は複数の現地端末3の位置を示す情報(アイコン等)と、を重畳したヒートマップを、マップ情報として外部端末4に表示させる。 The call control unit 126 may start a call between the local terminal 3 selected on the map information and the external terminal 4 after the output control unit 125 displays the map information on the external terminal 4. In this case, the output control unit 125, for example. On a map representing the analysis target area R, information (color, pattern, etc.) corresponding to the activity level of each position within the analysis target area R and the position of one or more local terminals 3 placed in the analysis target area R are displayed. A heat map in which information (such as an icon) indicating , and are superimposed is displayed on the external terminal 4 as map information.
 受付部121は、外部端末4に表示されたマップ情報において、一又は複数の現地端末3のうち、通話先とするいずれかの現地端末3の選択を受け付ける。外部ユーザは、例えば、分析対象領域R外から分析対象領域Rにおけるコミュニケーションを支援するために、マップ情報において活性度が小さい場所に配置された現地端末3を選択する。通話制御部126は、一又は複数の現地端末3のいずれかが選択されたことに応じて、選択された現地端末3と外部端末4との間で音声の授受を開始させる。現地端末3は、外部端末4との間で通話を行うための通話端末として機能し、外部端末4から受信した音声をスピーカ等の音声出力部から出力するとともに、現地端末3のマイクロフォン等の集音部に入力された音声を外部端末4に送信する。通話制御部126は、選択された現地端末3と外部端末4との間で双方向に音声を授受させてもよく、外部端末4から現地端末3へ一方向に音声を出力させてもよい。 The reception unit 121 receives the selection of one of the local terminals 3 to be the destination of the call from among the one or more local terminals 3 in the map information displayed on the external terminal 4. For example, in order to support communication in the analysis target area R from outside the analysis target area R, the external user selects the local terminal 3 located at a location where the degree of activity is low in the map information. The call control unit 126 starts transmitting and receiving audio between the selected local terminal 3 and the external terminal 4 in response to selection of one or more local terminals 3. The local terminal 3 functions as a telephone terminal for making a telephone call with the external terminal 4, and outputs the voice received from the external terminal 4 from an audio output section such as a speaker, and also outputs the voice received from the external terminal 4 from a microphone or the like of the local terminal 3. The audio input to the sound section is transmitted to the external terminal 4. The call control unit 126 may allow audio to be exchanged bidirectionally between the selected local terminal 3 and the external terminal 4, or may output audio from the external terminal 4 to the local terminal 3 in one direction.
 これにより、音声分析システムSは、現地端末3との通話を希望する外部ユーザが、活性度に基づいて通話先の現地端末3を選択しやすくすることができる。外部ユーザは、外部から分析対象領域R内の現地端末3に音声によって介入することにより、分析対象領域Rにおけるコミュニケーションの活発化を支援することができる。 Thereby, the voice analysis system S can make it easier for an external user who wishes to make a call with the local terminal 3 to select the local terminal 3 to call based on the activity level. The external user can support activation of communication in the analysis target area R by intervening with the local terminal 3 in the analysis target area R from the outside by voice.
[音声分析方法のフローチャート]
 図9は、本実施形態に係る音声分析装置1が実行する例示的な音声分析方法のフローチャートを示す図である。受付部121は、外部端末4から、分析対象領域Rと、分析対象領域R内における集音装置2及び現地端末3の位置と、分析対象領域R内において壁等の物体が位置する物体領域と、の設定を受け付ける(S11)。
[Flowchart of speech analysis method]
FIG. 9 is a diagram showing a flowchart of an exemplary speech analysis method executed by the speech analysis device 1 according to the present embodiment. The reception unit 121 receives from the external terminal 4 the analysis target area R, the positions of the sound collection device 2 and the local terminal 3 within the analysis target area R, and the object area where an object such as a wall is located within the analysis target area R. The settings of , are accepted (S11).
 音声取得部122は、分析対象領域Rに配置された複数の集音装置2それぞれが集音した音声を取得する(S12)。音声取得部122は、複数の集音装置2それぞれが集音した音声の時間ごとの到来方向Dを取得する(S13)。到来方向Dは、音声分析装置1によって推定されてもよく、複数の集音装置2それぞれによって推定されてもよい。 The audio acquisition unit 122 acquires the audio collected by each of the plurality of sound collection devices 2 arranged in the analysis target region R (S12). The audio acquisition unit 122 acquires the time-by-time arrival direction D of the audio collected by each of the plurality of sound collection devices 2 (S13). The direction of arrival D may be estimated by the voice analysis device 1 or may be estimated by each of the plurality of sound collection devices 2.
 特定部123は、複数の集音装置2に対する複数の到来方向Dに基づいて、時間ごとに、分析対象領域R内で発話が行われた位置である発話位置を特定する(S14)。ここで特定部123は、受付部121が受け付けた分析対象領域R内において物体が位置する物体領域を考慮して、発話位置を特定してもよい。 The specifying unit 123 specifies the utterance position, which is the position where the utterance was made, within the analysis target region R for each time based on the plurality of arrival directions D with respect to the plurality of sound collecting devices 2 (S14). Here, the specifying unit 123 may specify the utterance position by considering the object region where the object is located within the analysis target region R received by the receiving unit 121.
 特定部123は、特定部123が特定した時間ごとの発話位置に基づいて、分析対象領域R内の各位置における単位時間あたりの発話の長さを特定する(S15)。活性度決定部124は、分析対象領域R内の各位置において、特定部123が特定した単位時間あたりの発話の長さに対応する活性度を決定する(S16)。活性度は、例えば、特定部123が特定した単位時間あたりの発話の長さが長いほど大きく、特定部123が特定した単位時間あたりの発話の長さが短いほど小さい値である。 The specifying unit 123 specifies the length of the utterance per unit time at each position within the analysis target region R, based on the utterance position for each time specified by the specifying unit 123 (S15). The activity determining unit 124 determines the activity corresponding to the length of utterance per unit time specified by the specifying unit 123 at each position within the analysis target region R (S16). For example, the activity level increases as the length of the utterance per unit time specified by the specifying unit 123 increases, and decreases as the length of the utterance per unit time specified by the specifying unit 123 decreases.
 出力制御部125は、分析対象領域R内の各位置と、活性度決定部124が決定した活性度と、を関連付けたマップ情報を、現地端末3又は外部端末4の少なくとも一方に出力させる(S17)。また、出力制御部125は、分析対象領域R内において異なる期間の活性度を比較するための比較情報を、外部端末4に表示させてもよい。 The output control unit 125 causes at least one of the local terminal 3 and the external terminal 4 to output map information in which each position within the analysis target region R is associated with the activity determined by the activity determination unit 124 (S17 ). Further, the output control unit 125 may cause the external terminal 4 to display comparison information for comparing the activity levels of different periods within the analysis target region R.
 出力制御部125は、活性度決定部124が決定した活性度が所定の介入条件を満たした場合に(S18のYES)、当該介入条件に関連付けられた介入情報を、現地端末3又は外部端末4の少なくとも一方から出力させる(S19)。音声分析装置1は、活性度決定部124が決定した活性度が所定の介入条件を満たさない場合に(S18のNO)、処理を終了する。 When the activity level determined by the activity level determination unit 124 satisfies a predetermined intervention condition (YES in S18), the output control unit 125 transmits the intervention information associated with the intervention condition to the local terminal 3 or external terminal 4. (S19). The speech analysis device 1 ends the process when the activity level determined by the activity level determination unit 124 does not satisfy the predetermined intervention condition (NO in S18).
[本実施形態の効果]
 本実施形態に係る音声分析システムSは、分析対象領域Rに配置された集音装置2が取得した音声に基づいて、分析対象領域R内の各位置における単位時間あたりの発話の長さを特定し、発話の長さに対応する活性度を分析対象領域R内の各位置と関連付けて出力する。これにより、音声分析システムSは、音声の大きさではなく、分析対象領域R内の各位置における発話の長さを可視化できるため、音声によるコミュニケーションが活発に行われているかどうかを分析しやすくすることができる。
[Effects of this embodiment]
The speech analysis system S according to this embodiment specifies the length of utterance per unit time at each position in the analysis target region R based on the sound acquired by the sound collection device 2 placed in the analysis target region R. Then, the degree of activity corresponding to the length of the utterance is output in association with each position within the analysis target region R. As a result, the speech analysis system S can visualize the length of utterances at each position within the analysis target region R, rather than the loudness of the voice, making it easier to analyze whether vocal communication is actively taking place. be able to.
 また、音声分析システムSは、活性度が所定条件を満たしたことに応じて介入情報を出力することにより、コミュニケーションを促進又は抑制するように、分析対象領域R内のコミュニケーションを調整することができる。また、音声分析システムSは、異なる期間の活性度を関連付けて可視化することにより、活性度の増減や、時間帯ごとの活性度の傾向を分析しやすくすることができる。 Furthermore, the speech analysis system S can adjust communication within the analysis target region R to promote or suppress communication by outputting intervention information in response to the degree of activity satisfying a predetermined condition. . Moreover, the speech analysis system S can make it easier to analyze increases and decreases in the activity level and trends in the activity level for each time period by associating and visualizing the activity level for different periods.
[第1変形例]
 上述の実施形態では、音声分析システムSが会社や公共施設等の閉鎖空間において人間によって発せられた音声を分析する例を説明したが、音声分析システムSは、公園等の開放空間において人間に限らない猿や鳥等の動物によって発せられた音声を分析してもよい。
[First modification]
In the above embodiment, an example was explained in which the voice analysis system S analyzes the voice uttered by a human in a closed space such as a company or a public facility. Sounds made by animals such as monkeys and birds may also be analyzed.
 この場合に、音声分析装置1において、受付部121は、開放空間を分析対象領域Rとする設定を受け付ける。音声取得部122は、開放空間である分析対象領域Rに配置された複数の集音装置2それぞれが集音した、動物によって発せられた音声を取得する。そして音声分析装置1は、上述の実施形態と同様に各位置の発話の長さを特定し、発話の長さに対応する活性度に対応する情報を出力する。 In this case, in the speech analysis device 1, the reception unit 121 accepts the setting for setting the open space as the analysis target region R. The audio acquisition unit 122 acquires the sounds emitted by animals, which are collected by each of the plurality of sound collection devices 2 arranged in the analysis target region R, which is an open space. Then, the speech analysis device 1 specifies the length of the utterance at each position, as in the above-described embodiment, and outputs information corresponding to the degree of activity corresponding to the length of the utterance.
 このように、音声分析システムSは、開放空間における、人間に限らない動物のコミュニケーションが活発に行われているかどうかも分析しやすくすることができる。 In this way, the voice analysis system S can easily analyze whether communication among animals, not limited to humans, is actively occurring in an open space.
[第2変形例]
 本変形例に係る音声分析システムSは、笑い声や叫び声等の音声の種類ごとに活性度のヒートマップを表示し、音声の種類ごとの活性度に基づいて介入情報を出力する。以下、上述の実施形態とは異なる部分を主に説明する。
[Second modification]
The speech analysis system S according to this modification displays a heat map of the degree of activity for each type of voice, such as laughter or screaming, and outputs intervention information based on the degree of activity for each type of voice. Hereinafter, parts that are different from the above-described embodiment will be mainly described.
 活性度決定部124は、分析対象領域R内の各位置の活性度を決定した後に、当該活性度を決定するために用いられた音声が、音声の種類である複数の音声種類のうちいずれの音声種類であるかを決定する。音声種類は、例えば、笑い声、叫び声、特定の感情(怒り等)の声、特定の動物(鹿、熊等)の声、機械の異音等である。 After determining the activation level of each position within the analysis target region R, the activation level determining unit 124 determines which of the plurality of audio types the audio used to determine the activation level is. Determine the audio type. The types of sounds include, for example, laughter, screams, voices of specific emotions (anger, etc.), voices of specific animals (deer, bear, etc.), strange noises from machinery, and the like.
 活性度決定部124は、例えば、記憶部11に予め記憶された、入力された音声が複数の音声種類のうちいずれの音声種類であるかを示す情報を出力する機械学習モデルを取得する。機械学習モデルは、例えば、ニューラルネットワークによって構成されており、複数の音声及び音声種類を教師データとして既知の機械学習処理を行うことによって生成される。 The activity determination unit 124 obtains, for example, a machine learning model that is stored in advance in the storage unit 11 and outputs information indicating which of a plurality of voice types the input voice is. The machine learning model is configured by, for example, a neural network, and is generated by performing known machine learning processing using a plurality of voices and voice types as training data.
 活性度決定部124は、取得した機械学習モデルに音声取得部122が取得した音声を入力することにより当該機械学習モデルが出力した音声種類を、活性度を決定するために用いられた音声が該当する音声種類として決定する。 By inputting the voice acquired by the voice acquisition unit 122 into the acquired machine learning model, the activation level determining unit 124 determines the type of voice output by the machine learning model, and determines whether the voice used to determine the level of activity corresponds to the type of voice output by the machine learning model. Determine the type of voice to be used.
 また、音声分析装置1に代えて、集音装置2が音声種類を判別してもよい。この場合に、集音装置2は、例えば、集音装置2の記憶部に予め記憶された上述の機械学習モデルに、集音した音声を入力することによって、当該機械学習モデルが出力した音声種類を取得する。集音装置2は、集音した音声を示す音声データとともに、当該音声の音声種類を示す情報を、音声分析装置1に送信する。音声分析装置1において、活性度決定部124は、集音装置2から受信した音声種類を示す情報に基づいて、活性度を決定するために用いられた音声が該当する音声種類を決定する。 Furthermore, instead of the voice analysis device 1, the sound collection device 2 may determine the voice type. In this case, the sound collection device 2 inputs the collected sound into the above-mentioned machine learning model stored in advance in the storage unit of the sound collection device 2, thereby determining the type of sound output by the machine learning model. get. The sound collection device 2 transmits to the voice analysis device 1, along with voice data indicating the collected voice, information indicating the type of the voice. In the speech analysis device 1 , the activity level determination unit 124 determines the voice type to which the voice used to determine the activity level corresponds, based on the information indicating the voice type received from the sound collection device 2 .
 出力制御部125は、分析対象領域R内の各位置と、複数の音声種類のうちいずれかの音声種類の音声を用いて決定された活性度と、を関連付けたマップ情報を、現地端末3又は外部端末4の少なくとも一方に表示させる。1つの音声種類に対応するヒートマップ等のマップ情報を、1つのレイヤという。 The output control unit 125 outputs map information that associates each position within the analysis target region R with the degree of activity determined using one of the plurality of voice types to the local terminal 3 or It is displayed on at least one of the external terminals 4. Map information such as a heat map corresponding to one audio type is called one layer.
 出力制御部125は、例えば、現地端末3又は外部端末4の少なくとも一方において指定された音声種類の音声を用いて決定された活性度を表すマップ情報を、当該現地端末3又は外部端末4の少なくとも一方に表示させる。この場合に、現地端末3又は外部端末4の少なくとも一方は、音声分析装置1から受信した情報に基づいて、いずれか1つの音声種類に対応するレイヤを表示する。 For example, the output control unit 125 transmits map information representing the degree of activity determined using the voice of the voice type specified in at least one of the local terminal 3 or the external terminal 4 to at least one of the local terminal 3 or the external terminal 4. Display it on one side. In this case, at least one of the local terminal 3 and the external terminal 4 displays a layer corresponding to any one voice type based on the information received from the voice analysis device 1.
 また、出力制御部125は、例えば、複数の音声種類それぞれの音声を用いて決定された活性度を表すマップ情報を、現地端末3又は外部端末4の少なくとも一方に一覧表示させてもよい。この場合に、現地端末3又は外部端末4の少なくとも一方は、音声分析装置1から受信した情報に基づいて、複数の音声種類に対応する複数のレイヤを表示する。 Furthermore, the output control unit 125 may display a list of map information representing the degree of activity determined using the voices of each of the plurality of voice types on at least one of the local terminal 3 and the external terminal 4, for example. In this case, at least one of the local terminal 3 and the external terminal 4 displays a plurality of layers corresponding to a plurality of voice types based on the information received from the speech analysis device 1.
 図10A、図10Bは、複数の音声種類のうちいずれかの音声種類の音声を用いて決定された活性度を表すマップ情報の模式図である。図10Aは、マップ情報として、音声種類が「笑い声」である音声を用いて決定された活性度に対応するヒートマップH2を表している。図10Bは、マップ情報として、音声種類が「叫び声」である音声を用いて決定された活性度に対応するヒートマップH3を表している。 FIGS. 10A and 10B are schematic diagrams of map information representing the degree of activity determined using one of a plurality of audio types. FIG. 10A shows, as map information, a heat map H2 corresponding to the activity level determined using a sound whose sound type is "laughter". FIG. 10B shows, as map information, a heat map H3 corresponding to the activity level determined using a voice whose voice type is "scream".
 このように、音声分析システムSは、音声種類ごとの活性度をマップ情報として可視化することにより、分析対象領域R内でどのような種類のコミュニケーションが活発に行われているかどうかを分析しやすくすることができる。 In this way, the speech analysis system S makes it easier to analyze what types of communication are actively occurring within the analysis target region R by visualizing the degree of activity of each speech type as map information. be able to.
 出力制御部125は、活性度決定部124が決定した音声種類ごとの活性度が所定の介入条件を満たしたことに応じて、当該介入条件に関連付けられた介入情報を、現地端末3又は外部端末4の少なくとも一方から出力させてもよい。この場合に、出力制御部125は、例えば、記憶部11に予め記憶された、音声種類ごとの介入条件及び介入情報を記憶部11から取得する。複数の音声種類それぞれには、介入条件が設定されてもよく、設定されなくてもよい。 The output control unit 125 outputs the intervention information associated with the intervention condition to the local terminal 3 or external terminal in response to the activation level of each voice type determined by the activation level determination unit 124 satisfying a predetermined intervention condition. The output may be output from at least one of the four. In this case, the output control unit 125 obtains, for example, intervention conditions and intervention information for each voice type from the storage unit 11, which are stored in advance in the storage unit 11. Intervention conditions may or may not be set for each of the plurality of voice types.
 出力制御部125は、例えば、音声種類が「笑い声」である場合に介入情報を出力せず、音声種類が「叫び声」である場合に介入情報を出力してもよい。出力制御部125は、例えば、音声種類が「笑い声」である場合に活性度が所定の閾値以下であることに応じてコミュニケーションを促進するための介入情報を出力し、音声種類が「叫び声」である場合に活性度が所定の閾値以上であることに応じてコミュニケーションを抑制するための介入情報を出力してもよい。 For example, the output control unit 125 may not output the intervention information when the sound type is "laughter", but may output the intervention information when the sound type is "scream". For example, the output control unit 125 outputs intervention information for promoting communication in response to the activation level being below a predetermined threshold when the voice type is "laughter", and outputs intervention information for promoting communication when the voice type is "scream". In some cases, intervention information for suppressing communication may be output in response to the degree of activity being equal to or higher than a predetermined threshold.
 出力制御部125は、分析対象領域R内の各位置における複数の音声種類それぞれの活性度が、当該音声種類の介入条件を満たすか否か(例えば、介入条件が示す閾値以上か否か)を判定する。出力制御部125は、分析対象領域R内のいずれかの位置における音声種類ごとの活性度が介入条件を満たしたことに応じて、当該介入条件に関連付けられた介入情報を生成する。出力制御部125は、生成した介入情報を、現地端末3又は外部端末4の少なくとも一方から出力させる。 The output control unit 125 determines whether the activity level of each of the plurality of voice types at each position in the analysis target region R satisfies the intervention condition for the voice type (for example, whether it is equal to or greater than the threshold value indicated by the intervention condition). judge. The output control unit 125 generates intervention information associated with the intervention condition in response to the degree of activity of each voice type at any position within the analysis target region R satisfying the intervention condition. The output control unit 125 causes the generated intervention information to be output from at least one of the local terminal 3 and the external terminal 4.
 これにより、音声分析システムSは、活性度が音声種類ごとに設定された条件を満たしたことに応じて介入情報を出力することにより、音声種類ごとに介入の有無及び内容を切り替えることができる。 Thereby, the speech analysis system S can switch the presence or absence and content of intervention for each speech type by outputting intervention information in response to the degree of activity satisfying the condition set for each speech type.
[第3変形例]
 本変形例に係る音声分析システムSは、撮像装置が分析対象領域Rを撮像することにより生成された撮像画像を用いて活性度を決定し、また撮像装置の向きを制御する。以下、上述の実施形態とは異なる部分を主に説明する。
[Third modification]
The audio analysis system S according to this modification determines the degree of activity using a captured image generated by imaging the analysis target region R with the imaging device, and also controls the orientation of the imaging device. Hereinafter, parts that are different from the above-described embodiment will be mainly described.
 図11は、本変形例に係る音声分析システムSのブロック図である。本変形例に係る音声分析装置1の制御部12は、図2に示した各部に加えて、画像取得部127をさらに有する。 FIG. 11 is a block diagram of the speech analysis system S according to this modification. The control unit 12 of the speech analysis device 1 according to this modification further includes an image acquisition unit 127 in addition to the units shown in FIG.
 画像取得部127は、撮像装置5から、分析対象領域Rの一部を撮像することにより生成された撮像画像を取得する。撮像装置5は、例えば、分析対象領域Rの一部を撮像可能な位置に配置されている。撮像装置5は、撮像装置5の向きを変更するための駆動部を有する。撮像装置5は、駆動部を用いて向きを変更することにより、分析対象領域R内の任意の位置を撮像することができる。撮像装置5は、後述の出力制御部125から出力された制御情報に従って向きを変更してもよく、自動的に向きを変更してもよい。 The image acquisition unit 127 acquires a captured image generated by capturing a part of the analysis target region R from the imaging device 5. The imaging device 5 is arranged at a position where it can image a part of the analysis target region R, for example. The imaging device 5 has a drive unit for changing the direction of the imaging device 5. The imaging device 5 can image any position within the analysis target region R by changing the direction using the drive unit. The imaging device 5 may change its orientation according to control information output from an output control unit 125, which will be described later, or may change its orientation automatically.
 撮像装置5は、定期的に又は音声分析装置1から撮像指示を受けたことに応じて、分析対象領域Rの一部を撮像することにより撮像画像(すなわち、撮像装置5が向いている向きを撮像範囲とした撮像画像)を生成し、生成した撮像画像を有線通信又は無線通信により音声分析装置1に送信する。音声分析装置1において、画像取得部127は、撮像装置5から撮像画像を受信し、受信した撮像画像を記憶部11に記憶させる。 The imaging device 5 periodically or in response to receiving an imaging instruction from the voice analysis device 1 captures a portion of the analysis target region R to obtain a captured image (i.e., the direction in which the imaging device 5 is facing). A captured image defined as an imaging range is generated, and the generated captured image is transmitted to the voice analysis device 1 via wired or wireless communication. In the speech analysis device 1 , the image acquisition unit 127 receives the captured image from the imaging device 5 and stores the received captured image in the storage unit 11 .
 出力制御部125は、活性度決定部124が決定した活性度に基づいて、撮像装置5の向きを変えるための制御情報を出力する。図12は、活性度に基づいて撮像装置5の向きを変えるための処理を説明するための模式図である。図12は、分析対象領域RのヒートマップHの中で、撮像装置5の位置及び向きを模式的に表している。 The output control unit 125 outputs control information for changing the orientation of the imaging device 5 based on the activity level determined by the activity level determination unit 124. FIG. 12 is a schematic diagram for explaining a process for changing the direction of the imaging device 5 based on the degree of activity. FIG. 12 schematically shows the position and orientation of the imaging device 5 in the heat map H of the analysis target region R.
 出力制御部125は、分析対象領域R内で活性度が所定の条件(例えば、所定の閾値以上であること)を満たした位置を、対象位置Lとして特定する。出力制御部125は、撮像装置5を対象位置Lに向けるための制御情報を、撮像装置5に送信する。制御情報は、例えば、撮像装置5の位置を基準とした対象位置Lの相対的な位置を示す情報である。撮像装置5は、音声分析装置1から受信した制御情報に従って、駆動部を用いて対象位置Lに向くように向きを変更する。撮像装置5は、撮像装置5が向いている向きを撮像範囲として撮像することにより撮像画像を生成し、生成した撮像画像を有線通信又は無線通信により音声分析装置1に送信する。 The output control unit 125 identifies a position within the analysis target region R where the degree of activity satisfies a predetermined condition (for example, being equal to or greater than a predetermined threshold) as a target position L. The output control unit 125 transmits control information for directing the imaging device 5 to the target position L to the imaging device 5. The control information is, for example, information indicating the relative position of the target position L with respect to the position of the imaging device 5. The imaging device 5 changes its direction so as to face the target position L using the drive unit according to the control information received from the voice analysis device 1. The imaging device 5 generates a captured image by capturing the direction in which the imaging device 5 is facing as an imaging range, and transmits the generated captured image to the voice analysis device 1 via wired or wireless communication.
 分析対象領域R内で所定の条件を満たした複数の対象位置Lがある場合には、撮像装置5が1つの対象位置Lの撮像を終えた後に、撮像装置5を次の対象位置Lに向けるための制御情報を順次送信する。 If there are multiple target positions L that satisfy a predetermined condition within the analysis target region R, after the imaging device 5 finishes imaging one target position L, the imaging device 5 is directed to the next target position L. control information is sent sequentially.
 これにより、音声分析システムSは、撮像装置5が分析対象領域Rの全体を一度に撮像できない場合であっても、活性度が所定の条件を満たす位置に撮像装置5を向けて撮像させることができるため、分析対象領域R内で重要度や緊急度が高い可能性がある位置の撮像画像を優先的に取得することができる。 Thereby, even if the imaging device 5 is unable to image the entire analysis target region R at once, the audio analysis system S can point the imaging device 5 at a position where the degree of activity satisfies a predetermined condition to capture the image. Therefore, it is possible to preferentially acquire captured images at positions that are likely to have high importance or urgency within the analysis target region R.
 また、出力制御部125は、撮像装置5を対象位置Lに向けるための制御情報に代えて又は加えて、ロボットやドローン等の可動装置を対象位置Lに移動させるための制御情報を出力してもよい。この場合に、可動装置は、音声分析装置1から受信した制御情報に従って、対象位置Lに移動する。これにより、音声分析システムSは、分析対象領域R内で重要度や緊急度が高い可能性がある位置に対して自動的にロボットやドローン等を提供することができる。 Further, the output control unit 125 outputs control information for moving a movable device such as a robot or a drone to the target position L instead of or in addition to the control information for directing the imaging device 5 to the target position L. Good too. In this case, the mobile device moves to the target position L according to the control information received from the speech analysis device 1. Thereby, the voice analysis system S can automatically provide robots, drones, etc. to positions within the analysis target area R that are likely to have a high degree of importance or urgency.
 活性度決定部124は、画像取得部127が取得した撮像画像に基づいて、分析対象領域R内の各位置にいるユーザの人数を特定し、特定した人数を出力制御部125が出力する情報に反映してもよい。 The activity determination unit 124 identifies the number of users at each position within the analysis target region R based on the captured image acquired by the image acquisition unit 127, and converts the identified number of users into information output by the output control unit 125. It may be reflected.
 活性度決定部124は、例えば、ユーザの1人あたりの活性度を算出する。この場合に活性度決定部124は、例えば、撮像画像に対して既知の画像認識処理を行うことにより、分析対象領域R内の各位置にいるユーザの人数を特定する。活性度決定部124は、活性度が所定の条件(例えば、所定の閾値以上であること)を満たした位置にいるユーザの人数のみを特定してもよい。 The activity level determination unit 124 calculates the activity level per user, for example. In this case, the activity determining unit 124 specifies the number of users at each position within the analysis target region R, for example, by performing known image recognition processing on the captured image. The activity determination unit 124 may specify only the number of users at a position where the activity satisfies a predetermined condition (for example, being equal to or greater than a predetermined threshold).
 また、活性度決定部124は、撮像画像に代えて、ユーザが有するスマートフォン等の通信機器の位置を用いて、分析対象領域R内の各位置にいるユーザの人数を特定してもよい。この場合に、活性度決定部124は、例えば、ユーザが有する通信機器から、分析対象領域Rに配置されたビーコン等の複数の発信機それぞれが発した信号の受信強度を示す情報を取得する。活性度決定部124は、複数の発信機の位置と、当該複数の発信機それぞれが発した信号のユーザが有する通信機器における受信強度と、の関係を用いて、ユーザの位置を特定する。そして活性度決定部124は、特定した一又は複数のユーザの位置を集計することにより、分析対象領域R内の各位置にいるユーザの人数を特定する。 Furthermore, the activity determining unit 124 may identify the number of users at each position within the analysis target region R using the position of a communication device such as a smartphone owned by the user instead of the captured image. In this case, the activity determining unit 124 acquires, for example, information indicating the reception strength of the signal emitted by each of a plurality of transmitters such as beacons placed in the analysis target region R from a communication device owned by the user. The activity determining unit 124 identifies the user's position using the relationship between the positions of the plurality of transmitters and the reception strength of the signal emitted by each of the plurality of transmitters in the communication device of the user. The activity determining unit 124 then specifies the number of users at each position within the analysis target region R by totaling the positions of the identified one or more users.
 活性度決定部124は、ここに示した具体的な方法に限られず、その他の方法で分析対象領域R内の各位置にいるユーザの人数を特定してもよい。 The activity determining unit 124 is not limited to the specific method shown here, and may specify the number of users at each position within the analysis target region R using other methods.
 活性度決定部124は、各位置の活性度を各位置に対して特定した人数で除算することにより、1人あたりの活性度を算出する。出力制御部125は、分析対象領域R内の各位置と、1人あたりの活性度と、を関連付けたマップ情報を、現地端末3又は外部端末4の少なくとも一方に表示させる。これにより、音声分析システムSは、複数のユーザを含むグループ全体の盛り上がり度合いに代えて、ユーザの1人あたりの盛り上がり度合いを可視化することができる。 The activity determining unit 124 calculates the activity per person by dividing the activity at each location by the number of people specified for each location. The output control unit 125 causes at least one of the local terminal 3 and the external terminal 4 to display map information that associates each position within the analysis target region R with the degree of activity per person. Thereby, the speech analysis system S can visualize the level of excitement per user instead of the level of excitement for the entire group including a plurality of users.
 また、出力制御部125は、分析対象領域R内の各位置のうちユーザの人数が0人又は1人である位置を除いた位置と、活性度決定部124が決定した活性度と、を関連付けたマップ情報を、現地端末3又は外部端末4の少なくとも一方に表示させてもよい。これにより、音声分析システムSは、ユーザの独り言、機械音、自然音等を除いた活性度をマップ情報として表示することができ、マップ情報を見やすくすることができる。 In addition, the output control unit 125 associates each position in the analysis target region R excluding positions where the number of users is 0 or 1 with the activity determined by the activity determination unit 124. The map information may be displayed on at least one of the local terminal 3 and the external terminal 4. Thereby, the voice analysis system S can display the user's activity level excluding the user's soliloquy, mechanical sounds, natural sounds, etc. as map information, and can make the map information easier to view.
 以上、本発明を実施の形態を用いて説明したが、本発明の技術的範囲は上記実施の形態に記載の範囲には限定されず、その要旨の範囲内で種々の変形及び変更が可能である。例えば、装置の全部又は一部は、任意の単位で機能的又は物理的に分散・統合して構成することができる。また、複数の実施の形態の任意の組み合わせによって生じる新たな実施の形態も、本発明の実施の形態に含まれる。組み合わせによって生じる新たな実施の形態の効果は、もとの実施の形態の効果を併せ持つ。 Although the present invention has been described above using the embodiments, the technical scope of the present invention is not limited to the scope described in the above embodiments, and various modifications and changes can be made within the scope of the gist. be. For example, all or part of the device can be functionally or physically distributed and integrated into arbitrary units. In addition, new embodiments created by arbitrary combinations of multiple embodiments are also included in the embodiments of the present invention. The effects of the new embodiment resulting from the combination have the effects of the original embodiment.
 音声分析装置1のプロセッサは、図9に示す音声分析方法に含まれる各ステップ(工程)を実行する。すなわち、音声分析装置1のプロセッサは、図9に示す音声分析方法を実行するためのプログラムを実行することによって図9に示す音声分析方法を実行する。図9に示す音声分析方法に含まれるステップは一部省略されてもよく、ステップ間の順番が変更されてもよく、複数のステップが並行して行われてもよい。 The processor of the speech analysis device 1 executes each step (process) included in the speech analysis method shown in FIG. That is, the processor of the speech analysis device 1 executes the speech analysis method shown in FIG. 9 by executing a program for executing the speech analysis method shown in FIG. Some of the steps included in the speech analysis method shown in FIG. 9 may be omitted, the order of the steps may be changed, or a plurality of steps may be performed in parallel.
S 音声分析システム
1 音声分析装置
11 記憶部
12 制御部
121 受付部
122 音声取得部
123 特定部
124 活性度決定部
125 出力制御部
126 通話制御部
127 画像取得部
2 集音装置
3 現地端末
4 外部端末
5 撮像装置

 
S Speech analysis system 1 Speech analysis device 11 Storage section 12 Control section 121 Reception section 122 Speech acquisition section 123 Specification section 124 Activity level determination section 125 Output control section 126 Call control section 127 Image acquisition section 2 Sound collection device 3 Local terminal 4 External Terminal 5 Imaging device

Claims (15)

  1.  所定の領域に配置された複数の集音装置それぞれに対する音声の到来方向を取得する音声取得部と、
     前記複数の集音装置に対する複数の前記到来方向の関係を用いて、発話が行われた発話位置を特定し、前記発話位置と前記領域内の各位置との関係を用いて、当該位置における単位時間あたりの前記発話の長さを特定する特定部と、
     前記領域内の各位置と、当該位置における前記単位時間あたりの前記発話の長さに対応する活性度と、を関連付けたマップ情報を、情報端末に表示させる出力制御部と、
     を有する、音声分析装置。
    an audio acquisition unit that acquires the arrival direction of audio to each of the plurality of sound collection devices arranged in a predetermined area;
    The utterance position where the utterance was made is specified using the relationships of the plurality of arrival directions with respect to the plurality of sound collecting devices, and the unit at the position is identified using the relationship between the utterance position and each position in the area. a specifying unit that specifies the length of the utterance per unit of time;
    an output control unit that causes an information terminal to display map information that associates each position in the area with an activation level corresponding to the length of the utterance per unit time at the position;
    A voice analysis device having a.
  2.  前記領域内において物体が位置する物体領域の設定を受け付ける受付部をさらに有し、
     前記特定部は、前記到来方向に沿った直線が前記物体領域と交わる場合に、前記到来方向の中で前記集音装置の位置を基準として前記物体領域よりも遠い部分を除外して、前記発話が行われた位置を特定する、
     請求項1に記載の音声分析装置。
    further comprising a reception unit that receives a setting of an object area in which an object is located within the area,
    When a straight line along the direction of arrival intersects with the object region, the specifying unit is configured to exclude a portion of the direction of arrival that is farther from the object region with respect to the position of the sound collecting device, and determine the utterance. identify the location where the
    The speech analysis device according to claim 1.
  3.  前記マップ情報は、前記領域を表す地図上に、前記活性度に対応する情報を重畳した情報である、
     請求項1又は2に記載の音声分析装置。
    The map information is information in which information corresponding to the degree of activity is superimposed on a map representing the area.
    The speech analysis device according to claim 1 or 2.
  4.  前記マップ情報は、前記領域を表す地図上に、前記活性度に対応する情報と、前記領域に配置された一又は複数の通話端末の位置を示す情報と、を重畳した情報であり、
     前記情報端末に表示された前記マップ情報において前記一又は複数の通話端末のいずれかが選択されたことに応じて、選択された前記通話端末と前記情報端末との間で音声の授受を開始させる通話制御部をさらに有する、
     請求項3に記載の音声分析装置。
    The map information is information in which information corresponding to the degree of activity and information indicating the position of one or more call terminals arranged in the area are superimposed on a map representing the area,
    In response to selection of one of the one or more call terminals in the map information displayed on the information terminal, starting transmission and reception of audio between the selected call terminal and the information terminal. further comprising a call control unit;
    The speech analysis device according to claim 3.
  5.  前記出力制御部は、前記領域内の位置における前記活性度が所定の条件を満たしたことに応じて、前記条件に関連付けられた介入情報を前記情報端末に出力する、
     請求項1に記載の音声分析装置。
    The output control unit outputs intervention information associated with the condition to the information terminal in response to the degree of activity at a position within the region satisfying a predetermined condition.
    The speech analysis device according to claim 1.
  6.  前記情報端末から、前記条件、及び当該条件と関連付けられた前記介入情報の設定を受け付ける受付部をさらに有する、
     請求項5に記載の音声分析装置。
    further comprising a reception unit that receives settings of the condition and the intervention information associated with the condition from the information terminal;
    The speech analysis device according to claim 5.
  7.  前記特定部は、前記発話が行われた位置の時間変化を、前記発話が行われた位置の移動の軌跡として特定し、
     前記出力制御部は、前記移動の軌跡を含む情報を前記情報端末に表示させる、
     請求項1又は2に記載の音声分析装置。
    The identification unit identifies a temporal change in the position where the utterance was made as a locus of movement of the position where the utterance was made,
    the output control unit causes the information terminal to display information including the trajectory of the movement;
    The speech analysis device according to claim 1 or 2.
  8.  前記出力制御部は、前記領域の少なくとも一部であるサブ領域における第1期間の前記活性度と、前記サブ領域における第2期間の前記活性度と、を関連付けて前記情報端末に表示させる、
     請求項1又は2に記載の音声分析装置。
    The output control unit causes the information terminal to display the activity level in a first period in a sub-region that is at least a part of the area and the activity level in a second period in the sub-region in association with each other.
    The speech analysis device according to claim 1 or 2.
  9.  前記特定部は、前記複数の集音装置から取得した複数の前記音声それぞれを発した一又は複数の話者を認識することによって、前記領域内の各位置における前記発話を行った人物の人数を推定し、
     前記単位時間あたりの前記発話の長さを用いて暫定活性度を算出し、前記人数に応じて前記暫定活性度を補正することによって前記活性度を決定する活性度決定部をさらに有する、
     請求項1又は2に記載の音声分析装置。
    The identification unit determines the number of people who uttered the utterances at each position within the area by recognizing one or more speakers who uttered each of the plurality of voices acquired from the plurality of sound collection devices. Estimate,
    further comprising an activity determination unit that calculates a provisional activation level using the length of the utterance per unit time and determines the activation level by correcting the provisional activation level according to the number of people;
    The speech analysis device according to claim 1 or 2.
  10.  前記活性度決定部は、前記人数が複数人である場合の前記活性度を、前記人数が1人である場合の前記活性度よりも大きくする、
     請求項9に記載の音声分析装置。
    The activity level determining unit makes the activity level when the number of people is a plurality of people larger than the activity level when the number of people is one person.
    The speech analysis device according to claim 9.
  11.  前記出力制御部は、所定の時間間隔で決定された前記活性度を含む前記マップ情報を繰り返し前記情報端末に表示させる、
     請求項1又は2に記載の音声分析装置。
    The output control unit repeatedly displays the map information including the activity level determined at predetermined time intervals on the information terminal.
    The speech analysis device according to claim 1 or 2.
  12.  前記活性度を決定するために用いられた前記音声が複数の種類のうちいずれの種類であるかを決定する活性度決定部をさらに有し、
     前記出力制御部は、前記領域内の各位置と、前記複数の種類のうちいずれかの種類の前記音声を用いて決定された前記活性度と、を関連付けた前記マップ情報を、前記情報端末に表示させる、
     請求項1又は2に記載の音声分析装置。
    further comprising an activation level determination unit that determines which type of the voice used to determine the activation level is among a plurality of types;
    The output control unit transmits, to the information terminal, the map information in which each position in the area is associated with the degree of activity determined using the voice of any one of the plurality of types. display,
    The speech analysis device according to claim 1 or 2.
  13.  前記出力制御部は、前記領域の一部を撮像する撮像装置を前記活性度が所定の条件を満たした前記領域内の位置に向けるための制御情報を、前記撮像装置に出力する、
     請求項1又は2に記載の音声分析装置。
    The output control unit outputs, to the imaging device, control information for directing an imaging device that images a part of the region to a position within the region where the degree of activity satisfies a predetermined condition.
    The speech analysis device according to claim 1 or 2.
  14.  プロセッサが実行する、
     所定の領域に配置された複数の集音装置それぞれに対する音声の到来方向を取得するステップと、
     前記複数の集音装置に対する複数の前記到来方向の関係を用いて、発話が行われた発話位置を特定し、前記発話位置と前記領域内の各位置との関係を用いて、当該位置における単位時間あたりの前記発話の長さを特定するステップと、
     前記領域内の各位置と、当該位置における前記単位時間あたりの前記発話の長さに対応する活性度と、を関連付けたマップ情報を、情報端末に表示させるステップと、
     を有する、音声分析方法。
    The processor executes
    acquiring the direction of arrival of the sound to each of the plurality of sound collection devices arranged in a predetermined area;
    The utterance position where the utterance was made is specified using the relationships of the plurality of arrival directions with respect to the plurality of sound collecting devices, and the unit at the position is identified using the relationship between the utterance position and each position in the area. determining the length of the utterance per hour;
    displaying on an information terminal map information that associates each position in the area with the degree of activity corresponding to the length of the utterance per unit time at the position;
    A speech analysis method comprising:
  15.  プロセッサに、
     所定の領域に配置された複数の集音装置それぞれに対する音声の到来方向を取得するステップと、
     前記複数の集音装置に対する複数の前記到来方向の関係を用いて、発話が行われた発話位置を特定し、前記発話位置と前記領域内の各位置との関係を用いて、当該位置における単位時間あたりの前記発話の長さを特定するステップと、
     前記領域内の各位置と、当該位置における前記単位時間あたりの前記発話の長さに対応する活性度と、を関連付けたマップ情報を、情報端末に表示させるステップと、
     を実行させる、音声分析プログラム。
    to the processor,
    acquiring the direction of arrival of the sound to each of the plurality of sound collection devices arranged in a predetermined area;
    The utterance position where the utterance was made is specified using the relationships of the plurality of arrival directions with respect to the plurality of sound collecting devices, and the unit at the position is identified using the relationship between the utterance position and each position in the area. determining the length of the utterance per hour;
    displaying on an information terminal map information that associates each position in the area with the degree of activity corresponding to the length of the utterance per unit time at the position;
    A speech analysis program that runs
PCT/JP2022/045694 2022-04-27 2022-12-12 Voice analysis device, voice analysis method, and voice analysis program WO2023210052A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
PCT/JP2022/019170 WO2023209898A1 (en) 2022-04-27 2022-04-27 Voice analysis device, voice analysis method, and voice analysis program
JPPCT/JP2022/019170 2022-04-27

Publications (1)

Publication Number Publication Date
WO2023210052A1 true WO2023210052A1 (en) 2023-11-02

Family

ID=88518226

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/JP2022/019170 WO2023209898A1 (en) 2022-04-27 2022-04-27 Voice analysis device, voice analysis method, and voice analysis program
PCT/JP2022/045694 WO2023210052A1 (en) 2022-04-27 2022-12-12 Voice analysis device, voice analysis method, and voice analysis program

Family Applications Before (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/019170 WO2023209898A1 (en) 2022-04-27 2022-04-27 Voice analysis device, voice analysis method, and voice analysis program

Country Status (1)

Country Link
WO (2) WO2023209898A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007025859A (en) * 2005-07-13 2007-02-01 Sharp Corp Real world communication management device
JP2013058221A (en) * 2012-10-18 2013-03-28 Hitachi Ltd Conference analysis system
JP2016082356A (en) * 2014-10-15 2016-05-16 株式会社ニコン Electronic apparatus and program
JP2018036690A (en) * 2016-08-29 2018-03-08 米澤 朋子 One-versus-many communication system, and program
WO2019142233A1 (en) * 2018-01-16 2019-07-25 ハイラブル株式会社 Voice analysis device, voice analysis method, voice analysis program, and voice analysis system
WO2019142230A1 (en) * 2018-01-16 2019-07-25 ハイラブル株式会社 Voice analysis device, voice analysis method, voice analysis program, and voice analysis system
WO2021245759A1 (en) * 2020-06-01 2021-12-09 ハイラブル株式会社 Voice conference device, voice conference system, and voice conference method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007025859A (en) * 2005-07-13 2007-02-01 Sharp Corp Real world communication management device
JP2013058221A (en) * 2012-10-18 2013-03-28 Hitachi Ltd Conference analysis system
JP2016082356A (en) * 2014-10-15 2016-05-16 株式会社ニコン Electronic apparatus and program
JP2018036690A (en) * 2016-08-29 2018-03-08 米澤 朋子 One-versus-many communication system, and program
WO2019142233A1 (en) * 2018-01-16 2019-07-25 ハイラブル株式会社 Voice analysis device, voice analysis method, voice analysis program, and voice analysis system
WO2019142230A1 (en) * 2018-01-16 2019-07-25 ハイラブル株式会社 Voice analysis device, voice analysis method, voice analysis program, and voice analysis system
WO2021245759A1 (en) * 2020-06-01 2021-12-09 ハイラブル株式会社 Voice conference device, voice conference system, and voice conference method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HORI TAKAAKI, ATSUSHI NAKAMURA, AKIKO ARAKI, TOMOHIRO NAKATANI: "Aiming for a computer that can listen to everyone's conversations", NTT TECHNOLOGY JOURNAL, vol. 25, no. 9, 1 September 2013 (2013-09-01), pages 18 - 21, XP093105262 *

Also Published As

Publication number Publication date
WO2023209898A1 (en) 2023-11-02

Similar Documents

Publication Publication Date Title
US10453443B2 (en) Providing an indication of the suitability of speech recognition
US20220012470A1 (en) Multi-user intelligent assistance
US11290826B2 (en) Separating and recombining audio for intelligibility and comfort
US11010601B2 (en) Intelligent assistant device communicating non-verbal cues
US9293133B2 (en) Improving voice communication over a network
CN114556972A (en) System and method for assisting selective hearing
JP6710562B2 (en) Reception system and reception method
CN112352441B (en) Enhanced environmental awareness system
US11602287B2 (en) Automatically aiding individuals with developing auditory attention abilities
JPWO2008139717A1 (en) Display device, display method, display program
EP2503545A1 (en) Arrangement and method relating to audio recognition
US11164341B2 (en) Identifying objects of interest in augmented reality
US11460927B2 (en) Auto-framing through speech and video localizations
JP7400364B2 (en) Speech recognition system and information processing method
WO2023210052A1 (en) Voice analysis device, voice analysis method, and voice analysis program
US20190189088A1 (en) Information processing device, information processing method, and program
JP6589042B1 (en) Speech analysis apparatus, speech analysis method, speech analysis program, and speech analysis system
US20230005488A1 (en) Signal processing device, signal processing method, program, and signal processing system
JP2009060220A (en) Communication system and communication program
CN115516553A (en) System and method for multi-microphone automated clinical documentation
WO2019190812A1 (en) Intelligent assistant device communicating non-verbal cues
WO2023112668A1 (en) Sound analysis device, sound analysis method, and recording medium
US20230421983A1 (en) Systems and methods for orientation-responsive audio enhancement
US20230421984A1 (en) Systems and methods for dynamic spatial separation of sound objects

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22940323

Country of ref document: EP

Kind code of ref document: A1