WO2023079602A1

WO2023079602A1 - Voice analysis device and voice analysis method

Info

Publication number: WO2023079602A1
Application number: PCT/JP2021/040443
Authority: WO
Inventors: 浩平柳楽; 武志水本
Original assignee: ハイラブル株式会社
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2023-05-11

Abstract

A voice analysis device 1 according to an embodiment of the present invention has: an acquiring unit 123 that acquires time series information that represents, for each time period, respective utterance situations of a first group and a second group of respective voices uttered in a discussion by participants that belong to the first group and participants that belong to the second group; a generating unit 124 that, on the basis of the time series information, generates segment information in which a plurality of segments that constitute the discussion and segment tendencies that represent which of the first group or the second group in each segment primary utterances, are associated with each other in all or a portion of the discussion; and an output unit 125 that outputs the segment information.

Description

Speech analysis device and speech analysis method

The present invention relates to a speech analysis device and a speech analysis method for analyzing speech uttered in a discussion.

Japanese Patent Laid-Open No. 2004-100003 discloses a technology that uses a microphone to acquire voices uttered by participants during a conference, identifies the participants who are speaking on the basis of voiceprint data extracted from the voices, and obtains the utterance status of each of a plurality of participants. A system for displaying on a display is disclosed.

Japanese Patent Application Laid-Open No. 2006-208482

Discussions between two groups, such as discussions between learners and teachers, and discussions between two groups, may be analyzed. The system disclosed in Patent Literature 1 displays the utterances of each of a plurality of participants for each time, so it is difficult for an analyst to grasp the tendency of utterances of two groups that divide a plurality of participants. , there was a problem that it was difficult to analyze the discussion between the two groups.

The present invention was made in view of these points, and aims to make it easier to analyze the tendency of utterances in discussions between two groups.

The speech analysis device according to the first aspect of the present invention is a speech analysis device in which participants belonging to a first group and participants belonging to a second group each uttered in a discussion the utterances of the first group and the second group. an acquisition unit that acquires time-series information indicating a situation for each time period, each of a plurality of sections that constitute the discussion based on the time-series information, and which of the first group and the second group is selected in the section A generation unit that generates interval information that associates an interval tendency indicating whether an utterance is dominant in all or part of the discussion, and an output unit that outputs the interval information.

The interval tendency may indicate which of the first group and the second group is the main utterance, or that the utterances of the first group and the second group are antagonistic.

The generating unit determines each of the plurality of segments so that they are equal to or longer than a predetermined time, and determines the segment trend in the segment by comparing the utterance situations of the first group and the second group in the segment. You may

The time-series information may be information indicating which of the first group and the second group has a larger amount of speech for each predetermined time frame during the period from the start point to the end point of the discussion.

The speech analysis device may further include a classification unit that classifies a plurality of participants into the first group and the second group.

The classification unit may change the participants belonging to each of the first group and the second group during a plurality of periods in the discussion.

The classification unit classifies a first parent group including the first group and the second group into which some of the plurality of participants are classified, and a portion of the plurality of participants not belonging to the first parent group. a second parent group including the classified first group and the second group, and the output unit generates the section information of the first parent group and the section information of the second parent group; can be output at the same time.

The classification unit classifies a first parent group including the first group and the second group into which some of the plurality of participants are classified, and a portion of the plurality of participants not belonging to the first parent group. a second parent group containing the classified first group and the second group; A third parent group including groups is generated, and the output unit outputs the section information of at least one of the first parent group and the second parent group and the section information of the third parent group. may

The classification unit may change the participants belonging to the first group to which the specific participant belongs and the participants belonging to the second group based on the position of the specific participant.

The output unit may output the words included in the utterance of each of the plurality of sections, which are extracted by performing speech recognition processing on the speech, in association with the section.

The output unit may output the characteristics of the entire discussion based on the section trends of the plurality of sections forming the discussion.

The speech analysis device further includes a selection unit that selects reference section information to be compared with the section information, and the output unit outputs a comparison result between the section information and the reference section information. good too.

The output unit may output information corresponding to a difference between the section information and the reference section information as the comparison result during the discussion.

After the discussion, the output unit may associate and output the section tendency of each of the plurality of sections indicated by the section information and the section tendency of each of the plurality of sections indicated by the reference section information. good.

A speech analysis method according to a second aspect of the present invention is performed by a processor, in speeches uttered in a discussion by participants belonging to a first group and participants belonging to a second group. a step of acquiring time-series information indicating the utterance status of each of the two groups for each time; based on the time-series information, each of a plurality of sections constituting the discussion; A step of generating interval information that associates an interval tendency indicating which utterance of the group is dominant in all or part of the discussion; and a step of outputting the interval information.

According to the present invention, it is possible to easily analyze the tendency of utterances in discussions between two groups.

1 is a schematic diagram of a speech analysis system S according to an embodiment; FIG. 1 is a block diagram of a speech analysis system S according to an embodiment; FIG. FIG. 5 is a schematic diagram for explaining a method for a selection unit to select reference section information; FIG. 4 is a schematic diagram for explaining a method for an acquisition unit to acquire time-series information; FIG. 10 is a schematic diagram for explaining a method of determining an interval tendency by a generation unit; FIG. 10 is a schematic diagram for explaining a method for an output unit to output section information in real time; FIG. 10 is a schematic diagram for explaining a method of post-outputting section information by an output unit; FIG. 4 is a diagram showing a flowchart of an exemplary speech analysis method performed by the speech analysis device according to the embodiment; FIG. 7 is a diagram showing a flowchart of section information generation processing in an exemplary speech analysis method executed by the speech analysis device according to the embodiment; It is a schematic diagram for demonstrating the method an output part outputs area information in a modification. FIG. 11 is a schematic diagram for explaining a method of generating section information by a generation unit in a modified example;

[Overview of speech analysis system S]
FIG. 1 is a schematic diagram of a speech analysis system S according to this embodiment. A speech analysis system S includes a speech analysis device 1 , a sound collector 2 and an information terminal 3 . The number of sound collectors 2 and information terminals 3 included in the speech analysis system S is not limited. The speech analysis system S may include other devices such as servers and terminals.

The speech analysis device 1 is a computer that analyzes the speech uttered in discussions in which multiple participants participate and provides the analysis results to the analyst. The analyst may be some of the participants, or may be a different person from the participants. The voice analysis device 1 analyzes the voice acquired by the sound collection device 2 and outputs the analysis result to the sound collection device 2 or the information terminal 3 . The voice analysis device 1 is connected to the sound collector 2 and the information terminal 3 by wire or wirelessly via a network such as a local area network or the Internet.

The voice analysis device 1 analyzes the voices of discussions conducted by a plurality of participants divided into at least two groups. Discussions to be analyzed are, for example, classes, group discussions, debates, meetings, and the like. A plurality of participants are classified into either a first group or a second group. Participants belonging to the first group are instructors such as teachers and tutors, for example. Participants belonging to the second group are learners such as pupils and students, for example. Also, a plurality of learners may be classified into the first group and the second group. Multiple participants may be classified according to other criteria.

A plurality of participants may be classified into a plurality of parent groups, and a plurality of participants may be classified into the first group or the second group in each of the plurality of parent groups. In this case, each parent group includes a first group and a second group. For example, one parent group corresponds to one table surrounded by a plurality of participants, and the sound collector 2 is arranged at the table. A plurality of participants surrounding one table are classified into a first group and a second group.

In addition, the speech analysis device 1 may analyze the speech of discussions (for example, web conferences) held over a network. In this case, a sound collector 2 is arranged in each space where a plurality of participants are present during the discussion, and the sound collector 2 is associated with one of the plurality of participants.

The sound collecting device 2 is a device that acquires the voice uttered in the discussion. The sound collector 2 includes, for example, a microphone array including sound collectors such as a plurality of microphones arranged in different directions. A microphone array includes, for example, a plurality of (e.g., eight) microphones arranged at equal intervals on the same circumference in a plane horizontal to the ground. By using such a microphone array, the speech analysis device 1 can determine which participant is the speaker (sound source) based on the speech uttered by a plurality of participants surrounding the sound collector 2. can be specified. The sound collector 2 transmits the sound acquired using the microphone array to the sound analysis device 1 as sound data. Also, the sound collecting device 2 may include an audio output unit such as a speaker.

The information terminal 3 is a computer that outputs information, such as a smartphone, tablet terminal, or personal computer. The information terminal 3 is used, for example, by at least some of the participants. Also, the information terminal 3 may be used by an analyst different from the plurality of participants. The information terminal 3 has, for example, a display such as a liquid crystal display. The information terminal 3 causes the display unit to display the information received from the speech analysis device 1 .

Also, the information terminal 3 may function as the sound collector 2 by having a sound collector such as a microphone. In this case, the information terminal 3 used by each of the plurality of participants transmits the sound acquired using the sound collecting unit to the sound analysis device 1 as sound data.

An overview of the process of analyzing speech by the speech analysis system S according to this embodiment will be described below. The speech analysis device 1 classifies a plurality of participants into a first group or a second group. For example, the speech analysis device 1 accepts a setting as to whether the plurality of participants belong to the first group or the second group at the information terminal 3, or automatically selects the plurality of participants based on the attributes of the plurality of participants. classified into the first group or the second group.

The voice analysis device 1 acquires voices uttered by multiple participants in the discussion from the sound collection device 2 . The speech analysis apparatus 1 acquires time-series information indicating the utterance status of each of the first group and the second group by specifying the utterance period of each of the plurality of participants in the acquired voice. The time-series information is, for example, information indicating which of the first group and the second group has a larger amount of speech for each predetermined time frame.

Based on the acquired time-series information, the speech analysis device 1 obtains each of the plurality of sections that constitute the discussion, the section trend indicating which of the first group and the second group is the main utterance in the section, and Generate section information associated with . The section information may indicate which of the first group and the second group is the main utterance, and that the utterances of the first group and the second group are competing with each other. The speech analysis device 1 outputs the generated segment information to at least one of the sound collector 2 and the information terminal 3 .

In this way, the speech analysis system S determines a section tendency indicating which of the two groups is the main utterance for each discussion section based on the speech of the discussion, associates the section and the section tendency, and to notify. As a result, the speech analysis system S makes it easy for the analyst to grasp the utterance tendencies of the two groups, and facilitates analysis of the utterance tendencies in the discussion between the two groups.

[Structure of speech analysis system S]
FIG. 2 is a block diagram of the speech analysis system S according to this embodiment. In FIG. 2, arrows indicate main data flows, and there may be data flows other than those shown in FIG. In FIG. 2, each block does not show the configuration in units of hardware (apparatus), but the configuration in units of functions. As such, the blocks shown in FIG. 2 may be implemented within a single device, or may be implemented separately within multiple devices. Data exchange between blocks may be performed via any means such as a data bus, network, or portable storage medium.

The speech analysis device 1 has a storage unit 11 and a control unit 12. The speech analysis device 1 may be configured by connecting two or more physically separated devices by wire or wirelessly. Also, the speech analysis device 1 may be configured by a cloud that is a collection of computer resources.

The storage unit 11 is a storage medium including ROM (Read Only Memory), RAM (Random Access Memory), hard disk drive, and the like. The storage unit 11 stores programs executed by the control unit 12 in advance. The storage unit 11 may be provided outside the speech analysis device 1, in which case data may be exchanged with the control unit 12 via a network.

The control unit 12 has a selection unit 121 , a classification unit 122 , an acquisition unit 123 , a generation unit 124 and an output unit 125 . The control unit 12 is a processor such as a CPU (Central Processing Unit), for example, and by executing a program stored in the storage unit 11, a selection unit 121, a classification unit 122, an acquisition unit 123, a generation unit 124, and an output unit 125. At least part of the functions of the controller 12 may be performed by an electrical circuit. Moreover, at least part of the functions of the control unit 12 may be realized by the control unit 12 executing a program executed via a network.

The processing executed by the speech analysis device 1 will be described in detail below. The selection unit 121 selects reference section information to be compared with section information. The section information is information that associates each of a plurality of sections forming an argument with the section tendency of the section. The interval tendency indicates which of the utterances of the first group and the second group is dominant, or that the utterances of the first group and the second group are competing with each other. The section information is generated by the generation unit 124, which will be described later, based on the speech of the discussion to be analyzed. The reference section information is section information generated in advance and stored in the storage unit 11 in advance.

The selection unit 121 selects the reference section information based on the content specified by the analyst on the information terminal 3, for example. An analyst is either one of a plurality of participants participating in a discussion to be analyzed, or a person different from the plurality of participants.

FIGS. 3(a) and 3(b) are schematic diagrams for explaining how the selection unit 121 selects reference section information. In the example of FIG. 3A, the storage unit 11 stores the attributes of past discussions (subject, type, time, topic, format, etc.), the date and time of the discussion, and the participants of the discussion. (teacher in charge, grade, etc.) and discussion information indicating at least one of the attributes (teacher in charge, grade, etc.) are stored in advance in association with section information generated by the generation unit 124 by a method described later. For example, in the information terminal 3, the selection unit 121 receives designation of search conditions for past discussions by the analyst. The search condition is, for example, at least one of the attributes of the discussion, the date and time of the discussion, and the attributes of the participants. The selection unit 121 extracts discussion information that matches the designated search condition from the storage unit 11, and displays it on the information terminal 3 together with the section information 31 associated with each of the extracted one or more pieces of discussion information.

Then, the selection unit 121 receives the specification of any section information 31 by the analyst on the information terminal 3, and selects the specified section information 31 as the reference section information. Thereby, the speech analysis device 1 can compare the argument to be analyzed with the argument that matches the search condition specified by the analyst.

In the example of FIG. 3(b), the storage unit 11 pre-stores a template of section information. The segment information template includes, for example, a segment in which the utterances of the first group are dominant, a segment in which the utterances of the second group are dominant, and a segment in which the utterances of the first and second groups are competing. This is information indicating the order. The selection unit 121 causes the information terminal 3 to display the plurality of templates 32 stored in the storage unit 11 . FIG. 3B shows an example in which a plurality of participants are classified into a first group, T (teacher) group, and a second group, S (student) group. participants may be classified into two groups by other criteria.

Then, in the information terminal 3, the selection unit 121 accepts designation of one of the templates 32 by the analyst, and selects the order of the sections indicated by the designated template 32 as the reference section information. In the information terminal 3, the selection unit 121 selects a section in which the utterances of the first group are dominant, a section in which the utterances of the second group are dominant, and utterances of the first group and the utterances of the second group. It is also possible to receive an input of the order of the section and select the input order of the section as the reference section information. Thereby, the speech analysis device 1 can compare the argument to be analyzed with the order of the sections specified by the analyst.

The classification unit 122 classifies a plurality of participants participating in the analysis target discussion into the first group or the second group. For example, in the information terminal 3, the classification unit 122 may receive a setting by the analyst as to which of the first group and the second group each of the plurality of participants belongs to. In this case, the classification unit 122 causes the storage unit 11 to store information indicating to which of the first group and the second group each of the plurality of participants belongs, according to the content set in the information terminal 3 .

Also, the classification unit 122 may automatically classify a plurality of participants into the first group or the second group, for example, based on the attributes of each of the participants. In this case, the storage unit 11 stores in advance information indicating attributes of each of the plurality of participants. Attributes used for classification are, for example, the roles of participants (instructor, learner, etc.). For example, the classifying unit 122 classifies the participant into the first group if the participant's attribute satisfies a predetermined condition, and classifies the participant into the second group otherwise. The classification unit 122 causes the storage unit 11 to store information indicating to which of the first group and the second group each of the plurality of participants belongs according to the classification result.

The acquisition unit 123 acquires, from the sound collection device 2, voices uttered by a plurality of participants in the discussion. The acquisition unit 123 acquires a part of the speech of the discussion at predetermined time intervals during the discussion, or acquires the speech of the entire discussion after the discussion ends.

The acquisition unit 123 identifies the utterance period of each of the multiple participants based on the sound acquired from the sound collector 2 . In the case of a discussion around the sound collector 2 having a microphone array, the acquisition unit 123 performs known sound source localization on multi-channel sounds received from the sound collector 2, for example. The sound source localization is a process of estimating the direction of the sound source included in the sound acquired by the acquisition unit 123 for each time (for example, every 10 ms to 100 ms). The acquisition unit 123 associates the orientation of the sound source estimated for each time with the orientation of each of the plurality of participants preset in the information terminal 3 .

The acquisition unit 123 can use other sound source localization methods such as the MUSIC (Multiple Signal Classification) method, the beamforming method, etc., as long as the direction of the sound source can be specified based on the acquired sound.

Next, based on the acquired voice and the direction of the estimated sound source, the acquisition unit 123 determines which participant uttered (spoken) every predetermined time (for example, every 10 milliseconds to 100 milliseconds) in the discussion. determine whether The acquisition unit 123 identifies a continuous period from when one participant starts speaking to when it ends as an speaking period. When multiple participants speak at the same time, at least some of the speech periods of the multiple participants may overlap.

In the case of a discussion held over a network, for example, the acquisition unit 123 estimates, as a sound source, a participant associated with the sound collector 2 that is the transmission source of the acquired sound, and uses the acquired sound and the estimated sound source as The utterance period of each of a plurality of participants is specified based on this.

The acquisition unit 123 is not limited to the specific method shown here, and may identify the utterance period of each of the multiple participants by other methods.

The acquisition unit 123 acquires time-series information indicating the utterance status of each of the first group and the second group based on the utterance period of each of the specified participants. FIG. 4 is a schematic diagram for explaining how the acquisition unit 123 acquires time-series information.

The acquisition unit 123 calculates the speech volume of each of the multiple participants based on the identified speech period. For example, for each predetermined time frame (5 seconds, 10 seconds, 30 seconds, etc.), the acquisition unit 123 calculates, as the amount of speech, a value corresponding to the length of the speech period of the participant within that time frame. Instead of or in addition to the length of the speech period, the acquisition unit 123 may calculate a value corresponding to the number of times of speech or the volume of speech as the amount of speech. The acquisition unit 123 calculates the amount of speech for each time frame in the period from the start point (start time) to the end point (end time) of the discussion for each of the plurality of participants.

The acquisition unit 123 classifies the speech volume of each of the multiple participants for each time frame into the first group or the second group according to the classification result of the multiple participants by the classifying unit 122 . In the example of FIG. 4, a plurality of participants are classified into a first group, T (teacher) group, and a second group, S (student) group. The acquisition unit 123 calculates a statistical value (average value, median value, etc.) of the amount of speech of the participants belonging to each of the first group and the second group for each time frame.

The acquisition unit 123 determines, for each time frame, which of the first group and the second group has a larger statistical value of the amount of speech. Acquisition section 123 acquires information indicating which of the first group and second group has a larger amount of speech for each predetermined time frame in a period from the start point to the end point of the discussion, according to the determination result, as time-series information. to get as In the time-series information, when a plurality of time frames in which the first group and the second group have the same speech volume continue, the acquisition unit 123 may integrate the plurality of time frames. good.

Based on the time-series information acquired by the acquisition unit 123, the generation unit 124 determines a plurality of sections that constitute the discussion to be analyzed, and determines which of the first group and the second group is the main utterance for each section. Determine the interval trend that indicates whether The generation unit 124 may determine sections and section trends in a part of the discussion during the discussion, and may determine sections and section trends in the entire discussion after the discussion ends.

FIG. 5 is a schematic diagram for explaining how the generation unit 124 determines the interval tendency. Based on the time-series information acquired by the acquisition unit 123, the generation unit 124 generates a transition graph showing transitions indicating which of the first group and the second group has a larger amount of speech. FIG. 5 shows an example in which the speech analysis system S according to the present embodiment is applied to generate an ST analysis graph described in Non-Patent Document 1. In FIG.

In the transition graph illustrated in FIG. 5, the horizontal axis represents time for the first group (T group), and the vertical axis represents time for the second group (S group). Acquisition unit 123 takes the origin as the starting point (discussion starting point), draws a line to the right along the horizontal axis for a period in which the amount of speech in the first group in the time-series information is larger, and draws a line to the right in the time-series information. A line is drawn up along the vertical axis for periods when the group's speech volume is greater. The acquisition unit 123 repeats this from the start to the end of the time-series information, that is, from the start point to the end point of the discussion, thereby generating a transition graph showing the transition of which of the first group and the second group has the greater amount of speech. do.

The generation unit 124 uses the generated transition graph to divide the argument to be analyzed into a plurality of sections, and generates a section tendency indicating which of the first group and the second group is the main utterance for each section. decide. First, the generating unit 124 sets the origin (the starting point of the transition graph) as the starting point of the interval. The generation unit 124 extracts one predetermined period (for example, 5 seconds) in chronological order from the transition graph as a target unit.

The generation unit 124 determines whether or not the elapsed time from the start point of the section to the end point of the attention unit is equal to or longer than a predetermined time. The predetermined time is, for example, a value set in advance as the minimum duration of the section, such as 5 minutes or 10 minutes. Also, the predetermined time may be determined according to the reference section information selected by the selection unit 121 . If the elapsed time from the start point of the section to the end point of the unit of attention is less than the predetermined time, the generation unit 124 extracts the next predetermined period as the unit of attention, and the elapsed time from the start point of the section to the end point of the unit of attention is the predetermined time. It repeats the determination of whether or not it is equal to or longer than the time.

When the elapsed time from the start point of the section to the end point of the attention unit is equal to or longer than a predetermined time, the generation unit 124 determines the section tendency in the section by comparing the utterance situations of the first group and the second group. For example, the generation unit 124 calculates the slope (broken line in FIG. 5) between the coordinates of the start point of the section and the coordinates of the end point of the attention unit on the transition graph in order to compare the utterance situations.

The generation unit 124 determines the section tendency based on the calculated slope. For example, when the slope is equal to or less than the first reference value, the generation unit 124 determines that the utterances of the first group are predominant. For example, when the slope is greater than the first reference value and equal to or less than the second reference value, the generation unit 124 determines that the utterances of the first group and the second group are competing with each other. For example, when the slope is greater than the second reference value, the generator 124 determines that the utterances of the second group are predominant. The first reference value and the second reference value are stored in advance in the storage unit 11 or set in the information terminal 3 . The generation unit 124 determines the determination result as the segment tendency of the segment.

When the segment trend of the previous segment and the segment trend of the current segment are the same, the generation unit 124 combines the unit of interest with the previous segment and extracts the next predetermined period as the unit of interest. , repeatedly determines whether or not the elapsed time from the start point of the section to the end point of the target unit is equal to or longer than a predetermined time.

When the segment trend of the previous segment and the segment trend of the current segment are different, the generation unit 124 determines the previous segment and the segment trend. The generation unit 124 sets the start point of the unit of interest as the start point of the section, and repeats the determination of the section and the section tendency until the end point of the transition graph by the above-described processing.

The generation unit 124 is not limited to the specific method shown here, and by other methods, determines a plurality of sections that constitute the discussion based on the time-series information, and generates the first group and the second group for each section. , or whether the utterances of the first group and the second group are antagonistic.

The generation unit 124 associates each of the plurality of sections that constitute the discussion with a section tendency indicating which of the first group and the second group is the main utterance in the section, in all or part of the discussion. Section information is generated and stored in the storage unit 11 . In the section information, the section tendency in a part of the discussion indicates that the utterances of the first group and the second group are antagonistic (that is, neither the utterance of the first group nor the second group is dominant). good too.

In this way, the generation unit 124 divides the argument to be analyzed into a plurality of sections, and determines the section tendency indicating which of the first group and the second group is the main utterance for each section. Since the time-series information expresses the relative amount of utterances of the two groups in detail for each time, it is difficult for the analyst to analyze the tendency of the utterances of the two groups even if the time-series information is viewed as it is. On the other hand, when the generation unit 124 divides the period in which the same utterance tendency continues in the discussion into one section, the analyst can easily grasp the transition of the utterance tendency in the whole discussion. It makes it easier to analyze discussions between groups.

The output unit 125 outputs the section information generated by the generation unit 124 during the discussion and/or after the discussion ends. Hereinafter, the output of the section information by the output unit 125 during the discussion is called real-time output, and the output of the section information by the output unit 125 after the discussion is over is called post-output.

FIGS. 6(a) and 6(b) are schematic diagrams for explaining how the output unit 125 outputs section information in real time. The output unit 125 causes the information corresponding to the section information generated by the generation unit 124 to be displayed on the display unit provided in the information terminal 3 as shown in FIG. 2 performs control for outputting from the sound output unit provided in .

For example, the output unit 125 transmits information corresponding to the time-series information acquired by the acquisition unit 123 and the section information generated by the generation unit 124 to the information terminal 3 . In the example of FIG. 6(a), the output unit 125 creates a transition graph 33 showing the transition between the first group and the second group, which corresponds to the time-series information acquired by the acquisition unit 123 and has a larger amount of speech. , is displayed on the information terminal 3 . The output unit 125 may display the section tendency on the transition graph 33 by generating the transition graph 33 using a line of a color corresponding to the section tendency for each section.

Also, the output unit 125 causes the information terminal 3 to display a bar graph 34 corresponding to the section information generated by the generation unit 124 and showing the length of the section and the section tendency of the section. As a result, the speech analysis apparatus 1 makes it easy for the analyst to grasp the utterance tendencies of the two groups during the discussion, and facilitates analysis of the utterance tendencies in the discussion between the two groups.

Also, the output unit 125 causes the information terminal 3 to display the transition graph 33 and the bar graph 34 corresponding to the reference section information selected by the selection unit 121 . Thereby, the speech analysis device 1 can easily compare the section information of the argument to be analyzed with the reference section information specified by the analyst.

Further, the output unit 125 outputs, for example, information corresponding to the difference between the section information generated by the generation unit 124 and the reference section information selected by the selection unit 121 to the information terminal 3 or the sound collector 2 as a comparison result. Send. The output unit 125 outputs, as the difference between the section information and the reference section information, whether or not the difference in the length, number, order, etc. of sections satisfies a predetermined condition.

In the example of FIG. 6A, the output unit 125 outputs a message 35 indicating that the segment length of the first group (T group) in the segment information is longer than the segment length of the first group in the reference segment information. is displayed on the information terminal 3. In the example of FIG. 6B, the output unit 125 outputs a voice indicating that the segment length of the second group (S group) in the segment information is longer than the segment length of the second group in the reference segment information. , is output from the sound collector 2 . As a result, the speech analysis apparatus 1 sequentially notifies the analyst of the difference between the section information of the discussion to be analyzed and the reference section information specified by the analyst, and can easily reflect the difference in the ongoing discussion. .

FIG. 7 is a schematic diagram for explaining how the output unit 125 outputs the section information after the fact. The output unit 125 performs control for displaying information corresponding to the section information generated by the generation unit 124 on the display unit included in the information terminal 3 as shown in FIG.

For example, the output unit 125 transmits information corresponding to the time-series information acquired by the acquisition unit 123 and the section information generated by the generation unit 124 to the information terminal 3 . In the example of FIG. 7 , the output unit 125 outputs a transition graph 36 indicating a transition between the first group and the second group, which corresponds to the time-series information acquired by the acquisition unit 123, to the information terminal. 3 is displayed. The output unit 125 may display the section tendency on the transition graph 36 by generating the transition graph 36 using a line of a color corresponding to the section tendency for each section.

In addition, the output unit 125 causes the information terminal 3 to display a bar graph 37 corresponding to the section information generated by the generation unit 124 and showing the length of the section and the section tendency of the section. As a result, the speech analysis device 1 makes it easy for the analyst to grasp the tendencies of the two groups' utterances in the entire discussion, and facilitates the analysis of the utterance tendencies in the discussion between the two groups.

Also, the output unit 125 causes the information terminal 3 to display the transition graph 36 and the bar graph 37 corresponding to the reference section information selected by the selection unit 121 . Thereby, the speech analysis device 1 can easily compare the section information of the argument to be analyzed with the reference section information specified by the analyst.

Also, the output unit 125 associates, for example, the section tendency of each of the plurality of sections indicated by the section information generated by the generation unit 124 with the section tendency of each of the plurality of sections indicated by the reference section information selected by the selection unit 121. The received information is transmitted to the information terminal 3.

In the example of FIG. 7 , the output unit 125 displays on the information terminal 3 the transitions 38 of the interval tendencies of a plurality of intervals for each of the interval information generated by the generation unit 124 and the reference interval information selected by the selection unit 121. I am letting The output unit 125, for example, by dynamic programming, detects the correspondence relationship between the section indicated by the section information generated by the generation unit 124 and the section indicated by the reference section information selected by the selection unit 121, and increases the section. , decrease, order difference, etc. are displayed as transitions 38 of the interval tendency. It is possible to make it easy to understand the difference between the transition of the interval trend and the transition.

Also, the output unit 125 may output the words (words) uttered in each of the plurality of intervals determined by the generation unit 124 in association with the interval. In this case, the output unit 125 extracts the words included in the utterance in each of the sections by performing known speech recognition processing on the speech in each of the sections, for example. The output unit 125 causes the information terminal 3 to display, for example, each of the plurality of sections in association with some or all of the words extracted for the section. As a result, the speech analysis device 1 can make it easier for the analyst to understand the content of each of the plurality of sections.

Further, the output unit 125 may output the characteristics of the entire discussion based on the section information generated by the generation unit 124. In this case, the output unit 125 outputs a plurality of Based on the interval tendency of the interval of , determine the characteristics of the entire discussion. For example, the output unit 125 outputs, in a plurality of sections constituting a discussion, a section in which the utterance of the first group is the main part, a section in which the utterance of the second group is the main part, and the utterances of the first group and the second group. The characteristics of the entire argument are determined based on the ratio of the competitive interval and the .

For example, the output unit 125 determines that the discussion is a lecture-based discussion when the ratio of the section in which the T group (leader group) is the main utterance is equal to or greater than a predetermined value, and the S group (learner group) utterance is determined. If the ratio of main sections is equal to or greater than a predetermined value, it is determined that the discussion is an exercise-based discussion. The output unit 125 causes the information terminal 3 to display a message (report) representing the determined characteristics of the entire discussion. As a result, the speech analysis device 1 makes it easier for the analyst to grasp the trend of the entire discussion determined based on the segment trends of the plurality of segments.

[Flowchart of voice analysis method]
FIG. 8 is a diagram showing a flowchart of an exemplary speech analysis method executed by the speech analysis device 1 according to this embodiment. The selection unit 121 selects reference section information to be compared with section information (S11). The classification unit 122 classifies a plurality of participants participating in the analysis target discussion into the first group or the second group (S12). For example, the classification unit 122 receives a setting as to whether the plurality of participants belong to the first group or the second group in the information terminal 3, or automatically classifies the plurality of participants based on the attributes of the plurality of participants. classified into the first group or the second group.

Subsequent processing is performed sequentially during the discussion, or after the discussion ends. The acquisition unit 123 acquires voices uttered by a plurality of participants in the discussion from the sound collector 2 . The acquisition unit 123 identifies the utterance period of each of the multiple participants based on the sound acquired from the sound collector 2 (S13). The acquisition unit 123 acquires time-series information indicating the utterance status of each of the first group and the second group for each time based on the utterance period of each of the identified participants (S14).

Based on the time-series information acquired by the acquisition unit 123, the generation unit 124 generates a plurality of sections constituting the discussion and a section indicating which of the first group and the second group is the main utterance in the section. Section information generation processing is performed to generate section information that associates the tendency with the (S2). Section information generation processing in step S2 will be described later using FIG.

The output unit 125 outputs the section information generated by the generation unit 124 to at least one of the sound collector 2 and the information terminal 3 (S15).

FIG. 9 is a diagram showing a flowchart of section information generation processing in an exemplary speech analysis method executed by the speech analysis device 1 according to this embodiment. The generation unit 124 generates a transition graph showing transitions indicating which of the first group and the second group has a larger amount of speech based on the time-series information acquired by the acquisition unit 123 (S21). The generating unit 124 sets the origin (the starting point of the transition graph) as the starting point of the section (S22). The generation unit 124 extracts one predetermined period (for example, 5 seconds) in chronological order from the transition graph as a target unit (S23).

The generation unit 124 determines whether or not the elapsed time from the start point of the section to the end point of the attention unit is equal to or longer than a predetermined time (S24). If the elapsed time from the start point of the section to the end point of the attention unit is not equal to or longer than the predetermined time (NO in S25), the generation unit 124 returns to step S23 and repeats the process for the next attention unit.

If the elapsed time from the start point of the section to the end point of the unit of interest is equal to or longer than the predetermined time (YES in S25), the generating unit 124 calculates the distance between the coordinates of the start point of the section and the end point of the unit of interest on the transition graph. Calculate the slope of (S26).

The generation unit 124 determines the section tendency based on the calculated slope (S27). For example, when the slope is equal to or less than the first reference value, the generation unit 124 determines that the utterances of the first group are predominant. For example, when the slope is greater than the first reference value and equal to or less than the second reference value, the generation unit 124 determines that the utterances of the first group and the second group are competing with each other. For example, when the slope is greater than the second reference value, the generation unit 124 determines that the utterances of the second group are predominant. The generation unit 124 determines the determination result as the segment tendency of the segment.

If the segment tendency of the previous segment and the segment trend of the current segment are the same (YES in S28), the generation unit 124 combines the unit of interest with the previous segment (S29). The generation unit 124 returns to step S23 and repeats the processing for the next target unit.

If the segment trend of the previous segment and the segment trend of the current segment are different (NO in S28), the generator 124 determines the previous segment and the segment trend (S30). When the time-series information has not ended (NO in S31), the generation unit 124 sets the starting point of the attention unit as the starting point of the section (S32). The generation unit 124 returns to step S23 and repeats the processing for the next target unit.

When the time-series information has ended (YES in S31), the generation unit 124 indicates each of the plurality of sections constituting the discussion and which of the first group and the second group is the main utterance in the section. Section information that associates the section tendency with all or part of the discussion is generated and stored in the storage unit 11 . In the segment information, segment tendencies in part of the discussion may indicate that the utterances of the first group and the second group are competing.

[Effect of this embodiment]
According to the speech analysis system S according to the present embodiment, the speech analysis device 1 determines a section tendency indicating which of the two groups is the main utterance for each discussion section based on the speech of the discussion, Intervals and interval trends are associated and reported to analysts. As a result, the speech analysis system S makes it easy for the analyst to grasp the utterance tendencies of the two groups, and facilitates analysis of the utterance tendencies in the discussion between the two groups.

[First modification]
The grouping of multiple participants may change during the discussion. In this modification, the speech analysis device 1 changes the participants belonging to the first group and the second group during a plurality of periods in the discussion, and generates section information based on the changed grouping.

The classification unit 122 changes the participants who belong to the first group and the second group during a plurality of periods in the discussion. For example, the classification unit 122 accepts settings of the discussion structure (explanation period, exercise period, etc.) in advance in the information terminal 3, and at the timing when the structure changes, the participants belonging to the first group and the second group are classified. You can change it. For example, the classification unit 122 classifies teachers into the first group and students into the second group during the commentary period, while classifying some students into the first group and other students into the second group during the exercise period. 2 groups.

Further, the classification unit 122 detects the arrangement of each of the plurality of participants by, for example, performing known image recognition processing on the captured image acquired by a camera or the like, and at the timing when the arrangement changes, the first group and the The participants belonging to each of the second groups may be changed. For example, the classifying unit 122 classifies the participants sitting on the specific seats into the first group and the participants sitting in the other seats into the second group.

In addition, the classification unit 122 may change the parent group including the first group and the second group according to grouping that differs for each period. That is, for each of a plurality of periods in the discussion, the classification unit 122 classifies some of the participants into a first parent group that includes a first group and a second group, and a plurality of participants that do not belong to the first parent group. and a second parent group that includes a first group and a second group that classify some of the participants in . The classification unit 122 may generate three or more parent groups.

In addition, the classification unit 122 may generate a parent group including all participants (for example, the entire classroom) during at least one period of the discussion. That is, the classification unit 122 classifies some of the participants into a first parent group including a first group and a second group, and classifies some of the participants who do not belong to the first parent group into a first parent group. A second parent group including the first group and the second group is generated, and a third parent group including the first group and the second group is generated by classifying the participants belonging to the first parent group and the second parent group. You may

The generation unit 124 generates section information for each of a plurality of parent groups. The output unit 125 outputs section information for each of the multiple parent groups. FIG. 10 is a schematic diagram for explaining how the output unit 125 outputs section information in this modification. The output unit 125 performs control for displaying the information corresponding to the section information of each of the plurality of parent groups generated by the generation unit 124 on the display unit included in the information terminal 3 as shown in FIG.

The output unit 125 outputs, for example, the section information of the first parent group and the section information of the second parent group at the same time. In the example of FIG. 10, the output unit 125 displays the section information of three parent groups including the first parent group and the second parent group side by side in a period of 10 minutes to 50 minutes. As a result, the speech analysis device 1 allows the analyst to view the section information of multiple parent groups (multiple tables, etc.).

Also, the output unit 125 outputs, for example, section information of at least one of the first parent group and the second parent group, and section information of the third parent group. In the example of FIG. 10, the output unit 125 displays the section information of the three parent groups including the first parent group and the second parent group side by side for a period of 10 minutes to 50 minutes. and section information of the third parent group corresponding to all of the participants in the period of 50 minutes to 60 minutes. As a result, when the grouping of a plurality of participants is changed during the discussion, the speech analysis device 1 can provide the analyst with segment information corresponding to different groupings for each period. In addition, the speech analysis apparatus 1 hierarchically divides a parent group corresponding to a plurality of participants as a whole (a whole classroom, etc.) and a plurality of parent groups corresponding to groups (a plurality of tables, etc.) into which the plurality of participants are divided. can be easily analyzed.

[Second modification]
In the middle of a discussion, a particular participant, such as a leader, may move between multiple tables and join different groups. In this modification, the speech analysis device 1 changes grouping based on the position of a specific participant, and generates section information based on the changed grouping.

FIG. 11 is a schematic diagram for explaining how the generation unit 124 generates section information in this modified example. The classification unit 122 estimates the position of the leader who is a specific participant during the discussion. For example, the classifying unit 122 compares the voices acquired by the acquiring unit 123 from the plurality of sound collecting devices 2 with the pre-registered feature amount (voiceprint, etc.) of the voice of the leader to It may be estimated whether it is close to the sound collector 2 of . The classifying unit 122 classifies which sound collector 2 the leader is based on the strength of the short-range wireless communication performed between the communication device (smartphone, etc.) held by the leader and the plurality of sound collectors 2. may be estimated to be close to

The classification unit 122 is not limited to the specific method shown here, and may use other methods to estimate the position of the leader during the discussion. The example of FIG. 11 represents that the instructor moved to table 1, table 2, and table 3 in order.

Based on the estimated position of the leader, the classifying unit 122 changes the participants belonging to the first group to which the leader belongs and the participants belonging to the second group between a plurality of periods in the discussion. . In the example of FIG. 11 , the classification unit 122 generates a first group including the instructor and a second group including the students at table 1 during the period when the instructor is positioned at table 1 . Further, the classification unit 122 generates a first group including the instructor and a second group including the students at the table 2 during the period when the instructor is positioned at the table 2 . Further, the classification unit 122 generates a first group including the instructor and a second group including the students at the table 3 during the period when the instructor is positioned at the table 3 .

The generation unit 124 generates section information for each of the plurality of periods whose grouping has been changed, and combines the generated section information. As a result, when a specific participant such as a leader moves during the discussion, the speech analysis device 1 generates section information according to the grouping changed according to the position of the specific participant, It is possible to easily analyze the tendency of utterances centering on the specific participant.

Although the present invention has been described above using the embodiments, the technical scope of the present invention is not limited to the scope described in the above embodiments, and various modifications and changes are possible within the scope of the gist thereof. be. For example, all or part of the device can be functionally or physically distributed and integrated in arbitrary units. In addition, new embodiments resulting from arbitrary combinations of multiple embodiments are also included in the embodiments of the present invention. The effect of the new embodiment caused by the combination has the effect of the original embodiment.

The processor of the speech analysis device 1 is the subject of each step (process) included in the speech analysis method shown in FIGS. That is, the processor of the speech analysis device 1 reads a program for executing the speech analysis method shown in FIGS. The speech analysis method shown in FIGS. 8 and 9 is executed. Some steps included in the speech analysis method shown in FIGS. 8 and 9 may be omitted, the order between steps may be changed, and a plurality of steps may be performed in parallel.

S Speech analysis system 1 Speech analysis device 11 Storage unit 12 Control unit 121 Selection unit 122 Classification unit 123 Acquisition unit 124 Generation unit 125 Output unit 2 Sound collector 3 Information terminal

Claims

Acquisition for acquiring time-series information indicating the utterance status of each of the first group and the second group in the speech uttered in the discussion by each of the participants belonging to the first group and the participant belonging to the second group. Department and
Based on the time-series information, each of a plurality of sections constituting the discussion, and a section tendency indicating which of the first group and the second group utterance is dominant in the section, of the discussion. a generation unit that generates section information associated in whole or in part;
an output unit that outputs the section information;
A voice analysis device having
The interval tendency indicates which of the first group and the second group is the main utterance, or that the utterances of the first group and the second group are antagonistic,
The speech analysis device according to claim 1.
The generating unit determines each of the plurality of segments so that they are equal to or longer than a predetermined time, and determines the segment trend in the segment by comparing the utterance situations of the first group and the second group in the segment. do,
3. The speech analysis device according to claim 1 or 2.
The time-series information is information indicating which of the first group and the second group has a larger amount of speech for each predetermined time frame during the period from the start point to the end point of the discussion.
The speech analysis device according to any one of claims 1 to 3.
further comprising a classification unit that classifies a plurality of participants into the first group and the second group;
A speech analysis device according to any one of claims 1 to 4.
The classifying unit changes participants belonging to each of the first group and the second group between a plurality of periods in the discussion.
6. The speech analysis device according to claim 5.
The classification unit classifies a first parent group including the first group and the second group into which some of the plurality of participants are classified, and a portion of the plurality of participants not belonging to the first parent group. generating a second parent group containing the classified first group and the second group;
The output unit simultaneously outputs the section information of the first parent group and the section information of the second parent group.
7. The speech analysis device according to claim 5 or 6.
The classification unit classifies a first parent group including the first group and the second group into which some of the plurality of participants are classified, and a portion of the plurality of participants not belonging to the first parent group. a second parent group containing the classified first group and the second group; create a third parent group containing the group;
The output unit outputs the section information of at least one of the first parent group and the second parent group and the section information of the third parent group,
7. The speech analysis device according to claim 5 or 6.
The classification unit changes the participants belonging to the first group to which the specific participant belongs and the participants belonging to the second group, based on the position of the specific participant.
A speech analysis device according to any one of claims 5 to 8.
The output unit outputs words included in the utterance of each of the plurality of sections, which are extracted by performing speech recognition processing on the speech, in association with the section.
A speech analysis device according to any one of claims 1 to 9.
The output unit outputs features of the discussion as a whole based on the section trends of the plurality of sections that make up the discussion.
A speech analysis device according to any one of claims 1 to 10.
further comprising a selection unit that selects reference section information to be compared with the section information;
The output unit outputs a comparison result between the section information and the reference section information.
A speech analysis device according to any one of claims 1 to 11.
The output unit outputs information corresponding to a difference between the section information and the reference section information as the comparison result during the discussion.
13. A speech analysis device according to claim 12.
After the discussion, the output unit associates and outputs the section tendency of each of the plurality of sections indicated by the section information and the section tendency of each of the plurality of sections indicated by the reference section information.
14. A speech analysis device according to claim 12 or 13.
the processor executes
A step of acquiring time-series information indicating the utterance status of each of the first group and the second group in the speech uttered in the discussion by each of the participants belonging to the first group and the participant belonging to the second group. and,
Based on the time-series information, each of a plurality of sections constituting the discussion, and a section tendency indicating which of the first group and the second group utterance is dominant in the section, of the discussion. generating section information associated in whole or in part;
a step of outputting the interval information;
A speech analysis method comprising: