WO2023243009A1

WO2023243009A1 - Information presenting device, information presenting method, and program

Info

Publication number: WO2023243009A1
Application number: PCT/JP2022/023998
Authority: WO
Inventors: 充裕後藤; 聡一郎内田
Original assignee: 日本電信電話株式会社
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2023-12-21

Abstract

An information presenting device according to an embodiment of the present invention comprises a user information acquiring unit, a determination unit, and an output unit. The user information acquiring unit acquires uttered speech data of a first participant and uttered speed data of a second participant in an environment of bidirectional telecommunication by at least the first participant and the second participant, each of whom is wearing an acoustic device. The determination unit determines the state of communication between the first participant and the second participant on the basis of the uttered speech data of the first participant and the uttered speech data of the second participant. The output unit causes auditory information that corresponds to the determined state to be output to the acoustic device of the first participant.

Description

Information presentation device, information presentation method, and program

One aspect of the present invention relates to, for example, an information presentation device, an information presentation method, and a program in an online communication environment using a network.

When people communicate with each other, they unconsciously measure the physical or psychological distance between themselves and others. The sense of distance is sometimes referred to as distance, and people with good communication skills are good at measuring distance.
Due to recent social conditions, voice communication in an online environment such as web conferencing is becoming mainstream. In such an environment, the communication of nonverbal elements is more restricted than in direct interaction, and communication tends to occur in a uniform information presentation environment. In other words, it is difficult to communicate. In a face-to-face meeting, it is possible to grasp the other person's condition and control the distance between them, but in an online environment, it is difficult to control the sense of distance, and it is difficult to maintain a distance that feels comfortable for both oneself and the other person.

By the way, Non-Patent Document 1 reports the results of a psychological experiment in which the space near the body expands forward when there is a sensation of forward movement (expansion of the space near the body). Furthermore, Non-Patent Document 2 reports that presenting sounds from the front can induce the sensation of moving forward, and presenting sounds from the rear can induce the sensation of moving backwards. (self-motion sensation caused by sound).

It is known that by devising the way information is provided, it is possible to expand the perceived space near the body and induce a sense of self-motion. By exploiting this phenomenon, it is possible to make online communication smoother.
This invention was made in view of the above circumstances, and its purpose is to provide a technology that can encourage comfortable communication even in a remote environment.

An information presentation device according to one aspect of the present invention includes a user information acquisition section, a determination section, and an output section. The user information acquisition unit is configured to acquire utterance audio data of a first participant and utterance audio data of a second participant in a two-way telecommunication environment between at least a first participant and a second participant each wearing an acoustic device. get. The determination unit determines the state of communication between the first participant and the second participant based on the utterance audio data of the first participant and the utterance audio data of the second participant. The output unit causes the first participant's acoustic device to output auditory information according to the determined state.

According to one aspect of the present invention, it is possible to provide a technology that can encourage comfortable communication even in a remote environment.

FIG. 1 is a diagram for explaining elemental technology of a web conference system according to an embodiment. FIG. 2 is a functional block diagram showing an example of the information presentation device 1 shown in FIG. 1. As shown in FIG. FIG. 3 is a diagram for explaining the status data 12a shown in FIG. 2. FIG. 4 is a diagram for explaining the threshold value 12b shown in FIG. 2. FIG. 5 is a diagram for explaining the presentation content 12c shown in FIG. 2. FIG. 6 is a flowchart showing an example of the processing procedure of the information presentation device 1 having the above configuration. FIG. 7 is a flowchart showing an example of the processing procedure in step S10 of FIG. FIG. 8 is a flowchart showing an example of the processing procedure in step S11 of FIG. FIG. 9 is a diagram illustrating an example of the association between determined communication states and presentation contents. FIG. 10 is a diagram for explaining that psychological distance can be controlled by auditory information. FIG. 11 is a flowchart showing an example of the processing procedure in step S13 of FIG. FIG. 12 is a diagram for explaining a series of processing procedures in the information presentation device 1 of the embodiment. FIG. 13 is a diagram for explaining the effects obtained by the embodiment.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the embodiment, a technique for creating a space in which participants (users) can easily interact with each other in two-way telecommunications (online communication) using a network will be described.

A web conference is established by the participation of at least two users (referred to as a first user and a second user), so for the sake of simplicity, the following description assumes only the first user and the second user. Of course, the same argument can be made in a web conference in which three or more users participate.

FIG. 1 is a diagram for explaining the elemental technology of the web conference system according to the embodiment. In FIG. 1, conference equipment 2 of participants in a web conference communicates with an information presentation device 1 via a network 100, which is the so-called Internet, for example, via a VPN (Virtual Private Network). The information presentation device 1 acquires the user's utterance audio data acquired by the microphone 30 via the network 100, and determines the state of communication between the participants of the web conference. The information presentation device 1 transmits auditory information according to the determination result to the audio device worn by the participant, and causes the audio device to output it. The participant may wear only regular earphones 41 as an acoustic device, or may wear both earphones 41 and bone conduction earphones 42. Here, the earphone 41 is an example of a first device that reproduces conversational audio, and the bone conduction earphone 42 is an example of a second device that reproduces acoustic information different from the conversational audio.

<Configuration>
FIG. 2 is a functional block diagram showing an example of the information presentation device 1 shown in FIG. 1. As shown in FIG. Conference equipment 2 of a plurality of participants is connected to the network 100, and these communicate with the information presentation device 1 using a common protocol.

The information presentation device 1 is a computer that includes an interface section 13, a processor 11, a storage 12, and a memory 14. The interface section 13 sets up a communication link between the network 100 and each conference facility 2, and exchanges various data.

The processor 11 is an arithmetic device such as a CPU (Central Processing Unit) or an MPU (Micro Processing Unit), and implements the processing functions of the embodiment according to a program 14a loaded from the storage 12 to the memory 14. The memory 14 is a semiconductor memory such as ROM (Read Only Memory) or RAM (Random Access Memory).

The storage 12 is a nonvolatile memory such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive), and stores basic software such as an OS (Operating System) and programs for realizing the processing according to the embodiment. Remember. That is, the program can be installed on the information presentation device 1.
The storage 12 also stores state data 12a, threshold values 12b, and presentation contents 12c.

The processor 11 includes a user information acquisition section 111, a state data calculation section 112, a state determination section 113, a presentation content acquisition section 114, and an output section 115 as processing functions according to an embodiment of the present invention. The user information acquisition section 111, the state data calculation section 112, the state determination section 113, the presentation content acquisition section 114, and the output section 115 are realized by the processor 11 executing the program 14a loaded into the memory 14.

In other words, the program 14a includes instructions for causing the processor 11 to function as the user information acquisition section 111, instructions for functioning as the status data calculation section 112, instructions for causing the processor 11 to function as the status determination section 113, and instructions for causing the processor 11 to function as the presentation content acquisition section 114. and an instruction for causing the output unit 115 to function.

The user information acquisition unit 111 acquires utterance audio data of a first user participating in the web conference and utterance audio data of a second user via the network 100.
The state data calculation unit 112 calculates state data that reflects the state of telecommunications based on the acquired uttered audio data.

FIG. 3 is a diagram for explaining the status data 12a. The state data 12a includes, for example, the time ratio of silent periods in a conversation, the time ratio of one's (the first user) utterance period, and the time ratio of the other party's (second user) utterance period. The status data 12a is stored in the storage 12 in association with the numerical value of each entry, for example, for each record ID (IDentification) corresponding to the time. The record ID may increase over time, but when the number of entries reaches a predetermined value, the oldest entries may be deleted and new rows may be added.

Returning to FIG. 2, the explanation will be continued again.
The state determining unit 113 determines the state of communication between the first user and the second user based on the state data 12a calculated from the acquired uttered audio data. The state determining unit 113 calculates an index indicating the psychological distance between the first user and the second user based on the state data 12a.

Here, examples of the index indicating a sense of psychological distance include the average silent time ratio, the average spontaneous speech time ratio, and the average partner's speaking time ratio. These indicators can be viewed as indicators that reflect the state of telecommunications. The calculation will be described later.

The state determining unit 113 further determines the state of communication between the first user and the second user based on a comparison between the calculated index and a predetermined threshold.

FIG. 4 is a diagram for explaining the threshold value 12b. For example, a threshold value S _th for determining a silent state, a threshold value B _th for determining a biased state, etc. are defined in advance and stored in the storage 12 . Using these thresholds, the average silent time percentage, the average spontaneous speech time percentage, and the average partner speech time percentage, it is possible to determine the mutual communication state of the participants in the web conference. The calculation will be described later.

Returning to FIG. 2, the explanation will be continued again.
The presentation content acquisition unit 114 acquires auditory information according to the communication state determined by the state determination unit 113 from the storage 12 . The output unit 115 transmits the auditory information acquired by the presentation content acquisition unit 114 to the destination user's acoustic device and causes the audio device to output it. The presentation content acquisition unit 114 transmits auditory information to the second user's

earphones

41 and 42, for example, depending on the psychological distance between the first user and the second user. The auditory information is stored in the storage 12 as specific content (presentation content) to be presented to each participant.

FIG. 5 is a diagram for explaining the presentation content 12c. The presentation content 12c is a table in which types of sound sources, sound files, and designated playback times for improving the communication states are associated with each entry corresponding to a plurality of communication states. Here, the type of sound source and the sound file are examples of auditory information that can induce a sense of self-motion, and for example, noise that does not interfere with conversation can be used.

The presentation content 12c is a table for managing auditory information that induces a sense of self-kinesis and is associated in advance with a communication state. In the presentation content 12c, the sound source to be presented according to the communication state, the actual audio file, and the designated playback time are recorded. For example, in the "bias improvement" state, auditory information is presented to the speaker whose bias is determined by the threshold value. Note that a plurality of audio files may be prepared depending on one state, and in that case, one is randomly selected from the plurality of audio files of the presentation content 12c.
As is already known, by listening to forward movement sounds (sounds that sound like moving from behind to front), children can feel a sense of self-motion as they move forward, and a sensation as if the space in front of them is expanding. can be done. On the other hand, backward movement sounds (sounds that sound like moving from front to back) create the illusion that you have moved backwards. In the embodiment, this effect is utilized to shorten or lengthen the psychological distance between users participating in a web conference.

<Effect>
Next, the operation of the above configuration will be explained.
FIG. 6 is a flowchart showing an example of the processing procedure of the information presentation device 1 having the above configuration. In FIG. 6, the processor 11 acquires speech data of each user at regular intervals, for example, and stores it in the storage 12 (step S10). Next, the processor 11 calculates state data from the acquired speech sound data, and calculates an index indicating the psychological distance between the first user and the second user. Based on this index, the processor 11 determines the state of communication between the first user and the second user (step S11).

Next, the processor 11 selectively acquires presentation content (auditory information) from the storage 12 based on the determined communication state so that the index falls within a certain range. That is, for example, when the index indicates that the psychological distance is too far, the processor 11 acquires a sound file of forward moving sound. Further, when the index indicates that the psychological distance is too close, the processor 11 acquires a sound file of backward moving sound. Then, the processor 11 transmits the acquired sound information to the second user's acoustic device and presents the auditory information (step S13).

FIG. 7 is a flowchart illustrating an example of the processing procedure in step S10 of FIG. 6. In FIG. 7, the processor 11 records the user's utterances at regular intervals (T) (step S21), analyzes the recorded data (step S22), and performs speaker separation (step S23). As is well known, in order to separate speakers from voice data, data classification processing using, for example, AI (Artificial Intelligence) technology may be applied.

Next, the processor 11 calculates the silent interval time l _s , the own speaking interval time l _us , and the other party's speaking interval time l _ur from the spoken voice data (step S24). The subscript s is (silent
), the subscript us indicates (utterance_sender), and the subscript ur indicates (utterance_receiver).

Next, the processor 11 calculates the silent interval time ratio R _s , the own utterance interval time ratio R _us , and the other party's utterance interval time ratio R _ur using, for example, equation (1) (step S25).

Then, the processor 11 records these calculated amounts in the storage 12 (step S26).

FIG. 8 is a flowchart showing an example of the processing procedure in step S11 of FIG. In FIG. 8, the processor 11 reads the latest M rows of R _us , R _ur , R _s (FIG. 3) recorded in the state data 12a of the storage 12 (step S31). Note that the reading range M is preferably set in advance.

Next, the processor 11 calculates the average silent time ratio R _{s_ave} , the average spontaneous speech time ratio R _{us_ave} , and the average partner's speech time ratio R _{ur_ave} from the read information using equation (2) (step S32). .

Next, the processor 11 reads the silent state threshold value S _th and the biased state threshold value B _th from the storage 12 (step S33).
Next, the processor 11 compares the average silent time ratio R _{s_ave} and S _th (step S34), and if R _{s_ave} > S _th , the processor 11 determines that the communication state between itself and the other party is [silent state] ( Step S35). If No in step S34, the processor 11 compares the average spontaneous speech time ratio R _{us_ave} and B _th (step S36), and if R _{us_ave} > B _th , the processor 11 compares the communication state between itself and the other party to bias state] (step S37).

If No in step S36, the processor 11 compares the average other party's speaking time ratio R _{ur_ave} and B _th (step S36), and if R _{ur_ave} > B _th , the processor 11 determines the communication state between itself and the other party. bias state] (step S39). If No in step S38, the processor 11 concludes that the communication state between itself and the other party does not require improvement (step S40).

FIG. 9 is a diagram showing an example of the association between the determined communication state and the presentation content. In the embodiment, the auditory information to be presented to the user is selected depending on the item to be improved during the speaking operation. That is, when it is determined that there is no sound, auditory information (forward moving sound) that gives the sensation of moving forward is selected. Furthermore, if the speaker is biased, auditory information (backward moving sound) that gives the sensation of moving backwards (relative to the speaker who is biased in speaking) is selected.

FIG. 10 is a diagram for explaining that psychological distance can be controlled using auditory information. As shown in FIG. 10, in the embodiment, the psychological distance between the user and the communication partner is involuntarily controlled by presenting auditory information. That is, by using the technique of Non-Patent Document 1 (expansion of the space near the body in the direction of self-kinesthetic sense), it is possible to control the sense of distance between oneself and the other party. Furthermore, with the technology of Non-Patent Document 2 (kinesthesis based on auditory information), it is possible to induce a sense of self-kinesia by presenting a noise sound to the auditory senses, thereby controlling the space near the body. By combining these effects, psychological distance can be controlled.

That is, by expanding the space near the body forward as shown in FIG. 9(a), it is possible to reduce the psychological distance from the other party. Further, as shown in FIG. 9(b), by reducing the space near the body backward, it is possible to distance the user from the other party. Here, the peri-personal space is a concept that means the space around the body that is within reach.

If there is actual movement, the nearby space will expand or contract in the direction of the movement. In the embodiment, the auditory effect induces a sense of self-motion just by presenting information to the auditory sense, making the user feel the sensation of moving forward or backward. Utilizing this, it is possible to reduce the distance between the user and the other party or to move them away from the other party involuntarily, regardless of the user's intention.

FIG. 11 is a flowchart illustrating an example of the processing procedure in step S13 of FIG. 6. In FIG. 11, the processor 11 communicates with the user's conference equipment 2 to obtain the number of audio devices worn by the target user (step S51). If the user is wearing, for example, two earphones 41 and bone conduction earphones 42, it is determined that the user is wearing a plurality of devices (Yes in step S52). In this case, the processor 11 merges (superimposes) the presentation information (auditory information) onto the other speaker's voice and reproduces it (step S53).

If the answer in step S52 is No, that is, if the user is wearing only one of the earphones 41 or bone conduction earphones 42, the processor 11 causes the device that is not playing the other speaker's voice to play the presentation information (step S54). ). That is, the output unit 115 transmits the auditory information to the bone conduction earphone 42 as the second device. In the case of IP (Internet Protocol)-based communication, transmission destinations may be distinguished by, for example, port numbers.

Steps S52, S53, and S54 are repeated until the specified playback time is finished so that even short audio files can be played back for a certain period of time. That is, the reproduction of the audio information is repeated until the reproduction for the designated reproduction time is completed (Yes in step S55).

FIG. 12 is a diagram for explaining a series of processing procedures in the information presentation device 1 of the embodiment. The processor 11 of the embodiment determines the current state of communication based on the content of utterances up to now (step S100: communication state determination), and selects appropriate presentation information according to the determined state (step S200: state selection of presented information according to the situation). Then, the processor 11 presents the generated information to the auditory senses and urges the user to improve the motion (step S300: auditory presentation of information encouraging improvement).

<Effect>
As described above, in the embodiment, the information presentation device 1 acquires the user's speaking behavior in a web conference, and determines whether or not the behavior needs to be improved while considering the past acquisition history. Therefore, the speech action is recorded and the duration of the speech is determined. Furthermore, for each user, the occurrence of utterances by the user and each conversation partner is determined. These processes are executed at arbitrary sampling intervals and compared with a preset threshold value to determine whether an inappropriate state such as "silence" or "biased speaker" exists.

FIG. 13 is a diagram for explaining the effects obtained by the embodiment. As shown in FIG. 13(a), the sense of distance between speakers is controlled by auditory information so that when the sense of psychological distance is far, the distance is shortened, and when the sense of psychological distance is close, the distance is increased. In this way, by feeding back auditory information according to the state of communication, it is possible to guide each person to the ideal state as shown in Figure 13 (b), and to maintain a sense of distance that makes it easy to communicate, while increasing the level of interaction satisfaction. It is possible to promote the improvement of

That is, in the embodiment, attention is paid to the fact that the sense of distance in a conversation changes depending on auditory information, and the sense of distance is made appropriate by presenting the auditory information. Furthermore, depending on the state of the conversation, it is determined whether the sense of distance is close or far, and auditory information is presented so that the sense of distance is appropriate. For these reasons, according to the embodiment, it is possible to provide a technology that can encourage comfortable communication even in a remote environment.

Note that this invention is not limited to the description of the embodiments. For example, the embodiment assumes general communication via online communication tools that involve audio, such as web conferences. The technology disclosed in the embodiments is not limited to this, and can also be implemented in the form of a smartphone app that can record and analyze the user's speaking behavior during face-to-face communication.

Further, the number of speakers in the web conference may be two or more, and in such a case, the utterance interval time of each speaker may be calculated to determine which speakers are biased. In addition, when presenting auditory information, the presentation may not only be ended after being played for a certain period of time, but also may be presented as appropriate when the user's state is in a state that requires improvement while proceeding with state determination in real time.
Furthermore, the number of utterances per unit time and the duration of silence can also be used as indicators of psychological distance.

Furthermore, the present invention can be embodied by modifying the constituent elements within the scope of the invention at the implementation stage. Furthermore, various inventions can be formed by appropriately combining the plurality of components disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiments. Furthermore, components from different embodiments may be combined as appropriate.

1... Information presentation device 2... Conference equipment 11... Processor 12... Storage 12a... Status data 12b... Threshold value 12c... Presentation content 13... Interface section 14... Memory 14a... Program 30... Microphone 41... Earphone 42... Bone conduction earphone 100 ...Network 111...User information acquisition section 112...Status data calculation section 113...Status determination section 114...Presentation content acquisition section 115...Output section.

Claims

A user who obtains utterance audio data of the first participant and utterance audio data of the second participant in a two-way telecommunication environment between at least a first participant and a second participant each wearing an acoustic device. Information acquisition department;
a determination unit that determines the state of communication between the first participant and the second participant based on the utterance audio data of the first participant and the utterance audio data of the second participant;
An information presentation device comprising: an output unit that causes the acoustic device of the first participant to output auditory information according to the determined state.
a memory unit that stores auditory information that can induce self-motion sensation;
further comprising an auditory information acquisition unit that acquires the auditory information from the storage unit,
The determination unit determines the state based on an index indicating a psychological distance between the first participant and the second participant,
The auditory information acquisition unit acquires auditory information that induces a sense of self-motion associated with the state in advance from the storage unit,
The information presentation device according to claim 1, wherein the output unit transmits the acquired auditory information to the acoustic device of the first participant.
The information presentation device according to claim 2, wherein the auditory information acquisition unit selectively acquires the auditory information from the storage unit so that the index falls within a certain range.
The storage unit stores a forward movement sound and a backward movement sound,
The auditory information acquisition unit includes:
acquiring the forward moving sound when the index indicates that the psychological distance is too far;
The information presentation device according to claim 3, wherein the backward moving sound is acquired when the index indicates that the psychological distance is too close.
The information presentation device according to any one of claims 2 to 4, wherein the determination unit uses either the number of utterances in a unit time or the duration of a silent state as an index indicating the psychological distance. .
When the acoustic device includes a first device that reproduces conversational audio and a second device that reproduces acoustic information different from the conversational audio,
The information presentation apparatus according to claim 1, wherein the output unit transmits the auditory information to the second device.
In an information presentation method executed by the processor of a computer comprising a processor and a storage unit,
The processor is configured to process audio data uttered by the first participant and audio data uttered by the second participant in a two-way telecommunications environment between at least a first participant and a second participant each wearing an audio device. the process of obtaining
a step in which the processor determines a state of communication between the first participant and the second participant based on speech data of the first participant and speech data of the second participant; ,
An information presentation method comprising: the processor causing the acoustic device of the first participant to output auditory information according to the determined state.
A program including instructions to be executed by the processor of a computer including a processor and a storage unit,
Speech audio data of the first participant and utterance audio data of the second participant in a two-way telecommunication environment between at least a first participant and a second participant each equipped with an acoustic device in the processor; an instruction to execute the process of acquiring the
The processor is provided with a step of determining a state of communication between the first participant and the second participant based on the utterance audio data of the first participant and the utterance audio data of the second participant. an instruction to execute,
A program comprising: an instruction for causing the processor to cause the acoustic device of the first participant to output auditory information according to the determined state.