WO2021230180A1

WO2021230180A1 - Information processing device, display device, presentation method, and program

Info

Publication number: WO2021230180A1
Application number: PCT/JP2021/017640
Authority: WO
Inventors: 新高橋; 卓見飯野
Original assignee: ピクシーダストテクノロジーズ株式会社; 大日本住友製薬株式会社
Priority date: 2020-05-11
Filing date: 2021-05-10
Publication date: 2021-11-18
Also published as: JPWO2021230180A1

Abstract

An information processing device according to the present invention is equipped with a means for acquiring audio collected by a plurality of microphones. The information processing device is equipped with a means for estimating an arrival direction of the acquired audio. The information processing device is equipped with a means for generating a text image corresponding to the acquired audio. The information processing device is equipped with a means for referencing the estimated arrival direction and determining a presentation mode for the text image. The information processing device is equipped with a means for presenting the text image in the determined presentation mode.

Description

Information processing equipment, display devices, presentation methods, and programs

This disclosure relates to information processing devices, display devices, presentation methods, and programs.

Hearing aids are widely used as a device to assist hearing.

Japanese Unexamined Patent Publication No. 2013-236396

Hearing aid wearers may have diminished ability to grasp the direction of arrival of sound due to diminished auditory function. When such a wearer tries to have a conversation with a plurality of people, the direction of arrival of the voice cannot be grasped, and it is difficult to establish the conversation.

For example, as in Patent Document 1, a hearing aid that reproduces the direction of arrival of voice and enhances the clarity of the voice spoken by the speaker (hereinafter referred to as “spoken sound”) has been proposed. However, the reproduction of the direction of arrival by voice alone is not sufficient for the wearer of the hearing aid to recognize the direction of arrival. In particular, when a plurality of speakers speak at the same time, it is difficult for the wearer to recognize the arrival direction of the utterance sound of each speaker only by reproducing the arrival direction by voice.

The purpose of this disclosure is to easily recognize the direction of arrival of voice.

According to one aspect of the present disclosure, an information processing device is provided. The information processing device includes means for acquiring sound collected by a plurality of microphones. The information processing device includes means for estimating the arrival direction of the acquired voice. The information processing device comprises means for generating a text image corresponding to the acquired voice. The information processing apparatus includes means for determining the presentation mode of the text image with reference to the estimated arrival direction. The information processing apparatus comprises means for presenting a text image in a determined presentation mode.

It is a schematic diagram which shows the structure of the display device of this embodiment. It is a schematic diagram of the glass type display device which is an example of the display device shown in FIG. It is explanatory drawing of the outline of this embodiment. It is a flowchart which shows an example of the presentation process of this embodiment. It is a figure for demonstrating the collection of the utterance sound emitted from a speaker. It is a figure for demonstrating the arrival direction of an utterance sound. It is a schematic diagram which shows the presentation example of the glass type display device. It is a figure for demonstrating the field of view of a wearer. It is a schematic diagram which shows the structure of the display device of the modification 1. FIG. It is a schematic diagram which shows the display device of the modification 2 and the presentation example of the display device. It is a schematic diagram which shows the structure of the display device of the modification 3. It is a schematic diagram which shows the 1st example of the display device of the modification 3 and the presentation example of the display device. It is a schematic diagram which shows the photographing range by the camera shown in FIG. It is a schematic diagram which shows the 2nd example of the display device of the modification 3 and the presentation example of the display device. It is a schematic diagram which shows the 3rd example of the display device of the modification 3 and the presentation example of the display device. It is a schematic diagram which shows the 4th example of the display device of the modification 3 and the presentation example of the display device. It is a schematic diagram which shows the structure of the display device of the modification 4. It is a schematic diagram which shows the structure of the display device of the modification 5. It is a schematic diagram of the conference system which is an example of the display device shown in FIG.

Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. In the drawings for explaining the embodiments, the same components are designated by the same reference numerals in principle, and the repeated description thereof will be omitted.

(1) Configuration of Information Processing Device The configuration of the display device 1 of the present embodiment will be described. FIG. 1 is a schematic view showing the configuration of the display device of the present embodiment. FIG. 2 is a schematic diagram of a glass-type display device which is an example of the display device shown in FIG.

The display device 1 shown in FIG. 1 collects sound and displays a text image corresponding to the collected sound in a presentation mode according to the direction of arrival of the sound (an example of "presentation"). It is composed of.
The form of the display device 1 includes, for example, at least one of the following.
・ Glass-type display device ・ Mobile terminal ・ Conference system

As shown in FIG. 1, the display device 1 includes a plurality of microphones 101, a display 102, and a controller 10.
The microphones 101 are arranged at a predetermined distance from each other.

As shown in FIG. 2, when the display device 1 is a glass type display device, the display device 1 has a right temple 21, a right twist 22, a bridge 23, a left twist 24, a left temple 25, and a rim. 26 and.

The microphone 101-1 is arranged on the right temple 21.
The microphone 101-2 is arranged on the right twist 22.
The microphone 101-3 is arranged on the bridge 23.
The microphone 101-4 is arranged on the left twist 24. The microphone 101-5 is arranged on the left temple 25.
The microphone 101 collects, for example, at least one of the following sounds.
-Sound of speech by a person-Sound of the environment in which the display device 1 is used (hereinafter referred to as "environmental sound")

When the display device 1 is a glass-type display device, the display 102 is a transparent member (for example, at least one of glass, plastic, and a half mirror). In this case, the display 102 is arranged at a position visible to the user wearing the glass-type display device.

The displays 102-1 to 102-2 are supported by the rim 26. The display 102-1 is arranged so as to be located in front of the user's right eye when the user wears the display device 1. The display 102-2 is arranged so as to be located in front of the user's left eye when the user wears the display device 1.

The display 102 presents (for example, displays) an image according to the control from the controller 10. The method by which the display 102 presents an image is not limited, and any existing method may be used.

For example, as shown in FIG. 2, an image corresponding to the image light is projected onto the display 102-1 from a projector (not shown) arranged behind the right temple 21. An image corresponding to the image light is projected onto the display 102-2 from a projector (not shown) arranged on the back side of the left temple 25.
The display 102-1 and the display 102-2 present an image. The user can visually recognize the image and at the same time visually recognize the scenery transmitted through the display 102-1 and the display 102-2.

The controller 10 is an information processing device that controls the display device 1. The controller 10 is connected to the microphone 101 and the display 102 by wire or wirelessly.
When the display device 1 is a glass-type display device as shown in FIG. 2, the controller 10 is arranged, for example, inside the right temple 21.

As shown in FIG. 1, the controller 10 includes a storage device 11, a processor 12, an input / output interface 13, and a communication interface 14.

The storage device 11 is configured to store programs and data. The storage device 11 is, for example, a combination of a ROM (ReadOnlyMemory), a RAM (RandomAccessMemory), and a storage (for example, a flash memory or a hard disk).

The program includes, for example, the following program.
・ OS (Operating System) program ・ Application program that executes information processing

The data includes, for example, the following data.
-Database referenced in information processing-Data obtained by executing information processing (that is, the execution result of information processing)

The processor 12 is configured to realize the function of the controller 10 by activating the program stored in the storage device 11. The processor 12 is an example of a computer. For example, the processor 12 activates a program stored in the storage device 11 to display an image representing a text corresponding to the utterance sound collected by the microphone 101 (hereinafter referred to as “text image”) at a predetermined position on the display 102. Realize the function presented to.

The input / output interface 13 acquires at least one of the following.
-Voice signal collected by the microphone 101-User's instruction input from the input device connected to the glass-type display device 1. The input device may be, for example, a drive button, a keyboard, a pointing device, a touch panel, a remote controller, or a switch. , Or a combination thereof.
Further, the input / output interface 13 is configured to output information to an output device connected to the display device 1. The output device is, for example, a display 102.

The communication interface 14 is configured to control communication between the display device 1 and an external device (for example, a server or a mobile terminal) (not shown).

(2) Outline of the embodiment An outline of the present embodiment will be described. FIG. 3 is an explanatory diagram of an outline of the present embodiment.

In FIG. 3, the wearer P1 who wears the display device 1 has a conversation with the speakers P2 to P4.
The microphone 101 collects the utterance sounds of the speakers P2 to P4.
The controller 10 estimates the arrival direction of the collected utterance sound.
The controller 10 determines the text corresponding to the utterance sound by analyzing the audio signal corresponding to the collected utterance sound.
The controller 10 generates text images T1 to T3 corresponding to the determined text.
The controller 10 determines the presentation mode of each of the text images T1 to T3 according to the arrival direction of the utterance sound.
The controller 10 presents the text images T1 to T3 on the displays 102-1 to 102-32 in the determined presentation mode.

(3) Presentation processing The presentation processing of the present embodiment will be described. FIG. 4 is a flowchart showing an example of the presentation process of the present embodiment. FIG. 5 is a diagram for explaining the collection of utterance sounds emitted from the speaker. FIG. 6 is a diagram for explaining the arrival direction of the utterance sound. FIG. 7 is a schematic diagram showing a presentation example of the glass-type display device of FIG. FIG. 8 is a diagram for explaining the field of view of the wearer.

Each microphone 101 collects the utterance sound emitted from the speaker. For example, in the example shown in FIG. 2, the microphones 101-1 to 101-5 arranged in the right temple 21, the right twist 22, the bridge 23, the left twist 24, and the left temple 25 of the display device 1 are shown in FIG. Collects the utterance sounds that arrive through the path shown in. The microphones 101-1 to 101-5 convert the collected utterance sound into an audio signal.

The controller 10 executes acquisition (S110) of the audio signal converted by the microphone 101.

Specifically, the processor 12 acquires an audio signal including an utterance sound emitted from at least one of the speakers P2, P3, and P4 transmitted from the microphones 101-1 to 101-5. The audio signals transmitted from the microphones 101-1 to 101-5 include spatial information based on the path through which the utterance sound has progressed.

After step S110, the controller 10 executes estimation of the arrival direction (S111).

Specifically, the storage device 11 stores the arrival direction estimation model. The arrival direction estimation model describes the correlation between the spatial information contained in the voice signal and the arrival direction of the utterance sound.

Any existing method may be used as the arrival direction estimation method used in the arrival direction estimation model. For example, as the arrival direction estimation method, MUSIC (Multiple Signal Classification) using the eigenvalue expansion of the input correlation matrix, the minimum norm method, or ESPRIT (Estimation of Signal Parameters via Rotational Invariance Techniques) is used.

The processor 12 is collected by the microphones 101-1 to 101-5 by inputting the audio signal received from the microphones 101-1 to 101-5 into the arrival direction estimation model stored in the storage device 11. Estimate the direction of arrival of the utterance sound. At this time, the processor 12 estimates, for example, the declination from the axis whose front is zero degree as the arrival direction of the utterance sound. In the example shown in FIG. 6, the processor 12 estimates the arrival direction of the utterance sound emitted from the speaker P2 as an angle A1 to the right from the axis. The processor 12 estimates the arrival direction of the utterance sound emitted from the speaker P3 as an angle A2 to the left from the axis. The processor 12 estimates the arrival direction of the utterance sound emitted from the speaker P4 as an angle A3 to the left from the axis.

After step S111, the controller 10 executes audio signal extraction (S112).

Specifically, the beamforming model is stored in the storage device 11. The beamforming model describes the correlation between a given direction and the parameters for forming a directivity with a beam in this direction. Here, the parameter for forming the directivity is a parameter related to amplifying or attenuating a plurality of audio signals.

The processor 12 inputs the estimated arrival direction into the beamforming model stored in the storage device 11 to calculate the parameters for forming the directivity having the beam in the arrival direction.

In the example shown in FIG. 6, the processor 12 inputs the calculated angle A1 into the beamforming model and calculates the parameters for forming the directivity having the beam in the direction of the angle A1 to the right from the axis. The processor 12 inputs the calculated angle A2 into the beamforming model and calculates the parameters for forming the directivity having the beam in the direction of the angle A2 to the left from the axis. The processor 12 inputs the calculated angle A3 into the beamforming model and calculates the parameters for forming the directivity having the beam in the direction of the angle A3 to the left from the axis.

The processor 12 amplifies or attenuates the audio signal transmitted from the microphones 101-1 to 101-5 with the parameters calculated for the angle A1. The processor 12 synthesizes the amplified or attenuated audio signal to extract the audio signal for the utterance sound coming from the angle A1 from the received audio signal.

The processor 12 amplifies or attenuates the audio signal transmitted from the microphones 101-1 to 101-5 with the parameters calculated for the angle A2. The processor 12 synthesizes the amplified or attenuated audio signal to extract the audio signal for the utterance sound coming from the angle A2 from the received audio signal.

The processor 12 amplifies or attenuates the audio signal transmitted from the microphones 101-1 to 101-5 with the parameters calculated for the angle A3. The processor 12 synthesizes the amplified or attenuated audio signal to extract the audio signal for the utterance sound coming from the angle A3 from the received audio signal.

After step S112, the controller 10 executes voice recognition (S113).

Specifically, the voice recognition model is stored in the storage device 11. The speech recognition model describes the correlation between the speech signal and the text for the speech signal. The speech recognition model is, for example, a trained model learned by machine learning.

The processor 12 inputs the extracted voice signal into the voice recognition model stored in the storage device 11, and determines the text corresponding to the input voice signal.

In the example shown in FIG. 6, the processor 12 determines the text corresponding to the input voice signal by inputting the voice signals extracted for the angles A1 to A3 into the voice recognition model.

After step S113, the controller 10 executes image generation (S114).

Specifically, the processor 12 generates a text image based on the determined text.

After step S114, the controller 10 executes the determination of the presentation mode (S115).

Specifically, the processor 12 determines how the text image is presented on the display 102.

In the first example of step S115, the processor 12 determines the position corresponding to the arrival direction of the audio signal related to the text image as the presentation position of the text image.
The processor 12 determines the type of the text image to be presented (an example of the "presentation mode") according to the arrival direction.

More specifically, the processor 12 sets the presentation position of the text image T1 generated based on the voice signal extracted in the direction of the angle A1 from the axis to the right in the direction corresponding to the angle A1 and the predetermined elevation direction. Determine the position of. According to the example adopted in the glass-type display device, the processor 12 positions the position of the display 102-1 on the right side of the glass-type display device in the direction corresponding to the angle A1 and in the predetermined elevation angle direction as the text image T1. The presentation position of. Further, the processor 12 determines to present the text image T1 so that the text image T1 is formed at a predetermined distance from the wearer P1.
The processor 12 determines the presentation position of the text image T2 generated based on the voice signal extracted in the direction of the angle A2 from the axis to the left as the position corresponding to the angle A2 and the position in the predetermined elevation angle direction. .. According to the example adopted in the glass-type display device, the processor 12 positions the position of the display 102-2 on the left side of the glass-type display device in the direction corresponding to the angle A2 and in the predetermined elevation angle direction as the text image T2. The presentation position of. Further, the processor 12 determines to present the text image T2 so that the text image T2 is formed at a predetermined distance from the wearer P1.
The processor 12 determines the presentation position of the text image T3 generated based on the voice signal extracted in the direction of the angle A3 from the axis to the left as the position corresponding to the angle A3 and the position in the predetermined elevation angle direction. .. According to the example adopted in the glass-type display device, the processor 12 positions the position of the display 102-2 on the left side of the glass-type display device in the direction corresponding to the angle A3 and in the predetermined elevation angle direction as the text image T3. The presentation position of. Further, the processor 12 determines to present the text image T3 so that the text image T3 is formed at a predetermined distance from the wearer P1.

In the second example of step S115, the processor 12 determines a predetermined position as the presentation position of the text images T1 to T3.
The processor 12 determines to present the text images T1 to T3 in a format including at least one of a character string and a symbol corresponding to the direction of arrival of the voice signal relating to the text image (an example of the "presentation mode").

After step S115, the controller 10 executes image presentation (S116).

Specifically, the processor 12 presents the text image on the display 102 in the determined presentation mode.

According to the first example (FIG. 7) of step S115, the processor 12 presents the text image T1 at a position of the display 102-1 in the direction corresponding to the angle A1 and in the predetermined elevation angle direction. The processor 12 presents the text image T2 at a position on the display 102-2 in the direction corresponding to the angle A2 and in the predetermined elevation angle direction. The processor 12 presents the text image T3 at a position on the display 102-2 in the direction corresponding to the angle A3 and in the predetermined elevation angle direction. The humanoid figure shown by the broken line on the displays 102-1 to 102-2 in FIG. 7 is a supplementary representation of the speaker who can be seen through the displays 102-1 to 102-2 by the wearer P1. , Not presented on displays 102-1 to 102-2.

According to the second example of step S115, the processor 12 contains the text image T1 at a predetermined position on the display 102-1 and at least one of a character string and a symbol corresponding to the direction corresponding to the angle A1. Present at. The processor 12 presents the text image T2 in a predetermined position on the display 102-2 in a format that includes at least one of a string and a symbol corresponding to the direction corresponding to the angle A2. The processor 12 presents the text image T3 in a predetermined position on the display 102-2 in a format that includes at least one of a string and a symbol corresponding to the direction corresponding to the angle A3. As an example, a text image based on the utterance sound from the speaker on the left may contain, for example, the letters "left" or a symbol reminiscent of "left", to the utterance sound from the speaker on the right. Based text images include, for example, the letters "right" or symbols reminiscent of "right".

By presenting the text images T1 to T3 on the displays 102-1 to 102-2 in this way, the speaker P2 speaks to the wearer P1 of the glass-type display device 1 as shown in FIG. The text image T1 which is the conversation content is presented together with the speaker P2 which is visually recognized through the display 102-1. The text image T2, which is the conversation content spoken by the speaker P3, is presented to the wearer P1 together with the speaker P3 which is visually recognized through the display 102-2. The text image T3, which is the conversation content spoken by the speaker P4, is presented to the wearer P1 together with the speaker P4 which is visually recognized through the display 102-2.

(4) Summary According to the present embodiment, a text image corresponding to the utterance sound is presented in a presentation mode according to the arrival direction of the utterance sound. As a result, the wearer of the display device 1 can easily recognize the direction of arrival of the utterance sound.

Further, according to the present embodiment, the presentation mode is such that the image is presented at a position corresponding to the arrival direction of the utterance sound. This makes it easier to recognize the direction of arrival of the utterance sound.

Further, according to the present embodiment, the audio signal corresponding to the estimated arrival direction is extracted from the acquired audio signal. This makes it possible to accurately recognize the direction of arrival of the utterance sound.

Further, according to the present embodiment, the display device is applied to at least one form of a glass type display device, a mobile terminal, and a conference system. This makes it possible to easily recognize the direction of arrival of the utterance sound in various uses.

(5) Modification Example A modification of the present embodiment will be described.

(5.1) Modification 1
A modification 1 of the present embodiment will be described. Modification 1 shows an example in which the display device 1 is connected to a microphone module including a plurality of microphones 101. FIG. 9 is a schematic view showing the configuration of the display device of the first modification.

As shown in FIG. 9, in the display device 1 of the first modification, the communication interface 14 is connected to the microphone module 101a.
In this case, the microphone 101 is not arranged on the frame of the glass-type display device 1.

The microphone module 101a includes a plurality of microphones 101. The microphones 101 are arranged at a predetermined distance from each other. The microphone module 101a is attached to any part of the body shown below.
-Head-collar-chest-waist-Other parts that pass through the center of the wearer When the microphone module 101a is worn by the wearer, it communicates with the controller 10 via the communication interface 14.

The controller 10 executes steps S110 to S116 and presents the text images T1 to T3 on the displays 102-1 to 102-2 in the same manner as in FIG.

According to the first modification, even in the glass-type display device 1 in which the microphone 101 is not arranged, it is possible to present a text image corresponding to the sound collected by the microphone 101 in a mode corresponding to the arrival direction. Become.

(5.2) Modification 2
Modification 2 of this embodiment will be described. Modification 2 shows an example in which the display device 1 includes a mobile terminal. FIG. 10 is a schematic diagram showing the display device of the modification 2 and the presentation example of the display device.

In the second modification, the mobile terminal of FIG. 10 is an example of the display device 1. The mobile terminal includes, for example, any of the following.
・ Smartphones ・ Tablet terminals ・ Mobile devices with displays ・ Personal computers (for example, laptop computers)

In the second modification, the controller 10 executes steps S110 to S116 in the same manner as in FIG.

As a result, as shown in FIG. 10, the text images T1 to T3 are presented at positions on the display 102 in the direction corresponding to the arrival direction of the utterance sound.

According to the second modification, if the microphone module 101a is connected to the mobile terminal, it is possible to present a text image corresponding to the utterance sound collected by the microphone 101 in a presentation mode according to the arrival direction.

(5.3) Modification 3
A modification 3 of the present embodiment will be described. Modification 3 shows an example in which the display device 1 includes a camera. FIG. 11 is a schematic view showing the configuration of the display device of the modification 3.

As shown in FIG. 11, the display device 1a includes a microphone 101, a display 102, a camera 103, and a controller 10a.
The camera 103 is arranged so that the speaker is included in the shooting area.
The camera 103 shoots in a predetermined direction and generates a shooting signal.

The controller 10a is an information processing device that controls the display device 1a. The controller 10a is connected to the microphone 101, the display 102, and the camera 103 by wire or wirelessly.

As shown in FIG. 11, the controller 10a includes a storage device 11, a processor 12a, an input / output interface 13a, and a communication interface 14.

The processor 12a is configured to realize the function of the controller 10a by activating the program stored in the storage device 11. The processor 12a is an example of a computer. For example, the processor 12a responds to a shooting signal generated by the camera 103 at a predetermined position on the display 102 by activating a program stored in the storage device 11 to display a text image of the utterance sound collected by the microphone 101 at a predetermined position. It realizes a function of superimposing and presenting an image to be displayed (hereinafter referred to as "captured image").

The input / output interface 13a acquires at least one of the following.
-Voice signal collected by the microphone 101-Shooting signal taken by the camera 103-User's instruction input from the input device connected to the display device 1 The input device is, for example, a drive button, a keyboard, or a pointing device. , Touch panel, remote controller, switch, or a combination thereof.
Further, the input / output interface 13a is configured to output information to an output device connected to the display device 1. The output device is, for example, a display 102.

(5.3.1) Presentation processing The presentation processing of the modified example 13 will be described with reference to the flowchart shown in FIG.

The controller 10a executes steps S110 to S113 in the same manner as in FIG.

After step S113, the controller 10a executes image generation (S114).

Specifically, the controller 10a converts the shooting signal generated by the camera 103 into a shooting image.
The controller 10a generates a text image as in FIG.

After step S114, the controller 10a executes the determination of the presentation mode (S115).

Specifically, the processor 12a determines how the text image and the captured image are presented on the display 102.
For example, the processor 12a determines the position corresponding to the arrival direction of the audio signal related to the text image as the presentation position of the text image, and the type of the text image to be presented according to the arrival direction, as in FIG. To decide.
The processor 12a determines the presentation position of the captured image and the type of the captured image to be presented according to the arrival direction.

After step S115, the controller 10a executes the image presentation (S116).

Specifically, the processor 12a superimposes the text image generated in step S114 on the captured image and presents it on the display 102 in the determined presentation mode.

(5.3.2.) First Example of Display Device of Modification Example 3 The first example of the display device of Modification 3 will be described. The first example of the display device of the modification 3 shows an example in which the display device 1a includes a glass type display device. FIG. 12 is a schematic diagram showing a first example of the display device of the modified example 3 and a presentation example of the display device. FIG. 13 is a schematic diagram showing a shooting range by the camera shown in FIG.

In the example shown in FIG. 12, the camera 103 is arranged on the bridge 23 so as to capture an area including the wearer's field of view. The camera 103 is set so that the shooting range includes the field of view of the wearer.
In FIG. 13, the solid line represents the shooting range by the camera 103, and the broken line represents the field of view of the wearer. According to the example shown in FIG. 13, the camera 103 is capable of capturing a view that is in the field of view of the wearer. As a result, when the wearer's field of view includes the speakers P2 to P4, the camera 103 takes a picture of the speakers P2 to P4.

The controller 10a executes steps S110 to S114 shown in FIG. 4, as described in the modified example 3.

Specifically, the processor 12a determines the presentation position of the text image T1 generated based on the voice signal extracted in the predetermined arrival direction as the position corresponding to the arrival direction and the position in the predetermined elevation angle direction. .. That is, the processor 12a sets the position of the display 102-1 in the direction corresponding to the arrival direction and the predetermined elevation angle direction as the presentation position of the text image T1. Further, the processor 12a determines to present the text image T1 so that the text image T1 is formed at a predetermined distance from the wearer.
The processor 12a determines the presentation position of the text images T2 and T3 generated based on the voice signal extracted for the predetermined arrival direction as the position corresponding to the arrival direction and the position in the predetermined elevation angle direction. That is, the processor 12a sets the position of the display 102-2 in the direction corresponding to the arrival direction and the predetermined elevation angle direction as the presentation position of the text images T2 and T3. Further, the processor 12a determines to present the text images T2 and T3 so that the text images T2 and T3 are imaged at a predetermined distance from the wearer.
The processor 12a determines the presentation position of the captured image based on the imaging direction of the camera 103. Further, the processor 12a determines to present the photographed image so that the photographed image is formed at a predetermined distance from the wearer.

After step S115, the controller 10a executes image presentation (S116).
Specifically, the processor 12a superimposes the text image generated in step S114 on the captured image and presents it on the display 102 in the determined presentation mode.

According to the example shown in FIG. 12, the processor 12a presents the captured image on the display 102-1 and the display 102-2. Thereby, for example, the image I1 of the speaker P2 taken as shown in FIG. 12 is presented on the display 102-1, and the images I2 and I3 of the speakers P3 and P4 are presented on the display 102-2.
The processor 12a superimposes and presents the text image T1 on the captured image at a position on the display 102-1 in the direction corresponding to the arrival direction of the utterance sound and in the predetermined elevation angle direction.
The processor 12a superimposes the text images T2 to T3 on the captured image and presents the text images T2 to T3 at positions on the display 102-2 in the direction corresponding to the arrival direction of the utterance sound and in the predetermined elevation angle direction.

By presenting the images I1 to I3 and the text images T1 to T3 on the displays 102-1 to 102-2 in this way, the wearer of the display device 1a is informed of the text that is the conversation content spoken by the speaker P2. The image T1 will be presented together with the image I1 representing the speaker P2. The wearer P1 is presented with the text image T2, which is the conversation content spoken by the speaker P3, together with the image I2 representing the speaker P3. The wearer P1 is presented with the text image T3, which is the conversation content spoken by the speaker P4, together with the image I3 representing the speaker P4.

(5.3.3) Second Example of Display Device of Modification Example 3 A second example of the display device of Modification 3 will be described. The second example of the display device of the modification 3 shows an example in which the display device 1a is connected to a microphone module including a plurality of microphones 101. FIG. 14 is a schematic diagram showing a second example of the display device of the modified example 3 and a presentation example of the display device.

As shown in FIG. 14, in the second example of the display device of the modification 3, the microphone 101 is not arranged in the frame of the glass type display device 1a.

The controller 10a executes steps S110 to S116 shown in FIG. 4, as described in the first example of the display device of the modification example 3.

As a result, as shown in FIG. 14, the image I1 is presented on the display 102-1, and the images I2 and I3 are presented on the display 102-2. Further, the text image T1 is superimposed and presented at a position corresponding to the arrival direction of the display 102-1. Further, the text images T2 and T3 are superimposed and presented at positions corresponding to the arrival direction of the display 102-2.

(5.3.4) Third Example of Display Device of Modification Example 3 A third example of the display device of Modification 3 will be described. In the third example of the display device of the modification 3, an example in which the display device 1a includes a mobile terminal is shown. FIG. 15 is a schematic diagram showing a third example of the display device of the modified example 3 and a presentation example of the display device.

In the example shown in FIG. 15, as the camera 103, a camera arranged on the back surface of the arrangement surface of the display 102 is used so as to capture an area including the field of view of the user P1.

The controller 10a executes steps S110 to S114 shown in FIG. 4, as described in the first example of the display device of the modification 3.

After step S114, the controller 10a executes the determination of the presentation mode (step S115).

Specifically, the processor 12a determines the presentation position of the text images T1 to T3 generated based on the audio signal extracted in the predetermined arrival direction as the position in the direction corresponding to the arrival direction. According to the example adopted for the mobile terminal, the processor 12a sets the position of the display 102 of the mobile terminal in the direction corresponding to the arrival direction as the presentation position of the text images T1 to T3. Further, the processor 12a determines to present the text images T1 to T3.
The processor 12a determines the presentation position of the captured image based on the imaging direction of the camera 103. Further, the processor 12a determines to present the captured image.

After step S115, the controller 10a executes the image presentation (S116).

According to the example shown in FIG. 15, the processor 12a presents the captured image on the display 102. Thereby, for example, the speaker images I1 to I3 taken as shown in FIG. 15 are presented on the display 102. The processor 12a presents the text images T1 to T3 at positions on the display 102 of the mobile terminal in the direction corresponding to the arrival direction of the utterance sound.

By presenting the images I1 to I3 and the text images T1 to T3 on the display 102 in this way, the text image T1 which is the conversation content spoken by the speaker P2 is spoken to the user P1 of the display device 1a. It will be presented together with the image I1 representing the person P2. The text image T2, which is the conversation content spoken by the speaker P3, is presented to the user P1 together with the image I2 representing the speaker P3. The text image T3, which is the conversation content spoken by the speaker P4, is presented to the user P1 together with the image I3 representing the speaker P4.

(5.3.5) Fourth Example of Display Device of Modification Example 3 A fourth example of the display device of Modification 3 will be described. The fourth example of the display device of the modification 3 shows an example in which the display device 1a is adopted in the conference system. FIG. 16 is a schematic diagram showing a fourth example of the display device of the modified example 3 and a presentation example of the display device.

In the fourth example of the display device of the third modification, the conference system is a system that presents the utterance sound collected during the conference to the display as a text image at a position corresponding to the arrival direction.

The display 102 is arranged at a position where the conference participants can see it.

The camera 103 is arranged at a position where the conference participants can be photographed. In the example shown in FIG. 16, the camera 103 is located above the display 102. The camera 103 photographs the conference participants P2 to P4 who are having a conference.

The microphone module 101a is placed in any of the positions shown below: -Conference tabletop-Hollow position suspended from the ceiling When the microphone module 101a is placed in a predetermined position, it regulates with the controller 10a. To carry out.

The controller 10a executes steps S110 to S116 shown in FIG. 4, as described in the third example of the display device of the modification 3.

According to the example shown in FIG. 16, the processor 12a presents the captured image on the display 102. As a result, the images I1 to I3 obtained by capturing the conference participants P2 to P4 are presented on the display 102. The processor 12a presents the text images T1 to T3 at positions on the display 102 in the direction corresponding to the arrival direction of the utterance sound.

By presenting the images I1 to I3 and the text images T1 to T3 on the display 102 in this way, the text image T1 which is the conversation content spoken by the conference participant P2 is presented together with the image I1 representing the conference participant P2. Will be done. The text image T2, which is the conversation content spoken by the conference participant P3, is presented together with the image I2 representing the conference participant P3. The text image T3, which is the conversation content spoken by the conference participant P4, will be presented together with the image I3 representing the conference participant P4.

According to the third modification, the captured image is presented, and the text image corresponding to the utterance sound collected by the microphone 101 is presented in the presentation mode according to the arrival direction according to the speaker image included in the captured image. Is possible. This makes it possible to improve the visibility of the relationship between the sound source (for example, the speaker) and the text image.

(5.4) Modification 4
A modification 4 of the present embodiment will be described. Modification 4 shows an example in which the function of the controller is realized by the server device. FIG. 17 is a schematic view showing the configuration of the display device of the modified example 4.

As shown in FIG. 17, the display device 1b includes a plurality of microphones 101, a display 102, and a server device 10b.

The server device 10b is an information processing device that controls the display device 1b. The server device 10b is connected to the network by wire or wirelessly.

As shown in FIG. 17, the server device 10b includes a storage device 11, a processor 12b, an input / output interface 13, and a communication interface 14b.

The processor 12b is configured to realize the function of the server device 10b by activating the program stored in the storage device 11. The processor 12b is an example of a computer. For example, the processor 12b realizes a function of activating a program stored in the storage device 11 to present a text image based on the utterance sound collected by the microphone 101 to a predetermined position on the display 102.

The communication interface 14b is configured to control communication via a network between the display device 1b, the microphone 101, and the display 102.

In the modification 4, the server device 10b executes steps S110 to S116 in the same manner as in FIG.

According to the fourth modification, even if the terminal side is not provided with a complicated computable processor, the text image corresponding to the utterance sound collected by the microphone 101 can be presented in a presentation mode according to the arrival direction. It will be possible.

(5.5) Modification 5
A modification 5 of the present embodiment will be described. Modification 5 shows an example in which the display device of modification 4 includes a camera. FIG. 18 is a schematic view showing the configuration of the display device of the modified example 5. FIG. 19 is a schematic diagram of a conference system, which is an example of the display device shown in FIG.

As shown in FIG. 18, the display device 1c includes a plurality of microphones 101, a display 102, a camera 103, and a server device 10c.

The server device 10c is a device that controls the display device 1c. The server device 10c is connected to the network by wire or wirelessly.

As shown in FIG. 18, the server device 10c includes a storage device 11, a processor 12c, an input / output interface 13, and a communication interface 14c.

The processor 12c is configured to realize the function of the server device 10c by activating the program stored in the storage device 11. The processor 12c is an example of a computer. For example, the processor 12c realizes a function of activating a program stored in the storage device 11 to present a text image based on the utterance sound collected by the microphone 101 to a predetermined position on the display 102.

The communication interface 14c is configured to control communication via a network between the display device 1c and the microphone 101, the display 102, and the camera 103.

In the conference system shown in FIG. 19, a conference held remotely is photographed and the utterance sound of the conference is collected. The conference system presents the captured image on the display and presents the text image based on the utterance sound at the position of the display according to the arrival direction of the utterance sound. Hereinafter, a conference held remotely is referred to as a remote conference.

The display 102 is arranged at a position visible to at least one of the following persons.
・ Person who participates in the conference call ・ Person who monitors the conference call

The camera 103 is arranged at a position where a remote conference can be photographed. According to the example shown in FIG. 19, the camera 103 captures the conference participants P2 to P4 participating in the remote conference. The camera 103 shoots and generates a shooting signal. The camera 103 transmits a shooting signal to the server device 10c via the network.

The microphone module 101a is placed in one of the positions shown below that can collect the spoken sound of the remote conference.-Conference tabletop-Hollow position suspended from the ceiling The microphone module 101a is placed in a predetermined position. Then, regulation is performed with the server device 10c.

In FIG. 19, the server device 10c executes steps S110 to S116 in the same manner as in FIG.

According to the example shown in FIG. 19, the processor 12c presents the captured image on the display 102. As a result, the images I1 to I3 obtained by capturing the conference participants P2 to P4 are presented on the display 102. The processor 12c presents the text images T1 to T3 at positions on the display 102 in the direction corresponding to the arrival direction of the utterance sound.

According to the fifth modification, the captured image is presented, and the text image corresponding to the utterance sound collected by the microphone 101 is presented in the presentation mode according to the arrival direction according to the speaker image included in the captured image. Is possible.

(6) Other Modifications In the present embodiment, the case where the user's instruction is input from the input device connected to the input / output interface 13 has been described, but in the present embodiment, the computer connected to the communication interface 14 has been described. It is also applicable when a user's instruction is input from a drive button object presented by an application (for example, a smartphone).

The display device 1 may be realized by any method as long as the image can be presented to the user. The display device 1 can be realized by, for example, the following implementation method.
-HOE (Holographic optical element) or DOE (Diffractive optical element) using an optical element (for example, a light guide plate)
・ Liquid crystal display ・ Retinal projection display ・ LED (Light Emitting Diode) display ・ Organic EL (Electro Luminescence) display ・ Laser display ・ Optical elements (for example, lens, mirror, diffraction grid, liquid crystal, MEMS mirror, HOE) A display that guides the light emitted from the light emitter In particular, a retinal projection display makes it easy for even a person with low vision to observe an image. Therefore, it is possible to make a person suffering from both deafness and amblyopia more easily aware of the direction of arrival of the utterance sound.

In the present embodiment, the case where the display device 1a includes the camera 103 has been described as an example, but the present embodiment can also be applied to the case where the display device 1 includes a sensor configured to sense. The sensor is, for example, at least one of the following.
・ Human sensor ・ TOF (Time Of Flight) sensor ・ Millimeter wave radar ・ LiDAR (Light Detection And Ranging)
-Image sensor When the display device 1 includes the sensor, for example, the input / output interface 13 acquires a sensing signal generated by the sensor. The processor 12 determines the presentation mode of the text image in step S115 based on the acquired sensing signal. This makes it possible to improve the accuracy with which the text image is presented.
The sensing signal is, for example, a shooting signal obtained by shooting a region collected by a plurality of microphones by a camera equipped with an image sensor.

In the present embodiment, the case where the presentation position of the text image is determined based on the arrival direction of the utterance sound has been described even when there is a captured image, but in the present embodiment, the processors 12a and 12c are the text images. It is also applicable when the presentation position of is determined in association with an image of a speaker located within a predetermined range from the arrival direction of the utterance sound.
Specifically, for example, the processors 12a and 12c determine the presentation position of the captured image based on the imaging direction of the camera 103. The processors 12a and 12c associate the arrival direction of the utterance sound with the position of the speaker included in the captured image. The processors 12a and 12c determine the presentation position of the text images T1 to T3 generated based on the audio signal extracted in the predetermined arrival direction as the position in the vicinity of the speaker associated with the arrival direction.

In the present embodiment, an example of extracting an amplified or attenuated audio signal by beamforming has been described as a method of extracting an audio signal, but the scope of the present embodiment is not limited to this. The extraction of the audio signal of the present embodiment can also be realized by the following method.
・ Frost beamformer ・ Adaptive filter beamforming (for example, generalized sidelobe canceller)

In the present embodiment, an example in which the presentation position and the type of the text image are included in the presentation mode of the text image has been described, but the present embodiment is also applied to the case where the presentation mode includes, for example, the following modes. It is possible.
-Font-Character color-Pictogram When the presentation mode includes font, character color, pictogram, etc., the processor 12 instead of presenting the text image at a position corresponding to the arrival direction of the spoken sound, the text image. May be presented on the display 102 in a color or font or the like according to the direction of arrival.
In the present embodiment, a case where a text is created based on a voice signal by voice recognition has been described. In the present embodiment, the processor 12 has a speaker attribute (hereinafter referred to as "speaker attribute") by, for example, voice analysis of the utterance sound collected by the microphone 101 or image analysis of an image taken by the camera 103. ) May be estimated. Speaker attributes include, for example:
-Mood-Gender-Age Based on the estimated speaker attributes, the processor 12 determines the presentation mode of the text image, for example, the font, the color of the character, and the pictogram. As a result, the wearer of the display device 1 can easily recognize the speaker attribute.

In the present embodiment, the case where the captured image captured by the camera 103 is transmitted to the server device 10c via the network has been described, but in the present embodiment, the captured image captured by the camera 103 is transmitted to the server device 10c. It is also applicable when it is not done. In this case, the captured image captured by the camera 103 is presented on the display 102.

In the present embodiment, the processor 12 applies the voice analysis process to the input voice signal, the voice signal being processed, or the voice signal after the processing, so that the voice of the utterance sound among the voices acquired is obtained. You may specify the arrival direction of the extracted voice and present the text image corresponding to the extracted voice. As a result, the processing for the environmental sound is omitted from the voice including the sound other than the utterance sound (for example, the environmental sound), so that the processing load of the information processing apparatus can be suppressed.

In the present embodiment, the case of using the voice recognition model stored in the storage device 11 has been described, but in the present embodiment, the voice recognition model stored in the server connectable via the communication interface 14 is used. It is also applicable in some cases. In this case, steps S111 to S115 in FIG. 5 are executed by the processor of the server.

Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited to the above embodiments. Further, the above-described embodiment can be variously improved or modified without departing from the gist of the present invention. Further, the above embodiments and modifications can be combined.

(7) Addendum The matters described in the embodiment are added below.

(Appendix 1)
A means for acquiring sound collected by a plurality of microphones 101 (for example, a processor 12 for executing step S110) is provided.
A means for estimating the arrival direction of the acquired voice (for example, a processor 12 for executing step S111) is provided.
A means for generating a text image corresponding to the acquired voice (for example, a processor 12 for executing step S114) is provided.
A means for determining the presentation mode of the text image (for example, the processor 12 for executing step S115) with reference to the estimated arrival direction is provided.
An information processing device (eg, controller 10) comprising means for presenting a text image (eg, a processor 12 performing step S116) in a determined presentation mode.

According to (Appendix 1), the direction of arrival of voice can be easily recognized.

(Appendix 2)
The information processing apparatus according to (Appendix 1), wherein the means for determining the presentation mode determines the presentation mode in which the text image is presented at a position corresponding to the estimated arrival direction.

According to (Appendix 2), the direction of arrival of voice can be recognized more easily.

(Appendix 3)
A means for extracting the voice corresponding to the estimated arrival direction from the acquired voice (for example, the processor 12 for executing step S112) is provided.
The information processing device according to (Appendix 1) or (Appendix 2), wherein the means for generating a text image is to generate a text image corresponding to the extracted voice.

According to (Appendix 3), it is possible to accurately recognize the direction of arrival of voice.

(Appendix 4)
It is equipped with a means for estimating speaker attributes by analyzing the acquired voice.
The information processing apparatus according to any one of (Appendix 1) to (Appendix 3), wherein the means for determining the presentation mode determines the presentation mode of the text image with reference to the estimated speaker attribute.

According to (Appendix 4), the speaker attribute can be easily recognized.

(Appendix 5)
A means (for example, an input / output interface 13) for acquiring a sensing signal relating to the sensing of a region collected by a plurality of microphones by using a sensor is provided.
The information processing apparatus according to any one of (Appendix 1) to (Appendix 4), wherein the means for determining the presentation mode determines the presentation mode of the text image with reference to the acquired sensing signal.

According to (Appendix 5), the accuracy of presenting the text image can be improved.

(Appendix 6)
The information processing device according to (Appendix 5), wherein the sensing signal is a photographing signal in which an area is photographed using an image sensor.

According to (Appendix 6), the accuracy of presenting the text image can be improved.

(Appendix 7)
A means for acquiring a shooting signal in which a region is shot (for example, an input / output interface 13a) is provided.
A means for converting the acquired shooting signal into a shooting image (for example, a processor 12 for executing step S114) is provided.
The information processing device according to any one of (Appendix 1) to (Appendix 5), wherein the means for presenting the text image is superposed on the captured image and presented.

According to (Appendix 7), it is possible to improve the visibility of the relationship between the voice source (for example, the speaker) and the text image.

(Appendix 8)
It is equipped with a means to estimate the speaker attribute by analyzing the shooting signal.
The information processing apparatus according to (Appendix 6) or (Appendix 7), wherein the means for determining the presentation mode determines the presentation mode of the text image with reference to the estimated speaker attribute.

According to (Appendix 8), the speaker attribute can be easily recognized.

(Appendix 9)
It is equipped with a means for extracting the voice of the utterance sound emitted from a person from the acquired voice.
The means of estimating the arrival direction is to estimate the arrival direction of the extracted voice and
The means for generating a text image is to generate a text image corresponding to the extracted voice.
The information processing apparatus according to any one of (Appendix 1) to (Appendix 8).

According to (Appendix 9), among the voices including sounds other than the spoken sound (for example, environmental sounds), the processing for the environmental sounds is omitted, so that the processing load of the information processing device can be suppressed.

(Appendix 10)
A means for acquiring sound collected by a plurality of microphones 101 (for example, a processor 12 for executing step S110) is provided.
A means for estimating the arrival direction of the acquired voice (for example, a processor 12 for executing step S111) is provided.
A means for generating a text image corresponding to the acquired voice (for example, a processor 12 for executing step S114) is provided.
A means for determining the presentation mode of the text image (for example, the processor 12 for executing step S111) with reference to the estimated arrival direction is provided.
A means for presenting a text image (eg, a processor 12 performing step S116) in a determined presentation mode.
Display device 1.

According to (Appendix 10), the direction of arrival of voice can be easily recognized.

(Appendix 11)
The display device according to (Appendix 10), wherein the display device is at least one of a glass type display device, a mobile terminal, and a conference system.

According to (Appendix 11), the direction of arrival of voice can be easily recognized in various uses.

(Appendix 12)
The display device according to (Appendix 10) or (Appendix 11), wherein the display device is a retinal projection type display device.

According to (Appendix 12), a person suffering from both deafness and amblyopia can easily recognize the direction of arrival of voice.

(Appendix 13)
A program for causing a computer (for example, a processor 12) to realize the means according to any one of (Appendix 1) to (Appendix 12).

According to (Appendix 13), the direction of arrival of voice can be easily recognized.

(Appendix 14)
It is a presentation method that presents an image corresponding to voice.
A step (for example, step S110) for acquiring the sound collected by a plurality of microphones is provided.
A step (for example, step S111) for estimating the arrival direction of the acquired voice is provided.
A step (for example, step S114) for generating a text image corresponding to the acquired voice is provided.
A step (for example, step S115) for determining the presentation mode of the text image with reference to the estimated arrival direction is provided.
A step of presenting a text image (eg, step S116) in a determined presentation mode.
Method.

According to (Appendix 14), the direction of arrival of voice can be easily recognized.

1: Glass type display device 1: Display device 10: Controller 11: Storage device 12: Processor 13: Input / output interface 21: Right temple 22: Right twist 23: Bridge 24: Left twist 25: Left temple 26: Rim 101: Microphone 102: Display 103: Camera

Claims

Equipped with a means to acquire the sound collected by multiple microphones,
A means for estimating the arrival direction of the acquired voice is provided.
A means for generating a text image corresponding to the acquired voice is provided.
A means for determining the presentation mode of the text image with reference to the estimated arrival direction is provided.
A means for presenting the text image in the determined presentation mode.
Information processing device.
The information processing apparatus according to claim 1, wherein the means for determining the presentation mode determines the presentation mode for presenting the text image at a position corresponding to the estimated arrival direction.
A means for extracting the voice corresponding to the estimated arrival direction from the acquired voice is provided.
The information processing device according to claim 1 or 2, wherein the means for generating the text image is to generate a text image corresponding to the extracted voice.
A means for estimating speaker attributes by analyzing the acquired voice is provided.
The information processing apparatus according to any one of claims 1 to 3, wherein the means for determining the presentation mode determines the presentation mode of the text image with reference to the estimated speaker attribute.
A means for acquiring a sensing signal relating to the sensing of a region collected by the plurality of microphones by using a sensor is provided.
The information processing apparatus according to any one of claims 1 to 4, wherein the means for determining the presentation mode determines the presentation mode of the text image with reference to the acquired sensing signal.
The sensing signal is a photographing signal obtained by photographing the region by using an image sensor.
The information processing apparatus according to claim 5.
The area is provided with a means for acquiring a photographed signal in which the area is photographed.
A means for converting the acquired shooting signal into a shooting image is provided.
The information processing device according to any one of claims 1 to 5, wherein the means for presenting the text image is presented by superimposing the text image on the photographed image.
A means for estimating the speaker attribute by analyzing the shooting signal is provided.
The information processing apparatus according to claim 6 or 7, wherein the means for determining the presentation mode determines the presentation mode of the text image with reference to the estimated speaker attribute.
A means for extracting the voice of the utterance sound emitted from a person from the acquired voice is provided.
The means for estimating the arrival direction is to estimate the arrival direction of the extracted voice and to estimate the arrival direction.
The means for generating the text image generates a text image corresponding to the extracted voice.
The information processing apparatus according to any one of claims 1 to 8.
Equipped with a means to acquire the sound collected by multiple microphones,
A means for estimating the arrival direction of the acquired voice is provided.
A means for generating a text image corresponding to the acquired voice is provided.
A means for determining the presentation mode of the text image with reference to the estimated arrival direction is provided.
The text image is provided with a means for presenting the text image in the determined presentation mode.
Display device.
The display device according to claim 10, wherein the display device is at least one of a glass-type display device, a mobile terminal, and a conference system.
The display device according to claim 10 or 11, wherein the display device is a retinal projection type display device.
A program for realizing the means according to any one of claims 1 to 12 on a computer.
It is a presentation method that presents an image corresponding to voice.
Equipped with a step to acquire the sound collected by multiple microphones,
A step of estimating the arrival direction of the acquired voice is provided.
A step of generating a text image corresponding to the acquired voice is provided.
A step of determining the presentation mode of the text image with reference to the estimated arrival direction is provided.
A step of presenting the text image in the determined presentation mode.
Method.