WO2022270456A1

WO2022270456A1 - Display control device, display control method, and program

Info

Publication number: WO2022270456A1
Application number: PCT/JP2022/024487
Authority: WO
Inventors: 愛実田畑; 晴輝西村; 彰遠藤; 恭寛羽原; 蔵酒五味; 優大平良
Original assignee: ピクシーダストテクノロジーズ株式会社; 住友ファーマ株式会社
Priority date: 2021-06-21
Filing date: 2022-06-20
Publication date: 2022-12-29
Also published as: JPWO2022270456A1; US20240119684A1

Abstract

A display control device for controlling the display of a display device acquires sounds collected by a plurality of microphones, and estimates the arrival directions of the acquired sounds. The display control device displays text images corresponding to the acquired sounds in a predetermined text display region in a display part of the display device, and displays symbol images associated with the text images at display positions within the display part, the display positions corresponding to the estimated arrival directions.

Description

Display control device, display control method, and program

The present disclosure relates to a display control device, a display control method, and a program.

Hearing-impaired people may have a reduced ability to perceive the direction of arrival of sound due to a decline in auditory function. When such a hearing-impaired person tries to have a conversation with a plurality of people, it is difficult to accurately recognize who is speaking what, and communication is hindered.
Patent Literature 1 discloses a head-mounted display device for assisting hearing-impaired persons in recognizing ambient sounds. This device allows the wearer to visually recognize the surrounding sounds by displaying the results of speech recognition of ambient sounds using multiple microphones as text information in a part of the wearer's field of vision. make it possible.

JP 2007-334149 A

For display devices that display text images corresponding to voice, a display method that is highly convenient for users is required. For example, when a plurality of people are having a conversation around a user, if the user can not only recognize the content of the utterance but also easily recognize who said the utterance, communication involving the user will be smoother. become.

An object of the present disclosure is to provide a user-friendly display method for a display device that displays a text image corresponding to voice.

A display control device according to the present disclosure has, for example, the following configuration. That is, a display control device for controlling the display of a display device, comprising acquisition means for acquiring sounds collected by a plurality of microphones, estimation means for estimating the direction of arrival of the sounds acquired by the acquisition means, displaying a text image corresponding to the voice acquired by the acquisition means in a predetermined text display area in the display section of the display device, and the arrival estimated by the estimation means at the display position in the display section; display control means for displaying a symbol image associated with the text image at a display position corresponding to a direction.

It is a figure which shows the structural example of a display device. 1 is a diagram showing an overview of a display device; FIG. Fig. 3 shows the function of the display device; 4 is a flowchart showing an example of processing by a controller; FIG. 4 is a diagram for explaining sound collection by a microphone; It is a figure for demonstrating the arrival direction of a sound. FIG. 4 is a diagram showing an example of a display on a display device; FIG. FIG. 4 is a diagram showing an example of a display on a display device; FIG. FIG. 4 is a diagram showing an example of a display on a display device; FIG. FIG. 10 is a diagram showing an example of change in display on a display device; FIG. 10 is a diagram showing an example of change in display on a display device; FIG. 4 is a diagram showing an example of a table that associates sound sources with symbols;

Hereinafter, one embodiment of the present invention will be described in detail based on the drawings. In the drawings for describing the embodiments, in principle, the same constituent elements are denoted by the same reference numerals, and repeated description thereof will be omitted.

(1) Configuration of Information Processing Apparatus The configuration of the display device 1 of this embodiment will be described. FIG. 1 is a diagram showing a configuration example of a display device according to this embodiment. FIG. 2 is a diagram showing an outline of a glass-type display device, which is an example of the display device shown in FIG.

The display device 1 shown in FIG. 1 is configured to acquire speech and display a text image corresponding to the acquired speech in a manner that allows the direction of arrival of the speech to be identified.
Forms of the display device 1 include, for example, at least one of the following.
・Glass type display device ・Head mounted display ・PC
·Tablet terminal

As shown in FIG. 1, the display device 1 comprises a plurality of microphones 101, a display 102 and a controller 10. As shown in FIG.
Each microphone 101 is arranged so as to maintain a predetermined positional relationship with each other.

As shown in FIG. 2, when the display device 1 is a glass-type display device, the display device 1 includes a right temple 21, a right end piece 22, a bridge 23, a left end piece 24, a left temple 25, a rim 26 and is wearable by the user.

A microphone 101 - 1 is arranged on the right temple 21 .
A microphone 101 - 2 is placed on the right end piece 22 .
A microphone 101 - 3 is placed on the bridge 23 .
A microphone 101 - 4 is placed on the left end piece 24 .
A microphone 101 - 5 is arranged on the left temple 25 .
However, the number and arrangement of the microphones 101 in the display device 1 are not limited to the example in FIG.
The microphone 101 picks up sounds around the display device 1, for example. Sounds collected by the microphone 101 include, for example, at least one of the following sounds.
・Sounds spoken by people ・Sounds of the environment where the display device 1 is used (hereinafter referred to as “environmental sounds”)

When the display device 1 is a glass-type display device, the display 102 is a transparent member (for example, at least one of glass, plastic, and half mirror). In this case, the display 102 is placed within the field of view of the user wearing the glass display device.

The displays 102-1 to 102-2 are supported by the rim 26. The display 102-1 is arranged so as to be positioned in front of the user's right eye when the user wears the display device 1. FIG. The display 102-2 is arranged so as to be positioned in front of the user's left eye when the user wears the display device 1. FIG.

The display 102 presents (for example, displays) an image under the control of the controller 10. For example, a projector (not shown) placed behind the right temple 21 projects an image onto the display 102-1, and a projector (not shown) placed behind the left temple 25 projects an image onto the display 102-2. be done. Thereby, the display 102-1 and the display 102-2 present images. The user can visually recognize the scenery transmitted through the display 102-1 and the display 102-2 at the same time when viewing the image.

It should be noted that the method by which the display device 1 presents images is not limited to the above example. For example, the display device 1 may project images directly from a projector to the user's eyes.

The controller 10 is an information processing device that controls the display device 1 . The controller 10 is wired or wirelessly connected to the microphone 101 and the display 102 .
When the display device 1 is a glass-type display device as shown in FIG. 2, the controller 10 is arranged inside the right temple 21, for example. However, the arrangement of the controller 10 is not limited to the example in FIG. 2, and the controller 10 may be configured separately from the display device 1, for example.

As shown in FIG. 1, the controller 10 includes a storage device 11, a processor 12, an input/output interface 13, and a communication interface 14.

The storage device 11 is configured to store programs and data. The storage device 11 is, for example, a combination of ROM (Read Only Memory), RAM (Random Access Memory), and storage (eg, flash memory or hard disk).

Programs include, for example, the following programs.
・OS (Operating System) program ・Application program that executes information processing

The data includes, for example, the following data.
・Databases referenced in information processing ・Data obtained by executing information processing (that is, execution results of information processing)

The processor 12 is configured to implement the functions of the controller 10 by activating programs stored in the storage device 11 . Processor 12 is an example of a computer. For example, the processor 12 activates a program stored in the storage device 11 to display an image representing text (hereinafter referred to as a “text image”) corresponding to the speech sound collected by the microphone 101 at a predetermined position on the display 102 . Realize the function to be presented to. Note that the display device 1 may have dedicated hardware such as ASIC or FPGA, and at least part of the processing of the processor 12 described in this embodiment may be executed by the dedicated hardware.

The input/output interface 13 acquires at least one of the following.
・Audio signal collected by the microphone 101 ・User's instruction input from the input device connected to the controller 10 It's a combination of them.
Also, the input/output interface 13 is configured to output information to an output device connected to the controller 10 . An output device is, for example, the display 102 .

The communication interface 14 is configured to control communication between the display device 1 and an external device (eg, server or mobile terminal) not shown.

(2) Overview of Functions An overview of the functions of the display device 1 in this embodiment will be described. FIG. 3 is a diagram showing the functions of the display device.

In FIG. 3, a wearer P1 who wears the display device 1 is having a conversation with speakers P2 to P4.
A microphone 101 picks up the uttered sounds of the speakers P2 to P4.
The controller 10 estimates the direction of arrival of the collected speech sound.
The controller 10 generates a text image 301 corresponding to the collected speech sound by analyzing an audio signal corresponding to the collected speech sound.
The controller 10 displays the text image 301 on the displays 102-1 to 102-2 in such a manner that the incoming direction of the speech sound corresponding to the text image can be identified. The details of the display in which the direction of arrival can be identified will be described later with reference to FIGS. 7 to 9 and the like.

(3) Processing of Controller 10 FIG. 4 is a flowchart showing an example of processing of the controller 10 . FIG. 5 is a diagram for explaining sound collection by a microphone. FIG. 6 is a diagram for explaining the arrival direction of sound.

A plurality of microphones 101 each collects the speech sound emitted by the speaker. For example, in the example shown in FIG. 2, microphones 101-1 to 101-5 are arranged on the right temple 21, right end piece 22, bridge 23, left end piece 24, and left temple 25 of the display device 1, respectively. Microphones 101-1 to 101-5 collect speech sounds arriving via the paths shown in FIG. Microphones 101-1 to 101-5 convert collected speech sounds into audio signals.

The processing shown in FIG. 4 is started when the power of the display device 1 is turned on and the initial setting is completed. However, the start timing of the processing shown in FIG. 4 is not limited to this.
The controller 10 acquires the audio signal converted by the microphone 101 (S110).

Specifically, the processor 12 acquires from the microphones 101-1 to 101-5 audio signals including speech sounds uttered by at least one of the speakers P2, P3, and P4. The audio signals obtained from the microphones 101-1 to 101-5 contain spatial information (for example, frequency characteristics, delays, etc.) based on paths along which the sound waves of the speech sound travel.

After step S110, the controller 10 performs direction-of-arrival estimation (S111).

A direction-of-arrival estimation model is stored in the storage device 11 . The direction-of-arrival estimation model describes information for identifying the correlation between the spatial information included in the speech signal and the direction of arrival of the speech sound.

Any existing method may be used as a direction-of-arrival estimation method using the direction-of-arrival estimation model. For example, as a direction-of-arrival estimation method, MUSIC (Multiple Signal Classification) using eigenvalue expansion of the input correlation matrix, minimum norm method, or ESPRIT (Estimation of Signal Parameters via Rotational Invariance Techniques) is used.

The processor 12 inputs the sound signals received from the microphones 101-1 to 101-5 to the direction-of-arrival estimation model stored in the storage device 11, so that the sounds collected by the microphones 101-1 to 101-5 are input. Estimate direction of arrival of speech sound. At this time, the processor 12, for example, sets the reference direction (in this embodiment, the front direction of the user wearing the display device 1) defined with reference to the microphones 101-1 to 101-5, from the axis with 0 degrees. The direction of arrival of the speech sound is expressed by the declination of . In the example shown in FIG. 6, the processor 12 estimates the incoming direction of the speech sound emitted by the speaker P2 as an angle A1 to the right from the axis. The processor 12 estimates the incoming direction of the speech sound emitted by the speaker P3 to be an angle A2 to the left from the axis. The processor 12 estimates the incoming direction of the speech sound emitted by the speaker P4 to be an angle A3 to the left from the axis.

After step S111, the controller 10 executes audio signal extraction (S112).

A beamforming model is stored in the storage device 11 . The beamforming model describes information for identifying a correlation between a predetermined direction and parameters for forming directivity having a beam in that direction. Here, forming the directivity is a process of amplifying or attenuating a sound coming from a specific direction of arrival.

The processor 12 inputs the estimated direction of arrival into the beamforming model stored in the storage device 11 to calculate parameters for forming directivity having a beam in the direction of arrival.

In the example shown in FIG. 6, the processor 12 inputs the calculated angle A1 into the beamforming model and calculates the parameters for forming the directivity with the beam in the direction of the angle A1 rightward from the axis. The processor 12 inputs the calculated angle A2 into the beamforming model and calculates the parameters for forming the directivity with the beam directed at the angle A2 to the left of the axis. The processor 12 inputs the calculated angle A3 into the beamforming model and calculates the parameters for forming the directivity with the beam directed at the angle A3 to the left of the axis.

The processor 12 amplifies or attenuates the audio signals acquired from the microphones 101-1 to 101-5 using the parameters calculated for the angle A1. The processor 12 extracts the audio signal for the speech sound coming from the direction represented by the angle A1 by synthesizing the amplified or attenuated audio signal.

The processor 12 amplifies or attenuates the audio signals acquired from the microphones 101-1 to 101-5 using the parameters calculated for the angle A2. The processor 12 extracts the audio signal for the speech sound coming from the direction represented by the angle A2 by synthesizing the amplified or attenuated audio signal.

The processor 12 amplifies or attenuates the audio signals acquired from the microphones 101-1 to 101-5 using the parameters calculated for the angle A3. The processor 12 extracts the audio signal for the speech sound coming from the direction represented by the angle A3 by synthesizing the amplified or attenuated audio signal.

After step S112, the controller 10 executes speech recognition (S113).

A speech recognition model is stored in the storage device 11. A speech recognition model describes information for identifying a speech signal and the correlation of text to the speech signal. A speech recognition model is, for example, a trained model generated by machine learning.

The processor 12 inputs the extracted speech signal to the speech recognition model stored in the storage device 11 to determine the text corresponding to the input speech signal.

In the example shown in FIG. 6, the processor 12 inputs the speech signals extracted for the angles A1 to A3 to the speech recognition model respectively, thereby determining the text corresponding to the input speech signals.

After step S113, the controller 10 executes text image generation (S114).

Specifically, the processor 12 generates a text image representing the determined text.

After step S114, the controller 10 determines the display mode (S115).

Specifically, the processor 12 determines in what manner the display image including the text image is to be displayed on the display 102 .

After step S115, the controller 10 executes image display (S116).

Specifically, the processor 12 displays on the display 102 a display image according to the determined display mode.

(4) Display example of display device Below, an example of a display image according to the determination of the display mode in step S115 will be described in detail. The processor 12 causes the text image corresponding to the voice to be displayed in a predetermined text display area on the display 102 which is the display unit of the display device 1 . At the same time, the processor 12 displays the symbol image associated with the text image at the display position corresponding to the direction of arrival of the speech sound corresponding to the text image.

FIG. 7 is a diagram showing an example of display on a display device. A screen 901 represents the field of view seen through the display 102 by the user wearing the display device 1 . Here, the images of speaker P3 and speaker P4 are real images seen by the user through display 102, and window 902, symbol 905, symbol 906, and mark 907 are displayed on display 102. This is an image. Note that the field of view seen through the display 102-1 and the field of view seen through the display 102-2 are actually slightly different in image position, but for simplicity of explanation here, each field of view is common. will be described as being represented by the screen 901 of .

A window 902 is displayed at a predetermined position within the screen 901 . A window 902 displays a text image 903 generated in S114. The text image 903 is displayed in a manner in which the utterances of multiple speakers can be identified. For example, if speaker P3's utterance is followed by speaker P4's utterance, the text corresponding to each utterance is displayed in separate lines. As more lines of text are displayed in window 902, text image 903 is scrolled, hiding the text of older utterances and displaying the text of newer utterances.

Also, in the window 902, a symbol 904 is displayed to make it possible to identify whose statement each text included in the text image 903 represents. Sound sources and symbol types are associated, for example, by a table 1000 shown in FIG. The controller 10 refers to the table 1000 stored in the storage device 11 to determine the types of symbols to be displayed on the window 902 . In the example of FIG. 7, a heart-shaped symbol is displayed next to the text corresponding to the utterance of speaker P3, and a face-shaped symbol is displayed next to the text corresponding to the utterance of speaker P4. there is

Then, on screen 901, a heart-shaped symbol 905 is displayed at a position corresponding to the direction of arrival of the voice uttered by speaker P3 (in the example of FIG. 7, a position overlapping the image of speaker P3 existing in the direction of arrival). Also, a face-shaped symbol 906 is displayed at a position corresponding to the direction of arrival of the voice uttered by speaker P4 (in the example of FIG. 7, the position overlapping the image of speaker P4 existing in the direction of arrival). The types of

symbols

905 and 906 correspond to the types of symbol 904 displayed together with text image 903 in window 902 . That is, the symbol 904 displayed together with the text representing the utterance of the speaker P3 in the window 902 is the same kind of symbol as the symbol 905 displayed at the position corresponding to the speaker P3 on the screen 901 . With such a display, the user can easily identify whose utterance each text included in the text image 903 in the window 902 represents. Note that the controller 10 may determine the symbol type based on the voice recognition result in S113. For example, the controller 10 may estimate the emotion of the speaker by speech recognition in S113, and determine the expression and color of the symbol corresponding to the speaker based on the estimated emotion. This makes it possible to present information about the speaker's emotions to the user of the display device 1 .

Furthermore, on the screen 901, a mark 907 is displayed around the symbol 906 to indicate that the speaker P4 corresponding to the symbol 906 is speaking. That is, the mark 907 is displayed at a position corresponding to the arrival direction of the sound, and indicates that the sound is emitted from the sound source located in the arrival direction.

Note that the processor 12 identifies the utterances of a plurality of speakers based on the result of estimating the direction of arrival of the voice. That is, when the difference between the direction of arrival of the voice corresponding to one utterance and the direction of arrival of the voice corresponding to another utterance is greater than or equal to a predetermined angle, the processor 12 detects that the utterances are utterances of different speakers ( In other words, it is determined that the sound is a sound emitted from a separate sound source). Then, the processor 12 displays the text images 903 so that the texts corresponding to a plurality of utterances with different directions of arrival can be identified, and the

symbols

905 and 906 associated with each text are positioned according to the direction of arrival of the voice. display.

In the example of FIG. 7, the text image 903 representing the utterance of the speaker P3 and the symbol 905 representing the arrival direction of the voice uttered by the speaker P3 are the same type of symbol 904 as the text image 903. It is assumed that they are related by being displayed in the vicinity. However, the method of associating a text image representing an utterance of a specific speaker with a symbol image representing the direction of arrival of the voice uttered by the speaker is not limited to this example. For example, in the text image 903, texts corresponding to statements with different arrival directions may be displayed in different colors. Then, the text image corresponding to the sound in a specific direction of arrival and the symbol image indicating the direction of arrival may be associated by being displayed in the same kind of color. Specifically, the text corresponding to the utterance of speaker P3 may be displayed in a first color, and a symbol of the first color may be displayed at a position indicating the direction of speaker P3. Then, the text corresponding to the utterance of speaker P4 may be displayed in a second color, and a symbol of the second color may be displayed at a position indicating the direction of speaker P4. The symbols of the first color and the symbols of the second color may have different shapes or may have the same shape.

FIG. 8 is a diagram showing another example of display on the display device. A screen 901 includes images of speakers P3 and P4 as in the example of FIG. 7, and a window 902 and a text image 903 are displayed. On the other hand, instead of

symbols

904, 905 and 906 in FIG. 7, direction marks 1004, 1005 and 1006 are displayed.

Symbols

1005 and 1006 indicate the direction of arrival of the voice, that is, the position of the speaker.

Symbols

1005 and 1006 are associated with different speakers, but may be symbols of the same type. A direction mark 1004 indicates the direction of the sound source corresponding to each text included in the text image 903 . In the example of FIG. 8, arrows indicate whether the sound source is positioned to the right or left with respect to the front direction of the user (that is, the normal direction of the screen 901). Specifically, a rightward arrow is displayed next to the text corresponding to the utterances of the speaker P3 located to the right of the user's front, and corresponds to the utterances of the speaker P4 located to the left of the user's front. An arrow pointing left appears next to the text. In this way, by displaying a symbol or graphic capable of specifying the symbol corresponding to the direction of arrival from among the

symbols

1005 and 1006 on the screen 901 near the text corresponding to the sound from the specific direction of arrival, , a text image and a symbol image are associated. With such a display, the user can easily identify in which direction the text included in the text image 903 in the window 902 represents the sound from the sound source located in each direction.

It should be noted that the direction mark 1004 is not limited to two types indicating the right direction and the left direction, and may be a mark indicating more various directions. This makes it possible to identify which text represents which speaker's utterances even when there are three or more speakers. Also, the direction indicated by the direction mark 1004 is not limited to being determined by the position of the sound source relative to the front direction of the user, and may be determined based on the relative positions of a plurality of sound sources, for example. For example, if two speakers are positioned to the right of the user, a rightward arrow is displayed next to the text corresponding to the utterance of the speaker positioned relatively to the right, A left arrow may be displayed next to the text corresponding to the speaker's utterance located at .

FIG. 9 is a diagram showing another example of display on the display device. FIG. 9(a) shows a screen 901 when the speaker P3 and the speaker P4 are positioned to the right out of the field of view of the user wearing the display device 1. FIG. FIG. 9(b) shows the screen 901 when the speaker P3 is out of the user's field of view to the right and the speaker P4 is within the user's field of view. That is, when the user viewing the screen 901 of FIG. 9A turns slightly to the right, the screen 901 of FIG. 9B can be seen.

In FIG. 9A, screen 901 includes, in addition to window 902 representing text corresponding to speech, direction indicator frame 1101 indicating the direction of a sound source with respect to the FOV (Field of View) of display device 1, FOV and sound source A bird's-eye view map 1102 showing the relationship with the direction of is displayed. The FOV is an angle range preset for the display device 1, and has a predetermined width in each of the elevation direction and the azimuth direction centering on the reference direction of the display device 1 (the front direction of the wearer). The FOV of the display device 1 is included in the field of view seen by the user through the display device 1 .

An arrow indicating the direction of the sound source with respect to the FOV and a symbol identifying the sound source existing in the direction indicated by the arrow are displayed in the direction indication frame 1101 . In the example of FIG. 9A, since the sound source exists to the right of the FOV, a direction indicator frame 1101 is displayed on the right end of the screen 901. However, if the sound source exists to the left of the FOV, the screen A direction indicator frame 1101 is displayed at the left end of 901 . That is, the direction indication frame 1101 is displayed at the end of the screen 901 corresponding to the incoming direction of the sound. In this way, the symbol image associated with the text image 903 is displayed at a position corresponding to the incoming direction of the voice. This allows the user to easily recognize in which direction the sound source of the text displayed in the window 902 is emitted from the sound source with respect to the field of view seen through the display device 1 .

As shown in FIG. 9(b), when the speaker P4 enters the FOV from outside the FOV, the symbol corresponding to the speaker P4 is no longer displayed in the direction indicator frame 1101.

Note that the display position of the direction indicator frame 1101 is not limited to the edge of the screen 901 . Further, the contents displayed in the direction indication frame 1101 are not limited to symbols and arrows, and at least one of these may not be included in the direction indication frame 1101, and other figures or symbols may indicate direction indications. It may be included in the frame 1101 . If the direction indication frame 1101 includes a symbol or figure indicating a direction such as an arrow, the direction indication frame 1101 may be displayed at a position that does not depend on the direction of the sound source.

An area 1103 indicating the FOV of the display device 1 and a symbol indicating the direction of the sound source are displayed on the bird's-eye view map 1102 . The area 1103 is displayed at a fixed position on the bird's eye map 1102, and the symbol associated with the text image 903 is displayed on the bird's eye map 1102 at a position indicating the direction of the sound source (that is, a position corresponding to the direction of arrival of the sound). By displaying such a bird's-eye view map 1102 , the user can see from which direction the sound corresponding to the text displayed in the window 902 is coming from the sound source with respect to the visual field seen through the display device 1 . You can easily recognize what is being said. Note that the area 1103 displayed on the bird's-eye view map 1102 does not have to strictly match the FOV of the display device 1 . For example, area 1103 may represent the range included in the field of view of a user wearing display device 1 . Further, for example, the bird's-eye view map 1102 may indicate the reference direction of the display device 1 (the front direction of the wearer) instead of the FOV.

As shown in FIG. 9B, when the speaker P4 enters the FOV, the symbol corresponding to the speaker P4 is displayed at a position overlapping the area 1103 on the bird's-eye view map 1102 .

(5) Summary According to the present embodiment, the controller 10 causes the text image 903 corresponding to the voice acquired via the microphone 101 to be displayed in a predetermined text display area on the display section of the display device 1 . At the same time, the controller 10 displays the symbol image associated with the text image 903 at a display position within the display unit corresponding to the estimated arrival direction of the sound. As a result, the user of the display device 1 can visually recognize the content of the conversation taking place around the user, and can easily recognize whose utterances are included in the conversation.

Also, according to this embodiment, the text images corresponding to the voice are collectively displayed in a predetermined text display area regardless of the position of the sound source, so the user can easily follow the text images. Furthermore, even if the sound source is out of the user's field of view, the user can recognize the content of the utterance uttered by the sound source without facing the direction of the sound source.

Further, according to the present embodiment, the controller 10 causes the display unit to display information indicating the relationship between the range included in the visual field of the user wearing the display device 1 and the direction of the sound source. Thereby, the user can easily recognize in which direction the speaker is when a conversation is taking place outside the field of view or when the user is called out from the outside of the field of view. As a result, it is possible to quickly participate in conversations and respond to calls.

Further, according to the present embodiment, the controller 10 causes the sound to be emitted from a sound source located in the estimated direction of arrival of the sound at a position within the display section of the display device 1 that corresponds to the estimated direction of arrival of the sound. display a mark indicating that This allows the user to easily identify the speaking person even before text display by voice recognition is completed.

(6) Modification A modification of the present embodiment will be described.

(6.1) Modification 1
Modification 1 of the present embodiment will be described. In Modification 1, the controller 10 limits the total number of text image sentences displayed simultaneously on the display 102 that is the display unit of the display device 1 . Here, a sentence is a set of texts corresponding to speech from the same direction of arrival, collected in a single continuous sound collection period. The controller 10 distinguishes and displays the texts corresponding to the sounds with different arrival directions among the sounds acquired through the microphone 101 as separate sentences. In addition, the controller 10 distinguishes and displays texts corresponding to voices collected through a silence period longer than a predetermined time from among the voices acquired through the microphone 101 as separate sentences.

FIGS. 10(a) to 10(d) show examples of changes in the display of the display device. In this example, it is assumed that the controller 10 has set the upper limit of the total number of sentences of the text image displayed on the display 102 to 3 at the same time.

In a situation where a speaker P5 and a speaker P6 are having a conversation within the field of view of a user wearing the display device 1, when the speaker P6 first says "Hello", as shown in FIG. A sentence 1201 corresponding to the utterance is displayed on the display 102 . The total number of sentences displayed at this point is one.

Next, when speaker P5 says "Hello," a sentence 1202 corresponding to that statement is displayed on the display 102, as shown in FIG. 10(b). The total number of sentences displayed at this point is two.

Next, when speaker P5 says "today", a sentence 1203 corresponding to that statement is displayed on display 102, as shown in FIG. 10(c). The total number of sentences displayed at this point is three.

Next, when speaker P5 says "nice weather", a sentence 1204 corresponding to that statement is displayed on display 102, as shown in FIG. 10(d). Here, since the upper limit of the total number of sentences displayed simultaneously is limited to 3, the sentence 1201 corresponding to the oldest utterance among the plurality of sentences displayed on the display 102 is hidden.

By limiting the total number of text image sentences displayed simultaneously on the display 102 in this way, it is possible to prevent the area in which the text images are displayed on the display 102 from becoming too large. As a result, the user wearing the display device 1 can see both the displayed text image and the image of the real object (for example, the speaker's facial expression) seen through the display 102 and smoothly It becomes possible to communicate.

Note that, in the example shown in FIG. 10, a text image of a sentence corresponding to a certain direction of arrival (speech of speaker P5) and a text image of a sentence corresponding to speech of another direction of arrival (speech of speaker P6) are shown. are displayed so as to be identifiable by being displayed at positions different from each other. However, the display method is not limited to this. For example, as in the above-described embodiment, a text image displayed in a predetermined text display area and a symbol image associated with the text image are displayed, thereby displaying a plurality of sentences corresponding to a plurality of different arrival directions. It may be displayed so as to be identifiable. Also, in FIGS. 10 and 11, sentences are represented by balloons, but they can also be represented by the method described with reference to FIGS. 7 to 9. FIG.

Also, in the example shown in FIG. 10, when the number of displayed sentences exceeds the upper limit, one of the sentences is hidden. However, the present invention is not limited to this, and when the number of displayed sentences exceeds the upper limit, the controller 10 may perform processing to make the display of any sentence less conspicuous. For example, the controller 10 may reduce at least one of brightness, saturation, and contrast of sentences exceeding the upper limit, or reduce the size of any sentence.

In addition, the sentences displayed on the display 102 may be hidden after a predetermined period of time has elapsed, not only when the total number of displayed sentences reaches the upper limit.

(6.2) Modification 2
Modification 2 of this embodiment will be described. In Modified Example 2, the controller 10 limits the number of sentences of the text image simultaneously displayed on the display 102, which is the display unit of the display device 1, for each estimated direction of arrival.

FIGS. 11(a) to 11(d) show examples of changes in the display of the display device. In this example, it is assumed that the controller 10 sets the upper limit of the number of sentences displayed simultaneously on the display 102 to two for each direction of arrival.

In a situation where a speaker P5 and a speaker P6 are having a conversation within the field of view of a user wearing the display device 1, when the speaker P6 first says "Hello", as shown in FIG. A sentence 1201 corresponding to the utterance is displayed on the display 102 . At this point, the number of displayed sentences corresponding to the direction of speaker P5 is zero, and the number of displayed sentences corresponding to the direction of speaker P6 is one.

Next, when the speaker P5 says "Hello", a sentence 1202 corresponding to that utterance is displayed on the display 102 as shown in FIG. 11(b). At this point, the number of displayed sentences corresponding to the direction of speaker P5 is one, and the number of displayed sentences corresponding to the direction of speaker P6 is one.

Next, when the speaker P5 says "today", the sentence 1203 corresponding to that statement is displayed on the display 102 as shown in FIG. 11(c). At this point, the number of displayed sentences corresponding to the direction of speaker P5 is two, and the number of displayed sentences corresponding to the direction of speaker P6 is one.

Next, when speaker P5 says "nice weather", a sentence 1204 corresponding to that statement is displayed on display 102, as shown in FIG. 11(d). Here, since the upper limit of the number of sentences displayed simultaneously for each direction of arrival is limited to 2, the oldest utterance among the plurality of sentences corresponding to the direction of speaker P5 displayed on the display 102 is displayed. Sentence 1202 is hidden.

In this way, the number of text image sentences displayed simultaneously on the display 102 is limited for each direction of arrival. This prevents the situation where only the text image corresponding to the voice of the speaker who speaks frequently is displayed and the text image corresponding to the voice of the speaker who speaks less is not displayed. As a result, a user wearing the display device 1 can easily recognize the flow of conversations of a plurality of speakers.

(6.3) Other Modifications In the above-described embodiment, the case where the plurality of microphones 101 are integrated with the display device 1 has been mainly described. However, not limited to this, an array microphone device having a plurality of microphones 101 may be configured separately from the display device 1 and connected to the display device 1 by wire or wirelessly. In this case, the array microphone device and display device 1 may be directly connected, or may be connected via another device such as a PC or a cloud server.

Further, when the array microphone device and the display device 1 are configured separately, at least part of the functions of the display device 1 described above may be implemented in the array microphone device. For example, the array microphone apparatus performs the estimation of the direction of arrival in S111 and the extraction of the audio signal in S112 of the processing flow of FIG. You may send. The display device 1 may then use the received information and audio signals to control the display of images, including text images.

In the above-described embodiment, the case where the display device 1 is an optical see-through glass-type display device has been mainly described. However, the format of the display device 1 is not limited to this. For example, the display device 1 may be a video see-through glass type display device. That is, the display device 1 may comprise a camera. Then, the display device 1 displays on the display 102 a synthesized image obtained by synthesizing the various display images described above, such as text images and symbol images generated based on voice recognition, and the captured image captured by the camera. may be displayed. The captured image is an image captured in front of the user and may include an image of the speaker. Further, for example, the controller 10 and the display 102 may be configured separately, such as the controller 10 existing in a cloud server.

Also, the display device 1 may be a PC or a tablet terminal, and in that case, the display device 1 may display the above-described text image 903 and bird's-eye view map 1102 on the display of the PC or tablet terminal. In this case, the bird's-eye view map 1102 may not display the area 1103 , and the upward direction of the bird's-eye view map 1102 corresponds to the reference direction of the microphone array including the multiple microphones 101 . With such a configuration, the user can confirm the content of the conversation picked up by the microphone 101 in the text image 903, and can also see in which direction the speaker of each text is located with respect to the reference direction of the microphone array. It can be easily recognized from the bird's-eye view map 1102 .

In the embodiment described with reference to FIG. 7 and the like, the case where the predetermined text display area in which the text image 903 is displayed on the display 102 is the window 902 has been mainly described. However, the predetermined text display area is not limited to this example, and may be any area determined regardless of the orientation of the display 102 . The window 902 may not be displayed in the predetermined text display area. Also, the display format of the text image in the text display area is not limited to the example shown in FIG. 7 and the like. For example, utterances from different directions of arrival may be displayed in different portions of the text display area.

In the above-described embodiment, an example in which a user's instruction is input from an input device connected to the input/output interface 13 has been described, but the present invention is not limited to this. A user's instruction may be input from a drive button object presented by an application of a computer (for example, a smartphone) connected to the communication interface 14 .

The display 102 can be implemented by any method as long as it can present an image to the user. The display 102 can be implemented by, for example, the following implementation method.
・HOE (Holographic optical element) or DOE (Diffractive optical element) using an optical element (as an example, a light guide plate)
・Liquid crystal display ・Retinal projection display ・LED (Light Emitting Diode) display ・Organic EL (Electro Luminescence) display ・Laser display ・Optical element (for example, lens, mirror, diffraction grating, liquid crystal, MEMS mirror, HOE) 2. Display that Guides Light Emitted from a Light-Emitting Body In particular, in a retinal projection display, even a person with weak eyesight can easily observe an image. Therefore, a person suffering from both hearing loss and amblyopia can more easily recognize the incoming direction of the speech sound.

In the voice extraction process by the controller 10, any implementation method can be used as long as a voice signal corresponding to a specific speaker can be extracted. The controller 10 may, for example, extract the audio signal by the following method.
Frost beamformer Adaptive filter beamforming (generalized sidelobe canceller as an example)
・Speech extraction methods other than beamforming (for example, frequency filter or machine learning)

Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited to the above embodiments. Also, the above embodiments can be modified and modified in various ways without departing from the gist of the present invention. Also, the above embodiments and modifications can be combined.

1: display device 10: controller 101: microphone 102: display

Claims

A display control device for controlling display of a display device,
Acquisition means for acquiring sounds collected by a plurality of microphones;
estimating means for estimating the direction of arrival of the sound acquired by the acquiring means;
displaying a text image corresponding to the voice acquired by the acquisition means in a predetermined text display area in the display section of the display device, and the arrival estimated by the estimation means at the display position in the display section; display control means for displaying a symbol image associated with the text image at a display position corresponding to a direction;
A display controller having a
The display control device according to claim 1, wherein the text image and the symbol image are associated by displaying an image of the same type as the symbol image near the text image.
The display control device according to claim 1, wherein the text image and the symbol image are associated by being displayed in the same kind of color.
2. The text image and the symbol image according to claim 1, wherein the text image and the symbol image are associated by displaying a symbol or a figure capable of identifying the symbol image among the plurality of symbol images in the display section near the text image. display controller.
The display control device according to any one of claims 1 to 4, wherein the display position corresponding to the direction of arrival is a position overlapping an image of a sound source existing in the direction of arrival on the display unit.
The display control device according to any one of claims 1 to 4, wherein the display position corresponding to the arrival direction is an end portion corresponding to the arrival direction among the end portions of the display section.
The display position corresponding to the direction of arrival is a position representing the direction of the sound source on a map showing the relationship between the range included in the visual field of the user wearing the display device and the direction of the sound source. 5. The display control device according to any one of 4.
The display control means further places a mark indicating that a sound source located in the direction of arrival is emitting a sound at a position within the display section corresponding to the direction of arrival estimated by the estimation means. 8. The display control device according to any one of claims 1 to 7, for displaying.
2. The text image displayed in the predetermined text display area is an image representing text obtained by performing speech recognition by extracting speech in a specific direction from the speech obtained by the obtaining means. 9. The display control device according to claim 8.
A display control device for controlling display of a display device,
Acquisition means for acquiring sounds collected by a plurality of microphones;
estimating means for estimating the direction of arrival of the sound acquired by the acquiring means;
A display unit of the display device displays a text image corresponding to a sound in a first direction of arrival and a text image corresponding to a sound in a second direction of arrival different from the first direction of arrival in a distinguishable manner. a display control means for
limiting means for limiting the total number of text image sentences simultaneously displayed on the display unit by the display control means;
A display controller having a
A display control device for controlling display of a display device,
Acquisition means for acquiring sounds collected by a plurality of microphones;
estimating means for estimating the direction of arrival of the sound acquired by the acquiring means;
A display unit of the display device displays a text image corresponding to a sound in a first direction of arrival and a text image corresponding to a sound in a second direction of arrival different from the first direction of arrival in a distinguishable manner. a display control means for
limiting means for limiting the number of text image sentences simultaneously displayed on the display unit by the display control means for each direction of arrival estimated by the estimation means;
A display controller having a
The display control device according to claim 10 or 11, wherein the sentence is a group of texts corresponding to voices from the same direction of arrival collected in a single continuous sound collection period.
The display control device according to any one of claims 1 to 12, wherein the display device is a user-worn glass-type display device.
A program for causing a computer to implement each means of the display control device according to any one of claims 1 to 13.
A display control method for controlling display of a display device, comprising:
Acquire the sound collected by multiple microphones,
estimating a direction of arrival of the acquired speech;
displaying a text image corresponding to the acquired speech in a predetermined text display area in the display unit of the display device, and a display position within the display unit corresponding to the estimated direction of arrival; to display a symbol image associated with the text image;
Display control method.
A display control method for controlling display of a display device, comprising:
Acquire the sound collected by multiple microphones,
estimating a direction of arrival of the acquired speech;
A display unit of the display device displays a text image corresponding to a sound in a first direction of arrival and a text image corresponding to a sound in a second direction of arrival different from the first direction of arrival in a distinguishable manner. let
limiting the total number of text image sentences simultaneously displayed on the display;
Display control method.
A display control method for controlling display of a display device, comprising:
Acquire the sound collected by multiple microphones,
estimating a direction of arrival of the acquired speech;
A display unit of the display device displays a text image corresponding to a sound in a first direction of arrival and a text image corresponding to a sound in a second direction of arrival different from the first direction of arrival in a distinguishable manner. let
limiting the number of text image sentences simultaneously displayed on the display unit for each of the estimated directions of arrival;
Display control method.