WO2023157963A1 - 情報処理装置、情報処理方法、及びプログラム - Google Patents
情報処理装置、情報処理方法、及びプログラム Download PDFInfo
- Publication number
- WO2023157963A1 WO2023157963A1 PCT/JP2023/005887 JP2023005887W WO2023157963A1 WO 2023157963 A1 WO2023157963 A1 WO 2023157963A1 JP 2023005887 W JP2023005887 W JP 2023005887W WO 2023157963 A1 WO2023157963 A1 WO 2023157963A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sound source
- map image
- information
- microphone device
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S3/00—Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic or electromagnetic waves, or particle emission, not having a directional significance, are being received
- G01S3/80—Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves
- G01S3/801—Details
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—Two-dimensional [2D] image generation
- G06T11/20—Drawing from basic elements
- G06T11/26—Drawing of charts or graphs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/20—Arrangements for obtaining desired frequency or directional characteristics
- H04R1/32—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
- H04R1/326—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only for microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/20—Arrangements for obtaining desired frequency or directional characteristics
- H04R1/32—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
- H04R1/40—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers
- H04R3/005—Circuits for transducers for combining the signals of two or more microphones
Definitions
- the present disclosure relates to an information processing device, an information processing method, and a program.
- Hearing-impaired people may have a reduced ability to perceive the incoming direction of sound due to a decline in auditory function.
- a hearing-impaired person tries to have a conversation with a plurality of people, it is difficult to accurately recognize who is speaking what, and communication is hindered.
- Patent Document 1 display areas corresponding to each of a plurality of users are set in an image display area of a display unit, and text, which is a speech recognition result for a certain user's speech, is displayed as an image set for another user.
- a conversation support device for displaying on a display area is disclosed.
- the purpose of the present disclosure is to enable the user to intuitively associate the speaker with the utterance content based on visual information.
- An information processing apparatus includes means for acquiring information indicating the direction of a sound source with respect to at least one multi-microphone device, and information on the content of sound emitted from the sound source and collected by the multi-microphone device.
- FIG. 1 is a block diagram showing the configuration of an information processing system according to an embodiment
- FIG. It is a block diagram showing the configuration of a controller of the present embodiment. It is a figure which shows the external appearance of the multi-microphone device of this embodiment. It is a figure which shows one aspect
- 4 is a flowchart of audio processing according to the embodiment; FIG. 4 is a diagram for explaining sound collection by a microphone; FIG. 4 is a diagram for explaining the direction of a sound source in a reference coordinate system; It is a figure which shows an example of a map image.
- FIG. 10 is a diagram showing one mode of modification 1;
- FIG. 10 is a diagram showing a data structure of a statement database of modification 1;
- 10 is a flowchart of voice processing according to Modification 1;
- FIG. 11 is a diagram showing an example of a map image of modification 2;
- FIG. 11 is a diagram showing another example of a map image of modification 2;
- FIG. 11 is a diagram showing an example of a map image of modification 3;
- FIG. 10 is a diagram showing an example of image display in modification 1;
- a coordinate system (microphone coordinate system) based on the position and orientation of the multi-microphone device may be used.
- the origin of the microphone coordinate system is the position of the multi-microphone device (for example, the position of the center of gravity of the multi-microphone device), and the x-axis and the y-axis are perpendicular to each other at the origin.
- the x+ direction is the front of the multi-microphone device
- the x- direction is the rear of the multi-microphone device
- the y+ direction is the left direction of the multi-microphone device
- the y- direction is the right direction of the multi-microphone device. do.
- a direction in a specific coordinate system means a direction with respect to the origin of the coordinate system.
- FIG. 1 is a block diagram showing the configuration of the information processing system of this embodiment.
- the information processing system 1 includes a display device 10, a controller 30, and a multi-microphone device 50.
- FIG. The information processing system 1 is used by multiple users. At least one of the users may be hearing-impaired, and not all of the users may be hearing-impaired (that is, all of the users may have sufficient hearing for speech).
- Display device 10 and controller 30 are connected via a communication cable or wireless channel (eg, Wi-Fi channel or Bluetooth® channel).
- controller 30 and multi-microphone device 50 are connected via a communication cable or wireless channel (eg, Wi-Fi channel or Bluetooth channel).
- the display device 10 includes one or more displays 11 (an example of a "display unit").
- the display device 10 receives an image signal from the controller 30 and displays an image corresponding to the image signal on the display.
- the display device 10 is, for example, a tablet terminal, a personal computer, a smart phone, or a conference display device.
- the display device 10 may include an input device or operation unit for obtaining user instructions.
- the controller 30 controls the display device 10 and the multi-microphone device 50.
- the controller 30 is an example of an information processing device.
- the controller 30 is, for example, a smart phone, tablet terminal, personal computer, or server computer.
- the multi-microphone device 50 can be installed independently from the display device 10. That is, the position and orientation of multi-microphone device 50 can be determined independently from the position and orientation of display device 10 .
- FIG. 2 is a block diagram showing the configuration of the controller of this embodiment.
- the controller 30 includes a storage device 31, a processor 32, an input/output interface 33, and a communication interface .
- the storage device 31 is configured to store programs and data.
- the storage device 31 is, for example, a combination of ROM (Read Only Memory), RAM (Random Access Memory), and storage (eg, flash memory or hard disk).
- Programs include, for example, the following programs. ⁇ OS (Operating System) program ⁇ Application program that executes information processing
- the data includes, for example, the following data. ⁇ Databases referenced in information processing ⁇ Data obtained by executing information processing (that is, execution results of information processing)
- the processor 32 is a computer that implements the functions of the controller 30 by activating programs stored in the storage device 31 .
- Processor 32 is, for example, at least one of the following: ⁇ CPU (Central Processing Unit) ⁇ GPU (Graphic Processing Unit) ⁇ ASIC (Application Specific Integrated Circuit) ⁇ FPGA (Field Programmable Array)
- the input/output interface 33 is configured to acquire information (eg, user instructions) from an input device connected to the controller 30 and output information (eg, an image signal) to an output device connected to the controller 30.
- Input devices are, for example, keyboards, pointing devices, touch panels, or combinations thereof.
- An output device is, for example, a display.
- the communication interface 34 is configured to control communication between the controller 30 and external devices (eg, the display device 10 and the multi-microphone device 50).
- external devices eg, the display device 10 and the multi-microphone device 50.
- FIG. 3 is a diagram showing the appearance of the multi-microphone device of this embodiment.
- the multi-microphone device 50 includes multiple microphones. In the following description, the multi-microphone device 50 is assumed to have five microphones 51-1, . The multi-microphone device 50 uses microphones 51-1, . Also, the multi-microphone device 50 estimates the arrival direction of the sound (that is, the direction of the sound source) in the microphone coordinate system. The multi-microphone device 50 also performs beam forming processing, which will be described later.
- the microphone 51 collects sounds around the multi-microphone device 50, for example. Sounds collected by the microphone 51 include, for example, at least one of the following sounds. ⁇ Speech by a person ⁇ Sound of the environment where the multi-microphone device 50 is used
- the multi-microphone device 50 has, for example, a mark 50a on the surface of the housing indicating the reference direction of the multi-microphone device 50 (for example, the front (that is, the x+ direction), but may be another predetermined direction). attached. This allows the user to easily recognize the orientation of the multi-microphone device 50 from visual information. Note that the means for recognizing the orientation of the multi-microphone device 50 is not limited to this.
- the mark 50 a may be integrated with the housing of the multi-microphone device 50 .
- the multi-microphone device 50 further includes a processor, a storage device, and a communication or input/output interface for, for example, audio processing, which will be described later. Also, the multi-microphone device 50 can be equipped with an IMU (Inertial Measurement Unit) to detect the movement and state of the multi-microphone device 50 .
- IMU Inertial Measurement Unit
- FIG. 4 is a diagram showing one aspect of the present embodiment.
- the controller 30 generates a map image and displays it on the display 11 of the display device 10 while a conversation (for example, a meeting) is being held by a plurality of participants (that is, users of the information processing system 1).
- the map image corresponds to a bird's-eye view of the sound source (speaker) environment around the multi-microphone device 50, and text ( An example of "information on the content of voice") is arranged.
- the controller 30 updates the map image according to the participant's speech.
- the map image serves as a UI (User Interface) for visually grasping the content of the most recent conversation (particularly, who is speaking what) in real time.
- UI User Interface
- the map image includes a microphone icon MI31, a circumference CI31, sound source icons SI31, SI32, SI33, SI34, and text images TI32, TI34.
- the microphone icon MI31 represents the multi-microphone device 50.
- the microphone icon MI31 has a mark MR31 indicating the orientation of the microphone icon MI31.
- a viewer of the map image can recognize where the microphone icon MI31 is directed in the map image by checking the mark MR31.
- the viewer of the map image can easily associate the participants in the real world with the sound source icons in the map image. be able to.
- the circumference CI31 corresponds to the circumference centered on the microphone icon MI31.
- the controller 30 arranges sound source icons SI31, SI32, SI33, and SI34 corresponding to the participants of the conversation on the circle CI31. Specifically, the controller 30 arranges each of the sound source icons SI31, SI32, SI33, and SI34 on the circumference CI31 at a position corresponding to the direction of the sound source represented by the sound source icon with respect to the multi-microphone device 50 .
- the controller 30 converts the microphone coordinate system into the coordinate system of the map image (hereinafter, “map coordinate system”).
- the controller 30 places the sound source at the intersection of the straight line extending in the (estimated) direction of the sound source expressed in the map coordinate system from the display position of the microphone icon MI31 (an example of the “origin of the map coordinate system”) and the circumference CI31. Place a sound source icon that expresses
- the sound source icon SI31 represents a specific one of a plurality of participants (for example, a person who is hard of hearing and who has more opportunities to see map images than other participants; hereinafter also referred to as "you"). .
- the controller 30 sets a specific format (for example, color, texture, optical effect, shape, size, etc.) for the sound source icon SI31 representing “you” that is different from, for example, sound source icons representing other sound sources. good.
- the sound source icon SI32 represents Mr. D among the participants. In the example of FIG. 4, Mr. D is speaking.
- the controller 30 may set the sound source icon SI32 representing the speaker (sound source) who is speaking in a format different from that of the sound source icons representing speakers (sound sources) in other states. That is, the controller 30 can dynamically change the format of the sound source icon depending on the state of the sound source represented by the sound source icon.
- the text image TI32 represents Mr. D's most recent utterance content (speech recognition result for the voice uttered by Mr. D).
- the controller 30 arranges the text image TI32 on the map image in such a manner that the viewer of the map image can easily recognize the correspondence between the text image TI32 and the sound source icon SI32.
- the controller 30 arranges the text image TI32 at a predetermined position (for example, lower right) with respect to the sound source icon SI32.
- the controller 30 may format the text image TI32 at least partially in the same format as the sound source icon SI32. For example, the controller 30 may align the sound source icon SI32 and the background or characters of the text image TI32 with similar colors.
- the sound source icon SI33 represents Mr. T among the multiple participants. In the example of FIG. 4, Mr. T does not speak.
- the controller 30 may set the sound source icon SI33 representing a speaker (sound source) who is not speaking in a format different from that of sound source icons representing speakers (sound sources) in other states.
- the sound source icon SI34 represents Mr. H among the multiple participants. In the example of FIG. 4, Mr. H has just finished speaking.
- the controller 30 may set the sound source icon SI34 representing the speaker (sound source) immediately after finishing speaking in a format different from that of the sound source icons representing speakers (sound sources) in other states.
- the text image TI34 represents Mr. H's most recent remarks.
- the controller 30 arranges the text image TI34 on the map image in such a manner that the viewer of the map image can easily recognize the correspondence between the text image TI34 and the sound source icon SI34.
- the controller 30 arranges the text image TI34 at a predetermined position (for example, lower right) with respect to the sound source icon SI34.
- the controller 30 may format the text image TI34 at least partially the same as the sound source icon SI34. For example, the controller 30 may align the sound source icon SI34 and the background or characters of the text image TI34 with similar colors.
- the controller 30 generates a map image by arranging the text corresponding to the voice uttered by the speaker at a position according to the estimation result of the direction of the speaker with respect to the multi-microphone device 50, and displays the text on the display device. 10 is displayed on the display 11. This allows the viewer of the map image to intuitively associate the speaker with the utterance content.
- FIG. 5 is a diagram showing the data structure of the sound source database of this embodiment.
- Sound source information is stored in the sound source database.
- the sound source information is information about sound sources (typically speakers) around the multi-microphone device 50 identified by the controller 30 .
- the sound source database includes an "ID” field, a "name” field, an “icon” field, a “direction” field, a “recognition language” field, and a “translation language” field. .
- Each field is associated with each other.
- a sound source ID is stored in the "ID" field.
- a sound source ID is information for identifying a sound source.
- the controller 30 detects a new sound source, the controller 30 issues a new sound source ID and assigns the sound source ID to the sound source.
- the "name" field stores sound source name information.
- the sound source name information is information regarding the name of the sound source.
- the controller 30 may automatically determine the sound source name information, or may set it according to a user instruction as described later.
- the controller 30 can assign some initial sound source name to the newly detected sound source according to a predetermined rule or randomly.
- the "icon” field stores icon information.
- the icon information is information about the icon of the sound source.
- icon information may be an icon image (e.g., one of the preset icon images, or a user-provided photo or drawing), or an icon format (e.g., color, texture, optical effects, shape, etc.). can contain information that can identify
- the controller 30 may automatically determine the icon information, or may set it according to a user instruction.
- the controller 30 can assign some initial icon to the newly detected sound source according to a predetermined rule or randomly. However, when the icon of the sound source is not displayed on the map image as in Modification 2, which will be described later, the icon information can be omitted from the sound source information.
- the "direction" field stores sound source direction information.
- the sound source direction information is information regarding the direction of the sound source with respect to the multi-microphone device 50 .
- the direction of the sound source is defined as a reference direction (in the present embodiment, the front (x+ direction) of the multi-microphone device 50) determined with reference to the microphones 51-1 to 51-5 in the microphone coordinate system as 0 degrees. It is expressed as an angle of deviation from the axis that
- Recognition language information is stored in the "recognition language" field.
- the recognized language information is information about the language used by the sound source (speaker). Based on the recognized language information of the sound source, a speech recognition engine to be applied to the speech generated from the sound source is selected.
- the setting of the recognition language information may be specified by a user operation, or may be automatically specified based on a language recognition result by a speech recognition model.
- Translation language information is stored in the "translation language" field.
- the translation language information is information about the target language when machine translation is applied to the speech recognition result (text) of the speech emitted from the sound source. Based on the translation language information of the sound source, a machine translation engine to be applied to the speech recognition result for the speech generated from the sound source is selected. Note that the translation language information may be set collectively for all sound sources instead of individual sound sources, or may be set for each display device 10 .
- the sound source information may include sound source distance information.
- the sound source distance information is information regarding the distance from the multi-microphone device 50 to the sound source.
- the sound source direction information and the sound source distance information can also be expressed as sound source position information.
- the sound source position information is information about the relative position of the sound source with respect to the multi-microphone device 50 (that is, the coordinates of the sound source in the coordinate system of the multi-microphone device 50).
- FIG. 6 is a flowchart of audio processing according to this embodiment.
- FIG. 7 is a diagram for explaining sound collection by a microphone.
- FIG. 8 is a diagram for explaining the direction of the sound source in the reference coordinate system.
- FIG. 9 is a diagram showing an example of a map image.
- the audio processing shown in FIG. 6 is started after the display device 10, the controller 30, and the multi-microphone device 50 are powered on and the initial settings are completed.
- the start timing of the processing shown in FIG. 6 is not limited to this.
- the processing shown in FIG. 6 may be repeatedly executed, for example, at a predetermined cycle, so that the user of the information processing system 1 can browse the map image updated in real time.
- the multi-microphone device 50 acquires an audio signal via the microphone 51 (S150). Specifically, a plurality of microphones 51-1, . Microphones 51-1 to 51-5 collect speech sounds arriving via a plurality of paths shown in FIG. The microphones 51-1 to 51-5 convert the collected speech sounds into audio signals.
- a processor provided in the multi-microphone device 50 acquires audio signals including speech sounds uttered by at least one of the speakers PR3, PR4, and PR5 from the microphones 51-1 to 51-5.
- the audio signals obtained from the microphones 51-1 to 51-5 contain spatial information (for example, delay and phase change) based on the path along which the speech sound has traveled.
- the multi-microphone device 50 After step S150, the multi-microphone device 50 performs direction-of-arrival estimation (S151).
- a direction-of-arrival estimation model is stored in the storage device of the multi-microphone device 50 .
- the direction-of-arrival estimation model describes information for identifying the correlation between the spatial information included in the speech signal and the direction of arrival of the speech sound.
- the direction-of-arrival estimation method uses MUSIC (Multiple Signal Classification) using eigenvalue expansion of the input correlation matrix, minimum norm method, or ESPRIT (Estimation of Signal Parameters via Rotational Invariance Techniques).
- MUSIC Multiple Signal Classification
- ESPRIT Estimation of Signal Parameters via Rotational Invariance Techniques.
- the multi-microphone device 50 inputs the sound signals received from the microphones 51-1 to 51-5 into the direction-of-arrival estimation model, so that the direction of arrival ( That is, the direction of the sound source of the speech sound with respect to the multi-microphone device 50) is estimated.
- the multi-microphone device 50 for example, in the microphone coordinate system, the reference direction determined with reference to the microphones 51-1 to 51-5 (in this embodiment, the front of the multi-microphone device 50 (x + direction))
- the direction of arrival of the speech sound is represented by the angle of deviation from the axis where is 0 degree. In the example shown in FIG.
- the multi-microphone device 50 estimates that the direction of arrival of the speech sound emitted by the speaker PR3 is shifted leftward from the x-axis by an angle A2.
- the multi-microphone device 50 estimates that the direction of arrival of the voice uttered by speaker PR4 is the direction shifted leftward from the x-axis by an angle A3.
- the multi-microphone device 50 estimates that the direction of arrival of the speech sound emitted by the speaker PR5 is shifted rightward from the x-axis by an angle A1.
- the multi-microphone device 50 extracts an audio signal (S152).
- a storage device included in the multi-microphone device 50 stores a beamforming model.
- the beamforming model describes information for identifying a correlation between a predetermined direction and parameters for forming directivity having a beam in that direction.
- forming the directivity is a process of amplifying or attenuating a sound coming from a specific direction of arrival.
- the multi-microphone device 50 calculates parameters for forming directivity having a beam in the direction of arrival.
- the multi-microphone device 50 inputs the calculated angle A1 into the beamforming model, and forms a directivity with a beam in a direction shifted by an angle A1 to the right from the x-axis. to calculate The multi-microphone device 50 inputs the calculated angle A2 to the beamforming model, and calculates parameters for forming directivity having a beam in a direction shifted by an angle A2 leftward from the x-axis. The multi-microphone device 50 inputs the calculated angle A3 to the beamforming model, and calculates parameters for forming directivity having a beam in a direction shifted by the angle A3 leftward from the x-axis.
- the multi-microphone device 50 amplifies or attenuates the audio signals acquired from the microphones 51-1 to 51-5 using parameters calculated for the angle A1.
- the multi-microphone device 50 synthesizes the amplified or attenuated audio signals to extract, from the acquired audio signals, the audio signal of the speech sound coming from the sound source in the direction corresponding to the angle A1.
- the multi-microphone device 50 amplifies or attenuates the audio signals acquired from the microphones 51-1 to 51-5 using the parameters calculated for the angle A2.
- the multi-microphone device 50 synthesizes the amplified or attenuated audio signals, and extracts the audio signal of the speech sound coming from the sound source in the direction corresponding to the angle A2 from the acquired audio signals.
- the multi-microphone device 50 amplifies or attenuates the audio signals acquired from the microphones 51-1 to 51-5 using parameters calculated for the angle A3.
- the multi-microphone device 50 synthesizes the amplified or attenuated audio signals, and extracts the audio signal of the speech sound coming from the sound source in the direction corresponding to the angle A3 from the acquired audio signals.
- the multi-microphone device 50 transmits the extracted audio signal to the controller 30 together with information indicating the direction of the sound source corresponding to the audio signal estimated in step S151 (that is, the estimation result of the direction of the sound source for the multi-microphone device 50). do.
- the controller 30 After step S152, the controller 30 performs sound source identification (S130). Specifically, the controller 30 identifies sound sources existing around the multi-microphone device 50 based on the estimation result of the direction of the sound source (hereinafter referred to as “target direction”) obtained in step 151 .
- target direction the direction of the sound source
- the controller 30 determines whether the sound source corresponding to the target direction is the same as the identified sound source, and if the sound source corresponding to the target direction is not the identified sound source, a new sound source ID ( 5). Specifically, the controller 30 compares the direction of interest with the sound direction information (FIG. 5) for the identified sound sources. Then, when the controller 30 determines that the target direction matches any of the sound source direction information about the identified sound sources, the controller 30 selects the sound source corresponding to the target direction as a (identified) sound source having matching sound source direction information.
- the controller 30 determines that the target direction does not match any of the sound source direction information for the identified sound sources, the controller 30 detects that a new sound source exists in the target direction, is assigned a sound source ID.
- the fact that the target direction matches the sound source direction information includes at least that the target direction matches the direction indicated by the sound source direction information, and furthermore, the difference or ratio of the target direction to the direction indicated by the sound source direction information is within an allowable range. can include being within
- the storage device 31 stores a speech recognition model.
- a speech recognition model describes information for identifying a speech signal and the correlation of text to the speech signal.
- a speech recognition model is, for example, a trained model generated by machine learning. Note that the speech recognition model may be stored in an external device (for example, a cloud server) that can be accessed by the controller 30 via a network (for example, the Internet) instead of the storage device 31 .
- the controller 30 determines the text corresponding to the input speech signal by inputting the extracted speech signal into the speech recognition model.
- the controller 30 may select the speech recognition engine based on the recognition language information of the sound source corresponding to the speech signal.
- the controller 30 determines the text corresponding to the input speech signals by inputting the speech signals extracted for the angles A1 to A3 into the speech recognition model.
- the controller 30 executes machine translation (S132). Specifically, when the translation language information (FIG. 5) is set for the sound source of the voice corresponding to the text generated in step S131, the controller 30 performs machine translation of the text. Thereby, the controller 30 obtains the text in the language designated by the translation language information. The controller 30 may select a machine translation engine based on the translation language information of the sound source corresponding to the audio signal. On the other hand, when the translation language information (FIG. 5) is not set for the sound source of the speech corresponding to the text generated in step S131 (that is, when converting the speech into text without translating it), the controller 30 A step can be omitted.
- the controller 30 After step S132, the controller 30 generates a map image (S133). Specifically, the controller 30 generates a text image representing text based on the result of the speech recognition processing in step S131 or the result of the machine translation processing in step S132. The controller 30 sets the sound source icon representing the identified sound source around the microphone icon (for example, centering around the microphone icon) based on the direction of the sound source with respect to the multi-microphone device 50 (that is, the estimation result of step S151). on the circumference). The controller 30 places the aforementioned text image at a predetermined position with respect to the sound source icon representing the sound source of the corresponding sound.
- the controller 30 generates the map image shown in FIG.
- the microphone coordinate system is converted into the map coordinate system so that the front (x+ direction) of the microphone icon MI31 faces upward in the map image.
- the controller 30 can change the correspondence between the microphone coordinate system and the map coordinate system.
- the controller 30 displays each sound source icon around the display position of the microphone icon MI31 so that a specific sound source icon is positioned in a predetermined direction (for example, downward direction) of the map coordinate system in accordance with a user instruction. You can rotate the position. For example, in the map image of FIG. 4, the display positions of the sound source icons SI31 to SI34 rotate counterclockwise around the display position of the microphone icon MI31 in the map image of FIG.
- the sound source icon SI31 is positioned below the map image.
- the controller 30 may also generate a map image so as to emphasize a sound source icon representing the sound source or text relating to the sound while the sound source is producing sound. Controller 30 may, for example, emphasize the sound source icon or text by at least one of the following. ⁇ Add animation ⁇ Enlarge display ⁇ Change color, texture, optical effect, or shape
- step S133 the controller 30 performs information display (S134). Specifically, the controller 30 displays the map image generated in step S ⁇ b>133 on the display 11 of the display device 10 .
- FIG. 10 is a flowchart of sound source setting processing according to the present embodiment.
- FIG. 11 is a diagram showing an example of a screen displayed in the tone generator setting process of this embodiment.
- the sound source setting process shown in FIG. 10 is started in response to an instruction from the user of the information processing system 1 after the sound process shown in FIG. 6 is started.
- the start timing of the sound source setting process shown in FIG. 10 is not limited to this.
- the processing in FIG. 10 may be executed as initial setting processing before starting the audio processing shown in FIG.
- the controller 30 performs sound source selection (S230). Specifically, the controller 30 displays a sound source setting UI for the user to set sound source information on the display 11 of the display device 10 . As an example, the controller 30 displays the screen of FIG. 11 on the display 11 of the display device 10.
- FIG. The screen of FIG. 11 includes a map image MP40 and a sound source setting UI (image) CU40.
- the sound source setting UI CU40 includes display objects A41 and A42 and an operation object B43.
- the display object A41 displays information of registered participants (for example, sound source icon and registered sound source name).
- the registered participants mean the sound sources (speakers) identified in the sound source identification (S130) of FIG. 6 and whose sound source name information is registered by the sound source setting process shown in FIG. .
- the display object A42 displays information about unregistered participants (eg, sound source icon and initial sound source name).
- the unregistered participants are the sound sources (speakers) identified in the sound source identification (S130) in FIG. sound source that uses the initial sound source name).
- the operation object B43 accepts an operation to add a participant. Specifically, the user of the information processing system 1 selects the operation object B43 and further designates one of the unregistered participants. Controller 30 may present input forms (eg, text fields, menus, radio buttons, checkboxes, or combinations thereof) on display device 10 to accept designation of unregistered participants.
- input forms eg, text fields, menus, radio buttons, checkboxes, or combinations thereof
- the controller 30 selects a sound source (unregistered participant) for which sound source information is to be set, according to user instructions.
- the controller 30 acquires sound source information (S231). Specifically, the controller 30 acquires the sound source information to be set for the sound source selected in step S230 according to the user's instruction. As an example, the controller 30 acquires sound source name information for the selected sound source. Further, controller 30 may acquire icon information, recognition language information, translation language information, or a combination thereof for the selected sound source. Controller 30 may display input forms (eg, text fields, menus, radio buttons, check boxes, or combinations thereof) on display 11 of display device 10 to obtain sound source information. The controller 30 may acquire the participant information of the conversation and generate the elements of the input form (menu, radio button, or check box) based on the participant information. Conversation participant information may be set manually before the start of the conversation, or may be obtained from an account name logged into the information processing system 1 or a cooperating conference system.
- input forms eg, text fields, menus, radio buttons, check boxes, or combinations thereof
- the controller 30 updates the sound source information (S232). Specifically, the controller 30 updates the sound source information by associating the sound source information acquired in step S231 with the sound source ID that identifies the sound source selected in step S230 and registering it in the sound source database (FIG. 5).
- the controller 30 may end the tone generator setting process shown in FIG. Alternatively, the controller 30 may repeatedly execute the sound source setting process until the user instructs the end of the sound source setting process or sound source information is set for all unregistered participants.
- the controller 30 of the present embodiment acquires the estimation result indicating the direction of the sound source with respect to the multi-microphone device 50, and the sound emitted from the sound source and collected by the multi-microphone device 50 Get information about the content of the audio.
- the controller 30 generates a map image in which the text is arranged at a position corresponding to the direction of the sound source corresponding to the text with respect to the multi-microphone device 50 and displays the map image on the display 11 of the display device 10 . This allows the viewer of the map image to intuitively recognize the association between the sound source (eg, speaker) and the content of the voice (eg, utterance) emitted from the sound source.
- the controller 30 may identify individual sound sources existing around the multi-microphone device 50 based on the results of estimating the directions of the sound sources, and set sound source information regarding the identified sound sources, for example, according to user instructions. This makes it possible to appropriately set the sound source information for the sound source corresponding to the text displayed on the map image.
- the controller 30 may set at least one of sound source name information, recognition language information, and translation language information for the identified sound source. This makes it possible to clarify who said the text displayed in the map image, and to generate accurate or user-friendly text.
- the controller 30 includes a microphone icon representing the multi-microphone device 50 and a sound source icon representing the sound source, and the sound source icon indicates the direction of the sound source corresponding to the sound source icon with respect to the multi-microphone device on the circumference centered on the microphone icon.
- You may generate
- the viewer of the map image can intuitively recognize in which direction the text displayed on the map image corresponds to the sound emitted from the sound source positioned with respect to the multi-microphone device 50.
- the viewer of the map image can intuitively recognize which sound source in the real space the sound source icon displayed on the map image corresponds to.
- the controller 30 may display the map image so as to emphasize the sound source icon representing the sound source or the information regarding the content of the sound. This allows the viewer to easily identify the sound source and text of interest (e.g., who is speaking and what is being said) even when multiple sound source icons and multiple texts are displayed on the map image. can be discriminated. Further, the controller 30 rotates the display position of each sound source icon and each text around the display position of the microphone icon so that the specific sound source icon is positioned in a specific direction (for example, downward direction) on the map image. good too. As a result, a speaker (for example, a hearing-impaired person) corresponding to a specific sound source icon can easily grasp the correspondence between another speaker (sound source) and the sound source icon in the map image.
- a speaker for example, a hearing-impaired person
- Modification 1 is an example of generating the minutes in addition to the map image.
- FIG. 12 is a diagram illustrating one mode of Modification 1.
- the controller 30 generates a map image and minutes while a conversation is being held by a plurality of participants, and displays them on the display 11 of the display device 10.
- the minutes correspond to an utterance history in which utterances by sound sources (speakers) around the multi-microphone device 50 are arranged in chronological order.
- the controller 30 updates the map image and the minutes according to the participant's remarks. As a result, the minutes play a role of UI for visually grasping the flow of the conversation so far (particularly, who said what) in real time.
- the controller 30 displays the map image MP50 and the minutes (image) MN50 side by side on the display 11 of the display device 10, for example.
- the minutes MN50 includes a display object A51.
- Controller 30 may display only one of map image MP50 and minutes MN50 selected by the user on display 11 of display device 10 instead of arranging map image MP50 and minutes MN50 on one screen. .
- the display object A51 displays information on the speaker's utterance (for example, speaker (sound source) icon or name, utterance time, utterance content, or a combination thereof).
- a user of the information processing system 1 finds an error (for example, an error in speech recognition or an error in machine translation) in the statement content arranged in the minutes MN50. ) is found, the user can select the display object A51 that displays the content of the statement and edit the content of the statement.
- the controller 30 acquires the edited statement content from the user via, for example, an input form, and updates the display object A51 based on the statement content. Further, if the map image MP50 includes text corresponding to the edited comment content, the controller 30 may update the text.
- the controller 30 may cause the display 11 to display a screen shown in FIG. 18 instead of the screen shown in FIG.
- the direction of the speaker with respect to the multi-microphone device 50 is indicated by displaying an arc mark on the icon of the speaker.
- the user can grasp in which direction the speaker of each utterance is present with respect to the multi-microphone device 50 by only confirming the minutes MN50 without confirming the map image MP50.
- the controller 30 generates minutes corresponding to the history of utterances by speakers present around the multi-microphone device 50 and displays them on the display 11 of the display device 10 . This allows the viewer of the minutes to easily look back on the flow of the conversation.
- FIG. 13 is a diagram showing the data structure of the utterance database of Modification 1. As shown in FIG.
- the statement database stores statement information.
- the utterance information is information about voice (utterance) collected by the multi-microphone device 50 .
- the statement database includes a "statement ID” field, a “sound source ID field”, a “statement date and time” field, and a “statement content” field. Each field is associated with each other.
- a statement ID is stored in the "statement ID" field.
- the statement ID is information that identifies the statement.
- the controller 30 detects a new utterance from the speech recognition result or the machine translation result, it issues a new utterance ID and assigns the utterance ID to the utterance.
- the controller 30 divides the utterance according to the turn of the speaker.
- the controller 30 can delimit even a series of utterances by the same speaker according to speech boundaries (for example, silence intervals) or semantic boundaries of text.
- a sound source ID is stored in the "sound source ID" field.
- the sound source ID is information for identifying a speaker (sound source) who has made a statement.
- the sound source ID corresponds to a foreign key for referring to the sound source database of FIG. 5 as a parent table.
- the utterance date/time information is information related to the date/time when the utterance was made.
- the statement date and time information may be information indicating an absolute date and time, or may be information indicating an elapsed time from the start of the conversation.
- the "statement content” field stores statement content information.
- the statement content information is information about the content of the statement.
- the utterance content information is, for example, a speech recognition result for the utterance, a machine translation result for the speech recognition result, or a user's editing result for these.
- the utterance database can also be used to reproduce a map image at a specific point in time in this embodiment.
- FIG. 14 is a flowchart of the audio processing of Modification 1.
- the audio processing shown in FIG. 14 is started after the display device 10, the controller 30, and the multi-microphone device 50 are powered on and the initial settings are completed.
- the start timing of the processing shown in FIG. 14 is not limited to this.
- the process shown in FIG. 14 may be repeatedly executed, for example, at a predetermined cycle, so that the user of the information processing system 1 can browse the map image and minutes updated in real time.
- the multi-microphone device 50 acquires the audio signal (S150), estimates the direction of arrival (S151), and extracts the audio signal (S152), as in FIG.
- step S152 the controller 30 executes sound source identification (S130), speech recognition processing (S131), machine translation (S132), and map image generation (S133), as in FIG. Note that the controller 30 registers the utterance information in the utterance database (FIG. 13) during steps S130 to S132.
- the controller 30 executes minutes generation (S334). Specifically, the controller 30 refers to the statement database (FIG. 13) and generates minutes. As an example, the controller 30 may refer to the minutes generated when step S334 was executed last time (hereinafter referred to as "previous “Minutes”) may be updated.
- controller 30 executes information display (S335). Specifically, controller 30 displays the map image generated in step S133 and the minutes generated in step S334 on display 11 of display device 10 .
- the controller 30 of Modification 1 provides text (that is, speech recognition results or machine The minutes are generated based on the translation result), and the minutes are displayed on the display 11 of the display device 10 side by side with the map image.
- text that is, speech recognition results or machine
- the minutes are displayed on the display 11 of the display device 10 side by side with the map image.
- the controller 30 may generate the minutes by arranging the texts related to the statements in chronological order of the date and time of the statements. This allows the viewer of the minutes to intuitively recognize the flow of the conversation up to that point.
- the controller 30 may also edit the text placed in the minutes according to user instructions.
- Modification 2 is an example of generating a map image different from that of the present embodiment.
- FIG. 15 is a diagram illustrating an example of a map image according to modification 2.
- FIG. 16 is a diagram showing another example of the map image of Modification 2. In FIG.
- the controller 30 generates a map image and displays it on the display 11 of the display device 10 while the conversation is being held by a plurality of participants.
- the map image corresponds to a bird's-eye view of the sound source (speaker) environment around the multi-microphone device 50, and the text based on the voice uttered by the speaker is at a position based on the direction of the speaker with respect to the multi-microphone device 50. placed.
- the controller 30 updates the map image according to the participant's speech. As a result, the map image serves as a UI for visually grasping the content of the most recent conversation (particularly, who is speaking what) in real time.
- the map image shown in FIG. 15 includes a microphone icon MI61, a circumference CI61, display objects A61 and A62, and text images TI61a, TI61b and TI62.
- Microphone icon MI61 represents multi-microphone device 50, similar to microphone icon MI31 (FIG. 4).
- the microphone icon MI61 has a mark MR61 indicating the direction of the microphone icon MI61.
- a circumference CI61 corresponds to a circumference centered on the microphone icon MI61, like the circumference CI31 (FIG. 4).
- a text image TI61a is an utterance by the first speaker, and corresponds to the utterance content with the second latest utterance date and time among the text images TI61a, TI61b, and TI62 displayed in FIG.
- the text image TI 61 a is arranged at a position corresponding to the direction of the first speaker with respect to the multi-microphone device 50 .
- the text image TI61a is arranged along a straight line extending from the display position of the microphone icon MI61 (an example of the “origin of the map coordinate system”) toward the (estimated) direction of the first speaker.
- a text image TI61b is an utterance by the first speaker, and corresponds to the utterance content with the latest utterance date and time among the text images TI61a, TI61b, and TI62 displayed in FIG.
- the text image TI61b is placed at a position corresponding to the direction of the first speaker with respect to the multi-microphone device 50.
- FIG. Specifically, the text image TI61b is arranged along a straight line extending from the display position of the microphone icon MI61 in the (estimated) direction of the first speaker. However, the text image TI61b is arranged at a position closer to the display position of the microphone icon MI61 than the text image TI61a corresponding to the older utterance date and time.
- the display object A61 displays the (estimated) direction of the first speaker (sound source) with respect to the multi-microphone device 50.
- the display object A61 corresponds to a sector having a predetermined angular width centered on a straight line extending from the display position of the microphone icon MI61 toward the first speaker.
- the controller 30 may set a specific format for the display object A61 that is different from other objects displaying the direction of the speaker.
- the controller 30 may format the display object A61 at least partially identical to the text images TI61a, TI61b. For example, the controller 30 may align the display object A61 with a color similar to the background or characters of the text images TI61a and TI61b.
- a text image TI62 is an utterance by the second speaker, and corresponds to the utterance content with the oldest utterance date and time among the text images TI61a, TI61b, and TI62 displayed in FIG.
- the text image TI62 is arranged at a position corresponding to the direction of the second speaker with respect to the multi-microphone device 50.
- FIG. Specifically, the text image TI62 is arranged along a straight line extending from the display position of the microphone icon MI61 in the (estimated) direction of the second speaker.
- the display object A62 displays the (estimated) direction of the second speaker (sound source) with respect to the multi-microphone device 50.
- the display object A62 corresponds to a sector having a predetermined angular width centered on a straight line extending from the display position of the microphone icon MI61 toward the second speaker.
- the controller 30 may set the display object A62 to a specific format that is different from other objects displaying the direction of the speaker.
- Controller 30 may format display object A62 at least partially identical to text image TI62. For example, the controller 30 may align the display object A62 with a color similar to the background or characters of the text image TI62.
- the controller 30 updates the map image shown in FIG. 15 to the map image shown in FIG. 16 in response to new statements by the participants.
- the map image shown in FIG. 16 includes a microphone icon MI61, a circumference CI61, a display object A61, and text images TI61a, TI61b, and TI61c.
- the text image TI61a is an utterance by the first speaker, and corresponds to the utterance content with the oldest utterance date and time among the text images TI61a, TI61b, and TI61c displayed in FIG. As in FIG. 15, the text image TI61a is arranged along a straight line extending from the display position of the microphone icon MI61 toward the (estimated) direction of the first speaker. However, the controller 30 moves the display position of the text image TI61a away from the display position of the microphone icon M61 compared to the map image shown in FIG.
- the text image TI61b is an utterance by the first speaker, and corresponds to the utterance content with the second latest utterance date and time among the text images TI61a, TI61b, and TI61c displayed in FIG.
- the text image TI61b is arranged along a straight line extending from the display position of the microphone icon MI61 toward the (estimated) direction of the first speaker.
- the controller 30 moves the display position of the text image TI61b away from the display position of the microphone icon M61 compared to the map image shown in FIG.
- the text image TI61b is positioned closer to the display position of the microphone icon MI61 than the text image TI61a corresponding to the older utterance date and time, and the microphone icon MI61 is displayed in a position closer to the display position of the microphone icon MI61 than the text image TI61c corresponding to the newer utterance date and time. Placed at a position far from the position.
- a text image TI61c is an utterance by the first speaker, and corresponds to the utterance content with the latest utterance date and time among the text images TI61a, TI61b, and TI61c displayed in FIG.
- the text image TI61c is arranged along a straight line extending from the display position of the microphone icon MI61 toward the (estimated) direction of the first speaker. However, the text image TI61c is arranged at a position closer to the display position of the microphone icon MI61 than the text images TI61a and TI61b corresponding to the older utterance date and time.
- the controller 30 does not place the text image TI62 corresponding to the utterance position older than the text image TI61a on the map image, and does not place the display object A62 on the map image. This makes it easier for the viewer of the map image to pay attention to the content of the most recent utterance and the speaker.
- the controller 30 multiplies the texts corresponding to the voices uttered by the same speaker so that they move away from the origin of the map coordinate system (for example, the display position of the microphone icon MI61) in chronological order of date and time of corresponding occurrence.
- a map image is generated by arranging along the (estimated) direction of the speaker with respect to the microphone device 50 .
- the viewer of the map image can intuitively recognize the association between the speaker and the content of the statement, and the temporal order of the statement can be determined by the display position of the text corresponding to the statement and the map coordinate system. It can be determined based on the distance from the origin.
- each text image is displayed rotated in the direction corresponding to the direction of the sound source. may be
- Modification 3 is an example of generating a map image for each of a plurality of multi-microphone devices installed at different locations.
- 17A and 17B are diagrams illustrating an example of a map image according to Modification 3.
- FIG. 1
- the controller 30 During a conversation with multiple participants in different locations (e.g., different conference rooms, different offices, or different companies), the controller 30 generates a map image for each location and displays the display device 10 is displayed on the display 11 of the Each map image corresponds to a bird's-eye view of the sound source (speaker) environment around the multi-microphone device 50 installed at each location. The text is placed based on the sound emitted from the .
- the controller 30 updates the map image according to the participant's speech.
- the map image serves as a UI for visually grasping in real time the content of the most recent conversations at each location (particularly, who is speaking what at which location).
- the controller 30 displays a map image MP71 of the first location and a map image MP72 of the second location side by side on the display 11 of the display device 10, for example.
- the controller 30 may display only one of the map images MP71 and MP72 selected by the user on the display 11 of the display device 10 instead of arranging the map images MP71 and MP72 on one screen.
- the controller 30 generates map images for each of the multiple multi-microphone devices 50 installed at different locations.
- the viewer of the map image can intuitively recognize the association between the location, the speaker, and the utterance content. .
- the speaker at the second location can be easily identified. In other words, it is possible to compensate for the deterioration of the presence caused by the remote conference.
- the storage device 31 may be connected to the controller 30 via a network.
- Each step of the above information processing can be executed by any of the display device 10, the controller 30 and the multi-microphone device 50.
- the controller 30 may acquire multi-channel audio signals generated by the multi-microphone device 50, estimate the direction of arrival (S151), and extract the audio signal (S152).
- the display device 10 and the controller 30 are independent devices.
- display device 10 and controller 30 may be integrated.
- the display device 10 and controller 30 can be implemented as one tablet terminal or personal computer.
- the multi-microphone device 50 and the display device 10 or the controller 30 may be integrated.
- the controller 30 may reside in a cloud server.
- the display device 10 is an electronic device such as a tablet terminal, a personal computer, a smart phone, a conference display device, etc., which can easily share display contents with multiple users.
- display device 10 may also be configured to be wearable on a human head.
- display device 10 may be a glasses-type display device, a head-mounted display, a wearable device, or smart glasses.
- the display device 10 may be an optical see-through glass type display device, but the format of the display device 10 is not limited to this.
- display device 10 may be a video see-through glass-type display device. That is, display device 10 may comprise a camera.
- the display device 10 may display on the display 11 a synthesized image obtained by synthesizing the text image generated based on the voice recognition and the captured image captured by the camera.
- the captured image is an image captured in front of the user and may include an image of the speaker.
- the display device 10 may perform AR (Augmented Reality) display by synthesizing a text image generated based on voice recognition and a captured image captured by a camera, for example, in a smartphone, personal computer, or tablet terminal. .
- a plurality of display devices 10 may be connected to one controller 30 .
- the layout of the map image for example, correspondence between the microphone coordinate system and the map coordinate system
- translation language information may be configured to be changeable for each display device 10 .
- the display 11 may be implemented by any method as long as it can present an image to the user.
- the display 11 can be implemented, for example, by the following implementation method.
- ⁇ HOE Holographic optical element
- DOE diffractive optical element
- an optical element as an example, a light guide plate
- ⁇ Liquid crystal display ⁇ Retinal projection display
- LED Light Emitting Diode
- Organic EL Electro Luminescence
- ⁇ Laser display ⁇ Optical element (for example, lens, mirror, diffraction grating, liquid crystal, MEMS mirror, HOE) 2.
- the display 11 may display only a portion of the map image (for example, the upper half). Thereby, even when the display area of the display 11 is small, the visibility of the text image and the like can be maintained. A part of the map image displayed on the display 11 may be switched according to a user instruction or automatically.
- a user's instruction may be input from an operation unit provided in the display device 10 .
- any implementation method can be used as long as voice signals corresponding to a specific speaker can be extracted.
- the multi-microphone device 50 may extract audio signals by, for example, the following method.
- Frost beamformer Adaptive filter beamforming generally sidelobe canceller as an example
- ⁇ Speech extraction methods other than beamforming for example, frequency filter or machine learning
- the controller 30 may obtain text posted by chat participants in the chat associated with the conversation and place the text (image) on the map image.
- the controller 30 may arrange contributor icons representing chat participants on the map image in the same manner as the sound source icons. This makes it easier for the conversation participants to recognize the content posted by the chat participants.
- the text posted by the chat participant hereinafter referred to as “posted text”
- the display position of the poster icon can be determined by various techniques.
- the controller 30 may display the poster icon or the posted text outside the circumference CI31 or CI61, for example, to distinguish it from the sound source icon or the text about the statement.
- the controller 30 detects that the chat participant is the same person as one of the speakers, the controller 30 displays the text posted by the speaker according to the same rule as the text regarding the statement by the speaker. By doing so, the content of comments and the content of posts by the same person may be aggregated.
- the controller 30 determines the orientation of chat participants with respect to the multi-microphone device 50 in accordance with user instructions, and arranges poster icons or posted texts (for example, on the circumference CI 31) based on the determined orientation. ).
- the controller 30 may move the display position of the poster icon or the posted text on the map image in accordance with the user's instruction.
- the display position of the poster icon or posted text is optimized (for example, the speaker sound source icon and text image).
- Modified Example 1 an example has been described in which minutes are generated and the contents of remarks placed in the minutes can be edited by the user.
- the user may add a supplementary explanation about the statement, without being limited to correcting the content of the statement itself. As a result, it is possible to prevent the gist of the statement from being misunderstood or misunderstood by the audience of the minutes.
- Controller 30 may obtain text posted by chat participants in chats associated with the conversation and generate minutes further based on the text. In this case, the controller 30 generates the minutes by arranging the posted texts or the texts indicating the contents of the comments in chronological order of the posting date/time or the speaking date/time. For example, the posted text and the text indicating the content of the statement may be arranged in the same window in chronological order. This makes it easier for the participants in the conversation to recognize the content posted by the chat participant, and prevents the chat participant from overlooking the content posted by the chat participant when reviewing the flow of the discussion.
- Modified Example 2 an example was shown in which text images corresponding to three utterance contents are arranged on the map image in order of date and time of occurrence.
- the number of text images arranged on the map image may be two or less, or may be four or more.
- the number of text images arranged on the map image may be fixed, or may be variable according to various conditions (for example, the size of the map image, the number of characters included in the content of the statement, etc.). good.
- the text image to be placed on the map image may be determined depending on whether or not the elapsed time from the date and time of the statement corresponding to the text image is within a threshold.
- the map image described in this embodiment and the map image described in modification 2 can be combined.
- the sound source icon described in this embodiment may be displayed. may be displayed.
- controller 30 may generate map images for more than two locations.
- controller 30 may generate the minutes by arranging the content of statements made by the participants at a plurality of locations in chronological order. In this case, the controller 30 may collect the statements of each participant into the same minutes regardless of where the participants are.
- information processing system 10 display device 11: display 30: controller 31: storage device 32: processor 33: input/output interface 34: communication interface 50: multi-microphone device
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Otolaryngology (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Theoretical Computer Science (AREA)
- Circuit For Audible Band Transducer (AREA)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2023523217A JP7399413B1 (ja) | 2022-02-21 | 2023-02-20 | 情報処理装置、情報処理方法、及びプログラム |
| JP2023199974A JP2024027122A (ja) | 2022-02-21 | 2023-11-27 | 情報処理装置、情報処理方法、及びプログラム |
| US18/808,209 US20240410969A1 (en) | 2022-02-21 | 2024-08-19 | Information processing apparatus and information processing method |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2022024504 | 2022-02-21 | ||
| JP2022-024504 | 2022-02-21 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/808,209 Continuation US20240410969A1 (en) | 2022-02-21 | 2024-08-19 | Information processing apparatus and information processing method |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023157963A1 true WO2023157963A1 (ja) | 2023-08-24 |
Family
ID=87578686
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2023/005887 Ceased WO2023157963A1 (ja) | 2022-02-21 | 2023-02-20 | 情報処理装置、情報処理方法、及びプログラム |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US20240410969A1 (https=) |
| JP (2) | JP7399413B1 (https=) |
| WO (1) | WO2023157963A1 (https=) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2025183739A (ja) | 2024-06-05 | 2025-12-17 | 富士フイルムビジネスイノベーション株式会社 | 情報処理システム、操作機器、および、プログラム |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2011165056A (ja) * | 2010-02-12 | 2011-08-25 | Nec Casio Mobile Communications Ltd | 情報処理装置及びプログラム |
| JP2012059121A (ja) * | 2010-09-10 | 2012-03-22 | Softbank Mobile Corp | 眼鏡型表示装置 |
| WO2014097748A1 (ja) * | 2012-12-18 | 2014-06-26 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 特定の話者の音声を加工するための方法、並びに、その電子装置システム及び電子装置用プログラム |
| JP2016029466A (ja) * | 2014-07-16 | 2016-03-03 | パナソニック インテレクチュアル プロパティ コーポレーション オブアメリカPanasonic Intellectual Property Corporation of America | 音声認識テキスト化システムの制御方法および携帯端末の制御方法 |
| JP2021067830A (ja) * | 2019-10-24 | 2021-04-30 | 日本金銭機械株式会社 | 議事録作成システム |
| JP2021136606A (ja) * | 2020-02-27 | 2021-09-13 | 沖電気工業株式会社 | 情報処理装置、情報処理システム、情報処理方法、及び情報処理プログラム |
| WO2021230180A1 (ja) * | 2020-05-11 | 2021-11-18 | ピクシーダストテクノロジーズ株式会社 | 情報処理装置、ディスプレイデバイス、提示方法、及びプログラム |
-
2023
- 2023-02-20 JP JP2023523217A patent/JP7399413B1/ja active Active
- 2023-02-20 WO PCT/JP2023/005887 patent/WO2023157963A1/ja not_active Ceased
- 2023-11-27 JP JP2023199974A patent/JP2024027122A/ja active Pending
-
2024
- 2024-08-19 US US18/808,209 patent/US20240410969A1/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2011165056A (ja) * | 2010-02-12 | 2011-08-25 | Nec Casio Mobile Communications Ltd | 情報処理装置及びプログラム |
| JP2012059121A (ja) * | 2010-09-10 | 2012-03-22 | Softbank Mobile Corp | 眼鏡型表示装置 |
| WO2014097748A1 (ja) * | 2012-12-18 | 2014-06-26 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 特定の話者の音声を加工するための方法、並びに、その電子装置システム及び電子装置用プログラム |
| JP2016029466A (ja) * | 2014-07-16 | 2016-03-03 | パナソニック インテレクチュアル プロパティ コーポレーション オブアメリカPanasonic Intellectual Property Corporation of America | 音声認識テキスト化システムの制御方法および携帯端末の制御方法 |
| JP2021067830A (ja) * | 2019-10-24 | 2021-04-30 | 日本金銭機械株式会社 | 議事録作成システム |
| JP2021136606A (ja) * | 2020-02-27 | 2021-09-13 | 沖電気工業株式会社 | 情報処理装置、情報処理システム、情報処理方法、及び情報処理プログラム |
| WO2021230180A1 (ja) * | 2020-05-11 | 2021-11-18 | ピクシーダストテクノロジーズ株式会社 | 情報処理装置、ディスプレイデバイス、提示方法、及びプログラム |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2024027122A (ja) | 2024-02-29 |
| JP7399413B1 (ja) | 2023-12-18 |
| US20240410969A1 (en) | 2024-12-12 |
| JPWO2023157963A1 (https=) | 2023-08-24 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US9949056B2 (en) | Method and apparatus for presenting to a user of a wearable apparatus additional information related to an audio scene | |
| TWI681317B (zh) | 人機互動方法及系統 | |
| US12032155B2 (en) | Method and head-mounted unit for assisting a hearing-impaired user | |
| JP2022160406A (ja) | ワードフロー注釈 | |
| CN104254818B (zh) | 音频用户交互辨识和应用程序接口 | |
| JP6669073B2 (ja) | 情報処理装置、制御方法、およびプログラム | |
| US20170277257A1 (en) | Gaze-based sound selection | |
| CN107168518B (zh) | 一种用于头戴显示器的同步方法、装置及头戴显示器 | |
| CN207408959U (zh) | 具有文本及语音处理功能的混合现实智能眼镜 | |
| JP7048784B2 (ja) | 表示制御システム、表示制御方法及びプログラム | |
| CN109784128A (zh) | 具有文本及语音处理功能的混合现实智能眼镜 | |
| CN112673423A (zh) | 一种车内语音交互方法及设备 | |
| US12567415B2 (en) | Providing and controlling immersive three-dimensional environments | |
| US20230196943A1 (en) | Narrative text and vocal computer game user interface | |
| US12537013B2 (en) | Audio-visual speech recognition control for wearable devices | |
| US20240119684A1 (en) | Display control apparatus, display control method, and program | |
| JP2026012872A (ja) | 情報処理装置、ディスプレイデバイス、提示方法、及びプログラム | |
| US20240410969A1 (en) | Information processing apparatus and information processing method | |
| CN116755590A (zh) | 虚拟图像的处理方法、装置、增强实现设备及存储介质 | |
| US20250054246A1 (en) | Gaze-mediated augmented reality interaction with sources of sound in an environment | |
| CN120430322B (zh) | 一种跨语言的会议翻译方法及智能眼镜 | |
| WO2021020069A1 (ja) | 表示装置、表示方法、及び、プログラム | |
| KR20220002444A (ko) | 환경에 기초한 통신 데이터 제시 | |
| JP2025006291A (ja) | 情報処理装置、方法、システム、およびプログラム | |
| US20240129686A1 (en) | Display control apparatus, and display control method |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023523217 Country of ref document: JP |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23756487 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 23756487 Country of ref document: EP Kind code of ref document: A1 |