WO2022224586A1

WO2022224586A1 - Information processing device, information processing method, program, and information recording medium

Info

Publication number: WO2022224586A1
Application number: PCT/JP2022/008277
Authority: WO
Inventors: 雅俊浜中
Original assignee: 国立研究開発法人理化学研究所
Priority date: 2021-04-20
Filing date: 2022-02-28
Publication date: 2022-10-27
Also published as: JPWO2022224586A1

Abstract

Provided is an information processing device (101) that estimates the direction of the face of a user in the real world and outputs information corresponding to the estimation. The information processing device (101) comprises a camera (151). A detection unit (111) detects a first direction of the information processing device (101) in a first coordinate system fixed in the real world. When a photograph image photographed by the camera (151) includes a face image of the user, an estimation unit (112) estimates, from the photograph image and the face image, a second direction of the face of the user in a second coordinate system fixed with respect to the information processing device (101). A calculation unit (113) calculates a third direction of the face of the user in the first coordinate system from the detected first direction and the estimated second direction. An output unit (114) outputs information corresponding to the calculated third direction.

Description

Information processing device, information processing method, program, and information recording medium

The present invention relates to an information processing device, an information processing method, a program, and an information recording medium for estimating the orientation of a user's face in the real world and outputting information according to this.

Conventionally, techniques have been proposed for outputting information according to the movement of the user's head. For example, the sound source selection device disclosed in Patent Document 1 is
headphones and
a virtual sound source providing means for providing a plurality of virtual sound sources localized via the headphones to the listener wearing the headphones;
virtual sound source selection means for selecting one virtual sound source from the plurality of virtual sound sources;
The virtual sound source providing means is
localized sound source arrangement pattern storage means for storing a plurality of localized sound source arrangement patterns of the plurality of virtual sound sources to be provided to the listener;
arrangement pattern selection means for selecting a desired pattern from the plurality of localized sound source arrangement patterns according to the listener's selection action;
mixing means for providing the plurality of virtual sound sources according to the localized sound source arrangement pattern;
a head movement detection sensor mounted on the headphones and detecting movement of the listener's head;
head motion determination means for determining the motion of the head based on the output of the head motion detection sensor;
The arrangement pattern selection means selects another localized sound source arrangement pattern from the localized sound source arrangement pattern storage means when the head motion determination means detects a predetermined arrangement pattern changing motion from the movement of the head. It is configured to output to mixing means.

On the other hand, in recent smartphones and tablets, the front camera (sometimes called the in-camera, front camera, or front camera. A camera that allows you to shoot while checking the state of the world spreading in front of the user on the screen by using a rear camera (sometimes called a rear camera) whose shooting direction is opposite to that of the front camera. There are many.

In recent years, smartphones and tablets have been able to detect the position and orientation of smartphones and tablets using GPS (Global Positioning System), Wifi access points, geolocation detection functions that use Bluetooth (registered trademark) beacons, acceleration sensors, geomagnetic sensors, etc. can be detected with respect to the world coordinate system fixed in the real world.

Moreover, technologies that provide augmented reality functions that display an augmented version of the real world on the screens of smartphones and tablets are also spreading.

Patent No. 4837512

Here, in the technology disclosed in Patent Document 1, a head movement detection sensor included in headphones is used to detect movement of the user's head.

However, although audio equipment such as headphones and earphones used with smartphones and tablets have noise canceling functions and external audio capture functions, they are becoming popular, but most of them do not have head movement detection sensors. is.

Therefore, there is a strong demand for technology that can estimate the orientation of a user's face using the functions of smartphones and tablets that are already in widespread use.

The present invention is intended to solve the above problems, and includes an information processing apparatus, an information processing method, a program, and an information recording medium for estimating the orientation of a user's face in the real world and outputting information according to the orientation. Regarding.

An information processing apparatus according to the present invention has a camera,
detecting a first orientation of the information processing device in a first coordinate system fixed in the real world;
If the photographed image taken by the camera contains the face image of the user, the second coordinate system of the user's face in the second coordinate system fixed to the information processing device can be obtained from the photographed image and the face image. Estimate the orientation,
calculating a third orientation of the user's face in the first coordinate system from the detected first orientation and the estimated second orientation;
Information corresponding to the calculated third orientation is output.

According to the present invention, it is possible to provide an information processing device, an information processing method, a program, and an information recording medium for estimating the direction of a user's face in the real world and outputting information according to this.

1 is an explanatory diagram showing a schematic configuration of an information processing device according to an embodiment of the present invention; FIG. 4 is a flow chart showing control of an information processing method executed by the information processing apparatus according to the embodiment of the present invention; 4 is a drawing substitute photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in gray scale. 4 is a drawing-substituting photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in monochrome binary. 4 is a drawing substitute photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in gray scale. 4 is a drawing-substituting photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in monochrome binary. 4 is a drawing substitute photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in gray scale. 4 is a drawing-substituting photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in monochrome binary. 4 is a drawing substitute photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in gray scale. 4 is a drawing-substituting photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in monochrome binary. 4 is a drawing substitute photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in gray scale. 4 is a drawing-substituting photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in monochrome binary. 4 is a drawing substitute photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in gray scale. 4 is a drawing-substituting photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in monochrome binary. 4 is a drawing substitute photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in gray scale. 4 is a drawing-substituting photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in monochrome binary. 4 is a drawing substitute photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in gray scale. 4 is a drawing-substituting photograph showing an example of display by the information processing apparatus according to the embodiment of the present invention in monochrome binary. 4 is a drawing-substituting photograph showing, in grayscale, a display example of a stage in a virtual concert venue by the information processing apparatus according to the embodiment of the present invention. FIG. 10 is a drawing-substituting photograph showing a display example of a stage in a virtual concert venue by the information processing apparatus according to the embodiment of the present invention in monochrome binary. 4 is a drawing-substituting photograph showing, in gray scale, a display example of a virtual room in which a plurality of displays are arranged by the information processing apparatus according to the embodiment of the present invention. 3 is a drawing-substitute photograph showing, in monochrome binary, a display example of a virtual room in which a plurality of displays are arranged by the information processing apparatus according to the embodiment of the present invention. 4 is a drawing-substituting photograph showing, in grayscale, a display example of a virtual room in which a plurality of moving image contents are arranged by the information processing apparatus according to the embodiment of the present invention; 10 is a drawing-substituting photograph showing, in monochrome binary, a display example of a virtual room in which a plurality of pieces of moving image content are arranged by the information processing apparatus according to the embodiment of the present invention. 10 is a drawing-substituting photograph showing, in grayscale, a display example when one moving image content is noticed in the virtual room by the information processing apparatus according to the embodiment of the present invention. 10 is a drawing-substituting photograph showing, in monochrome binary, a display example when one moving image content is noticed in the virtual room by the information processing apparatus according to the embodiment of the present invention. 10 is a drawing-substituting photograph showing, in grayscale, a display example when one moving image content is noticed in the virtual room by the information processing apparatus according to the embodiment of the present invention. 10 is a drawing-substituting photograph showing, in monochrome binary, a display example when one moving image content is noticed in the virtual room by the information processing apparatus according to the embodiment of the present invention. 10 is a drawing-substituting photograph showing, in grayscale, a display example when one moving image content is noticed in the virtual room by the information processing apparatus according to the embodiment of the present invention. 10 is a drawing-substituting photograph showing, in monochrome binary, a display example when one moving image content is noticed in the virtual room by the information processing apparatus according to the embodiment of the present invention. 1 is an explanatory diagram showing a schematic configuration of an information processing device that processes an object of interest according to an embodiment of the present invention; FIG.

Embodiments of the present invention are described below. In addition, this embodiment is for description and does not limit the scope of the present invention. Therefore, those skilled in the art can adopt embodiments in which each element or all of the elements of this embodiment are replaced with equivalents. Also, the elements described in each embodiment can be omitted as appropriate depending on the application. As such, any embodiment constructed in accordance with the principles of the present invention is within the scope of the present invention.

(Constitution)
FIG. 1 is an explanatory diagram showing a schematic configuration of an information processing device according to an embodiment of the present invention. An outline will be described below with reference to this figure.

As shown in this figure, the information processing apparatus 101 according to this embodiment has a camera 151. As shown in FIG. It has a detection unit 111 , an estimation unit 112 , a calculation unit 113 and an output unit 114 . Also, the audio equipment 152, the screen 153 of the display, etc. can be employed as the output destination of the information.

The information processing apparatus 101 according to this embodiment is typically realized by executing a program on a portable computer such as a smart phone or a tablet. The computer is connected to various output devices and input devices, and exchanges information with these devices.

Programs run on a computer can be distributed and sold by a server to which the computer is communicatively connected, as well as CD-ROM (Compact Disk Read Only Memory), flash memory, EEPROM (Electrically Erasable Programmable ROM). After recording on a non-transitory information recording medium such as the above, it is also possible to distribute and sell the information recording medium.

The program is installed on a computer's hard disk, solid state drive, flash memory, EEPROM, or other non-temporary information recording medium. Then, the computer realizes the information processing apparatus according to the present embodiment. In general, a computer's CPU (Central Processing Unit) reads a program from an information recording medium to RAM (Random Access Memory) under the control of the computer's OS (Operating System), and then executes the code contained in the program. interpret and execute. However, in architectures that allow mapping of information recording media within the memory space accessible by the CPU, explicit program loading to RAM may not be necessary. Various information required in the process of program execution can be temporarily recorded in the RAM.

Furthermore, it is desirable that the computer has a GPU (Graphics Processing Unit) for performing various image processing calculations at high speed. By using libraries such as GPU and TensorFlow, it becomes possible to use learning functions and classification functions in various artificial intelligence processing under the control of CPU.

It is also possible to configure the information processing apparatus 101 of the present embodiment using a dedicated electronic circuit instead of implementing the information processing apparatus of the present embodiment using a computer on which software is installed. For example, a portable camera, a portable electronic game device, or the like can be used as the information processing device 101 .

In this aspect, the program can also be used as material for generating wiring diagrams, timing charts, etc. of electronic circuits. In such an aspect, an electronic circuit that satisfies the specifications defined in the program is configured by FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and the electronic circuit performs the functions defined in the program. The information processing apparatus of this embodiment is realized by functioning as a dedicated device that fulfills the functions.

For ease of understanding, the information processing apparatus 101 will be described below assuming that it is implemented by a computer executing a program.

It should be noted that the information processing apparatus 101 can be connected wirelessly or by wire to audio equipment 152 such as headphones, earphones, neck speakers, bone conduction speakers, hearing aids, etc., as information output destinations. These audio devices 152 desirably have an external audio capture function.

Further, as described above, the screen 153 such as a liquid crystal display, an organic EL (Organic Electro-Luminescence) display, or a paper display using electronic ink, which the information processing apparatus 101 has, can be adopted as an information output destination. By using these displays as touch screens, they can also function as input devices for the information processing apparatus 101 .

Now, in the information processing device 101 of this embodiment, the detection unit 111 detects the first orientation of the information processing device 101 in the first coordinate system fixed to the real world.

The orientation (first orientation) of the information processing device 101 in the first coordinate system fixed in the real world is detected via a geomagnetic sensor, an inertial sensor for detecting gravity, an acceleration sensor, a gyro sensor, etc., which the information processing device 101 has. can do.

The position (first position) of the information processing device 101 in the first coordinate system can also be detected by a geolocation detection function using GPS, Wifi access points, Bluetooth beacons, or the like.

On the other hand, if the captured image captured by camera 151 includes the face image of the user, estimating section 112 calculates the face image of the user in the second coordinate system fixed to information processing device 101 from the captured image and the face image. Estimate the second orientation of the face.

That is, the information processing apparatus 101 extracts the face image drawn in the captured image by image recognition, recognizes the characteristic parts such as the eyes, nose, mouth, etc. Then, based on the face image, the information processing apparatus 101 Estimate the relative user face orientation (second orientation). A general face tracking technique can be applied to this process.

The position (second position) of the user's face relative to the information processing device 101 may be further estimated based on the position and size of the face image in the captured image.

Furthermore, calculation section 113 calculates a third orientation of the user's face in the real world (first coordinate system) from the detected first orientation and the estimated second orientation.

The directional transformation between the first coordinate system and the second coordinate system can be uniquely defined based on the first orientation. Further, when the first position is detected, it is possible to uniquely determine coordinate transformation of coordinate values between the first coordinate system and the second coordinate system based on the first orientation and the first position. .

Therefore, by converting the component in the second coordinate system of the second orientation estimated based on the captured image into the component in the first coordinate system, it is possible to determine which direction the user's face is facing with respect to the world (earth). A third orientation can be calculated that represents

Then, the output unit 114 outputs information corresponding to the calculated third orientation. As the information output destination, the audio equipment 152 worn by the user or the screen 153 of the display can be adopted.

The information to be output is the voice mixed by setting one or more virtual sound sources in the real world and changing the intensity, tone, phase, etc. of the waveform associated with each virtual sound source according to the third direction. Information can be employed.

Assuming that the virtual sound source is virtually placed far enough away in the real world, only the virtual direction (virtual azimuth) from the listening point may be predetermined and associated. Also, the virtual sound source may be virtually arranged at a position in the real world.

In the former case, the output unit 114
a virtual orientation associated with the virtual sound source;
the calculated third orientation;
The virtual sound source is mixed with the intensity (amplification factor) according to the angle difference between the two. If the angle difference is small, the virtual sound source is in front of the user, so by increasing the intensity of the waveform during mixing, it is possible to provide the user with audio augmented reality that changes according to the direction of the face. become.

When there are multiple virtual sound sources, the ratio of the amplification factor based on the angle difference for each virtual sound source is maintained so that the average sound pressure does not change significantly when it is assumed that the face is rotated once. By adjusting the amplification factor during mixing so that the sum of the power of the virtual sound sources remains almost constant, it is possible to emphasize a specific virtual sound source while maintaining the power of the entire virtual sound source. Become. This is called power correction.

It is also possible to let the user know the direction of the virtual sound source by changing the amplification factor and time difference between the left and right stereo outputs according to the angle difference. For example, if there is a virtual sound source on the right side of the user, simple binaural playback can be achieved by setting the amplification factor of the right side larger than that of the left side, or by setting the time difference so that the right side precedes the left side. can be realized, and the user can feel the direction of the virtual sound source.

In the latter case, the virtual direction associated with the virtual sound source is
a virtual position where the virtual sound source is located in the first coordinate system;
a first orientation and a first position detected by the detection unit 111;
, the same processing as the former may be performed after calculation.

Note that if the measurement accuracy of the first orientation and the first position and the estimation accuracy of the second position are sufficiently high, the position of the user's face in the real world ( 3rd position) may be used. The third position can be obtained by coordinate-transforming the relative face position (second position) with respect to the information processing device 101 obtained by face tracking into a coordinate system fixed to the information processing device 101 .

In addition, the orientation of the user's face may be displayed on the display screen 153 like a compass. If a display mode is adopted in which the direction of the "needle" of the "compass" changes in accordance with the change in direction when the user changes the direction of the face, the range in which the screen 153 of the display falls within the user's field of vision. If it is within, the user can confirm that the present embodiment is operating properly.

In addition, according to the angle difference between the virtual direction associated with the virtual sound source and the calculated third orientation, the waveform of the virtual sound source is corrected in terms of other than the intensity, so that the virtual It is also possible to emphasize the sound source and let the user hear it.

For example, direct sound and reverb sound may be generated based on the waveform of the virtual sound source, and the mixing ratio of the two may be changed according to the angular difference. If the angle difference is small, the user can be made to feel that the virtual sound source is being heard loudly from the front side by increasing the ratio of the direct sound. This is called echo correction.

In addition, the central sound range of the virtual sound source in the front is obtained, and for the virtual sound sources in other directions, the obtained central sound range is weakened by an equalizer to reduce the frequency fogging and reduce the virtual sound source in the front side. It is also possible to let the user listen by floating it. This is called center range correction.

In addition, for the virtual sound source on the front side, it is possible to add saturation that strengthens the overtone components to make the sound brilliant, and make the virtual sound source on the front side stand out for the user to listen to. This is called saturation correction.

Now, when the camera 151 of the information processing apparatus 101 is a so-called front camera, its photographing direction matches the display direction of the screen 153 of the display and faces the direction in which the user is assumed to be positioned.

Therefore, if the user is looking at the screen 153 of the information processing device 101 from the front, the user's face should be captured by the camera 151 .

Therefore, if the user's face is not captured by the camera 151, it is assumed that the user is not concentrating on listening to a specific virtual sound source. correction may be used as an average default value.

It should be noted that the dramatic correction can also be adjusted by the user's gestures.

For example, when the user brings his or her face closer to the screen 153 of the display, it is assumed that the user is trying to concentrate on the front, and dramatic correction may be made to emphasize the virtual sound source in front. In this aspect, it is sufficient to change the strength of the dramatic correction based on the size of the face image drawn in the captured image.

In addition, it is also possible to change the strength of the dramatic correction by the user's gesture. For example, assuming that the user is trying to concentrate on the front by a gesture of listening, dramatic correction may be made to emphasize the virtual sound source in front.

For example, from the captured image, the user's face image and the user's hand image are image-recognized, and the position of the user's face (for example, the center position of the face) and the position of the user's hand (for example, the position of the tip of the little finger) ) and, after estimating, depending on the distance (closeness) between the two, the intensity of the dramatic correction can be changed, thereby easily responding to the gesture of listening.

In addition, more simply, the position of the user's hand image in the photographed image (for example, the position of the tip of the little finger) and the representative point of the photographed image (for example, the center position of the photographed image, the center position of the face image, etc.) It is also possible to change the intensity of dramatic correction according to the distance (closeness) between and.

Here, when the center position of the captured image is used as the representative point, even if face tracking fails and the face image cannot be recognized, if the hand image is recognized, the strength of correction is adjusted. be able to

In addition, when adjusting the intensity/time difference and correcting the production, instead of immediately reflecting the calculated angle difference and distance as they are, use the average or attenuation average within the most recent fixed time (for example, about 100ms). By approximating the calculated value, the change of the value may be smoothed.

In the above description, it is assumed that the user is allowed to experience augmented reality through audio. The output of the virtual sound source changes according to

Therefore, even if the environmental sound in the real world has directivity, the output of the virtual sound source has directivity linked to this. Even in the case where the external sound enters the user's ear as it is, such as a speaker, the environmental sound and the virtual sound are mixed without contradiction according to the direction of the user's face and provided to the user. It is possible to provide voice augmented reality.

Note that when the information processing device 101 has a rear camera, an augmented reality image is generated by synthesizing a captured image of the real world captured by the rear camera and the appearance of a virtual object placed at the same position as the virtual sound source. and displaying it on the screen 153 of the display of the information processing device 101, it is possible to provide the user with both visual and auditory augmented reality.

(control flow)
FIG. 2 is a flow chart showing control of an information processing method executed by the information processing apparatus according to the embodiment of the present invention. Description will be made below with reference to this figure. It should be noted that each step of the following processing can be omitted as appropriate depending on the mode of application.

When this process is started, the information processing device 101 first initializes parameters for reproduction of each virtual sound source to default values (step S200), and starts mixing reproduction (step S201). For this parameter, various factors such as gain (amplification rate for each channel such as left and right, or overall gain), mixing ratio of direct sound and reverb sound, intensity of saturation, etc. are adopted. and default values will be set for these parameters at the beginning of the process.

After that, playback of the virtual sound source is executed in parallel as background processing, but the parameters for mixing are changed according to the orientation of the user's face, etc., by the following processing.

Next, the information processing device 101 detects a first orientation (or first position) of the information processing device 101 in the real world (first coordinate system) via a geomagnetic sensor, a gyro sensor, an acceleration sensor, etc. (step S202). ).

Further, the information processing device 101 tries to extract the user's face image from the captured image captured by the camera 151 by image recognition (step S203).

If the trial succeeds in extracting the user's facial image (step S204; Yes), the information processing apparatus 101 calculates the relative (in the second coordinate system) of the user (in the second coordinate system) to the information processing apparatus 101 based on the facial image. estimating the second orientation (or second position) of the face (step S205).

Then, information processing apparatus 101 converts the estimated second orientation (or second position) to the user's face in the first coordinate system by coordinate transformation based on the detected first orientation (or first position). A third orientation (or third position) is calculated (step S206).

Information processing device 101 then repeats the following process for each of the virtual sound sources (step S207).

That is, the information processing device 101 acquires the virtual direction of the virtual sound source in the first coordinate system (step S208). This virtual orientation may be determined in advance, or calculated based on the virtual position of the virtual sound source in the first coordinate system and the first position of the information processing device 101 (or the third position of the user's face). can be

Next, the information processing device 101 calculates new parameters for reproduction of the virtual sound source based on the angle difference between the virtual direction and the third orientation (step S209). In the simplest way, a new amplification factor is calculated based on the angle difference, but echo correction and saturation correction may be added.

In addition, the amplification factor may be further corrected according to the distance (closeness) between the virtual position of the virtual sound source and the first position (or third position). That is, the smaller the distance, the larger the amplification factor, and the like.

After repeating the process for all virtual sound sources (step S210), the information processing device 101 further corrects the new parameters for reproduction of all virtual sound sources based on their mutual relationships (step S211). This correction includes, for example, center range correction for emphasizing the virtual sound source on the front side compared to other virtual sound sources, and power correction for maintaining the force of the entire virtual sound source as it is.

Then, the parameters for playback of all virtual sound sources are smoothly brought close to the new parameters (or the new parameters are used as they are), and after repeating the updating process based on the new parameters for each virtual sound source, (Steps S212-S214), the process returns to step S202.

On the other hand, if the extraction of the user's face image fails (step S204; No), the parameters of all the virtual sound sources are updated based on the default values so as to approach the default values (or keep the default values as they are). is repeated for each virtual sound source (steps S215-S217), and then the process returns to step S202.

Although omitted in the above control flow, the user's hand image is recognized from the captured image, and based on the user's gesture, the amplification factor of the virtual sound on the front side is changed, and saturation correction and center range correction are performed. You can also change the intensity.

(output to the screen of the display)
3 to 18 are drawings substitute photographs showing display examples by the information processing apparatus according to the embodiment of the present invention in grayscale or monochrome binary. Description will be made below with reference to these figures.

3 and 4, various information is displayed on the screen 153 of the display of the information processing device 101, which is a smart phone. When the triangular play button in the lower center of the figure is tapped, playback of the music composed of the virtual sound source is started.

At the top of the screen 153, the result of detection of the orientation of the user's face and the result of detection of the tip of the little finger are displayed in a window. Until the user gets accustomed to the operation, he or she can check and practice gestures while holding a position where the camera 151 captures the user's face by looking at the detection results.

By tapping or sliding the on/off button to the left of the play button, the window can be closed as shown in Figures 5 and 6. The window can be displayed again by tapping or sliding the same on/off button again.

In the center of the screen 153, musical instrument icons are arranged in a circle. This represents the orientation of the virtual sound source part placed in the virtual space.

In this figure, the musical instruments are arranged at equal intervals, but they do not necessarily have to be evenly spaced and circular, and can be arranged arbitrarily.

When you swipe on this circle, the instrument rotates around the center of the circle, and you can place your favorite instrument in any direction as shown in Figures 7 and 8.

At the center of the circle is the avatar of the operating user, and the direction of the white arrow indicates the direction of the user's face. The musical instrument icon at the tip of the white arrow corresponds to the virtual sound source positioned in front of the user.

At the start of this process, the white arrow is pointing in a default direction (for example, upward), and if the user changes the direction of the face or moves the position of the smartphone, the direction of the white arrow changes accordingly.

Tapping on the avatar resets the direction of the white arrow and the placement (distance) of the instrument.

In Figures 3 and 4, the sector is displayed in the direction of the white arrow. This represents the range where the gain is 0.5 times or more. By making a gesture of listening, the angle of the fan changes, telling the user which virtual sound source is emphasized.

Two sliders are lined up at the bottom of the screen 153 . The upper slider represents the distance to the musical instruments arranged in a circle in the virtual space, and the distance can be changed by moving the slider. In the arrangement shown in Figures 3 and 4 the distance is 20 meters, in Figures 9 and 10 it is 10 meters and in Figures 11 and 12 it is 30 meters. The distance from the avatar to the musical instrument shown on screen 153 also changes according to this distance.

The lower slider is linked to the degree of focusing, that is, the angle of the sector. As mentioned above, the degree of focusing can be changed by gestures, but it can also be adjusted by moving the slider directly.

When you press the gear-shaped settings button to the right of the play button, you will be taken to the settings form, as shown in Figures 13 and 14.

In the setting form, you can set the master volume (the default value of the mixer gain) for each instrument. The information processing apparatus 101 multiplies the master volume by a multiplier corresponding to the angle difference, thereby once calculating the amplification factor used for mixing, and then performing correction so that the overall power becomes substantially constant.

In figures 15 and 16, the boost mode is set. In the boost mode, when adjusting the amplification factor to keep the overall power constant, it is possible to emphasize the instrument in front by doubling the strength of the virtual sound source in front.

Figures 17 and 18 are examples of output when the same functions as the above smartphone are implemented on a tablet.

In these figures, an augmented reality image is displayed overlaid with a video of a virtual person playing a virtual musical instrument in an uninhabited park captured by a rear camera.

On the other hand, the present embodiment can also be provided for virtual reality instead of augmented reality. 19 and 20 provide the user with a virtual reality as if players of virtual musical instruments were arranged in a circle on the stage of a virtual concert venue and the user was placed in the center.

In this display example, 10 performers of virtual musical instruments are placed on the stage, and the avatars of the performers (in this figure, 3 avatars among the 10 performers) are placed on the stage. A virtual object is created by composing an image of playing a musical instrument. A performance sound of a musical instrument is associated with each virtual object as a virtual sound source, and the virtual sound source is mixed and output in the same manner as in the above embodiment.

In this aspect, the user can have the experience of being the conductor of a virtual concert.

As in the above embodiment, the user selects the avatar facing the user, i.e., the avatar positioned in front of the user, among the avatars of a plurality of performers, as the target of attention by using a gesture of listening. It can be identified as an object of interest.

When the user holds the information processing device 101 in front of him/herself and faces the center of the screen 153, the virtual object displayed in the center of the screen 153 becomes the target object.

On the other hand, even if the user holds the information processing device 101 in front of him/herself, if the user's face is not directed to the center of the screen 153 but to the right side, the left side, or another direction, the center of the screen is displayed. The object of interest is the virtual object displayed in the direction in which the face is directed, not the virtual object displayed in the direction. That is, the pronunciation object that has the smallest angle difference between the virtual direction associated with the virtual sound source and the third direction and is equal to or less than the threshold angle is specified as the object of interest.

Instead of using gestures, it is also possible to keep facing a desired virtual object, and if the time for which the face is kept facing exceeds a predetermined threshold time, the virtual object may be specified as the object of interest.

Once the player's avatar is selected as the object of interest, the user may be able to change the position and orientation of the object of interest.

For example, when the screen 153 is configured as a touch screen, when the touch screen is touched and a tracing operation is performed, the object of interest is moved along the locus of the same shape obtained by translating the locus of the tracing operation. Also good. In this aspect, since the target object is specified, it is not necessary to touch the target object itself displayed on the screen 153, and the tracing operation can be performed on the screen 153 other than the place where the target object is displayed, The position of the target object can be changed without hiding the target object with the finger.

Also, when the screen 153 is touched with two fingers or three fingers and rotated, the object of interest is rotated by the touched angle around the axis corresponding to the number of fingers touched. Also good. Also in this mode, it is not necessary to touch the attention object itself displayed on the screen 153 .

It should be noted that these various operations performed on the screen 153 configured by the touch screen can be replaced by gestures.

In the above example, an avatar of a performer playing a musical instrument is placed as a virtual object in the virtual space. 21, 22, 23, and 24 are display examples showing how a plurality of virtual displays are arranged in a virtual room and virtual moving images are reproduced on the virtual displays.

In these display examples, the virtual video played on each virtual display together with the sound functioning as the virtual sound source corresponds to the virtual object.

In these figures, 10 virtual moving images are arranged around the user in the virtual space. You can compare videos side by side. The virtual sound source of the virtual moving image to which the user's head is facing is output with priority over the virtual sound sources of other virtual moving images.

In this aspect, it is also possible for the user to view and compare more than 10 virtual moving images in order. That is, the virtual moving images can be exchanged on a virtual display arranged at a position invisible to the user in the virtual space.

In this aspect, the user can view the virtual moving images in order by rotating his or her body in the real space while holding the information processing device 101 .

In addition, by a gesture such as changing the orientation of the finger in the horizontal direction from right to left or from left to right in a short period of time, or by an operation of sliding the touch screen that constitutes the screen 153 to the left or right, etc., The virtual display may be rotated around the user in the virtual space.

This allows multiple virtual animations to be presented to the user in a manner similar to a carousel display or cover flow display.

As in the above embodiment, the user turns his face to one of the virtual moving images displayed on the screen 153 and makes a gesture such as listening, or keeps turning his face for the duration of the threshold time. , etc., the virtual moving image can be specified as the target object.

When the virtual moving image is identified as the object of interest, the information processing device 101 displays the virtual moving image (moving image of interest) identified as the object of interest at a predetermined position such as the center of the screen 153 at a predetermined magnification and reproduces it. At the same time, the virtual sound source associated with the object of interest, that is, the sound to be reproduced together with the virtual moving image (the sound of interest) is played with priority over other virtual sound sources. At this time, the audio to be output may be mixed with a predetermined amplification factor for the audio of interest and muted for other virtual sound sources, that is, only the audio of interest may be output and other virtual audio may not be output.

FIGS. 25 and 26 show how the virtual video drawn in the center of the screen in FIGS. 23 and 24 is identified as the object of interest, enlarged in the center of the screen, and the video and audio of the target object being played. .

In FIGS. 27, 28, 29, and 30, the user dances to the sound being reproduced here, and although the directions of the information processing device 101 and the user in the real space are changing, the reproduction is performed in the center of the screen. This indicates that the moving image remains the object of interest.

In this aspect, the user can cancel the identification as the object of interest by making a gesture of spreading out his/her hand and bringing it closer to the camera 151 of the information processing device 101, by tapping the touch screen for a short period of time, or the like.

When the identification is canceled, the virtual moving image surrounding the user in the virtual space is rotated around the user so that the virtual moving image whose identification has been canceled is positioned where the user's head is facing. It's good as a thing. That is, (the virtual orientation of) the virtual object placed in the virtual space around the virtual starting point in the virtual space so that the virtual orientation of the virtual object whose identification has been canceled matches the calculated third orientation. will be rotated.

According to this aspect, immediately after the identification is canceled, the virtual moving image is arranged in the direction in which the user's face faces, and the columns of virtual moving images are arranged in the same order and at approximately the same positions as before. Therefore, the user can intuitively compare the virtual moving images in order.

The information processing apparatus 101 according to these embodiments will be organized and described below. FIG. 31 is an explanatory diagram showing a schematic configuration of an information processing device that processes an object of interest according to the embodiment of the present invention.

The information processing apparatus 101 according to this embodiment has a specifying unit 301 and a canceling unit 302 in addition to the configuration disclosed in FIG. The identifying unit 301 and the canceling unit 302 acquire various kinds of information from the detecting unit 111, the estimating unit 112, and the calculating unit 113, and control the output unit 114 accordingly.

As described above, in the information processing device 101 according to this embodiment, a plurality of sounding objects are arranged in the virtual space. Each sounding object can be, for example, a virtual object in the above embodiment, which corresponds to an avatar of a performer playing a virtual musical instrument or a virtual display playing back a virtual moving image.

Each pronunciation object is associated with a virtual sound source. In the above embodiments, the virtual sound source corresponds to the performance sound output by the virtual musical instrument or the sound reproduced together with the virtual moving image.

Then, the information processing device 101 displays the state of the virtual space on the screen 153. FIG. Specifically, the state of the virtual space observed from the viewpoint position and line-of-sight direction corresponding to the first position and first orientation is displayed on screen 153 whose display direction is the same as the photographing direction of camera 151 .

When the position or orientation of (the screen 153 of) the information processing device 101 is changed, or the position or orientation of the user's head is changed, the appearance of the virtual world displayed on the screen 153 changes accordingly. As a result, the screen 153 of the information processing device 101 functions as a "window" for looking into the virtual space.

Here, the specifying unit 301 of the information processing device 101 determines whether or not a specified condition is satisfied, and performs processing according to the determination.

Further, the cancellation unit 302 of the information processing device 101 determines whether or not the cancellation condition is satisfied, and performs processing accordingly.

A specific condition is a condition for specifying one of a plurality of sounding objects as an attention object by the user, and a cancellation condition is a condition for canceling identification as an attention object.

In the above-described embodiment, as specific conditions, a gesture of listening closely, or continuing to face a specific sounding object for a predetermined period of time or longer, is adopted as the specific condition. A gesture of bringing the screen closer to 151, tapping on the touch screen that constitutes the screen 153, etc. are employed, but other conditions can also be employed.

In the information processing device 101, when the identifying unit 301 determines that the specific condition is satisfied, the sounding object having the smallest angular difference between the virtual direction associated with the virtual sound source and the calculated third direction is selected by the user. identified as the object of interest by

The object of interest is an object that the user wants to pay attention to or is presumed to pay attention to among the pronunciation objects arranged in the virtual space. In the above embodiment, the pronunciation object displayed on the screen 153 and facing the user can be the target object. That is, if the user faces the center of the screen 153, the pronunciation object displayed in the center of the screen 153 is displayed. , if the user is facing the left end of the screen 153, the pronunciation objects displayed on the left end of the screen 153 can be the objects of interest.

Now, when the object of interest is not specified, the output unit 114 mixes the virtual sound source with an intensity corresponding to the angle difference between the virtual direction associated with the virtual sound source and the calculated third direction. , when an object of interest is specified, instead of outputting information corresponding to the calculated third direction, output unit 114 outputs the virtual sound source associated with the specified object of interest to another virtual sound source. Give priority to the sound source.

In the above embodiment, the performance sound of the performer's virtual musical instrument corresponding to the object of interest and the sound accompanying the virtual moving image are output with priority over other sounds. Here, "priority" includes, for example, setting the amplification factor of the virtual sound source of the target object to a predetermined constant and setting the amplification factor of the other virtual sound sources to zero (mute) or a small value.

Further, information processing apparatus 101 may display on screen 153 the identified object of interest in a more emphasized manner than other pronunciation objects.

In the above embodiment, it is also possible to employ modes such as brightening the color of the performer corresponding to the object of interest, marking the performer, and the like. Also, the virtual moving image corresponding to the object of interest is displayed in a predetermined size in the center of the screen for highlighting.

In the information processing apparatus 101, the cancellation unit 302 cancels the identification as the object of interest when the cancellation condition is satisfied. As a result, priority output of the virtual sound source and highlighting on the screen 153 are ended, and the output method described first is adopted.

The position of the virtual object placed in the virtual space can also be rotated around the viewpoint in the virtual space. That is, the information processing apparatus 101 rotates the virtual orientation of the sound object placed in the virtual space around the viewpoint position based on a gesture based on the user's hand image included in the captured image or a touch operation on the screen. Let

Then, when a gesture of changing the orientation of the finger in the horizontal direction from right to left or left to right in a short period of time, or a right or left slide on the touch screen that constitutes the screen 153 is performed, the user's viewpoint position is changed. The moving images of the performers lined up and the virtual display move, and the user can see and compare these situations in order and listen to and compare their voices.

Further, in the above embodiment, the position and orientation of the avatar can be edited by pinching the player's avatar or touching the screen 153 with a plurality of fingers and rotating the avatar. That is, while the object of interest is being specified, the information processing apparatus 101 determines the position or orientation of the object of interest in the virtual space based on a gesture based on the user's hand image included in the captured image or a touch operation on the screen. can be changed.

(summary)
As described above, the information processing apparatus according to this embodiment has a camera,
a detection unit that detects a first orientation of the information processing device in a first coordinate system fixed in the real world;
If the photographed image taken by the camera contains the face image of the user, the second coordinate system of the user's face in the second coordinate system fixed to the information processing device can be obtained from the photographed image and the face image. an estimator for estimating orientation;
a calculation unit that calculates a third orientation of the user's face in the first coordinate system from the detected first orientation and the estimated second orientation;
An output unit for outputting information corresponding to the calculated third orientation is provided.

Further, in the information processing device according to the present embodiment,
The information processing device is wirelessly or wiredly connected to the audio equipment worn by the user,
The output unit can be configured to output the information to the audio equipment.

Further, in the information processing device according to the present embodiment,
The audio device can be configured to be headphones, earphones, neck speakers, bone conduction speakers, or hearing aids capable of capturing ambient sounds.

Further, in the information processing apparatus according to the present embodiment, the output unit may
a virtual orientation associated with the virtual sound source;
the calculated third orientation;
A sound obtained by mixing the virtual sound source with an intensity corresponding to the angle difference between the two can be output as the information.

Further, in the information processing device according to the present embodiment,
The virtual orientation can be configured to be predetermined.

Further, in the information processing device according to the present embodiment,
The detection unit further detects a first position of the information processing device in the first coordinate system,
The virtual orientation associated with the virtual sound source is
a virtual position where the virtual sound source is arranged in the first coordinate system;
the sensed first orientation and first position;
can be configured to be calculated from

Further, in the information processing device according to the present embodiment,
The information processing device displays video information corresponding to the detected first position and first orientation on a screen whose display direction is the same as the shooting direction of the camera,
The waveform of the virtual sound source may be corrected according to the size of the face image.

Further, in the information processing device according to the present embodiment,
If the captured image includes the face image of the user and the hand image of the user, the distance between the face of the user and the hand of the user in the second coordinate system , to correct the waveform of the virtual sound source.

Further, in the information processing device according to the present embodiment,
If the captured image includes the hand image of the user, the waveform of the virtual sound source can be corrected according to the distance between the representative point of the captured image and the hand image. .

Further, in the information processing device according to the present embodiment,
The intensity may be set to a default value if the face image is not included in the captured image.

Further, in the information processing device according to the present embodiment,
the virtual sound source is associated with a sounding object placed in the virtual space;
The information processing device is
A screen in which the state of the virtual space observed from the viewpoint position and line-of-sight direction corresponding to the detected first position and first direction in which the sound object is arranged is displayed in the same direction as the shooting direction of the camera. display in
when a specific condition is satisfied, identifying a sounding object having the smallest angular difference between the virtual direction associated with the virtual sound source and the calculated third orientation as the user's attention object;
The output unit outputs the virtual sound source associated with the identified object of interest in preference to other virtual sound sources instead of outputting information according to the calculated third direction,
displaying the identified object of interest on the screen while emphasizing it more than other pronunciation objects;
The identification as the target object can be canceled when the cancellation condition is satisfied.

Further, in the information processing device according to the present embodiment,
the pronunciation object is a video that is played back with audio,
While the object of interest is being identified, the information processing device
displaying the target object at a predetermined position in the screen at a predetermined magnification;
The output unit outputs a mixed sound by muting other virtual sound sources with a predetermined amplification factor for the virtual sound source associated with the object of interest,
When the identification as the object of interest is canceled, the information processing device adjusts the viewpoint position so that the virtual orientation of the pronunciation object whose identification as the object of interest has been canceled matches the calculated third orientation. centered around the virtual orientation of the sounding object placed in the virtual space.

Further, in the information processing device according to the present embodiment,
The information processing device virtualizes a sounding object arranged in the virtual space around the viewpoint position based on a gesture based on the hand image of the user included in the captured image or a touch operation on the screen. Can be configured to rotate orientation.

Further, in the information processing device according to the present embodiment,
the pronunciation object is an avatar that emits a sound,
While the object of interest is being specified, the information processing device moves the object of interest in the virtual space based on a gesture based on the hand image of the user included in the captured image or a touch operation on the screen. It can be configured to change position or orientation.

In the information processing method according to the present embodiment, an information processing device having a camera,
detecting a first orientation of the information processing device in a first coordinate system fixed in the real world;
If the photographed image taken by the camera contains the face image of the user, the second coordinate system of the user's face in the second coordinate system fixed to the information processing device can be obtained from the photographed image and the face image. Estimate the orientation,
calculating a third orientation of the user's face in the first coordinate system from the detected first orientation and the estimated second orientation;
It is configured to output information according to the calculated third orientation.

A program according to the present embodiment causes a computer having a camera to
a detection unit that detects a first orientation of the information processing device in a first coordinate system fixed in the real world;
If the photographed image taken by the camera contains the face image of the user, the second coordinate system of the user's face in the second coordinate system fixed to the information processing device can be obtained from the photographed image and the face image. an estimator for estimating orientation;
a calculation unit that calculates a third orientation of the user's face in the first coordinate system from the detected first orientation and the estimated second orientation;
It is configured to function as an output unit that outputs information corresponding to the calculated third orientation.

The program may be recorded on a non-temporary computer-readable information recording medium, distributed, and sold. It can also be distributed and sold through a temporary transmission medium such as a computer communication network.

A computer-readable non-temporary information recording medium according to this embodiment is configured to record the above program.

The present invention is capable of various embodiments and modifications without departing from the broader spirit and scope of the invention. Moreover, the embodiment described above is for explaining the present invention, and does not limit the scope of the present invention. That is, the scope of the present invention is indicated by the claims rather than the embodiments. Various modifications made within the scope of the claims and within the meaning of equivalent inventions are considered to be within the scope of the present invention.
In this application, we claim the priority based on the patent application 2021-070745 filed in Japan on April 20, 2021 (Tuesday), and the laws and regulations of the designated country To the extent permitted by

According to the present invention, it is possible to provide an information processing device, an information processing method, a program, and an information recording medium for estimating the orientation of a user's face in the real world and outputting information according to this.

101 information processing device 111 detection unit 112 estimation unit 113 calculation unit 114 output unit 151 camera 152 audio device 153 screen 301 identification unit 302 release unit

Claims

An information processing device having a camera,
a detection unit that detects a first orientation of the information processing device in a first coordinate system fixed in the real world;
If the photographed image taken by the camera contains the face image of the user, the second coordinate system of the user's face in the second coordinate system fixed to the information processing device can be obtained from the photographed image and the face image. an estimator for estimating orientation;
a calculation unit that calculates a third orientation of the user's face in the first coordinate system from the detected first orientation and the estimated second orientation;
An information processing apparatus comprising: an output unit that outputs information according to the calculated third orientation.
The information processing device is wirelessly or wiredly connected to the audio equipment worn by the user,
2. The information processing apparatus according to claim 1, wherein the output unit outputs the information to the audio device.
3. The information processing apparatus according to claim 2, wherein the audio device is a headphone, an earphone, a neck speaker, a bone conduction speaker, or a hearing aid capable of taking in external sound.
The output unit
a virtual orientation associated with the virtual sound source;
the calculated third orientation;
4. The information processing apparatus according to claim 2, wherein a sound obtained by mixing the virtual sound sources with an intensity corresponding to the angular difference is output as the information.
5. The information processing device according to claim 4, wherein the virtual orientation is determined in advance.
The detection unit further detects a first position of the information processing device in the first coordinate system,
The virtual orientation associated with the virtual sound source is
a virtual position where the virtual sound source is arranged in the first coordinate system;
the sensed first orientation and first position;
5. The information processing apparatus according to claim 4, characterized in that it is calculated from:
The information processing device displays video information corresponding to the detected first position and first orientation on a screen whose display direction is the same as the shooting direction of the camera,
7. The information processing apparatus according to claim 6, wherein the waveform of said virtual sound source is corrected according to the size of said face image.
If the captured image includes the face image of the user and the hand image of the user, the distance between the face of the user and the hand of the user in the second coordinate system 5. The information processing apparatus according to claim 4, wherein the waveform of said virtual sound source is corrected.
4. The waveform of the virtual sound source is corrected according to the distance between a representative point of the captured image and the hand image, if the captured image includes the hand image of the user. 5. The information processing device according to 4.
5. The information processing apparatus according to claim 4, wherein if the captured image does not include the face image, the intensity is set to a default value.
the virtual sound source is associated with a sounding object placed in the virtual space;
The information processing device is
A screen in which the state of the virtual space observed from the viewpoint position and line-of-sight direction corresponding to the detected first position and first direction where the sound object is arranged is displayed in the same direction as the imaging direction of the camera. display in
when a specific condition is satisfied, identifying a pronunciation object having a minimum angular difference between the virtual direction associated with the virtual sound source and the calculated third orientation as the user's attention object;
The output unit outputs the virtual sound source associated with the identified object of interest in preference to other virtual sound sources instead of outputting information according to the calculated third direction,
displaying the identified object of interest on the screen while emphasizing it more than other pronunciation objects;
5. The information processing apparatus according to claim 4, wherein the identification as the target object is canceled when a cancellation condition is satisfied.
The information processing device virtualizes a sounding object arranged in the virtual space around the viewpoint position based on a gesture based on the hand image of the user included in the captured image or a touch operation on the screen. 12. The information processing apparatus according to claim 11, wherein the orientation is rotated.
the pronunciation object is a video that is played back with audio,
While the object of interest is being identified, the information processing device
displaying the target object at a predetermined position in the screen at a predetermined magnification;
The output unit outputs a mixed sound by muting other virtual sound sources with a predetermined amplification factor for the virtual sound source associated with the object of interest,
When the identification as the object of interest is canceled, the information processing device adjusts the viewpoint position so that the virtual orientation of the pronunciation object whose identification as the object of interest has been canceled matches the calculated third orientation. 13. The information processing apparatus according to claim 11, wherein the virtual orientation of the sounding object placed in the virtual space is rotated around .
the pronunciation object is an avatar that emits a sound,
While the object of interest is being specified, the information processing device moves the object of interest in the virtual space based on a gesture based on the hand image of the user included in the captured image or a touch operation on the screen. 13. The information processing device according to claim 11, wherein the position or orientation is changed.
An information processing device having a camera,
detecting a first orientation of the information processing device in a first coordinate system fixed in the real world;
If the photographed image taken by the camera contains the face image of the user, the second coordinate system of the user's face in the second coordinate system fixed to the information processing device can be obtained from the photographed image and the face image. Estimate the orientation,
calculating a third orientation of the user's face in the first coordinate system from the detected first orientation and the estimated second orientation;
An information processing method, characterized by outputting information corresponding to the calculated third orientation.
a computer having a camera,
a detection unit that detects a first orientation of the information processing device in a first coordinate system fixed in the real world;
If the photographed image taken by the camera contains the face image of the user, the second coordinate system of the user's face in the second coordinate system fixed to the information processing device can be obtained from the photographed image and the face image. an estimator for estimating orientation;
a calculation unit that calculates a third orientation of the user's face in the first coordinate system from the detected first orientation and the estimated second orientation;
A program characterized by functioning as an output unit that outputs information according to the calculated third orientation.
A computer-readable non-transitory information recording medium on which the program according to claim 16 is recorded.