WO2024053094A1

WO2024053094A1 - Media information emphasis playback device, media information emphasis playback method, and media information emphasis playback program

Info

Publication number: WO2024053094A1
Application number: PCT/JP2022/033902
Authority: WO
Inventors: 麻衣子井元; 真二深津; 淳一中嶋; 馨亮長谷川
Original assignee: 日本電信電話株式会社
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2024-03-14

Abstract

This media information emphasis playback device comprises a media information reception unit, a user state acquisition unit, an emotion inference unit, and a media information emphasis playback unit. The media information reception unit receives media information which includes video and audio. The user state acquisition unit acquires state information which indicates a state of a user during viewing. The emotion inference unit infers an emotion of the user during viewing, on the basis of the state information which has been input from the user state acquisition unit. On the basis of an inference result which has been input from the emotion inference unit, the media information emphasis playback unit performs emphasis playback of the media information which has been input from the media information reception unit.

Description

Media information emphasizing playback device, media information emphasizing playback method, and media information emphasizing playback program

The present invention relates to a media information emphasizing reproduction device, a media information emphasizing reproduction method, and a media information emphasizing reproduction program.

In recent years, there has been an increase in the number of live viewing events where people can watch live streaming of entertainment and sports taking place at event venues from their homes.

At an actual event venue, you can get a sense of unity and excitement through the co-occurrence of feelings and emotions, such as when you get excited, the audience around you gets excited as well, but you can't just watch the distributed video from a remote location such as your home. , such interactions do not occur, and it is difficult to achieve a sense of unity.

So far, in two-way communication between users in remote locations (a small number of people), a system has been proposed in which users in remote locations view live distribution while sharing the excitement with each other [Non-Patent Document 1]. Although it may be possible to increase the feeling of participation and excitement among users in remote locations, it does not provide a sense of unity with the event venue.

Through two-way communication between the event venue and the remote location, users at the remote location must be able to It is necessary for the user to be able to feel feedback (recognize the interaction) for their actions (share feelings and emotions), and for the interaction to be immediate.

However, communication delays occur in two-way communication between a remote location and the event venue, so it takes time to recognize the interaction, resulting in a loss of immediacy.

Additionally, when there are many users in remote locations, it is difficult for the users to recognize feedback on their actions.

The present invention has been made in view of the above-mentioned circumstances, and an object of the present invention is to provide a media information emphasizing playback device, a media information emphasizing playback method, and a media information emphasizing playback method that allow users in remote locations to feel a high sense of participation in an event venue. The goal is to provide a playback program.

One aspect of the present invention is a media information emphasizing playback device. The media information emphasis playback device includes a media information reception section, a user state acquisition section, an emotion estimation section, and a media information emphasis playback section. The media information receiving unit receives media information including video and audio. The user status acquisition unit acquires status information indicating the viewing status of the user. The emotion estimator estimates the user's emotion during viewing based on the status information input from the user status acquisition unit. The media information emphasis reproduction section emphatically reproduces the media information input from the media information reception section based on the estimation result input from the emotion estimation section.

One aspect of the present invention is a media information emphasis reproduction method. The media information emphasis playback method includes the steps of receiving media information including video and audio, acquiring status information indicating the user's viewing status, and estimating the user's emotion while viewing based on the status information. and a step of emphasizing and reproducing media information based on the estimation result of the user's emotion during viewing.

One aspect of the present invention is a media information emphasis playback program. The media information emphasis playback program causes a computer having a processor and a storage device to execute the functions of the media information reception section, user state acquisition section, emotion estimation section, and media information emphasis playback section of the above-mentioned media information emphasis playback device.

According to the present invention, there are provided a media information emphasizing playback device, a media information emphasizing playback method, and a media information emphasizing playback program that allow users in remote locations to feel a high sense of participation in an event venue.

FIG. 1 is a block diagram of a media information transmitting and receiving system including a media information emphasizing playback device according to an embodiment. FIG. 2 is a block diagram showing the hardware configuration of the media information emphasizing playback device according to the embodiment. FIG. 3 is a flowchart showing the flow of processing executed by the media information emphasizing playback device according to the embodiment.

<Configuration example>
[Functional configuration]
First, with reference to FIG. 1, a media information transmitting and receiving system including a media information emphasizing playback device according to an embodiment will be described. FIG. 1 is a block diagram of a media information transmitting and receiving system including a media information emphasizing playback device according to an embodiment.

A media information transmission/reception system is constructed between one base O and N bases Rn (n=1, 2,...,N). In FIG. 1, only one of the N bases Rn is illustrated. The configuration of each base Rn is similar.

Base O is an event venue where an event will be held. Media information including video and audio of the event is distributed from base O (event venue) via the IP network 70. The base Rn is a remote location that receives and views media information distributed from the base O (event venue) via the IP network 70. For example, the remote location is the home of the user viewing the media information.

(Base O (event venue))
The base O (event venue) is provided with a server 10, a video shooting device 21, an event audio recording device 22, and an audience audio recording device 23. The server 10 includes a media information generation section 11 and a media information transmission section 12.

The event held at the event venue may be, for example, a music concert, a play, a sports competition, etc.

The video photographing device 21 includes a camera and its related equipment, and photographs the event. The video shooting device 21 outputs the shot video of the event to the media information generation unit 11 of the server 10.

The event audio recording device 22 includes a microphone and its related equipment, and records audio etc. generated by implementing the event. Hereinafter, for convenience, sounds generated by implementing an event will be simply referred to as event sounds. For example, when the event is a music concert, a play, a show, etc., the event sounds include voices uttered by performers, sounds made by the performers, sound effects, and the like. Furthermore, when the event is a sports competition or the like, the event sounds include voices uttered by competitors, sounds produced by the competitors, sounds made for the progress of the competition, and the like. The event audio recording device 22 outputs the recorded event audio to the media information generation unit 11 of the server 10.

The audience audio recording device 23 includes a microphone and its related equipment, and records audio etc. generated by the audience at the event. Hereinafter, for convenience, the sounds generated by the audience at the event will be simply referred to as audience sounds. For example, when the event is a sports competition or the like, the audience sounds include cheers emitted by the audience, sounds made by the audience making noises, and the like. The audience audio recording device 23 outputs the recorded audience audio to the media information generation unit 11 of the server 10.

The media information generation unit 11 generates a video based on the event video input from the video shooting device 21, the event audio input from the event audio recording device 22, and the audience audio input from the audience audio recording device 23. Generate media information including audio (event audio and audience audio). Media information is information distributed to base Rn via IP network 70. The media information generation section 11 outputs the generated media information to the media information transmission section 12.

When generating media information, the media information generation unit 11 may separate event audio, audience audio, or both using a known audio analysis technique. For example, the media information generation unit 11 may separate event audio into voices and background sounds. An example of such speech analysis technology is "Masashi Nishiyama, Makoto Hirohata, Toshiyuki Ono. Sound source separation for volume balance adjustment between voice and background sound. Information Processing Society of Japan Research Report. Vol. 2013-CVIM-187 No. 46 .'' is disclosed. Furthermore, the media information generation unit 11 may separate the event audio for each sound source. An example of such speech analysis technology is disclosed in "Mizuki Kobayashi, Hiroshi Tezuka, Mari Inaba. Proposal of instrument sound separation method using musical scores. Entertainment Computing Symposium (EC2015). September 2015." There is.

Here, an example has been described in which the media information generation unit 11 separates event audio and audience audio, but instead of this, the event audio recording device 22 separates the event audio, and the audience audio recording device 23 separates the audience audio. You may. Further, although an example has been described in which the audio is separated at the site O, the audio may be separated at the site Rn instead.

The media information transmitter 12 transmits the media information input from the media information generator 11 to the IP network 70.

Here, an example has been described in which base O (event venue) is provided with the event audio recording device 22 for recording event audio and the audience audio recording device 23 for recording audience audio. Instead of the audio recording device 23, one audio recording device may be provided to record a mixture of event audio and audience audio.

(Base Rn (remote location))
At base Rn (remote location), there is a user who receives media information distributed from base O (event venue) via the IP network 70 and remotely views an event being held at base O (event venue). There is. Hereinafter, the user at base Rn (remote location) will be simply referred to as a user.

The base Rn (remote location) is provided with a media information emphasis playback device 30, a camera 41, a microphone 42, a biological information measurement device 43, and a playback information output device 44.

The reproduction information output device 44 has a display and a speaker, and outputs video and audio based on the reproduction information input from the media information emphasis reproduction device 30. By viewing the video and audio output from the playback information output device 44, the user views the event being held at the base O (event venue). In the following, it is assumed that the user is viewing an event through the reproduction information output device 44.

The media information emphasis playback device 30 is a user terminal, receives media information distributed from the base O (event venue), and outputs playback information to the playback information output device 44.

The media information emphasis playback device 30 includes a user state acquisition section 31 , an emotion estimation section 32 , a media information emphasis playback section 33 , and a media information reception section 34 .

The media information receiving unit 34 receives media information transmitted from the server 10 at the base O (event venue) via the IP network 70 and outputs it to the media information emphasis reproduction unit 33.

The camera 41 is installed by the user himself to photograph the user. The camera 41 photographs the user and outputs the video information to the user status acquisition section 31.

The microphone 42 is installed by the user himself so as to pick up the user's voice. The microphone 42 picks up the user's voice and background sound, and outputs the voice information to the user status acquisition unit 31.

The biological information measuring device 43 measures the user's biological information. Biological information includes brain function and heart rate. Therefore, the electrodes and sensors included in the biological information measuring device 43 are attached to the user by the user himself/herself. The biological information measuring device 43 outputs the biological information measured by the user to the user state acquisition unit 31.

The user status acquisition unit 31 acquires video information input from the camera 41, audio information input from the microphone 42, and biological information input from the biological information measuring device 43 as status information indicating the user's status. . The user state acquisition unit 31 outputs the acquired state information to the emotion estimation unit 32.

Here, an example has been described in which the camera 41, the microphone 42, and the biological information measuring device 43 are installed at the base Rn as devices capable of acquiring the user's status. need not be provided. At least one device that can acquire the user's status may be provided.

The emotion estimating unit 32 estimates the user's emotion during viewing based on the status information input from the user status acquisition unit 31. For example, the emotion estimation unit 32 uses a known emotion estimation technique to estimate whether the user's emotion is one of three emotions: "positive," "neutral," and "negative." An example of such emotion estimation technology is "Atsushi Okada, Joeji Uemura, Kazuya Mera, Yoshiaki Kurosawa, Toshiyuki Takezawa. Real-time emotion estimation system from facial expressions, acoustic information, and text information. The 31st Annual Conference of the Japanese Society. for Artificial Intelligence, 2017. The emotion estimation section 32 outputs the estimation result to the media information emphasis reproduction section 33.

The media information emphasizing reproduction section 33 emphatically reproduces the media information input from the media information receiving section 34 based on the estimation result input from the emotion estimation section 32.

Here, emphasizing and reproducing media information based on the estimation result means reproducing the media information by changing the media information according to the estimation result. Therefore, depending on the estimation result, emphasizing and reproducing media information based on the estimation result includes reproducing the media information as it is without changing the media information. Hereinafter, emphasizing reproduction of media information based on the estimation result will also be simply referred to as emphasizing reproduction.

Furthermore, changing the media information includes changing the volume of the audio of the media information, processing the video of the media information, or both. Furthermore, changing the media information also includes changing the media information once and then changing it again to return to the original media information, resulting in no change to the media information.

Each time an estimation result is input from the emotion estimation unit 32, the media information emphasis reproduction unit 33 temporarily stores the estimation result and compares it with the estimation result input last time to determine whether there is a change in the estimation result. do. After the determination, the media information emphasis reproduction unit 33 updates the temporarily stored estimation result. As a result of the determination, if there is a change in the estimation result, the media information emphasis playback unit 33 changes the emphasis playback of the media information. Here, changing the emphasis reproduction of media information means changing the change given to the media information.

For example, as described above, it is assumed that the estimation result of the emotion estimation unit 32 is one of the three emotions "positive", "neutral", and "negative".

When the estimation result of the emotion estimation unit 32 changes to "positive", it is considered that the user is feeling excited. In this case, the media information emphasizing playback section 33 plays back the media information by making the audio volume louder than before. Furthermore, the media information emphasis playback section 33 may add an AR (augmented reality) effect to the video to create a sense of excitement and play back the media information. Such AR effects include confetti and lighting.

Further, when the estimation result of the emotion estimation unit 32 changes to "negative", it is considered that the user is feeling relaxed. In this case, the media information emphasizing playback unit 33 plays back the media information at a lower audio volume than before. Furthermore, the media information emphasizing playback unit 33 may add an AR effect to the video to raise the user's mood and play back the media information.

The audio to be changed may be either event audio or audience audio, or both. Furthermore, when the voices are separated, the voice to be changed may be any of the separated voices or all of the separated voices.

Furthermore, if the result of the determination is that there is no change in the estimation result, the estimation result after the update is the same as the estimation result before the update, so the media information emphasis reproduction unit 33 continues to play the media information as before. Reproduce.

The media information emphasis playback unit 33 outputs playback information that emphasizes playback of media information based on the estimation result to the playback information output device 44. As described above, the reproduction information output device 44 outputs video and audio based on the reproduction information input from the media information emphasis reproduction device 30.

(Hardware configuration)
Next, the hardware configuration of the media information emphasis playback device 30 will be explained. For example, the media information emphasizing playback device 30 is configured with a personal computer, a server computer, or the like.

FIG. 2 is a block diagram showing the hardware configuration of the media information emphasis playback device 30 according to the embodiment. As shown in FIG. 2, the media information emphasis playback device 30 includes a processor 51, a ROM (Read Only Memory) 52, a RAM (Random Access Memory) 53, an auxiliary storage device 54, an input/output interface 55, It has a communication interface 56.

The processor 51, ROM 52, RAM 53, auxiliary storage device 54, input/output interface 55, and communication interface 56 are electrically connected to each other via a bus 57, and exchange data via the bus 57.

The processor 51 is configured with a general-purpose hardware processor including, for example, a CPU (Central Processing Unit) and a GPU (Graphical Processing Unit). The processor 51 controls the ROM 52, RAM 53, auxiliary storage device 54, input/output interface 55, and communication interface 56 as a whole.

The ROM 52 is a nonvolatile memory that forms part of the main storage device. The ROM 52 non-temporarily stores a startup program necessary for starting the processor 51. The processor 51 is activated by executing a program in the ROM 52. The ROM 52 is composed of, for example, an EPROM (Erasable Programmable Read Only Memory), and stores various startup settings in addition to the startup program.

The RAM 53 is a volatile memory that forms part of the main storage device. The RAM 53 temporarily stores programs necessary for processing by the processor 51 and data necessary for executing the programs. In the RAM 53, the processor 51 calculates data in the RAM 53 by executing a program in the RAM 53, and stores the calculation results in the RAM 53.

The auxiliary storage device 54 is composed of nonvolatile memory such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive). The auxiliary storage device 54 non-temporarily stores programs executed by the processor 51 and data necessary for executing the programs. The processor 51 reads programs and data in the auxiliary storage device 54 into the RAM 53, and executes various functions by executing the programs.

The input/output interface 55 is connected to an external input device 61 , output device 62 , etc., and enables input of information from the input device 61 and output of information to the output device 62 . For example, the input/output interface 55 may be a wired interface or a wireless interface. The wired interface includes a port to which a device is connected. Wireless interfaces include Bluetooth (registered trademark), WiFi (registered trademark), and the like.

The input device 61 includes a camera 41, a microphone 42, and a biological information measuring device 43. Input device 61 may further include a keyboard, mouse, touch panel, receiving device, disk drive, and the like. The input device 61 is not limited to this, and may include any other input equipment. Output device 62 includes playback information output device 44 . Output devices 62 may further include displays, transmitters, disk drives, and the like. The output device 62 is not limited to this, and may include any other output equipment. The input device 61 and the output device 62 may be configured with an input/output device 63 having both functions.

The program non-temporarily stored in the auxiliary storage device 54 is provided to the computer via, for example, a computer-readable recording medium 64 on which the program is non-temporarily recorded. Such storage media 64 are referred to as non-transitory computer-readable storage media. Non-transitory computer-readable recording media include disks such as flexible disks, optical disks (CD-ROM, CD-R, DVD-ROM, DVD-R, etc.), magneto-optical disks (MO, etc.), semiconductor memories, etc. .

The programs non-temporarily stored in the auxiliary storage device 54 include a media information emphasis playback program. The media information emphasis playback program is a program that causes the computer constituting the media information emphasis playback device 30 to implement the functions of the user state acquisition section 31, emotion estimation section 32, media information emphasis playback section 33, and media information reception section 34. .

The program non-temporarily stored in the auxiliary storage device 54 is stored via the disk drive, which is the input device 61, and the input/output interface 55, when the recording medium 64 is a disk, or via the input/output interface 55, when the recording medium 64 is a semiconductor memory. Then, the data is read into the auxiliary storage device 54 via the port that is the input/output interface 55 and stored non-temporarily. Further, the program may be stored on a server on the network, downloaded from the server, and stored non-temporarily in the auxiliary storage device 54.

The communication interface 56 enables communication of information to and from the IP network 70. That is, the communication interface 56 makes it possible to receive media information distributed from the base O (event venue).

At startup, the processor 51 executes the program in the ROM 52, loads the OS into the RAM 53, and starts it. The processor 51 monitors input of instructions, connection of external devices, etc. under the control of the OS. Further, the processor 51 sets a program area and a data area in the RAM 53 under the control of the OS. In response to an input instruction to start up the media information emphasizing playback device 30, the processor 51 reads the media information emphasizing playback program from the auxiliary storage device 54 into the program area of the RAM 53, and also loads the data necessary for executing the media information emphasizing playback program. The data is read from the auxiliary storage device 54 into the data area of the RAM 53. The processor 51 calculates data in the data area according to the media information emphasis playback program, and writes the calculation results into the data area. Through such operations, the processor 51, the RAM 53, the auxiliary storage device 54, the input/output interface 55, and the communication interface 56 work together to obtain the user status acquisition unit 31, emotion estimation unit 32, and media information of the media information emphasis playback device 30. It implements the functions of the emphasis playback section 33 and the media information reception section 34.

[Operation example]
Next, with reference to FIG. 3, the emphasized playback process executed by the media information emphasized playback device 30 will be described. FIG. 3 is a flowchart showing the flow of emphasized playback processing executed by the media information emphasized playback device according to the embodiment. Here, it is assumed that the media information emphasis reproduction section 33 always outputs reproduction information to the reproduction information output device 44.

In step S1, the user state acquisition unit 31 converts the video information input from the camera 41, the audio information input from the microphone 42, and the biological information input from the biological information measuring device 43 into a state indicating the user's state. Obtain as information.

In step S2, the emotion estimation unit 32 estimates the user's emotion during viewing based on the state information acquired in step S1.

In step S3, the media information emphasis reproduction unit 33 compares the previous estimation result obtained in step S2 with the current estimation result, and determines whether there is a change in the estimation result. If there is no change in the estimation result, the process returns to step S1. If there is a change in the estimation result, the process advances to step S4.

In step S4, the media information emphasis reproduction section 33 changes the emphasis reproduction of media information.

Hereinafter, if the estimation result of the emotion estimation section 32 is one of the three emotions "positive", "neutral", and "negative", the media information emphasis reproduction section 33 changes the volume of the audio of the media information according to the estimation result. An example will be explained below.

Here, it is assumed that A, B, C, and D are set in advance as coefficients for changing the sound volume. Here, A is a numerical value satisfying 0.8≦A<1, B is a numerical value satisfying 0.5<B<0.8, C is a numerical value satisfying 1≦C<1.2, and D is a numerical value satisfying 0.5<B<0.8. The numerical value satisfies 1.2≦D<1.5.

If the estimation result changes from "positive" to "neutral", the media information emphasis playback unit 33 changes the audio volume to A times the previous volume.

If the estimation result changes from "positive" to "negative", the media information emphasis playback unit 33 changes the audio volume to B times the previous volume.

If the estimation result changes from "neutral" to "positive", the media information emphasis playback unit 33 changes the audio volume to C times the previous volume.

If the estimation result changes from "neutral" to "negative", the media information emphasis playback unit 33 changes the audio volume to A times the previous volume.

If the estimation result changes from "negative" to "neutral", the media information emphasis playback unit 33 changes the audio volume to C times the previous volume.

If the estimation result changes from "negative" to "positive," the media information emphasis playback unit 33 changes the audio volume to D times the previous volume.

The volume level may be changed immediately at the timing when the estimation result changes, or it may be changed linearly so that the volume reaches a predetermined level after a certain period of time (for example, one second).

After changing the emphasis playback of media information, in step S5, the process returns to step S1 while the user continues viewing, and when the user finishes viewing, the operation of the media information emphasis playback device 30 is ended by the user. .

<effect>
According to the embodiment, a media information reproduction technique is provided that allows a user in a remote location to feel a high sense of participation in an event venue.

In other words, the streamed video that the user is viewing changes according to the emotional ups and downs of the user viewing the live event from a remote location, making it possible for the event venue and other viewers to see how the user is viewing the event ( It is felt as if cheering, emotions, emotions) are acting (propagating), which can increase the sense of participation (satisfaction) and unity in the event.

By executing a series of processes entirely on the media information emphasizing playback device 30, which is a user terminal, it is possible to immediately adjust the volume and add effects without communication delay, and the user's viewing (watching) situation can be seen in real time by the performer at the event venue. You can feel as if you are affecting the audience, or the audience in a remote location.

Furthermore, by not transmitting data from the media information emphasizing playback device 30, which is a user terminal, to the event venue side, there is no need to prepare a bidirectional distribution infrastructure, and construction and operation costs can be reduced.

Note that the present invention is not limited to the above-described embodiments, and can be variously modified at the implementation stage without departing from the gist thereof. Moreover, each embodiment may be implemented in combination as appropriate, and in that case, the combined effect can be obtained. Furthermore, the embodiments described above include various inventions, and various inventions can be extracted by combinations selected from the plurality of constituent features disclosed. For example, if a problem can be solved and an effect can be obtained even if some constituent features are deleted from all the constituent features shown in the embodiment, the configuration from which these constituent features are deleted can be extracted as an invention.

O... Base (event venue)
10... Server 11... Media information generation section 12... Media information transmission section 21... Video shooting device 22... Event audio recording device 23... Audience audio recording device Rn... Base (remote location)
30... Media information emphasizing playback device 31... User status acquisition section 32... Emotion estimation section 33... Media information emphasis playback section 34... Media information receiving section 41... Camera 42... Microphone 43... Biological information measuring device 44... Playback information output device 51 ...Processor 52...ROM
53...RAM
54... Auxiliary storage device 55... Input/output interface 56... Communication interface 57... Bus 61... Input device 62... Output device 63... Input/output device 64... Recording medium 70... IP network

Claims

a media information receiving unit that receives media information including video and audio;
a user status acquisition unit that acquires status information indicating the viewing status of the user;
an emotion estimation unit that estimates the user's emotion during viewing based on the status information input from the user status acquisition unit;
a media information emphasizing reproduction unit that emphatically reproduces the media information input from the media information receiving unit based on the estimation result input from the emotion estimation unit;
A media information emphasizing playback device having:
When the estimation result of the emotion estimation unit changes, the media information emphasis playback unit changes the emphasis playback of the media information.
The media information emphasizing playback device according to claim 1.
The emotion estimating unit estimates whether the user's emotion is one of three emotions: “positive,” “neutral,” and “negative.”
The media information emphasizing playback device according to claim 2.
When the estimation result of the emotion estimation unit changes to “positive”, the media information emphasis reproduction unit reproduces the media information by increasing the volume of the audio.
The media information emphasizing playback device according to claim 3.
The media information emphasis reproduction unit adds an AR effect to the video and reproduces the media information.
The media information emphasizing playback device according to claim 4.
When the estimation result of the emotion estimation section changes to "negative", the media information emphasis reproduction section reduces the volume of the audio and reproduces the media information.
The media information emphasizing playback device according to claim 3.
receiving media information including video and audio;
obtaining state information indicating the user's viewing state;
estimating the user's emotion during viewing based on the state information;
emphasizing and reproducing the media information based on the estimation result of the user's emotion during viewing;
A media information emphasizing reproduction method comprising:
A computer having a processor and a storage device,
Executing the functions of the media information reception section, the user state acquisition section, the emotion estimation section, and the media information emphasis reproduction section of the media information emphasis playback device according to claim 1;
Media information emphasis playback program.