WO2023058330A1 - 情報処理装置、情報処理方法、および記憶媒体 - Google Patents
情報処理装置、情報処理方法、および記憶媒体 Download PDFInfo
- Publication number
- WO2023058330A1 WO2023058330A1 PCT/JP2022/031034 JP2022031034W WO2023058330A1 WO 2023058330 A1 WO2023058330 A1 WO 2023058330A1 JP 2022031034 W JP2022031034 W JP 2022031034W WO 2023058330 A1 WO2023058330 A1 WO 2023058330A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- viewer
- audio data
- audio
- information processing
- control unit
- Prior art date
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 66
- 238000003672 processing method Methods 0.000 title abstract description 5
- 238000012545 processing Methods 0.000 claims description 34
- 230000008451 emotion Effects 0.000 claims description 28
- 238000002372 labelling Methods 0.000 claims description 27
- 238000004458 analytical method Methods 0.000 claims description 25
- 238000000034 method Methods 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 12
- 238000004891 communication Methods 0.000 description 21
- 238000006243 chemical reaction Methods 0.000 description 13
- 238000010586 diagram Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 9
- 238000010295 mobile communication Methods 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000001965 increasing effect Effects 0.000 description 4
- 238000004590 computer program Methods 0.000 description 2
- 230000001934 delay Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005401 electroluminescence Methods 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 230000008450 motivation Effects 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- WZFUQSJFWNHZHM-UHFFFAOYSA-N 2-[4-[2-(2,3-dihydro-1H-inden-2-ylamino)pyrimidin-5-yl]piperazin-1-yl]-1-(2,4,6,7-tetrahydrotriazolo[4,5-c]pyridin-5-yl)ethanone Chemical compound C1C(CC2=CC=CC=C12)NC1=NC=C(C=N1)N1CCN(CC1)CC(=O)N1CC2=C(CC1)NN=N2 WZFUQSJFWNHZHM-UHFFFAOYSA-N 0.000 description 1
- 206010039740 Screaming Diseases 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 230000000763 evoking effect Effects 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000001151 other effect Effects 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/72—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for transmitting results of analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/233—Processing of audio elementary streams
Definitions
- the present disclosure relates to an information processing device, an information processing method, and a storage medium.
- Events such as sports and live music can be experienced at the venue where the event is held, as well as through TV broadcasting and Internet distribution.
- event distribution is performed in real time, and many viewers can participate from public viewing venues, homes, and the like.
- one of the major differences between the event experience on the Internet distribution and the event experience at the venue is that there is no means to convey the audience's reactions such as cheers and applause to the performers and other viewers. be done. Audience reactions such as cheers and applause are an important factor in events, as they lead to increased motivation for the performers and even more excitement among the audience.
- Patent Document 1 the processing load on the server is high, and more communication capacity is required to constantly upload audio data collected during live distribution to the server. , delays can also occur.
- the present disclosure proposes an information processing device, an information processing method, and a storage medium that can further reduce the load when generating audio data for a viewer.
- audio metadata indicating information about a viewer's utterance is acquired in real time from one or more information processing terminals, and based on the acquired audio metadata, audio data prepared in advance is used for output.
- An information processing apparatus is proposed that includes a control unit that performs control for generating viewer audio data for use.
- the processor acquires audio metadata indicating information about the utterance of the viewer from one or more information processing terminals in real time, and based on the acquired audio metadata, prepares audio data.
- an information processing method including controlling generation of output viewer audio data using
- a computer acquires audio metadata indicating information about a viewer's utterance from one or more information processing terminals in real time, and based on the acquired audio metadata, prepares audio data.
- a storage medium is proposed that stores a program that functions as a control unit that controls the generation of viewer audio data for output.
- FIG. 1 is a diagram describing an overview of an audio data generation system according to an embodiment of the present disclosure
- FIG. It is a block diagram which shows an example of a structure of the server by this embodiment, and a viewer terminal.
- FIG. 4 is a sequence diagram showing an example of the flow of audio data generation processing according to the embodiment; It is a figure explaining the data transmission in the audio
- FIG. 4 is a diagram illustrating control for changing viewer audio data according to event scenes according to the present embodiment;
- FIG. 4 is a sequence diagram showing an example of the flow of audio data generation processing using labeling information according to the present embodiment;
- Configuration example 2-1 Configuration example of server 20 2-2. Configuration example of viewer terminal 10 3 . Operation processing 4. Concrete example 4-1. Generation of viewer voice data according to number of viewers 4-2. Generation of audience voice data according to gender and emotion 4-3. Generation of viewer voice data according to property 4-4. Generation of viewer voice data according to viewing environment 4-5. Generation of viewer audio data according to the number of viewers present at the same location 4-6. Generation of viewer audio data according to virtual seating area 4-7. Generating viewer audio data according to enable/disable of sound pickup section 4-8. Generation of audience voice data according to event scene 4-9. Generation of viewer voice data according to labeling 4-10. Combined use of audio data and audio metadata 4-11. 5. Use of audio metadata in archive distribution. supplement
- FIG. 1 is a diagram illustrating an overview of an audio data generation system according to an embodiment of the present disclosure. As shown in FIG. 1, the audio data generation system according to this embodiment includes an event venue device 30, a server 20, and a viewer terminal 10. FIG.
- the event venue device 30 acquires the video and audio of the venue where the event is being held, and transmits them to the server 20.
- the event venue device 30 may be composed of a plurality of devices.
- the event venue may be a facility with a stage and audience seats (arena, concert hall, etc.), or may be a recording room (recording studio).
- the server 20 is an information processing device that controls distribution of video and audio of the event venue received from the event venue device 30 to the viewer terminal 10 in real time.
- the viewer terminals 10 (10a to 10c%) are information processing terminals used by viewers to view the event venue.
- the viewer terminal 10 can be realized by, for example, a smart phone, a tablet terminal, a PC (personal computer), an HMD (Head Mounted Display), a projector, a television device, a game machine, or the like.
- the HMD may have a non-transmissive display that covers the entire field of view, or may have a transmissive display.
- the viewer terminal 10 is connected for communication with the server 20 and outputs the video and audio of the event site received from the server 20 .
- the viewer terminal 10 outputs the video and audio of the event venue, while generating audio metadata indicating information about the utterance of the viewer and transmitting it to the server 20 .
- the server 20 acquires audio metadata in real time from one or more viewer terminals 10, and based on the acquired audio metadata, generates viewer audio data for output using audio data prepared in advance.
- the viewer audio data can be said to be the audio data of the entire audience.
- the server 20 counts the number of cheering viewers based on the audio metadata acquired from each viewer terminal 10, and stores the viewer audio data corresponding to the number of viewers by number of viewers prepared in advance. It is selected from the viewer voice data and used as the viewer voice data for output.
- the server 20 then transmits the generated viewer audio data to the event venue device 30 and one or more viewer terminals 10 .
- the event venue device 30 can output audience voice data from a speaker or the like installed in the event venue, and feed back the audience's reaction to the performer in real time.
- the viewer terminal 10 can provide the reaction of other viewers to the viewer by outputting the viewer voice data.
- the use of audio metadata reduces the load on the amount of communication, and the use of audio data prepared in advance can also reduce the processing load on the server 20 .
- FIG. 2 is a block diagram showing an example of the configuration of the server 20 and the viewer terminal 10 included in the audio data generation system according to this embodiment.
- the server 20 and the viewer terminal 10 are connected for communication via a network and can transmit and receive data.
- the configuration of each device will be described below.
- the server 20 has a communication section 210 , a control section 220 and a storage section 230 .
- the communication unit 210 transmits and receives data to and from an external device by wire or wirelessly.
- the communication unit 210 is, for example, wired/wireless LAN (Local Area Network), Wi-Fi (registered trademark), Bluetooth (registered trademark), mobile communication network (LTE (Long Term Evolution), 4G (4th generation mobile communication system), 5G (fifth generation mobile communication system)), etc., to connect to the viewer terminal 10 and the event venue device 30 for communication.
- control unit 220 functions as an arithmetic processing device and a control device, and controls overall operations within the server 20 according to various programs.
- the control unit 220 is implemented by an electronic circuit such as a CPU (Central Processing Unit), a microprocessor, or the like.
- the control unit 220 may also include a ROM (Read Only Memory) that stores programs to be used, calculation parameters, and the like, and a RAM (Random Access Memory) that temporarily stores parameters that change as appropriate.
- ROM Read Only Memory
- RAM Random Access Memory
- the control unit 220 controls transmission of the video and audio of the event venue received from the event venue device 30 to the viewer terminal 10 .
- the control unit 220 may, for example, stream video and audio of an event venue where an event is being held in real time to one or more viewer terminals 10 .
- the control unit 220 also functions as an audio metadata analysis unit 221 and a viewer audio data generation unit 222 .
- the audio metadata analysis unit 221 analyzes audio metadata continuously transmitted from each viewer terminal 10 . A specific example of information included in the audio metadata will be described later.
- the audio metadata analysis unit 221 analyzes the audio metadata acquired from each viewer terminal 10 and performs appropriate processing such as counting the number of cheering viewers.
- the audio metadata analysis section 221 outputs the analysis result to the viewer audio data generation section 222 .
- the viewer voice data generation unit 222 generates viewer voice data for output based on the analysis result by the voice metadata analysis unit 221 .
- the viewer audio data generation unit 222 generates the viewer audio data using audio data prepared in advance (for example, stored in the storage unit 230).
- the sound data prepared in advance is, for example, cheers (“Wow”, “Kyaa”, “Wow”, etc.). Such cheers may be prepared for each number of people, for example. That is, the cheers of 20 people, the cheers of 50 people, the cheers of 100 people, etc. are recorded in advance, and the recorded voice data is stored in the storage unit 230 .
- the viewer voice data generation unit 222 generates viewer voice data by generating viewer voice data corresponding to the number of people indicated by the analysis result (the number of viewers cheering). Select from viewer audio data.
- the processing load of the server 20 can be reduced by selecting the viewer voice data from the viewer voice data for each number of viewers prepared in advance, compared to the case where the collected voice data of the viewer is subjected to voice processing and synthesized. can be greatly reduced. Note that the generation of viewer audio data described here is an example. Variations in the method of generating viewer audio data will be described later.
- the control unit 220 controls transmission of the generated viewer audio data from the communication unit 210 to the viewer terminal 10 and the event venue device 30. Note that the control unit 220 may transmit audio data obtained by synthesizing the generated viewer audio data with the audio data of the event site to the viewer terminal 10 .
- the generation and transmission of the viewer audio data described above can be continuously performed by the control unit 220 .
- the control unit 220 may generate and transmit every 0.5 seconds.
- the storage unit 230 is implemented by a ROM (Read Only Memory) that stores programs and calculation parameters used in the processing of the control unit 220, and a RAM (Random Access Memory) that temporarily stores parameters that change as appropriate.
- the storage unit 230 stores audio data used for generating viewer audio data.
- the configuration of the server 20 has been specifically described above, the configuration of the server 20 according to the present disclosure is not limited to the example shown in FIG.
- the server 20 may be realized by multiple devices.
- the viewer terminal 10 has a communication section 110 , a control section 120 , a display section 130 , a sound pickup section 140 , an audio output section 150 and a storage section 160 .
- the communication unit 110 transmits and receives data to and from an external device by wire or wirelessly.
- the communication unit 110 is, for example, wired/wireless LAN (Local Area Network), Wi-Fi (registered trademark), Bluetooth (registered trademark), mobile communication network (LTE (Long Term Evolution), 4G (fourth generation mobile communication system), 5G (fifth-generation mobile communication system)), etc., to connect to the server 20 for communication.
- control unit 120 functions as an arithmetic processing device and a control device, and controls overall operations within the viewer terminal 10 according to various programs.
- the control unit 120 is realized by an electronic circuit such as a CPU (Central Processing Unit), a microprocessor, or the like.
- the control unit 120 may also include a ROM (Read Only Memory) that stores programs to be used, calculation parameters, and the like, and a RAM (Random Access Memory) that temporarily stores parameters that change as appropriate.
- ROM Read Only Memory
- RAM Random Access Memory
- the control unit 120 controls to display the video of the event site received from the server 20 on the display unit 130, and controls to reproduce the audio of the event site and the viewer audio data received from the server 20 from the audio output unit 150. . From the server 20, for example, the video and audio of the event site where the event is being held in real time are streamed.
- the control unit 120 also functions as an audio metadata generation unit 121.
- the audio metadata generation unit 121 generates audio metadata indicating information about the voice uttered by the viewer. For example, the audio metadata generating unit 121 generates based on collected sound data obtained by collecting the uttered sound of the viewer by the sound collecting unit 140 . It is assumed that viewers cheer while watching the distribution at the event venue, and the sound pickup unit 140 picks up such cheers (vocal sounds). Also, the audio metadata generation unit 121 may generate audio metadata based on preset/measured information.
- the information about the uttered sound of the viewer includes, for example, the presence or absence of utterance, the gender of the viewer who uttered the utterance, and the emotion (specific types of cheers) when uttering the utterance. Specific contents of the audio metadata will be described later.
- the audio metadata generation unit 121 continuously generates audio metadata and transmits it to the server 20 while live distribution of the event site by the server 20 (for example, streaming distribution of video and audio of the event site) is being performed. .
- the audio metadata generation unit 121 may generate audio metadata every 0.5 seconds and transmit it to the server 20 .
- the display unit 130 has a function of displaying an image of the event venue in accordance with an instruction from the control unit 120 .
- the display unit 130 may be a display panel such as a liquid crystal display (LCD) or an organic EL (Electro Luminescence) display.
- the sound pickup unit 140 has a function of picking up a sound uttered by a viewer (user).
- the sound pickup unit 140 outputs the collected sound data to the control unit 120 .
- the audio output unit 150 has a function of outputting (reproducing) audio data according to instructions from the control unit 120 .
- the audio output unit 150 may be configured, for example, as a loudspeaker provided in the viewer terminal 10 , headphones, earphones, or a bone conduction speaker for wired/wireless communication with the viewer terminal 10 .
- the storage unit 160 is realized by a ROM (Read Only Memory) that stores programs and calculation parameters used in the processing of the control unit 120, and a RAM (Random Access Memory) that temporarily stores parameters that change as appropriate.
- ROM Read Only Memory
- RAM Random Access Memory
- the configuration of the viewer terminal 10 has been specifically described above, the configuration of the viewer terminal 10 according to the present disclosure is not limited to the example shown in FIG.
- at least one of the display unit 130, the sound pickup unit 140, and the audio output unit 150 may be separate.
- FIG. 3 is a sequence diagram showing an example of the flow of audio data generation processing according to this embodiment.
- the viewer terminal 10 acquires sound pickup data (input information) from the sound pickup unit 140 (step S103).
- the viewer terminal 10 generates audio metadata based on the input information (collected sound data) (step S106), and transmits the generated audio metadata to the server 20 (step S109).
- the server 20 acquires audio metadata from one or more viewer terminals 10 (step S112), and analyzes the audio metadata (step S115).
- the server 20 generates viewer voice data based on the analysis results (step S118).
- the viewer audio data can be said to be the audio data of the entire audience.
- the server 20 transmits the viewer audio data to each viewer terminal 10 together with the audio data of the event venue (received from the event venue device 30) (step S121).
- the server 20 transmits viewer audio data to all viewer terminals 10 (all viewers) connected for communication.
- the audio data of the event venue and the audio data of the entire audience are reproduced (step S127).
- the server 20 also transmits the viewer voice data to the event venue device 30 (step S124).
- the event venue device 30 reproduces the audio data of the entire audience through speakers or the like installed at the event venue (step S130).
- FIG. 3 An example of the flow of audio data generation processing according to the present embodiment has been described above. Note that the operation processing shown in FIG. 3 is an example, and some processing may be performed in a different order or in parallel, or some processing may not be performed. For example, the process of transmitting the viewer audio data to the event venue device 30 may not necessarily be performed.
- FIG. 4 is a diagram for explaining data transmission in the audio data generation process according to this embodiment.
- the server 20 at regular intervals (for example, every 0.5 seconds), based on the audio metadata received from the viewer terminal 10 until then, the viewer audio data (audio data of the entire audience). Generated and transmitted to each viewer terminal 10 and the event venue device 30 .
- the viewer terminal 10 also includes an information processing terminal corresponding to the public viewing venue. When the event is distributed to individuals and spectators at the public viewing venue, it is possible to mutually share the cheers of the individuals and the cheers of the public viewing spectators.
- the audio metadata includes the presence or absence of vocalizations (cheers), and the server 20 selects and transmits viewer audio data for each number of viewers according to the number of viewers who cheered.
- the vocalization: yes was 50 people, so the cheers of 50 people are selected and transmitted, and at the next timing, the vocalization: yes is 100 people, so 100 people select and send cheers.
- the viewers and performers it is possible for the viewers and performers to share that the audience is gradually getting excited (the cheers are increasing) in real time.
- the audio metadata includes presence or absence of vocalization
- the viewer audio data generation unit 222 generates viewer audio data according to the number of people.
- the audio metadata generation unit 121 of the viewer terminal 10 analyzes the collected sound data of the sound collection unit 140 (speech recognition), determines whether or not the viewer has uttered, and includes information indicating the presence or absence of the uttered sound. Generate audio metadata. In an example of data to be generated, for example, as “speaking_flag", "1" may be given if there is a voice, and "2" if there is no voice.
- the viewer terminal 10 determines the presence or absence of vocalization every second, and generates and transmits audio metadata.
- the server 20 prepares voice data for each number of people in advance.
- the audio data is, for example, the sound source of cheers and cheers.
- the audio metadata analysis unit 221 of the server 20 counts the number of viewers with vocalizations based on the information indicating the presence or absence of vocalizations included in the audio metadata transmitted from one or more viewer terminals 10 .
- the viewer voice data generation unit 222 selects voice data close to the counted number of people from the voice data prepared in advance for each number of people, and uses it as viewer voice data.
- the server 20 transmits the viewer audio data thus generated to each viewer terminal 10 and the event venue device 30 .
- the voice metadata including the utterance no information is transmitted, but the present embodiment is not limited to this.
- the viewer terminal 10 may transmit audio metadata including information indicating that the utterance is present only when the utterance is present.
- the audio metadata includes at least one of the gender of the viewer who uttered the vocalization and the emotion determined from the vocalization. generate the voice data of the user.
- the audio metadata generating unit 121 of the viewer terminal 10 analyzes (speech recognition) the collected sound data of the sound collecting unit 140, determines whether it is a female voice or a male voice, and generates audio metadata including information indicating gender. to generate In addition, when the viewer's gender is set in advance, that information may be used. Also, the audio metadata generation unit 121 analyzes the sound data collected by the sound collection unit 140 (speech recognition), determines the emotion evoked by the uttered sound, and generates audio metadata including information indicating the emotion. For example, there are various types of cheers, such as dejected voices, joyful voices, excited voices, flustered voices, surprised voices, screams, etc., which evoke emotions. Also, the audio metadata generation unit 121 may include information indicating the type of cheers as the emotion information. Further, when the viewer does not speak at all based on the analysis of collected sound data, the audio metadata generation unit 121 may include information indicating that the viewer does not speak.
- the server 20 stores in advance audio data by gender (sound source of cheers only for women, sound source of cheers only for men, etc.) and sound data by emotion (sound source of disappointment, sound source of joyful voice, sound source of screams, etc.). Have it ready.
- voice data may be voice data for one person, voice data for a certain number of people (for example, 1000 people), or voice data for each number of people. may
- the viewer voice data generation unit 222 of the server 20 generates gender-specific voice data and emotion-specific voice data prepared in advance from information indicating gender and emotion contained in audio metadata transmitted from one or more viewer terminals 10. From the audio data, corresponding audio data are selected to generate viewer audio data. More specifically, the viewer audio data generation unit 222 synthesizes audio for each audio metadata of each viewer, and collectively generates one audio data.
- the audio metadata analysis unit 221 counts the number of people for each emotion
- the audience audio data generation unit 222 uses the audio data of disappointment to Generate data (or select voice data for a similar number of people), and if the feeling of joy is 100 people, use the voice data for joy to generate voice data for 100 people (or voice data for a similar number of people ) and combine them to generate the final audience voice data.
- the final audience voice data may be generated by adjusting the volume and the like of emotion-specific voice data for a certain number of people prepared in advance. The same can be done for gender.
- the viewer terminal 10 analyzes the properties of the viewer's voice in advance and generates property information.
- the audio metadata includes information indicating the gender of the viewer and the nature of the voice. These pieces of information are also called speech generation parameters.
- the viewer audio data generation unit 222 of the server 20 appropriately adjusts default audio data prepared in advance based on the information indicating the characteristics of the voice for each audio metadata transmitted from one or more viewer terminals 10. Audience audio data is generated by doing so, and these data are combined to generate one audio data. As a result, it is possible to generate more realistic cheers that reflect the characteristics of the viewer's voice, instead of the originally prepared cheers.
- the audio metadata described above may include, in addition to the gender and the nature of the voice, emotional information (type of cheers) determined from the vocalization. As a result, when generating audio data for each viewer, it is possible to generate audio data corresponding to emotions.
- the audio metadata may further include volume information of the voice uttered by the viewer.
- volume information of the voice uttered by the viewer.
- Modification 2 The above-described properties of the viewer's voice may be arbitrarily set by the viewer. This makes it possible to cheer (shout) with a tone different from the actual tone of voice. For example, a male may use a female voice. Alternatively, it may be possible to select from voice generation parameters prepared in advance by the distribution provider (for example, parameters for voice generation of celebrities). Furthermore, the voice generation parameters prepared by the distribution provider may be sold separately or attached only to tickets for specific events. As a result, it can also be used as an income item for an event on the delivery provider's side.
- the audio metadata generation unit 121 of the viewer terminal 10 selects the characteristics of the viewer's voice from audio generation parameters prepared in advance, and includes the selected audio metadata in the audio metadata. This reduces the processing load of generating audio data for each viewer on the server 20 .
- the audio metadata analysis unit 221 of the server 20 counts, for example, the number of viewers who have uttered voices for each audio generation parameter. Audience voice data is generated using the voice data of
- both the process of selecting from voice generation parameters prepared in advance and the process of using the characteristics of the viewer's voice may be performed.
- the function of reflecting the characteristics of the viewer's voice may be sold only to specific viewers.
- the audio metadata may include the presence or absence of the uttered sound described above and the volume information of the sound uttered by the viewer.
- the server 20 can generate the viewer voice data considering the actual volume information of the voice uttered by the viewer.
- the maximum volume value of the viewer (maximum volume of voice that the viewer can make) may be measured in advance and included in the audio metadata.
- the audio metadata generation unit 121 of the viewer terminal 10 generates, for example, information indicating the presence or absence of vocalization, information indicating the actual loudness of the vocalization, and information indicating the pre-measured maximum volume value as an audio metadata. It is included in the data and transmitted to the server 20 .
- the audio metadata generation unit 121 acquires in advance a value measured at a specific timing that is assumed to be the loudest at an event (for example, at a music live, the timing at which an artist appears), and calculates the maximum volume value. may be used as
- the viewer audio data generation unit 222 of the server 20 when generating audio data for each viewer based on each audio metadata, the maximum volume value is taken into account, and the volume of the voice actually uttered by the user is calculated. You can set it to be larger.
- the viewer audio data generation unit 222 sets the maximum volume setting value A of the audio data that can be generated by the viewer audio data generation unit 222, and the maximum volume value of the audio metadata is equal to the maximum volume setting value A. You may adjust suitably so that it may become the same.
- gender information when gender information is included in audio metadata, information indicating that men and women are included or information indicating the ratio of male to female may be used.
- the audio metadata analysis unit 221 of the server 20 counts the number of viewers instead of 1.
- Audio metadata may include information indicating a virtual seating area (viewing position) at the event venue.
- the viewing position is one of the factors to enjoy the event, and the performer may also ask for a reaction linked to the viewing position (for example, the performer cheers on the audience on the second floor). etc.).
- the virtual seating area may be set in advance for each viewer, or the viewer may select an arbitrary area.
- the viewer terminal 10 includes, for example, information indicating the presence or absence of vocalization and information indicating a virtual seating area in audio metadata.
- the audio metadata analysis unit 221 of the server 20 counts the number of viewers who have uttered voices for each virtual seating area, and the audience audio data generation unit 222 counts the number of viewers for each virtual seating area. Select audio data and generate viewer audio data. Then, the control unit 220 of the server 20 transmits the generated viewer audio data to the event venue device 30 in association with the virtual seating area information.
- the event venue device 30 controls a plurality of speakers installed in spectator seats in the event venue to reproduce the viewer audio data of the virtual seat area corresponding to the position of each speaker. This allows the performer to grasp the cheers of the audience at each position.
- the viewer audio data generation unit 222 may generate audio data for each viewer based on each audio metadata, and collect the audio data for each virtual viewing area to generate viewer audio data.
- the server 20 may transmit viewer audio data associated with virtual seating area information to each viewer terminal 10 .
- each viewer terminal 10 when reproducing the viewer voice data, based on the information of the virtual seat area of each viewer voice data, each viewer terminal 10 performs a process of localizing the sound source to the position corresponding to the virtual seat area. you can go This will allow viewers to experience the same atmosphere as watching from their seats at the actual venue.
- audio metadata is generated based on the input information (collected sound data) from the sound collecting unit 140 .
- the sound pickup unit 140 may not be provided or connected, and there may be viewers who have no choice but to listen quietly depending on the environment. In this case, there is a possibility that the viewer voice data generated by the server 20 will be the cheers of fewer people than the actual number of viewers.
- the sound pickup unit 140 includes information on effectiveness (ON/OFF).
- the viewer audio data generation unit 222 of the server 20 turns off (disables) the sound pickup unit 140 based on the ratio of the number of utterances among the viewers whose sound pickup unit 140 is turned on (enabled, available state), for example. , unusable state) are considered to be the number of utterers, and the viewer voice data is generated.
- Information that the sound pickup unit 140 considers regarding the viewers whose sound pickup unit 140 is OFF is not limited to the number of people speaking. audio metadata may be applied accordingly.
- the viewer terminal 10 may analyze the movement of the viewer captured by the camera and include the analysis result in the audio metadata.
- a feeling of excitement may be expressed by small clapping (clapping that is not actually clapping, etc.) or waving.
- the viewer terminal 10 may grasp such a viewer's movement by image analysis, determine that there is an uttered sound, the type of cheers is "joy", and the like, and generate audio metadata.
- the viewer voice data generation unit 222 of the server 20 may change the volume of the viewer voice data to be generated and the type of viewer voice data (cheers, clapping, etc.) according to the scene of the event. What kind of voice data is generated in which scene may be controlled on the server side in real time, may be changed at a preset time, or may be changed according to the voice of the performer. .
- FIG. 5 is a diagram for explaining control for changing viewer audio data according to event scenes. As shown in FIG. 5, for example, it may be changed to "kind: clapping" and "volume: low” during an event performance, and changed to "type: cheers" and "volume: loud” during a talk. In this way, an event can be staged using the audience voice data.
- the audio metadata generation unit 121 of the viewer terminal 10 adds information about the team to support in a soccer match to the audio metadata as labeling information.
- the viewer voice data generation unit 222 of the server 20 generates viewer voice data for each piece of labeling information.
- the control unit 220 of the server 20 transmits the viewer voice data having the same labeling information as that of the viewer to the viewer terminal 10 of the viewer.
- viewers can mainly listen to the cheering of the people cheering for the same soccer team, and it is possible to have the experience of watching the game among the supporters of the team they are cheering for.
- the viewer voice data generation unit 222 may generate the entire viewer voice data by emphasizing (increasing the volume) the viewer voice data of the labeling information for each labeling information.
- FIG. 6 is a sequence diagram showing an example of the flow of audio data generation processing using labeling information according to this embodiment.
- the viewer unit 10 first sets labeling information for viewing (step S203). Labeling information may be set based on viewer selection.
- the viewer terminal 10 acquires sound pickup data (input information) from the sound pickup unit 140 (step S206).
- the viewer terminal 10 generates audio metadata based on the input information (collected sound data), further includes labeling information (step S209), and transmits the audio metadata to the server 20 (step S212). ).
- steps S215-S221 the same processing as that shown in steps S112-S118 of FIG. 3 is performed. That is, the control unit 220 of the server 20 generates viewer audio data based on audio metadata acquired from one or more viewer terminals 10 .
- control unit 220 also generates viewer audio data using only the audio metadata of the same labeling information (step S224).
- the server 20 transmits the audio data of the event venue (received from the event venue device 30) and the viewer audio data based on the same labeling information as the labeling information of each viewer to each viewer terminal 10 (step S227). ).
- the server 20 may transmit only the viewer voice data based on the same labeling information as the viewer's labeling information to the viewer terminal 10, or transmit the entire viewer voice data based on the same labeling information with emphasis. Audio data may be generated and transmitted to the viewer terminal 10 .
- the viewer terminal 10 reproduces the audio data of the event site and the viewer's audio data (step S233).
- the server 20 also transmits the viewer audio data to the event venue device 30 (step S230).
- viewer audio data is the audio data generated in step S221.
- the event venue device 30 reproduces the audio data of the entire audience through speakers or the like installed in the event venue (step S236).
- the audio data generation process using labeling information has been described above. Note that the operation processing shown in FIG. 6 is an example, and the present embodiment is not limited to this.
- the event distribution of this embodiment is not limited to distribution for individuals, and may be distributed to venues with thousands to tens of thousands of people, such as public viewing.
- the audio data collected at the venue can be sent to the server 20 as it is, and synthesized with the audio data of other individual viewers (audio data generated based on the audio metadata). good. It is assumed that there will be several public viewing venues, and since the voices of thousands to tens of thousands of people can be converted into one voice data, the communication capacity and processing load will be several thousand to 100,000. It can be said that it is not large compared to the case of individually transmitting and processing voice data for tens of thousands of people.
- a specific individual viewer for example, a viewer who purchased a premium ticket
- various services can be provided to the viewers to the extent that a large delay does not occur.
- control unit 220 of the server 20 also stores the audio metadata acquired from each viewer terminal 10 at the time of live distribution, and at the time of archive distribution, uses the audio metadata that was not used at the time of live distribution.
- Viewer audio data may be generated and distributed.
- Voice data includes the presence/absence of vocalization, gender, emotion, nature of voice, volume of voice, maximum volume value, number of people, virtual seating area, effectiveness of sound pickup unit 140, labeling information, and the like. information can be included.
- at least a part for example, only the presence or absence of vocalization
- other various information is appropriately used. It may be used to generate and distribute viewer audio data.
- the event venue audio data and the viewer audio data transmitted from the server 20 to the viewer terminal 10 are generated as separate sound sources, and the viewer can arbitrarily select one of the audio data during live distribution or archive distribution. It is also possible to erase and replay.
- the audio metadata includes the presence or absence of the above-mentioned uttered sound, gender, emotion, nature of voice, loudness of voice, maximum volume value, number of people, virtual seating area, effectiveness of sound pickup unit 140, and labeling information. At least one or more of such as may be included.
- other information included in the audio metadata may include the temporal length of the vocalization. For example, information such as whether it was a momentary utterance or whether the utterance sound had a certain length can be included.
- one or more computer programs can be created for causing hardware such as the CPU, ROM, and RAM built into the server 20 and the viewer terminal 10 described above to exhibit the functions of the server 20 and the viewer terminal 10. be. Also provided is a computer-readable storage medium storing the one or more computer programs.
- the present technology can also take the following configuration.
- An information processing apparatus comprising a control unit that controls the generation of (2)
- the audio metadata is generated based on the analysis results of sound data collected by a sound collecting unit that collects the uttered sound of the viewer when distributing data of an event being held in real time.
- the audio metadata includes information indicating the presence or absence of the utterance
- the control unit counts the number of people uttering based on the information indicating the presence or absence of the uttered sound, selects voice data close to the counted number of people from voice data classified by number of people prepared in advance, and selects the voice data of the audience.
- the information processing apparatus according to (2) above, which generates (4) the audio metadata includes information indicating the gender of the viewer who uttered the vocalization;
- the control unit selects the audio data corresponding to the sex from the sex-specific audio data prepared in advance based on the information indicating the sex of the viewer who uttered the utterance, and outputs the audio data of the viewer.
- the information processing apparatus according to (2) which generates.
- the audio metadata includes information indicating an emotion determined by an analysis result of the utterance; wherein the control unit selects audio data corresponding to the emotion from the emotion-specific audio data prepared in advance based on the information indicating the emotion, and generates the viewer audio data;
- the information processing device according to any one of (4).
- the audio metadata includes information indicating the properties of the utterance generated by analysis results of the utterance;
- the information processing apparatus according to any one of (2) to (5), wherein the control unit reflects the properties in audio data prepared in advance to generate the viewer audio data.
- the audio metadata includes information indicating a property arbitrarily set by a viewer as a property of the uttered sound
- the information processing apparatus according to any one of (2) to (5), wherein the control unit reflects the properties in audio data prepared in advance to generate the viewer audio data.
- the audio metadata includes information indicating a property selected from property variations prepared in advance as the property of the uttered sound;
- the control unit according to any one of (2) to (5) above, wherein the control unit selects audio data reflecting the property from audio data prepared in advance, and generates the viewer audio data. information processing equipment.
- the audio metadata further includes information indicating the loudness of the utterance, which is determined by an analysis result of the utterance;
- the information processing apparatus according to any one of (2) to (8), wherein the control unit further reflects the volume of the uttered sound of each viewer to generate the viewer voice data.
- the audio metadata further includes information on the maximum volume value of the viewer who uttered the vocalization,
- the information processing device according to (9), wherein the control unit further reflects a maximum volume value of each viewer to generate the viewer audio data.
- the audio metadata further includes information on the maximum volume value of the viewer who uttered the vocalization, The information processing apparatus according to (9), wherein the control unit adjusts the maximum volume value to the same magnitude as a preset maximum volume setting value, and generates and outputs the viewer audio data.
- the audio metadata includes information about the number of viewers in the same location who emitted the vocalization; The information processing apparatus according to (2), wherein the control unit selects audio data close to the number of people among prepared audio data for each number of people, and generates the viewer audio data.
- the audio metadata further includes information indicating a virtual seating area of a viewer who emitted the vocalization; The information processing device according to any one of (2) to (11), wherein the control unit further generates the viewer audio data for each virtual seat area of each viewer.
- the audio metadata further includes information indicating whether or not a sound pickup unit that picks up the uttered sound is enabled; The control unit counts the number of speakers after applying the ratio of the number of speakers among the viewers for whom the sound pickup unit is enabled to the ratio of the assumed number of speakers for each viewer for whom the sound pickup unit is disabled. and selecting audio data close to the counted number of utterers from the audio data for each number of people prepared in advance to generate the audience audio data. information processing equipment.
- the audio metadata further includes labeling information of the category to which the viewer belongs; (2) to (14), wherein the control unit generates the viewer voice data for each classification, and outputs the viewer voice data corresponding to the viewer classification to the viewer's information processing terminal;
- the information processing apparatus according to any one of items 1 and 2.
- the control unit changes at least one of the type and volume of the viewer audio data to be generated according to the scene of the event distributed to each viewer.
- the information processing device according to the item.
- 17. The information processing device according to any one of (1) to (16), wherein the control unit outputs the generated viewer audio data to the information processing terminal and an event venue device.
- the control unit synthesizes the audio data acquired from the public viewing venue with the viewer audio data generated based on the audio metadata, and outputs the data to the information processing terminal and the event venue device. ) to (17).
- the processor Acquire audio metadata indicating information about the utterance of the viewer in real time from one or more information processing terminals, and output viewer audio data using audio data prepared in advance based on the acquired audio metadata.
- a method of processing information including controlling the generation of (20) the computer, Acquire audio metadata indicating information about the utterance of the viewer in real time from one or more information processing terminals, and output viewer audio data using audio data prepared in advance based on the acquired audio metadata.
- a storage medium that stores a program that functions as a control unit that controls the generation of .
- viewer terminal 110 communication unit 120 control unit 121 audio metadata generation unit 130 display unit 140 sound collection unit 150 audio output unit 160 storage unit 20 management server 210 communication unit 220 control unit 221 audio metadata analysis unit 222 audience audio data Generating unit 230 Storage unit
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
Description
1.本開示の一実施形態による音声データ生成システムの概要
2.構成例
2-1.サーバ20の構成例
2-2.視聴者端末10の構成例
3.動作処理
4.具体例
4-1.人数に応じた視聴者音声データの生成
4-2.性別、感情に応じた視聴者音声データの生成
4-3.性質に応じた視聴者音声データの生成
4-4.視聴環境に応じた視聴者音声データの生成
4-5.同じ場所にいる視聴者の人数に応じた視聴者音声データの生成
4-6.仮想的な座席エリアに応じた視聴者音声データの生成
4-7.収音部の有効/無効に応じた視聴者音声データの生成
4-8.イベントの場面に応じた視聴者音声データの生成
4-9.ラベリングに応じた視聴者音声データの生成
4-10.音声データと音声メタデータの併用
4-11.アーカイブ配信での音声メタデータの利用
5.補足
図1は、本開示の一実施形態による音声データ生成システムの概要について説明する図である。図1に示すように、本実施形態による音声データ生成システムは、イベント会場装置30と、サーバ20と、視聴者端末10と、を含む。
ここで、上述したように、インターネット配信等でのイベント体験における、会場でのイベント体験との大きな違いの一つとして、視聴者の歓声や拍手等の反応を演者や他の視聴者に伝える手段が無いことが挙げられる。歓声や拍手といった視聴者の反応は、演者のモチベーションアップや、観客間のさらなる盛り上がりにも繋がり、イベントでは重要な要素と言える。各視聴者(リモートユーザ)の発声音を収音し、これをサーバに送信し、サーバで音声処理を施して複数の音声データを加算した上で各リモートユーザに配信することも考え得る。しかしながら、サーバの処理負荷が高く、また、ライブ配信が行われている際に収音した音声データをサーバに常にアップロードするためにはより多くの通信容量が必要となり、遅延も生じ得る。
図2は、本実施形態による音声データ生成システムに含まれるサーバ20および視聴者端末10の構成の一例を示すブロック図である。サーバ20および視聴者端末10は、ネットワークを介して通信接続し、データの送受信を行い得る。以下各装置の構成について説明する。
図2に示すように、サーバ20は、通信部210、制御部220、および記憶部230を有する。
通信部210は、有線または無線により外部装置とデータの送受信を行う。通信部210は、例えば有線/無線LAN(Local Area Network)、Wi-Fi(登録商標)、Bluetooth(登録商標)、携帯通信網(LTE(Long Term Evolution)、4G(第4世代の移動体通信方式)、5G(第5世代の移動体通信方式))等を用いて、視聴者端末10や、イベント会場装置30と通信接続する。
制御部220は、演算処理装置および制御装置として機能し、各種プログラムに従ってサーバ20内の動作全般を制御する。制御部220は、例えばCPU(Central Processing Unit)、マイクロプロセッサ等の電子回路によって実現される。また、制御部220は、使用するプログラムや演算パラメータ等を記憶するROM(Read Only Memory)、及び適宜変化するパラメータ等を一時記憶するRAM(Random Access Memory)を含んでいてもよい。
記憶部230は、制御部220の処理に用いられるプログラムや演算パラメータ等を記憶するROM(Read Only Memory)、および適宜変化するパラメータ等を一時記憶するRAM(Random Access Memory)により実現される。例えば、本実施形態により記憶部230は、視聴者音声データの生成に用いられる音声データを格納する。
図2に示すように、視聴者端末10は、通信部110、制御部120、表示部130、収音部140、音声出力部150、および記憶部160を有する。
通信部110は、有線または無線により外部装置とデータの送受信を行う。通信部110は、例えば有線/無線LAN(Local Area Network)、Wi-Fi(登録商標)、Bluetooth(登録商標)、携帯通信網(LTE(Long Term Evolution)、4G(第4世代の移動体通信方式)、5G(第5世代の移動体通信方式))等を用いて、サーバ20と通信接続する。
制御部120は、演算処理装置および制御装置として機能し、各種プログラムに従って視聴者端末10内の動作全般を制御する。制御部120は、例えばCPU(Central Processing Unit)、マイクロプロセッサ等の電子回路によって実現される。また、制御部120は、使用するプログラムや演算パラメータ等を記憶するROM(Read Only Memory)、及び適宜変化するパラメータ等を一時記憶するRAM(Random Access Memory)を含んでいてもよい。
表示部130は、制御部120の指示に従って、イベント会場の映像を表示する機能を有する。例えば表示部130は、液晶ディスプレイ(LCD:Liquid Crystal Display)、有機EL(Electro Luminescence)ディスプレイなどの表示パネルであってもよい。
収音部140は、視聴者(ユーザ)の発声音を収音する機能を有する。収音部140は、収音した音声データを制御部120に出力する。
記憶部160は、制御部120の処理に用いられるプログラムや演算パラメータ等を記憶するROM(Read Only Memory)、および適宜変化するパラメータ等を一時記憶するRAM(Random Access Memory)により実現される。
次に、本実施形態による音声データ生成処理の流れについて図面を用いて具体的に説明する。図3は、本実施形態による音声データ生成処理の流れの一例を示すシーケンス図である。
続いて、視聴者音声データの生成について具体例を用いて説明する。
例えば、音声メタデータには、発声音の有無が含まれ、視聴者音声データ生成部222は、人数に応じた視聴者音声データを生成する。
例えば、音声メタデータには、発声音を発した視聴者の性別、および発声音から判別される感情の少なくともいずれかが含まれ、視聴者音声データ生成部222は、性別、感情に応じた視聴者音声データを生成する。
生成する視聴者音声データを、さらに実際の反応に近付けるために、例えば、視聴者の声の性質(性別、声の高さ(高い、低い)、太さ(細い、太い)等)を用いてもよい。
上述した音声メタデータには、性別と声の性質に加えて、さらに、発声音から判断される感情の情報(歓声の種類)が含まれていてもよい。これにより、視聴者毎の音声データを生成する際、感情に応じた音声データを生成することができる。
上述した視聴者の声の性質は、視聴者が任意に設定してもよい。これにより、実際の自分の声色とは異なる声色で応援する(歓声を上げる)ことが可能になる。例えば、男性が女性の声を用いてもよい。また、予め配信提供者側で準備した音声生成用パラメータ(例えば有名人の音声生成用パラメータ)から選択できるようにしてもよい。さらに、配信提供者側で用意される音声生成用パラメータは、別途販売、若しくは特定のイベントのチケットにのみ付属するように扱ってもよい。これにより、配信提供者側のイベントの収入アイテムとして利用することもできる。
また、取り扱う音声生成用パラメータのバリエーションを限定しておいてもよい。視聴者端末10の音声メタデータ生成部121は、視聴者の声の性質を、予め用意された音声生成用パラメータから選択し、音声メタデータに含める。これにより、サーバ20では、視聴者毎に音声データを生成する処理負荷が軽減される。サーバ20の音声メタデータ解析部221は、例えば発声した視聴者のうち、音声生成用パラメータ毎の人数をカウントし、視聴者音声データ生成部222において、予め用意された音声生成用パラメータと人数別の音声データを用いて、視聴者音声データを生成する。
音声メタデータには、上述した発声音の有無と、視聴者の発声した音声の大きさ情報を含めてもよい。この場合、サーバ20は、視聴者が発声した実際の声の大きさ情報を考慮して視聴者音声データを生成できる。ここで、視聴環境によっては大きな声が出せず、声の大きさが控えめになる場合もある。大きな声が出せない視聴環境の視聴者が多い場合、生成される視聴者音声データも声の大きさが控えめになってしまう。そこで、視聴者端末10において、予め視聴者の最大音量値(視聴者が出せる最大の声の大きさ)を計測しておき、音声メタデータに含めるようにしてもよい。視聴者端末10の音声メタデータ生成部121は、例えば発声音の有無を示す情報に加えて、実際の発声音の大きさを示す情報と、予め計測した最大音量値を示す情報を、音声メタデータに含めて、サーバ20に送信する。
上述した各具体例では、視聴者端末10が生成する音声メタデータの対象となる視聴者は一人であると想定していた。しかし、家族や友達と一緒に数人で視聴する場合もある。この場合は、例えば、視聴者端末10において、音声認識やカメラ等で、視聴している人を認識した上で人数分の音声メタデータ生成する手法も可能である。また、音声メタデータに、人数を示すfieldを追加して人数を示し、その他は1人分の情報にまとめてしまってもよい。
音声メタデータには、イベント会場での仮想的な座席エリア(視聴位置)を示す情報を含めてもよい。実際の音楽ライブなどでは、視聴位置もイベントを楽しむ1つの要素であり、また、演者も視聴位置に紐付けたリアクションを求めたりする場合がある(例えば演者が2階席の観客に声援するように呼び掛ける等)。仮想的な座席エリアは、予め視聴者毎に設定されていてもよいし、視聴者が任意のエリアを選択してもよい。視聴者端末10は、例えば発声音の有無を示す情報と、仮想的な座席エリアを示す情報を、音声メタデータに含める。
上述した各具体例では、収音部140からの入力情報(収音データ)に基づいて、音声メタデータを生成している。しかし、視聴者端末10によっては、収音部140が設けられていない、接続されていない場合があり、また、環境により静かに視聴するしかない視聴者が存在する場合もある。この場合、サーバ20で生成する視聴者音声データが、実際の視聴者数より少ない人数の歓声となってしまう可能性がある。
例えば、音楽ライブの場合、曲の途中はなるべく曲を聴くことに集中させたいため、視聴者の歓声は小さい方が好ましい場合もある。一方で、曲間は演者や観客が盛り上がり状況を確認するタイミングともいえる。また、スポーツイベントにおいても、プレーヤーがプレイするタイミングでは静かにすることがマナーの競技もあれば、手拍子などをプレーヤーが求める競技もある。このように、イベントの場面毎に、望ましい音声の大きさが異なり、さらには歓声よりも手拍子が好ましい場合もある。
本実施形態では、視聴者が属する分類のラベリング情報を設定することで、視聴者毎にカスタマイズした視聴者音声データを提供することが可能となる。すなわち、視聴者毎に、視聴者と同じ分類のラベリング情報に対応する視聴者音声データを強調させて提供することができる。
本実施形態のイベント配信は、個人向けの配信に限定されず、パブリックビューイングのように数千人~数万人が居る会場に配信する場合もある。パブリックビューイング会場の音声データは、会場の収音した音声データをそのままサーバ20に送信し、その他の個人視聴者の音声データ(音声メタデータに基づいて生成された音声データ)と合成してもよい。パブリックビューイング会場は数か所であることが想定され、また、数千~数万人規模の音声を1つの音声データとすることができるため、通信容量や処理量の負荷は、数千~数万人分の音声データを個別に送信、処理する場合に比べて大きくないと言える。
上述した各具体例は、いずれもリアルタイムでイベント配信を行う、所謂ライブ配信を想定しているが、本実施形態はこれに限定されず、かかるイベント配信を後日アーカイブ配信することも想定される。
以上、添付図面を参照しながら本開示の好適な実施形態について詳細に説明したが、本技術はかかる例に限定されない。本開示の技術分野における通常の知識を有する者であれば、請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本開示の技術的範囲に属するものと了解される。
(1)
視聴者の発声音に関する情報を示す音声メタデータを1以上の情報処理端末からリアルタイムで取得し、取得した音声メタデータに基づいて、予め用意された音声データを用いて出力用の視聴者音声データを生成する制御を行う制御部を備える、情報処理装置。
(2)
前記音声メタデータは、リアルタイムで開催されているイベントのデータを配信している際における、前記視聴者の発声音を収音する収音部による収音データの解析結果に基づいて生成される、前記(1)に記載の情報処理装置。
(3)
前記音声メタデータは、前記発声音の有無を示す情報を含み、
前記制御部は、前記発声音の有無を示す情報に基づいて発声人数をカウントし、予め用意された人数別音声データのうち、前記カウントした人数に近い音声データを選択し、前記視聴者音声データを生成する、前記(2)に記載の情報処理装置。
(4)
前記音声メタデータは、前記発声音を発した視聴者の性別を示す情報を含み、
前記制御部は、前記発声音を発した視聴者の性別を示す情報に基づいて、予め用意された性別別音声データのうち、前記性別に対応する音声データを選択し、前記視聴者音声データを生成する、前記(2)に記載の情報処理装置。
(5)
前記音声メタデータは、前記発声音の解析結果により判断される感情を示す情報を含み、
前記制御部は、前記感情を示す情報に基づいて、予め用意された感情別音声データのうち、前記感情に対応する音声データを選択し、前記視聴者音声データを生成する、前記(2)~(4)のいずれか1項に記載の情報処理装置。
(6)
前記音声メタデータは、前記発声音の解析結果により生成される、前記発声音の性質を示す情報を含み、
前記制御部は、予め用意された音声データに、前記性質を反映させ、前記視聴者音声データを生成する、前記(2)~(5)のいずれか1項に記載の情報処理装置。
(7)
前記音声メタデータは、前記発声音の性質として、視聴者が任意に設定した性質を示す情報を含み、
前記制御部は、予め用意された音声データに、前記性質を反映させ、前記視聴者音声データを生成する、前記(2)~(5)のいずれか1項に記載の情報処理装置。
(8)
前記音声メタデータは、前記発声音の性質として、予め用意された性質バリエーションの中から選択した性質を示す情報を含み、
前記制御部は、予め用意された音声データのうち、前記性質が反映された音声データを選択し、前記視聴者音声データを生成する、前記(2)~(5)のいずれか1項に記載の情報処理装置。
(9)
前記音声メタデータは、前記発声音の解析結果により判別される、前記発声音の大きさを示す情報をさらに含み、
前記制御部は、さらに各視聴者の前記発声音の大きさを反映させて、前記視聴者音声データを生成する、前記(2)~(8)のいずれか1項に記載の情報処理装置。
(10)
前記音声メタデータは、前記発声音を発した視聴者の最大音量値の情報をさらに含み、
前記制御部は、さらに各視聴者の最大音量値を反映させて、前記視聴者音声データを生成する、前記(9)に記載の情報処理装置。
(11)
前記音声メタデータは、前記発声音を発した視聴者の最大音量値の情報をさらに含み、
前記制御部は、予め設定された最大音量設定値と同じ大きさに前記最大音量値を調整して、前記視聴者音声データを生成し、出力する、前記(9)に記載の情報処理装置。
(12)
前記音声メタデータは、前記発声音を発した、同じ場所に居る視聴者の人数の情報を含み、
前記制御部は、予め用意された人数別音声データのうち、前記人数に近い音声データを選択し、前記視聴者音声データを生成する、前記(2)に記載の情報処理装置。
(13)
前記音声メタデータは、前記発声音を発した視聴者の仮想的な座席エリアを示す情報をさらに含み、
前記制御部は、さらに各視聴者の仮想的な座席エリア毎に、前記視聴者音声データを生成する、前記(2)~(11)のいずれか1項に記載の情報処理装置。
(14)
前記音声メタデータは、前記発声音を収音する収音部が有効であるか否かを示す情報をさらに含み、
前記制御部は、前記収音部が有効である各視聴者における発声人数の割合を、前記収音部が無効である各視聴者におけるみなし発声人数の割合に適用した上で、発声人数をカウントし、予め用意された人数別音声データのうち、前記カウントした発声人数に近い音声データを選択し、前記視聴者音声データを生成する、前記(2)~(13)のいずれか1項に記載の情報処理装置。
(15)
前記音声メタデータは、視聴者が属する分類のラベリング情報をさらに含み、
前記制御部は、前記視聴者音声データを分類毎に生成し、前記視聴者の分類に対応する視聴者音声データを前記視聴者の情報処理端末に出力する、前記(2)~(14)のいずれか1項に記載の情報処理装置。
(16)
前記制御部は、前記各視聴者に配信するイベントの場面に合わせて、前記生成する視聴者音声データの種類および音量の少なくともいずれかを変更する、前記(1)~(15)のいずれか1項に記載の情報処理装置。
(17)
前記制御部は、前記生成した視聴者音声データを、前記情報処理端末と、イベント会場装置に出力する、前記(1)~(16)のいずれか1項に記載の情報処理装置。
(18)
前記制御部は、パブリックビューイング会場から取得した音声データを、前記音声メタデータに基づいて生成した視聴者音声データと合成して、前記情報処理端末と、イベント会場装置に出力する、前記(1)~(17)のいずれか1項に記載の情報処理装置。
(19)
プロセッサが、
視聴者の発声音に関する情報を示す音声メタデータを1以上の情報処理端末からリアルタイムで取得し、取得した音声メタデータに基づいて、予め用意された音声データを用いて出力用の視聴者音声データを生成する制御を行うことを含む、情報処理方法。
(20)
コンピュータを、
視聴者の発声音に関する情報を示す音声メタデータを1以上の情報処理端末からリアルタイムで取得し、取得した音声メタデータに基づいて、予め用意された音声データを用いて出力用の視聴者音声データを生成する制御を行う制御部として機能させるプログラムが記憶された、記憶媒体。
110 通信部
120 制御部
121 音声メタデータ生成部
130 表示部
140 収音部
150 音声出力部
160 記憶部
20 管理サーバ
210 通信部
220 制御部
221 音声メタデータ解析部
222 視聴者音声データ生成部
230 記憶部
Claims (20)
- 視聴者の発声音に関する情報を示す音声メタデータを1以上の情報処理端末からリアルタイムで取得し、取得した音声メタデータに基づいて、予め用意された音声データを用いて出力用の視聴者音声データを生成する制御を行う制御部を備える、情報処理装置。
- 前記音声メタデータは、リアルタイムで開催されているイベントのデータを配信している際における、前記視聴者の発声音を収音する収音部による収音データの解析結果に基づいて生成される、請求項1に記載の情報処理装置。
- 前記音声メタデータは、前記発声音の有無を示す情報を含み、
前記制御部は、前記発声音の有無を示す情報に基づいて発声人数をカウントし、予め用意された人数別音声データのうち、前記カウントした人数に近い音声データを選択し、前記視聴者音声データを生成する、請求項2に記載の情報処理装置。 - 前記音声メタデータは、前記発声音を発した視聴者の性別を示す情報を含み、
前記制御部は、前記発声音を発した視聴者の性別を示す情報に基づいて、予め用意された性別別音声データのうち、前記性別に対応する音声データを選択し、前記視聴者音声データを生成する、請求項2に記載の情報処理装置。 - 前記音声メタデータは、前記発声音の解析結果により判断される感情を示す情報を含み、
前記制御部は、前記感情を示す情報に基づいて、予め用意された感情別音声データのうち、前記感情に対応する音声データを選択し、前記視聴者音声データを生成する、請求項2に記載の情報処理装置。 - 前記音声メタデータは、前記発声音の解析結果により生成される、前記発声音の性質を示す情報を含み、
前記制御部は、予め用意された音声データに、前記性質を反映させ、前記視聴者音声データを生成する、請求項2に記載の情報処理装置。 - 前記音声メタデータは、前記発声音の性質として、視聴者が任意に設定した性質を示す情報を含み、
前記制御部は、予め用意された音声データに、前記性質を反映させ、前記視聴者音声データを生成する、請求項2に記載の情報処理装置。 - 前記音声メタデータは、前記発声音の性質として、予め用意された性質バリエーションの中から選択した性質を示す情報を含み、
前記制御部は、予め用意された音声データのうち、前記性質が反映された音声データを選択し、前記視聴者音声データを生成する、請求項2に記載の情報処理装置。 - 前記音声メタデータは、前記発声音の解析結果により判別される、前記発声音の大きさを示す情報をさらに含み、
前記制御部は、さらに各視聴者の前記発声音の大きさを反映させて、前記視聴者音声データを生成する、請求項2に記載の情報処理装置。 - 前記音声メタデータは、前記発声音を発した視聴者の最大音量値の情報をさらに含み、
前記制御部は、さらに各視聴者の最大音量値を反映させて、前記視聴者音声データを生成する、請求項9に記載の情報処理装置。 - 前記音声メタデータは、前記発声音を発した視聴者の最大音量値の情報をさらに含み、
前記制御部は、予め設定された最大音量設定値と同じ大きさに前記最大音量値を調整して、前記視聴者音声データを生成し、出力する、請求項9に記載の情報処理装置。 - 前記音声メタデータは、前記発声音を発した、同じ場所に居る視聴者の人数の情報を含み、
前記制御部は、予め用意された人数別音声データのうち、前記人数に近い音声データを選択し、前記視聴者音声データを生成する、請求項2に記載の情報処理装置。 - 前記音声メタデータは、前記発声音を発した視聴者の仮想的な座席エリアを示す情報をさらに含み、
前記制御部は、さらに各視聴者の仮想的な座席エリア毎に、前記視聴者音声データを生成する、請求項2に記載の情報処理装置。 - 前記音声メタデータは、前記発声音を収音する収音部が有効であるか否かを示す情報をさらに含み、
前記制御部は、前記収音部が有効である各視聴者における発声人数の割合を、前記収音部が無効である各視聴者におけるみなし発声人数の割合に適用した上で、発声人数をカウントし、予め用意された人数別音声データのうち、前記カウントした発声人数に近い音声データを選択し、前記視聴者音声データを生成する、請求項2に記載の情報処理装置。 - 前記音声メタデータは、視聴者が属する分類のラベリング情報をさらに含み、
前記制御部は、前記視聴者音声データを分類毎に生成し、前記視聴者の分類に対応する視聴者音声データを前記視聴者の情報処理端末に出力する、請求項2に記載の情報処理装置。 - 前記制御部は、前記各視聴者に配信するイベントの場面に合わせて、前記生成する視聴者音声データの種類および音量の少なくともいずれかを変更する、請求項1に記載の情報処理装置。
- 前記制御部は、前記生成した視聴者音声データを、前記情報処理端末と、イベント会場装置に出力する、請求項1に記載の情報処理装置。
- 前記制御部は、パブリックビューイング会場から取得した音声データを、前記音声メタデータに基づいて生成した視聴者音声データと合成して、前記情報処理端末と、イベント会場装置に出力する、請求項1に記載の情報処理装置。
- プロセッサが、
視聴者の発声音に関する情報を示す音声メタデータを1以上の情報処理端末からリアルタイムで取得し、取得した音声メタデータに基づいて、予め用意された音声データを用いて出力用の視聴者音声データを生成する制御を行うことを含む、情報処理方法。 - コンピュータを、
視聴者の発声音に関する情報を示す音声メタデータを1以上の情報処理端末からリアルタイムで取得し、取得した音声メタデータに基づいて、予め用意された音声データを用いて出力用の視聴者音声データを生成する制御を行う制御部として機能させるプログラムが記憶された、記憶媒体。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202280066089.4A CN118020309A (zh) | 2021-10-06 | 2022-08-17 | 信息处理装置、信息处理方法和存储介质 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2021164748 | 2021-10-06 | ||
JP2021-164748 | 2021-10-06 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023058330A1 true WO2023058330A1 (ja) | 2023-04-13 |
Family
ID=85803348
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/031034 WO2023058330A1 (ja) | 2021-10-06 | 2022-08-17 | 情報処理装置、情報処理方法、および記憶媒体 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN118020309A (ja) |
WO (1) | WO2023058330A1 (ja) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007036685A (ja) * | 2005-07-27 | 2007-02-08 | Nippon Hoso Kyokai <Nhk> | 映像音声合成装置及び遠隔体験共有型映像視聴システム |
JP2010232860A (ja) * | 2009-03-26 | 2010-10-14 | Sony Corp | 情報処理装置、コンテンツ処理方法及びプログラム |
JP2012129800A (ja) | 2010-12-15 | 2012-07-05 | Sony Corp | 情報理装置および方法、プログラム、並びに情報処理システム |
JP2014011509A (ja) * | 2012-06-27 | 2014-01-20 | Sharp Corp | 音声出力制御装置、音声出力制御方法、プログラム及び記録媒体 |
CN113301359A (zh) * | 2020-07-30 | 2021-08-24 | 阿里巴巴集团控股有限公司 | 音视频处理方法、装置及电子设备 |
JP2021145296A (ja) * | 2020-03-13 | 2021-09-24 | ヤマハ株式会社 | 端末装置の動作方法、端末装置およびプログラム |
-
2022
- 2022-08-17 WO PCT/JP2022/031034 patent/WO2023058330A1/ja active Application Filing
- 2022-08-17 CN CN202280066089.4A patent/CN118020309A/zh active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007036685A (ja) * | 2005-07-27 | 2007-02-08 | Nippon Hoso Kyokai <Nhk> | 映像音声合成装置及び遠隔体験共有型映像視聴システム |
JP2010232860A (ja) * | 2009-03-26 | 2010-10-14 | Sony Corp | 情報処理装置、コンテンツ処理方法及びプログラム |
JP2012129800A (ja) | 2010-12-15 | 2012-07-05 | Sony Corp | 情報理装置および方法、プログラム、並びに情報処理システム |
JP2014011509A (ja) * | 2012-06-27 | 2014-01-20 | Sharp Corp | 音声出力制御装置、音声出力制御方法、プログラム及び記録媒体 |
JP2021145296A (ja) * | 2020-03-13 | 2021-09-24 | ヤマハ株式会社 | 端末装置の動作方法、端末装置およびプログラム |
CN113301359A (zh) * | 2020-07-30 | 2021-08-24 | 阿里巴巴集团控股有限公司 | 音视频处理方法、装置及电子设备 |
Also Published As
Publication number | Publication date |
---|---|
CN118020309A (zh) | 2024-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7725203B2 (en) | Enhancing perceptions of the sensory content of audio and audio-visual media | |
US10687145B1 (en) | Theater noise canceling headphones | |
CN106464953B (zh) | 双声道音频系统和方法 | |
WO2022004103A1 (ja) | 演出効果制御方法、端末装置の動作方法、演出効果制御システムおよび端末装置 | |
WO2023058330A1 (ja) | 情報処理装置、情報処理方法、および記憶媒体 | |
WO2022163137A1 (ja) | 情報処理装置、情報処理方法、およびプログラム | |
WO2022024898A1 (ja) | 情報処理装置、情報処理方法およびコンピュータプログラム | |
WO2022018786A1 (ja) | 音声処理システム、音声処理装置、音声処理方法、及び音声処理プログラム | |
WO2021246104A1 (ja) | 制御方法および制御システム | |
US11533537B2 (en) | Information processing device and information processing system | |
JP5790021B2 (ja) | 音声出力システム | |
JP7137278B2 (ja) | 再生制御方法、制御システム、端末装置およびプログラム | |
CN115668956A (zh) | 无观众现场表演配送方法以及系统 | |
KR102013054B1 (ko) | 퍼포먼스의 출력 및 퍼포먼스 컨텐츠 생성을 수행하는 방법 및 그 시스템 | |
WO2024053094A1 (ja) | メディア情報強調再生装置、メディア情報強調再生方法、およびメディア情報強調再生プログラム | |
WO2021157638A1 (ja) | サーバ装置、端末装置、同時通訳音声送信方法、多重化音声受信方法、および記録媒体 | |
WO2023120244A1 (ja) | 伝送装置、伝送方法、およびプログラム | |
JP7503257B2 (ja) | コンテンツ収集・配信システム | |
US20210320959A1 (en) | System and method for real-time massive multiplayer online interaction on remote events | |
WO2023243375A1 (ja) | 情報端末、情報処理方法、プログラム、および情報処理装置 | |
JP2020008752A (ja) | 生バンドカラオケライブ配信システム | |
US20220264193A1 (en) | Program production apparatus, program production method, and recording medium | |
JP7468111B2 (ja) | 再生制御方法、制御システムおよびプログラム | |
US20230097803A1 (en) | Hybrid Audio/Visual Imagery Entertainment System With Live Audio Stream Playout And Separate Live Or Prerecorded Visual Imagery Stream Playout | |
JP2024079861A (ja) | データ配信プログラム及びデータ配信方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22878201 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023552720 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022878201 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2022878201 Country of ref document: EP Effective date: 20240506 |