WO2018150774A1 - Voice signal processing device and voice signal processing system - Google Patents

Voice signal processing device and voice signal processing system Download PDF

Info

Publication number
WO2018150774A1
WO2018150774A1 PCT/JP2018/000736 JP2018000736W WO2018150774A1 WO 2018150774 A1 WO2018150774 A1 WO 2018150774A1 JP 2018000736 W JP2018000736 W JP 2018000736W WO 2018150774 A1 WO2018150774 A1 WO 2018150774A1
Authority
WO
WIPO (PCT)
Prior art keywords
rendering
audio
audio signal
unit
signal processing
Prior art date
Application number
PCT/JP2018/000736
Other languages
French (fr)
Japanese (ja)
Inventor
健明 末永
永雄 服部
Original Assignee
シャープ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by シャープ株式会社 filed Critical シャープ株式会社
Publication of WO2018150774A1 publication Critical patent/WO2018150774A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control

Definitions

  • the present disclosure relates to an audio signal processing device and an audio signal processing system.
  • Non-patent Document 1 a technique for reproducing multi-channel sound image localization using a small number of speakers has been studied.
  • Japanese Patent Publication “JP 2013-055439 A” (published March 21, 2013) Japanese Patent Publication “Japanese Patent Laid-Open No. 11-113098” (April 23, 1999)
  • Non-Patent Document 1 Vector Base Amplitude Panning (VBAP) and sound pressure panning shown in Non-Patent Document 1 are, for example, a group of three speakers 1302, 1303, and 1304 as shown in (a) of FIG.
  • the sound pressure is controlled based on the positional relationship between the pair of speakers 1306 and 1307 as shown in b) and the sound image 1301 or 1305 to be reproduced, and at any position within the range surrounded by the pair of speakers.
  • This technology reproduces sound images. Since the technique can reproduce a sound image within a range surrounded by a set of speakers even when there are a plurality of sound images, a multi-channel audio (for example, 22.2 ch or 5.1 ch) signal is reduced in a speaker. Can be reproduced with numbers.
  • VBAP and sound pressure panning can reproduce a sound image only within a range surrounded by a set of speakers. Therefore, if the speaker cannot be installed in an area where the speaker cannot be installed, for example, a position close to the ceiling surface, in the user's viewing environment, a sound image in the height direction cannot be reproduced.
  • Non-Patent Document 2 or Patent Document 2 if the transoral technique shown in Non-Patent Document 2 or Patent Document 2 is used, three-dimensional sound image control can be performed using at least two speakers. Therefore, for example, there is an advantage that sound image localization at an arbitrary position around the user can be reproduced only by using two speakers installed in front of the user.
  • this technique is a technique that assumes a specific listening area in principle and obtains a sound effect in that area, if the listener is removed from the listening area, the sound image is located at an unexpected position. It may happen that the camera is localized, or the localization is not felt in the first place.
  • One embodiment of the present disclosure is to realize an audio signal processing device capable of presenting audio rendered by a suitable rendering method to a user in a viewing state, and an audio signal processing system including the device. Objective.
  • an audio signal processing device that renders audio signals of one or more audio tracks and outputs the audio signals to a plurality of audio output devices.
  • a reproduction position specifying unit for specifying a reproduction position of the audio signal of the audio track, and position information of each audio output device
  • a position information acquisition unit to be acquired, and one rendering method is selected from a plurality of rendering methods based on the reproduction position and the position information, and the sound corresponding to the reproduction position is selected using the one rendering method.
  • a processing unit for rendering the audio signal of the track may include a plurality of audio channels. However, in the present disclosure, it is assumed that one audio channel is included for each audio track for easy understanding. .
  • an audio signal processing system includes the audio signal processing device having the above-described configuration and the plurality of audio output devices. Yes.
  • FIG. 1 is a block diagram illustrating a main configuration of an audio signal processing system according to Embodiment 1 of the present disclosure. It is a figure showing an example of track information used with an audio signal processing system concerning Embodiment 1 of this indication. It is a figure which shows the coordinate system used for description of this indication.
  • FIG. 3 is a block diagram illustrating a main configuration of a rendering switching signal generation unit according to Embodiment 1 of the present disclosure.
  • FIG. 6 is a diagram illustrating a processing flow of a rendering switching signal generation unit according to the first embodiment of the present disclosure. It is the figure which showed the relationship between the arrangement position of a speaker, and a sound image position.
  • FIG. 6 is a diagram illustrating a processing flow of a rendering unit according to the first embodiment of the present disclosure. It is a block diagram which shows the principal part structure of the audio
  • Embodiment 1 Hereinafter, an embodiment of the present disclosure will be described with reference to FIGS. 1 to 8.
  • FIG. 1 is a block diagram showing the main configuration of the audio signal processing system 1 according to the first embodiment.
  • the audio signal processing system 1 according to the first embodiment includes an audio signal processing unit 10 (audio signal processing device) and an audio output unit 20 (a plurality of audio output devices).
  • the audio signal processing unit 10 is an audio signal processing apparatus that renders audio signals of one or a plurality of audio tracks using two different rendering methods.
  • the audio signal after the rendering process is output from the audio signal processing unit 10 to the audio output unit 20.
  • the audio signal processing unit 10 includes a content analysis unit 101 (reproduction position specifying unit) that specifies the sound image position (reproduction position) of the audio signal of the audio track based on the input audio signal or on information accompanying the input audio signal, A rendering switching signal generation unit 102 (position information acquisition unit, processing unit) that acquires position information of the audio output unit 20, and one selected from a plurality of rendering methods based on a sound image position (playback position) and position information A rendering unit 103 (processing unit) that renders an audio signal of an audio track corresponding to the sound image position using a rendering method;
  • the audio signal processing unit 10 includes a storage unit 104 as shown in FIG.
  • the storage unit 104 stores various parameters required by the rendering switching signal generation unit 102 and the rendering unit 103 or generated various parameters.
  • the content analysis unit 101 analyzes an audio track included in video content or audio content recorded on a disc medium such as a DVD or a BD, an HDD (Hard Disc Drive), or any metadata (information) associated therewith. Then, the pronunciation object position information is obtained. The pronunciation object position information is sent from the content analysis unit 101 to the rendering switching signal generation unit 102 and the rendering unit 103.
  • the audio content received by the content analysis unit 101 is an audio content including two or more audio tracks.
  • this audio track may be a “channel-based” audio track employed in stereo (2ch), 5.1ch, etc., and each sound generation object unit is set as one track, and this position / volume It may be an “object-based” audio track to which accompanying information (metadata) describing a change in the environment is added.
  • the audio track based on the object base is recorded on each track for each sounding object, that is, recorded without mixing, and these sounding objects are appropriately rendered on the player (playing device) side.
  • each of these pronunciation objects is associated with metadata such as when, where, and at what volume the player should pronounce.
  • the “channel-based” audio track is employed in conventional surround sound (for example, 5.1ch surround), and is presupposed to be sounded from a predetermined playback position (speaker placement position). This is a track recorded in a state where individual sound generation objects are mixed.
  • FIG. 2 conceptually shows the configuration of the track information 201 obtained by analysis by the content analysis unit 101.
  • the content analysis unit 101 analyzes all the audio tracks included in the content and reconstructs the track information 201 shown in FIG. In the track information 201, the ID of each audio track and the type of the audio track are recorded.
  • the audio track is an object-based track
  • one or more pronunciation object position information is attached as metadata.
  • the pronunciation object position information is composed of a pair of a reproduction time and a sound image position (reproduction position) at the reproduction time.
  • the audio track is a channel-based track
  • a pair of a playback time and a sound image position (playback position) at the playback time is recorded. Is from the start to the end of the content, and the sound image position at the playback time is based on the playback position defined in advance on the channel base.
  • the sound image position (playback position) recorded as a part of the pronunciation object position information is expressed in the coordinate system shown in FIG. Further, it is assumed that the track information 201 is described in a markup language such as XML (Extensible Markup Language).
  • the rendering switching signal generation unit 102 generates a rendering method switching instruction signal based on information related to the viewing environment and the track information 201 (FIG. 2) obtained by the content analysis unit 101. Details of the rendering switching signal generation unit 102 will be described with reference to FIG.
  • FIG. 4 is a block diagram illustrating a configuration of the rendering switching signal generation unit 102.
  • the rendering switching signal generation unit 102 includes an environment information acquisition unit 10201 (position information acquisition unit) and a rendering switching instruction signal calculation unit 10202 (processing unit).
  • the environment information acquisition unit 10201 is configured to acquire information on the environment in which the user views the content (hereinafter referred to as environment information).
  • the environment information is assumed to be the number of speakers connected to the audio signal processing unit 10 as the audio output unit 20, the position of the speaker, and the type of the speaker.
  • the speaker type is information indicating which of a plurality of rendering methods used in this system can be used. As described in the first embodiment, when the audio signal processing unit 10 uses two types of rendering methods, information on whether or not each speaker can be used for either or both of the methods at the position where each speaker is arranged. Is the speaker type.
  • Environmental information is recorded in the storage unit 104 in advance. Therefore, the environment information acquisition unit 10201 reads information from the storage unit 104 as necessary.
  • the environment information recorded in the storage unit 104 may be recorded as metadata information described according to an arbitrary format, for example, a format such as XML.
  • the environment information acquisition unit 10201 may be used. Decodes as appropriate to extract information.
  • the sound image position and the speaker position are shown in a coordinate system as shown in FIG.
  • the coordinate system used here is centered on the origin O as shown in the top view of FIG. 3A, the distance from the origin O is the radius r, the front of the origin O is 0 °, the right position, The azimuth angle ⁇ with the left position being 90 ° and ⁇ 90 °, respectively, and the front of the origin O is 0 ° and the position just above the origin O is 90 ° as shown in the side view of FIG.
  • the elevation angle ⁇ is assumed, and the sound image position and the speaker position are expressed as (r, ⁇ , ⁇ ).
  • the coordinate system of FIG. 3 is used for the sound image position and the speaker position.
  • the environment information is acquired in advance and recorded in the storage unit 104.
  • information may be input in real time through an information input terminal (not shown in the first embodiment) such as a tablet terminal.
  • image processing is performed from an image taken by a camera installed at an arbitrary position in the viewing environment (for example, a marker is attached to the audio output unit 20 and this is recognized by a camera installed on the ceiling of the room). It is good also as a structure.
  • a device that transmits position information to the audio output unit 20 itself may be used to acquire various information.
  • the rendering switching instruction signal calculation unit 10202 is provided for each audio track based on the environment information obtained from the environment information acquisition unit 10201 and the sounding object position information of the track information 201 (FIG. 2) obtained by the content analysis unit 101.
  • the audio signal is determined to be rendered by any of a plurality of rendering methods, and the information is output to the rendering unit 103.
  • the rendering unit 103 simultaneously drives two types of rendering methods (rendering algorithms), that is, the rendering method A and the rendering method B, in order to make the description easier to understand.
  • FIG. 5 is a flowchart for explaining the operation of the rendering switching instruction signal calculation unit 10202.
  • the rendering switching instruction signal calculation unit 10202 When the rendering switching instruction signal calculation unit 10202 receives the above-described environment information and track information 201 (FIG. 2), it starts a rendering method selection process (step S101).
  • step S102 it is confirmed whether or not rendering method selection processing has been performed for all audio tracks. If the rendering method selection process after step S103 is completed for all audio tracks (YES in step S102), the rendering method selection process is terminated (step S106). On the other hand, if there is an audio track that has not been subjected to rendering method selection processing (NO in step S102), the process proceeds to step S103.
  • step S103 the sound generation object position information corresponding to the unprocessed audio track is referred to from the acquired track information 201 (FIG. 2), and the sound image position recorded as a part of the sound generation object position information is rendered. It is determined whether or not the image is included in the rendering processable range in the method A.
  • the rendering processable range indicates a range in which a sound source can be arranged in a specific rendering method, and information (position information) indicating the position of the speaker obtained as part of the environment information as necessary. To be determined.
  • the determination of the rendering processable range does not necessarily require reference to the environment information (that is, information acquired by using some means regarding the current environment). For example, when the speaker position is determined on the system in advance and the user places the speaker at this position in accordance with an instruction from the system, it is not necessary to acquire the information. It is also possible to define a rendering processable range regardless of the position of the speaker (as will be described later, if the rendering process is a downmix to a monaural signal, the entire area can be defined as the processable range. it can).
  • FIG. 6 A more specific example will be described with reference to FIG. Assume that a user (listener) 601 exists at the position of the origin O, and speakers (sound output devices) 602, 603, 604, and 605 are arranged around the user (listener) 601. The speakers 602, 603, 604, 605 are arranged at the same height as the position of the viewer's head.
  • (A) in FIG. 6 is a diagram when the layout is viewed from above, and
  • FIG. 6 is a diagram when viewed from the side.
  • Reference numerals 606, 607, 608, and 609 denote positions (sound image positions) where sound images based on the sound signals of the respective sound tracks should be localized.
  • the sound image positions 606, 607, and 608 are at the same height as the position of the viewer's head, and the sound image position 609 is higher than the position of the viewer's head.
  • the rendering method A is VBAP (first rendering method)
  • the rendering method B is transoral (second rendering method)
  • the speakers usable for VBAP are 602, 603, 604, 605, and transoral.
  • Speakers 602 and 603 that can be used in
  • the rendering processable range in the rendering method A (VBAP) is a range sandwiched between adjacent speakers, specifically a range sandwiched between speakers 602 and 603, a range sandwiched between 603 and 605, A range between 604 and 605 and a range between 602 and 604.
  • audio signals to be localized at the sound image positions 606, 607, and 608 included in this range can be processed by the rendering method A (VBAP).
  • the sound image position 609 shown in FIG. 6 is higher than the position of the speaker and is not included in the rendering processable range in the rendering method A (VBAP) (NO in step S103 of FIG. 5).
  • the sound image position 609 is a sound signal rendering process by a rendering method B which is a rendering method (trans-oral) that can localize a sound image regardless of the position of the speaker.
  • step S103 if the sound image position of the unprocessed audio track is included in the rendering processable range in the rendering method A (YES in step S103), the process proceeds to step S104.
  • step S103 if the sound image position of the unprocessed audio track is not included in the rendering processable range in the rendering method A (NO in step S103), the process proceeds to step S105.
  • step S104 an instruction signal (rendering switching signal) for rendering the audio signal of the unprocessed audio track using the rendering method A is output to the rendering unit 103.
  • step S105 an instruction signal (rendering switching signal) for rendering the audio signal of the unprocessed audio track using the rendering method B is output to the rendering unit 103.
  • the sound image positions of all the audio tracks are described as being within the rendering processable range of either the rendering method A or the rendering method B. However, if this is not the case, that is, if there is a possibility that it does not fall within the rendering processable range of either rendering system A or rendering system B, the rendering system selection process is performed according to the flow shown in FIG. Also good.
  • FIG. 7 is a modification of the flow shown in FIG.
  • the rendering switching instruction signal calculation unit 10202 receives the environment information and the track information 201 (FIG. 2), and the rendering method selection process starts (step S111).
  • step S112 it is confirmed whether or not rendering method selection processing has been performed for all audio tracks (step S112). If rendering method selection processing in step S113 and subsequent steps has been completed for all audio tracks (step S112). In step S118, the rendering method selection process is terminated. On the other hand, if there is an unprocessed audio track for which the rendering method selection process is not performed (NO in step S112), the sound generation object position information corresponding to the unprocessed audio track is referred to from the acquired track information 201 (FIG. 2). Similarly to step S103 described above, whether or not the sound image position recorded as part of the sound generation object position information corresponding to the unprocessed audio track is included in the rendering processable range in the rendering method A is determined. It discriminate
  • step S113 If the result of determination in step S113 is that the sound image position is within the rendering processable range in rendering method A (YES in step S113), the process proceeds to step S114.
  • step S ⁇ b> 114 an instruction signal for rendering the audio signal of the unprocessed audio track using the rendering method A is output to the rendering unit 103.
  • step S113 if the sound image position is not included in the rendering processable range in the rendering method A (NO in step S113), the process proceeds to step S115.
  • step S115 it is determined whether or not the sound image position is included in the rendering processable range in the rendering method B.
  • step S115 determines whether the sound image position is within the rendering processable range in rendering method B (YES in step S115). If the result of determination in step S115 is that the sound image position is within the rendering processable range in rendering method B (YES in step S115), the process proceeds to step S116. On the other hand, as a result of the determination, if the sound image position is not included in the rendering processable range in the rendering method B (NO in step S115), the process proceeds to step S117. That is, if the sound image position is not included in the rendering processable range of the rendering method A and the rendering method B, the process proceeds to step S117.
  • step S116 an instruction signal for rendering the audio signal of the unprocessed audio track using the rendering method B is output to the rendering unit 103.
  • step S117 an instruction is issued not to render the audio signal of the unprocessed audio track.
  • the instruction signal is output to the rendering unit 103.
  • the selectable rendering methods are described as two types, but it goes without saying that three or more types of rendering methods may be selected.
  • the expression that the rendering switching instruction signal calculation unit 10202 is for instructing switching of the rendering method is used.
  • the expression “instructing switching” here is used.
  • the mode of instructing to switch the rendering mode from A to B or from B to A the mode of instructing to use the rendering mode A in the next track of the track using the rendering mode A (also for the mode B) The same).
  • the rendering unit 103 constructs an audio signal to be output from the audio output unit 20 based on the input audio signal and the instruction signal output from the rendering switching instruction signal calculation unit 10202 of the rendering switching signal generation unit 102.
  • the rendering unit 103 simultaneously drives two types of rendering algorithms, switches the rendering algorithm to be used based on the instruction signal output from the rendering switching instruction signal calculation unit 10202, and renders the audio signal.
  • rendering means performing processing for converting an audio signal (input audio signal) included in the content into a signal to be output from the audio output unit 20.
  • FIG. 8 is a flowchart showing the operation of the rendering unit 103.
  • the rendering unit 103 When the rendering unit 103 receives the input audio signal and the instruction signal from the rendering switching instruction signal calculation unit 10202 of the rendering switching signal generation unit 102, the rendering unit 103 starts the rendering process (step S201).
  • step S202 it is confirmed whether rendering processing has been performed for all audio tracks (step S202).
  • step S202 if the rendering process after step S203 has been completed for all audio tracks (YES in step S202), the rendering process is terminated (S208).
  • step S208 if there is an unprocessed audio track (NO in step S202), rendering is performed using a rendering method based on the instruction signal from the rendering switching instruction signal calculation unit 10202 of the rendering switching signal generation unit 102.
  • the instruction signal indicates the rendering method A (rendering method A in step S203)
  • parameters necessary for rendering the audio signal using the rendering method A are read from the storage unit 104 (step S204). Rendering based on this is performed (step S205).
  • step S203 when the instruction signal indicates the rendering method B (rendering method B in step S203), parameters necessary for rendering the audio signal in the rendering method B are read from the storage unit 104 (step S206). Rendering based on is performed (step S207). If the instruction signal indicates no rendering based on the flow of FIG. 7 (no rendering in step S203), the corresponding track is not rendered and is not included in the output audio.
  • the storage unit 104 is configured by a secondary storage device for recording various data used in the rendering switching signal generation unit 102 and the rendering unit 103.
  • the storage unit 104 is configured by, for example, a magnetic disk, an optical disk, a flash memory, and the like, and more specific examples include an HDD, an SSD (Solid State Drive), an SD memory card, a BD, a DVD, and the like.
  • the rendering switching signal generation unit 102 and the rendering unit 103 read data from the storage unit 104 as necessary.
  • various parameter data including the coefficient calculated by the rendering switching signal generation unit 102 can be recorded in the storage unit 104.
  • the audio output unit 20 outputs the audio obtained by the rendering unit 103.
  • the audio output unit 20 includes a plurality of independent speakers, and each speaker includes a speaker unit and an amplifier (amplifier) that drives the speaker unit.
  • the environment information acquisition unit 10201 acquires the position information of each speaker configured in the audio output unit 20. Then, the rendering switching instruction signal calculation unit 10202 selects a rendering method based on a plurality of pieces of position information acquired by the environment information acquisition unit 10201.
  • a suitable rendering method considering sound image localization is automatically calculated according to the arrangement of speakers arranged by the user and information obtained from the content, and audio reproduction is performed.
  • a suitable rendering method considering sound image localization is automatically calculated according to the arrangement of speakers arranged by the user and information obtained from the content, and audio reproduction is performed.
  • content including a plurality of audio tracks is targeted for reproduction.
  • the present disclosure is not limited to this, and content including one audio track may be targeted for reproduction.
  • a suitable rendering method for the one audio track is selected from a plurality of rendering methods.
  • rendering method In the first embodiment, a rendering method of VBAP, trans-oral, and downmixing to a monaural signal has been described. However, the present disclosure is not limited to these rendering methods.
  • a rendering method similar to VBAP may be employed, in which an audio signal is output from each audio output unit at a sound pressure ratio corresponding to the sound image position (playback position).
  • a rendering method similar to transaural in which an audio signal processed according to the sound image position (reproduction position) is output from each audio output unit, may be employed.
  • the sound image position is included in the range defined by the arrangement positions of the plurality of sound output units, the sound quality is improved by adopting a rendering method that outputs from each sound output unit at a sound pressure ratio according to the sound image position.
  • An audio environment where emphasis is placed on can be realized.
  • a rendering method that is processed according to the sound image position (reproduction position) such as transaural, it is possible to localize the sound image without being restricted by the arrangement of the sound output unit.
  • downmixing to a stereo signal can be adopted as one of rendering methods.
  • FIG. 9 is a block diagram illustrating a main configuration of the audio signal processing system 1a according to the second embodiment of the present disclosure.
  • the audio signal processing system 1a according to the second embodiment is different only in the behavior of the rendering switching signal generation unit 102 in the audio signal processing system 1 shown in the first embodiment, and other processing units are used. Since these are the same, the description of the other configuration is the same as that described in the first embodiment unless described below.
  • the audio signal processing unit 10a of the audio signal processing system 1a according to the second embodiment is replaced with the rendering switching signal generation unit 102a (acquisition) instead of the rendering switching signal generation unit 102 of the audio signal processing unit 10 described in the first embodiment. Part).
  • the rendering switching signal generation unit 102a further acquires viewing position information indicating the viewing position of the user in addition to the track information and environment information (speaker position information) acquired by the rendering switching signal generation unit 102 of the first embodiment. .
  • the rendering switching signal generation unit 102a selects one rendering method from among a plurality of rendering methods based on the track information, the position information, and the viewing position information. Details will be described below. In the second embodiment as well, for convenience of explanation, an appropriate selection is made from two types of rendering methods.
  • the rendering switching signal generation unit 102a based on the information related to the viewing environment, the track information 201 (FIG. 2) obtained by the content analysis unit 101, and the viewing position information, is used as a rendering method switching instruction signal Is generated. Details of the rendering switching signal generation unit 102a will be described with reference to FIG.
  • FIG. 10 is a block diagram showing a configuration of the rendering switching signal generation unit 102a.
  • the rendering switching signal generation unit 102a includes an environment information acquisition unit 10201a and a rendering switching instruction signal calculation unit 10202a.
  • the environment information acquisition unit 10201a is configured to acquire information on an environment in which the user views content (hereinafter referred to as environment information).
  • environment information information on an environment in which the user views content
  • information viewing environment information
  • information indicating the viewing position of the user is added to the number, position, and type of speakers connected to the system as the audio output unit 20 shown in the first embodiment. It shall be a thing.
  • viewing environment information is acquired / updated in real time, and a marker is attached in advance by a camera (not shown) installed at an arbitrary position in the viewing environment and connected to the environment information acquisition unit 10201a.
  • the user and the speaker (sound output unit 20) are photographed, the three-dimensional position is acquired, and the viewing environment information is updated.
  • the user position may be acquired by using face recognition from information obtained from a camera that is also installed.
  • the rendering switching instruction signal calculation unit 10202a is provided for each audio track based on the environment information obtained from the environment information acquisition unit 10201a and the sound generation object position information of the track information 201 (FIG. 2) obtained by the content analysis unit 101.
  • the audio signal is determined to be rendered by any of a plurality of rendering methods, and the information is output to the rendering unit 103.
  • the rendering switching instruction signal calculation unit 10202a upon receiving the above-described environment information and track information 201 (FIG. 2), the rendering switching instruction signal calculation unit 10202a starts a rendering method selection process (step S301).
  • step S302 it is confirmed whether or not rendering method selection processing has been performed for all audio tracks (S302), and if rendering method selection processing in step S303 and subsequent steps has been completed for all audio tracks (YES in step S302). ), The rendering method selection process is terminated (step S310). On the other hand, if there is an audio track that has not been subjected to the rendering method selection process (NO in step S302), the process proceeds to step S303.
  • step S303 the sound object position recorded as part of the sounding object position information is rendered by referring to the sounding object position information corresponding to the unprocessed audio track from the acquired track information 201 (FIG. 2). If it is included in the rendering processable range in method A (YES in step S303), and the current position of the user is within the viewing effective range of rendering method A based on the viewing position information (YES in step S304), An instruction signal for rendering the audio signal of the audio track by the rendering method A is output (step S305).
  • step S303 when the sound image position recorded as a part of the pronunciation object position information is not included in the rendering processable range in the rendering method A (NO in step S303), the user is based on the viewing position information. Is outside the effective viewing range of the rendering method A (NO in step S304), the process proceeds to step S306, and whether or not rendering by the rendering method B is possible is confirmed.
  • the sound image position recorded as part of the pronunciation object position information is included in the rendering processable range by the rendering method B (YES in step S306), and based on the viewing position information, the user's current position
  • an instruction signal for rendering the audio signal of the audio track by the rendering method B is output (step S308).
  • the sound image position recorded as a part of the pronunciation object position information is not included in the rendering processable range in the rendering method B (NO in step S306), or the current position of the user is the rendering method. If it is out of the viewing effective range of B (NO in step S307), an instruction is issued not to render the audio signal of the audio track (step S310).
  • the rendering processable range indicates a range in which sound sources can be arranged in a specific rendering method as described in the first embodiment.
  • the viewing effective range is a recommended viewing area where the effect can be enjoyed in each rendering method (for example, as shown in FIG. 12, the viewing effective range of the rendering method A is represented as 1202, and the viewing effective range of the rendering method B is displayed.
  • the range is represented as 1203), and what is recorded in the storage unit 104 in advance for each rendering method is appropriately read.
  • a suitable rendering method that takes into account sound image localization according to the position of the speaker arranged by the user, the information obtained from the content, and the viewing position information of the user is provided. By calculating and performing sound reproduction, it is possible to deliver sound with good localization to the user.
  • the rendering method A is VBAP and the rendering method B is trans-oral.
  • the rendering method A is trans-oral and the rendering method B is VBAP.
  • the transoral in accordance with the operation flow shown in FIG.
  • the transoral can localize the sound image without being limited to the range of the speaker arrangement position, whereas in the case of VBAP, the sound image position depends on the speaker arrangement position. Therefore, in the case of the aspect of the first embodiment in which it is first determined whether or not the audio track can be processed by VBAP, and if it cannot be processed, the rendering method changes in the content. As a result, there is a possibility that the user feels uncomfortable.
  • the processing is first determined whether or not the processing can be performed by transoral (rendering method A) that does not depend on the speaker arrangement position.
  • rendering method A rendering based on a rendering method that can cover a wide range of sound image positions occupies most of the content, and it is difficult to give the above-mentioned uncomfortable feeling.
  • VBAP Compared to transoral, VBAP has better sound quality because it localizes the sound image within the range of the speaker placement position. Therefore, it can be said that the aspect of the first embodiment that first determines whether processing is possible with VBAP emphasizes sound quality.
  • the audio signal processing device (audio signal processing unit 10, 10a) according to aspect 1 of the present disclosure renders audio signals of one or a plurality of audio tracks, and outputs a plurality of audio output devices (audio output unit 20 (speaker 602). 603, 604, 605)), which specifies the playback position of the audio signal of the audio track based on the audio track or information associated with the audio track. And a position information acquisition unit (rendering switching signal generation unit) that acquires position information of each of the audio output devices (audio output unit 20 (speakers 602, 603, 604, 605)). 102, 102a) and a rendering method selected from a plurality of rendering methods based on the playback position and the position information.
  • a processing unit for rendering the audio signal of the audio track corresponding to the reproduction position (the rendering unit 103, the rendering switching signal generator 102, 102a), a.
  • a suitable rendering method is selected from a plurality of rendering methods based on the position of each sound output device and the reproduction position (sound image position) of the sound signal of the sound track. To do.
  • the input audio signal includes a plurality of audio tracks
  • rendering is performed for each audio track, or if the input audio signal includes one audio track, rendering is performed using a rendering method suitable for the one audio track.
  • the position information acquisition unit stores the viewing position information indicating the viewing position of the user. Further, the processing unit (the rendering switching signal generation unit 102a and the rendering unit 103) acquires the one rendering method from the plurality of rendering methods based on the reproduction position, the position information, and the viewing position information. The audio signal of the audio track corresponding to the reproduction position may be rendered using the one rendering method.
  • the rendering method can be selected in consideration of the viewing position information of the user, and the sound image localization can be reproduced more suitably.
  • the audio signal processing device (audio signal processing unit 10, 10a) according to aspect 3 of the present disclosure is the above-described aspect 1 or 2, wherein the reproduction position specifying unit (content analysis unit 101) is connected to the audio track or the audio track. It may be configured to analyze the accompanying information and generate track information indicating the reproduction position.
  • the track information is analyzed by analyzing the audio track or the information associated therewith by the reproduction position specifying unit. Can be generated.
  • the audio signal processing device (audio signal processing unit 10, 10a) according to aspect 4 of the present disclosure requires the processing unit (rendering unit 103, rendering switching signal generation unit 102, 102a) in the above-described aspects 1 to 3.
  • the configuration may further include a storage unit (104) for storing parameters.
  • the plurality of rendering methods may be configured such that the audio signal has a sound pressure ratio according to a reproduction position. From each audio output device, the first rendering method output from each audio output device (audio output unit 20 (speakers 602, 603, 604, 605)) and the audio signal processed according to the reproduction position are output from each audio output device. And a second rendering method to be output.
  • the first rendering method is VBAP
  • the second rendering method is transoral. It may be.
  • the audio signal processing device (audio signal processing unit 10, 10a) according to aspect 7 of the present disclosure is the above-described aspect 1 to 6, and the processing unit (rendering unit 103, rendering switching signal generation unit 102, 102a) It is determined whether or not the reproduction position is included in a range defined by the arrangement position of the plurality of audio output devices (audio output unit 20 (speakers 602, 603, 604, 605)), and according to the determination result
  • the above-described one rendering method may be selected.
  • the audio signal processing device (audio signal processing unit 10, 10a) according to aspect 8 of the present disclosure is the above-described aspect 2, and the processing unit (rendering unit 103, rendering switching signal generation unit 102, 102a)
  • the effective viewing range of the system is specified, and it is determined whether or not the reproduction position is included in a range defined by the plurality of audio output devices (audio output unit 20 (speakers 602, 603, 604, 605)). And determining whether or not the viewing position of the user indicated by the viewing position information is included in the effective viewing range and selecting the one rendering method according to the determination result. Good.
  • the audio signal processing system (audio signal processing system 1, 1a) according to aspect 9 of the present disclosure includes the audio signal processing apparatus (audio signal processing units 10, 10a) according to aspects 1 to 8 and the plurality of audio output apparatuses.
  • Sound output unit 20 (speakers 602, 603, 604, 605)).
  • Audio signal processing system 10 10a Audio signal processing unit (audio signal processing device) 20 Audio output unit (multiple audio output devices) 101 Content analysis unit (playback position specifying unit) 102, 102a Rendering switching signal generation unit (position information acquisition unit, processing unit) 103 Rendering unit (processing unit) 104 Storage Unit 201 Track Information 602, 603, 604, 605 Speaker (Audio Output Device) 606, 607, 608, 609 Sound image position (playback position) 10201, 10201a Environmental information acquisition unit (position information acquisition unit) 10202, 10202a Rendering switching instruction signal calculation unit (processing unit)

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)

Abstract

The present invention addresses the problem of presenting a user with a voice rendered by a rendering scheme that is preferable under a viewing situation of the user. A voice signal processing system (1) according to an embodiment of the present disclosure is provided with a voice signal processing unit (10) which selects one rendering scheme from a plurality of rendering schemes on the basis of position information of a voice output device and track information indicating a play back position for an input voice signal, and renders the input voice signal using the one rendering scheme.

Description

音声信号処理装置及び音声信号処理システムAudio signal processing apparatus and audio signal processing system
 本開示は、音声信号処理装置及び音声信号処理システムに関する。 The present disclosure relates to an audio signal processing device and an audio signal processing system.
 現在、放送波、DVD(Digital Versatile Disc)やBD(Blu-ray(登録商標) Disc)などのディスクメディア、インターネットを介すなどして、ユーザはマルチチャンネル音声(サラウンド音声)を含むコンテンツを簡単に入手できるようになった。映画館等においてはDolby Atmosに代表されるオブジェクトベースオーディオによる立体音響システムが多く配備され、更に日本においては、次世代放送規格に22.2chオーディオが採用されるなど、ユーザがマルチチャンネルコンテンツに触れる機会は格段に多くなった。従来のステレオ方式の音声信号に関しても、マルチチャンネル化手法が様々検討されており、ステレオ信号の各チャンネル間の相関に基づいてマルチチャネル化する技術が特許文献1に開示されている。 Currently, users can easily access content including multi-channel audio (surround audio) via broadcast waves, disc media such as DVD (Digital Versatile Disc) and BD (Blu-ray (registered trademark) Disc), and the Internet. Became available. In movie theaters and the like, many 3D sound systems using object-based audio represented by Dolby Atmos are deployed, and in Japan, 22.2ch audio is adopted as the next-generation broadcasting standard, and users touch multi-channel content. Opportunities have increased significantly. Various techniques for making multi-channels have been studied for conventional stereo audio signals, and a technique for making multi-channels based on the correlation between each channel of stereo signals is disclosed in Patent Document 1.
 マルチチャンネル音声を再生するシステムについても、前述の映画館やホールのような大型音響設備が配された施設以外でも、家庭で手軽に楽しめるようなシステムが一般的となりつつある。具体的には、ユーザ(聴取者)は、国際電気通信連合(International Telecommunication Union;ITU)が推奨する配置基準に基づいて複数のスピーカを配置することで、5.1chや7.1chなどのマルチチャンネル音声を聴取する環境を家庭内に構築することができる。また、少ないスピーカ数を用いてマルチチャンネルの音像定位を再現する手法なども研究されている(非特許文献1)。 As for a system for reproducing multi-channel audio, a system that can be easily enjoyed at home is becoming common other than the facilities such as the above-mentioned movie theaters and halls where large sound equipment is arranged. Specifically, the user (listener) arranges a plurality of speakers on the basis of an arrangement standard recommended by the International Telecommunication Union (ITU), so that a multichannel such as 5.1ch or 7.1ch can be used. An environment for listening to channel sound can be established in the home. Also, a technique for reproducing multi-channel sound image localization using a small number of speakers has been studied (Non-patent Document 1).
日本国公開特許公報「特開2013-055439号公報」(2013年3月21日公開)Japanese Patent Publication “JP 2013-055439 A” (published March 21, 2013) 日本国公開特許公報「特開平11-113098号公報」(1999年4月23日)Japanese Patent Publication “Japanese Patent Laid-Open No. 11-113098” (April 23, 1999)
 非特許文献1に示されるVector Base Amplitude Panning(VBAP)や音圧パンニングは、例えば図13中の(a)に示すように、3つのスピーカ1302、1303、1304の組や、図13中の(b)に示すようなペアとなるスピーカ1306、1307の組と、再現するべき音像1301または1305との位置関係に基づいて音圧をコントロールし、これらスピーカの組が囲む範囲内の任意の位置の音像を再現する技術である。当該技術は、音像が複数存在していても、スピーカの組が囲む範囲内であれば音像を再現できるものであるので、マルチチャンネル音声(例えば22.2chや5.1ch)信号をより少ないスピーカ数で再現できる。 Vector Base Amplitude Panning (VBAP) and sound pressure panning shown in Non-Patent Document 1 are, for example, a group of three speakers 1302, 1303, and 1304 as shown in (a) of FIG. The sound pressure is controlled based on the positional relationship between the pair of speakers 1306 and 1307 as shown in b) and the sound image 1301 or 1305 to be reproduced, and at any position within the range surrounded by the pair of speakers. This technology reproduces sound images. Since the technique can reproduce a sound image within a range surrounded by a set of speakers even when there are a plurality of sound images, a multi-channel audio (for example, 22.2 ch or 5.1 ch) signal is reduced in a speaker. Can be reproduced with numbers.
 しかしながら、VBAPや音圧パンニングは、前述の通り、スピーカの組が囲む範囲内のみで音像が再現できる。そのため、ユーザの視聴環境下において、スピーカを設置できない領域、例えば天井面に近い位置にスピーカを設置できない場合、高さ方向の音像は再現できない。 However, as described above, VBAP and sound pressure panning can reproduce a sound image only within a range surrounded by a set of speakers. Therefore, if the speaker cannot be installed in an area where the speaker cannot be installed, for example, a position close to the ceiling surface, in the user's viewing environment, a sound image in the height direction cannot be reproduced.
 一方、非特許文献2や特許文献2に示されるトランスオーラル技術を用いれば、最低2つのスピーカを用いて3次元音像制御を行うことが出来る。そのため、例えばユーザ正面に設置したスピーカ2つを用いるのみでユーザ周囲の任意の位置の音像定位を再現することが出来るというメリットがある。しかしながら、当該技術は、原理的に特定の受聴エリアを想定し、その中で音響効果を得ることを想定した技術であるため、該受聴エリアから受聴者が外れた場合、音像が想定外の位置に定位したり、そもそも定位が感じられなかったりということが起こり得る。 On the other hand, if the transoral technique shown in Non-Patent Document 2 or Patent Document 2 is used, three-dimensional sound image control can be performed using at least two speakers. Therefore, for example, there is an advantage that sound image localization at an arbitrary position around the user can be reproduced only by using two speakers installed in front of the user. However, since this technique is a technique that assumes a specific listening area in principle and obtains a sound effect in that area, if the listener is removed from the listening area, the sound image is located at an unexpected position. It may happen that the camera is localized, or the localization is not felt in the first place.
 本開示の一態様は、ユーザに対し、その視聴状況下において好適なレンダリング方式でレンダリングした音声を提示することができる音声信号処理装置、及び当該装置を備えた音声信号処理システムを実現することを目的とする。 One embodiment of the present disclosure is to realize an audio signal processing device capable of presenting audio rendered by a suitable rendering method to a user in a viewing state, and an audio signal processing system including the device. Objective.
 上記の課題を解決するために、本開示の一態様に係る音声信号処理装置は、一つまたは複数の音声トラックの音声信号をレンダリングして、複数の音声出力装置に出力する音声信号処理装置であって、或る上記音声トラックに基づいてまたは当該音声トラックに付随する情報に基づいて、当該音声トラックの音声信号の再生位置を特定する再生位置特定部と、各上記音声出力装置の位置情報を取得する位置情報取得部と、上記再生位置及び上記位置情報に基づいて複数のレンダリング方式の中から一つのレンダリング方式を選択して、当該一つのレンダリング方式を用いて、当該再生位置に対応する音声トラックの音声信号をレンダリングする処理部と、を備えていることを特徴としている。なお、一般的に音声トラックには複数の音声チャンネルが含まれていても良いが、本開示においては、説明を分かりやすくするため音声トラック1つにつき1つの音声チャンネルが含まれているものとする。 In order to solve the above problem, an audio signal processing device according to an aspect of the present disclosure is an audio signal processing device that renders audio signals of one or more audio tracks and outputs the audio signals to a plurality of audio output devices. Based on a certain audio track or information associated with the audio track, a reproduction position specifying unit for specifying a reproduction position of the audio signal of the audio track, and position information of each audio output device A position information acquisition unit to be acquired, and one rendering method is selected from a plurality of rendering methods based on the reproduction position and the position information, and the sound corresponding to the reproduction position is selected using the one rendering method. And a processing unit for rendering the audio signal of the track. In general, an audio track may include a plurality of audio channels. However, in the present disclosure, it is assumed that one audio channel is included for each audio track for easy understanding. .
 また、上記の課題を解決するために、本開示の一態様に係る音声信号処理システムは、上述した構成の音声信号処理装置と、上記複数の音声出力装置と、を備えていることを特徴としている。 In order to solve the above problem, an audio signal processing system according to an aspect of the present disclosure includes the audio signal processing device having the above-described configuration and the plurality of audio output devices. Yes.
 本開示の一態様によれば、ユーザに対し、その視聴状況下において好適なレンダリング方式でレンダリングした音声を提示することができる。 According to one aspect of the present disclosure, it is possible to present a sound rendered by a suitable rendering method to the user under the viewing situation.
本開示の実施形態1に係る音声信号処理システムの要部構成を示すブロック図である。1 is a block diagram illustrating a main configuration of an audio signal processing system according to Embodiment 1 of the present disclosure. 本開示の実施形態1に係る音声信号処理システムで使用するトラック情報の例を示した図である。It is a figure showing an example of track information used with an audio signal processing system concerning Embodiment 1 of this indication. 本開示の説明に使用する座標系を示す図である。It is a figure which shows the coordinate system used for description of this indication. 本開示の実施形態1に係るレンダリング切替信号生成部の要部構成を示すブロック図である。FIG. 3 is a block diagram illustrating a main configuration of a rendering switching signal generation unit according to Embodiment 1 of the present disclosure. 本開示の実施形態1に係るレンダリング切替信号生成部の処理フローを示した図である。FIG. 6 is a diagram illustrating a processing flow of a rendering switching signal generation unit according to the first embodiment of the present disclosure. スピーカの配置位置と音像位置の関係を示した図である。It is the figure which showed the relationship between the arrangement position of a speaker, and a sound image position. 本開示の実施形態1に係るレンダリング切替信号生成部の別形態における処理フローを示した図である。It is a figure showing a processing flow in another form of a rendering change signal generation part concerning Embodiment 1 of this indication. 本開示の実施形態1に係るレンダリング部の処理フローを示した図である。FIG. 6 is a diagram illustrating a processing flow of a rendering unit according to the first embodiment of the present disclosure. 本開示の実施形態2に係る音声信号処理システムの要部構成を示すブロック図である。It is a block diagram which shows the principal part structure of the audio | voice signal processing system which concerns on Embodiment 2 of this indication. 本開示の実施形態2に係るレンダリング切替信号生成部の要部構成を示すブロック図である。It is a block diagram which shows the principal part structure of the rendering switching signal generation part which concerns on Embodiment 2 of this indication. 本開示の実施形態2に係るレンダリング切替信号生成部の処理フローを示した図である。It is a figure showing a processing flow of a rendering change signal generation part concerning Embodiment 2 of this indication. レンダリング方式毎の視聴有効範囲を示した模式図である。It is the schematic diagram which showed the viewing-and-listening effective range for every rendering system. VBAP方式ならびに音圧パンニング方式を説明する模式図である。It is a schematic diagram explaining a VBAP system and a sound pressure panning system.
 〔実施形態1〕
 以下、本開示の一実施形態について、図1から図8を用いて説明する。
Embodiment 1
Hereinafter, an embodiment of the present disclosure will be described with reference to FIGS. 1 to 8.
 図1は、本実施形態1における音声信号処理システム1の主要な構成を示すブロック図である。本実施形態1に係る音声信号処理システム1は、音声信号処理部10(音声信号処理装置)と、音声出力部20(複数の音声出力装置)とを備える。 FIG. 1 is a block diagram showing the main configuration of the audio signal processing system 1 according to the first embodiment. The audio signal processing system 1 according to the first embodiment includes an audio signal processing unit 10 (audio signal processing device) and an audio output unit 20 (a plurality of audio output devices).
 <音声信号処理部10>
 音声信号処理部10は、一つまたは複数の音声トラックの音声信号を異なる2種類のレンダリング方式を用いてレンダリングする音声信号処理装置である。レンダリング処理後の音声信号は、音声信号処理部10から音声出力部20へ出力される。
<Audio signal processing unit 10>
The audio signal processing unit 10 is an audio signal processing apparatus that renders audio signals of one or a plurality of audio tracks using two different rendering methods. The audio signal after the rendering process is output from the audio signal processing unit 10 to the audio output unit 20.
 音声信号処理部10は、入力音声信号に基づいてまたはこれに付随する情報に基づいて、音声トラックの音声信号の音像位置(再生位置)を特定するコンテンツ解析部101(再生位置特定部)と、音声出力部20の位置情報を取得するレンダリング切替信号生成部102(位置情報取得部、処理部)と、音像位置(再生位置)及び位置情報に基づいて複数のレンダリング方式のなかから選択した一つのレンダリング方式を用いて、当該音像位置に対応する音声トラックの音声信号をレンダリングするレンダリング部103(処理部)と、を備えている。 The audio signal processing unit 10 includes a content analysis unit 101 (reproduction position specifying unit) that specifies the sound image position (reproduction position) of the audio signal of the audio track based on the input audio signal or on information accompanying the input audio signal, A rendering switching signal generation unit 102 (position information acquisition unit, processing unit) that acquires position information of the audio output unit 20, and one selected from a plurality of rendering methods based on a sound image position (playback position) and position information A rendering unit 103 (processing unit) that renders an audio signal of an audio track corresponding to the sound image position using a rendering method;
 また、音声信号処理部10は、図1に示すように記憶部104を備える。記憶部104は、レンダリング切替信号生成部102及びレンダリング部103が必要とする各種パラメータ、または生成した各種パラメータを記憶する。 Further, the audio signal processing unit 10 includes a storage unit 104 as shown in FIG. The storage unit 104 stores various parameters required by the rendering switching signal generation unit 102 and the rendering unit 103 or generated various parameters.
 以下、それぞれの構成について詳述する。 Hereinafter, each configuration will be described in detail.
 [コンテンツ解析部101]
 コンテンツ解析部101は、DVDやBDなどのディスクメディア、HDD(Hard Disc Drive)等に記録されている映像コンテンツまたは音声コンテンツに含まれる音声トラックとこれに付随する任意のメタデータ(情報)を解析し、発音オブジェクト位置情報を求める。発音オブジェクト位置情報は、コンテンツ解析部101からレンダリング切替信号生成部102及びレンダリング部103に送られる。
[Content Analysis Unit 101]
The content analysis unit 101 analyzes an audio track included in video content or audio content recorded on a disc medium such as a DVD or a BD, an HDD (Hard Disc Drive), or any metadata (information) associated therewith. Then, the pronunciation object position information is obtained. The pronunciation object position information is sent from the content analysis unit 101 to the rendering switching signal generation unit 102 and the rendering unit 103.
 本実施形態1では、コンテンツ解析部101が受け取る音声コンテンツは2つ以上の音声トラックを含む音声コンテンツであるものとする。また、この音声トラックは、ステレオ(2ch)や5.1chなどに採用されている「チャネルベース」の音声トラックであってもよいし、個々の発音オブジェクト単位を1トラックとし、この位置的・音量的変化を記述した付随情報(メタデータ)を付与した「オブジェクトベース」の音声トラックであってもよい。 In the first embodiment, it is assumed that the audio content received by the content analysis unit 101 is an audio content including two or more audio tracks. Further, this audio track may be a “channel-based” audio track employed in stereo (2ch), 5.1ch, etc., and each sound generation object unit is set as one track, and this position / volume It may be an “object-based” audio track to which accompanying information (metadata) describing a change in the environment is added.
 「オブジェクトベース」の音声トラックの概念について説明する。オブジェクトベースに基づく音声トラックは、個々の発音オブジェクト単位で各トラックに記録、すなわちミキシングせずに記録しておき、プレイヤー(再生機)側でこれら発音オブジェクトを適宜レンダリングするものである。各々の規格やフォーマットにおいて差はあるものの、一般的には、これら発音オブジェクトには各々、いつ、どこで、どの程度の音量で発音されるべきかといったメタデータが紐づけられており、プレイヤーはこれに基づいて個々の発音オブジェクトをレンダリングする。 Explain the concept of “object-based” audio tracks. The audio track based on the object base is recorded on each track for each sounding object, that is, recorded without mixing, and these sounding objects are appropriately rendered on the player (playing device) side. Although there is a difference in each standard and format, in general, each of these pronunciation objects is associated with metadata such as when, where, and at what volume the player should pronounce. Render individual pronunciation objects based on
 他方、「チャネルベース」の音声トラックは、従来のサラウンド等で採用されているものであり(例えば5.1chサラウンド)、予め規定された再生位置(スピーカの配置位置)から発音される前提で、個々の発音オブジェクトをミキシングした状態で記録されたトラックである。 On the other hand, the “channel-based” audio track is employed in conventional surround sound (for example, 5.1ch surround), and is presupposed to be sounded from a predetermined playback position (speaker placement position). This is a track recorded in a state where individual sound generation objects are mixed.
 (発音オブジェクト位置情報)
 ここで、発音オブジェクト位置情報について、図2を用いて説明する。
(Pronunciation object position information)
Here, the pronunciation object position information will be described with reference to FIG.
 図2は、コンテンツ解析部101によって解析されて得られるトラック情報201の構成を概念的に示したものである。 FIG. 2 conceptually shows the configuration of the track information 201 obtained by analysis by the content analysis unit 101.
 コンテンツ解析部101は、コンテンツに含まれる音声トラック全てを解析し、図2に示すトラック情報201として再構成するものとする。トラック情報201には、各音声トラックのIDと、その音声トラックの種別とが記録されている。 The content analysis unit 101 analyzes all the audio tracks included in the content and reconstructs the track information 201 shown in FIG. In the track information 201, the ID of each audio track and the type of the audio track are recorded.
 ここで、音声トラックがオブジェクトベースのトラックである場合、1つ以上の発音オブジェクト位置情報がメタデータとして付随している。発音オブジェクト位置情報は、再生時刻と、その再生時刻での音像位置(再生位置)とのペアで構成される。 Here, when the audio track is an object-based track, one or more pronunciation object position information is attached as metadata. The pronunciation object position information is composed of a pair of a reproduction time and a sound image position (reproduction position) at the reproduction time.
 他方、音声トラックがチャネルベースのトラックである場合も同様に、再生時刻と、その再生時刻での音像位置(再生位置)とのペアが記録されるが、チャネルベースのトラックである場合の再生時刻はコンテンツの開始から終了までとなり、また、その再生時刻での音像位置はチャネルベースにおいて予め規定された再生位置に基づく。 On the other hand, when the audio track is a channel-based track, a pair of a playback time and a sound image position (playback position) at the playback time is recorded. Is from the start to the end of the content, and the sound image position at the playback time is based on the playback position defined in advance on the channel base.
 ここで、発音オブジェクト位置情報の一部として記録されている音像位置(再生位置)は、図3に示した座標系で表現されるものとする。また、トラック情報201は例えばXML(Extensible Markup Language)のようなマークアップ言語で記述されているものとする。 Here, it is assumed that the sound image position (playback position) recorded as a part of the pronunciation object position information is expressed in the coordinate system shown in FIG. Further, it is assumed that the track information 201 is described in a markup language such as XML (Extensible Markup Language).
 [レンダリング切替信号生成部102]
 レンダリング切替信号生成部102は、詳細は後述するが、視聴環境に関する情報と、コンテンツ解析部101で得られるトラック情報201(図2)とに基づいてレンダリング方式の切り替え指示信号を生成する。レンダリング切替信号生成部102の詳細を、図4に基づいて説明する。
[Rendering switching signal generator 102]
Although the details will be described later, the rendering switching signal generation unit 102 generates a rendering method switching instruction signal based on information related to the viewing environment and the track information 201 (FIG. 2) obtained by the content analysis unit 101. Details of the rendering switching signal generation unit 102 will be described with reference to FIG.
 図4は、レンダリング切替信号生成部102の構成を示すブロック図である。図4に示すように、レンダリング切替信号生成部102は、環境情報取得部10201(位置情報取得部)と、レンダリング切替指示信号算出部10202(処理部)とを有する。 FIG. 4 is a block diagram illustrating a configuration of the rendering switching signal generation unit 102. As illustrated in FIG. 4, the rendering switching signal generation unit 102 includes an environment information acquisition unit 10201 (position information acquisition unit) and a rendering switching instruction signal calculation unit 10202 (processing unit).
 [環境情報取得部10201]
 環境情報取得部10201は、ユーザがコンテンツを視聴する環境の情報(以下、環境情報と記載)を取得するように構成されている。
[Environmental information acquisition unit 10201]
The environment information acquisition unit 10201 is configured to acquire information on the environment in which the user views the content (hereinafter referred to as environment information).
 ここで、本実施形態1においては、環境情報は音声出力部20として音声信号処理部10に接続されるスピーカの個数、スピーカの位置及びスピーカのタイプであるものとする。なお、スピーカのタイプとは、本システムで使用する複数のレンダリング方式のいずれに用いることができるかを示す情報である。本実施形態1において説明するように音声信号処理部10が2種類のレンダリング方式を用いる場合、各スピーカが、各々配置された位置において、これら方式のいずれかまたは両方に用いることが出来るかという情報をスピーカのタイプとする。 Here, in the first embodiment, the environment information is assumed to be the number of speakers connected to the audio signal processing unit 10 as the audio output unit 20, the position of the speaker, and the type of the speaker. The speaker type is information indicating which of a plurality of rendering methods used in this system can be used. As described in the first embodiment, when the audio signal processing unit 10 uses two types of rendering methods, information on whether or not each speaker can be used for either or both of the methods at the position where each speaker is arranged. Is the speaker type.
 環境情報は、予め記憶部104に記録されている。そのため、環境情報取得部10201は、必要に応じて記憶部104から情報を読み出す。 Environmental information is recorded in the storage unit 104 in advance. Therefore, the environment information acquisition unit 10201 reads information from the storage unit 104 as necessary.
 なお、記憶部104に記録されている環境情報は、任意の書式、例えばXMLなどのフォーマットに則って記述されたメタデータ情報として記録されているものとしてもよく、この場合は環境情報取得部10201が適宜デコードして情報を取り出す。 The environment information recorded in the storage unit 104 may be recorded as metadata information described according to an arbitrary format, for example, a format such as XML. In this case, the environment information acquisition unit 10201 may be used. Decodes as appropriate to extract information.
 また、音像位置及びスピーカの位置は、図3に示すような座標系で示されているものとする。ここで用いる座標系は、図3中の(a)の上面図で示すような、原点Oを中心とし、原点Oからの距離を動径rと、原点Oの正面を0°、右位置、左位置を各々90°、-90°とする方位角θと、図3中の(b)の側面図で示すような、原点Oの正面を0°、原点Oの真上を90°とする仰角φで示すものとし、音像位置及びスピーカの位置を(r,θ,φ)と表記するものとする。以降の説明においては、特に断りが無い限り、音像位置及びスピーカの位置は図3の座標系を用いるものとする。 Further, it is assumed that the sound image position and the speaker position are shown in a coordinate system as shown in FIG. The coordinate system used here is centered on the origin O as shown in the top view of FIG. 3A, the distance from the origin O is the radius r, the front of the origin O is 0 °, the right position, The azimuth angle θ with the left position being 90 ° and −90 °, respectively, and the front of the origin O is 0 ° and the position just above the origin O is 90 ° as shown in the side view of FIG. The elevation angle φ is assumed, and the sound image position and the speaker position are expressed as (r, θ, φ). In the following description, unless otherwise specified, the coordinate system of FIG. 3 is used for the sound image position and the speaker position.
 なお、本実施形態1では、上述のように、環境情報は予め取得されて記憶部104に記録されているものとしている。しかしながら、本開示はこれに限定されるものではなく、例えばタブレット端末などの情報入力端末(本実施形態1では図示しない)を通じて情報をリアルタイムに入力できるようにしても良い。また、視聴環境の任意の位置に設置されたカメラで撮影された画像から画像処理(例えば、音声出力部20にマーカを付しておき、部屋の天井に設置したカメラでこれを認識させる)する構成としても良い。あるいは、音声出力部20自身に位置情報を発信する機器を付しておき(例えば、ビーコン等を活用)、種々の情報を取得する構成としても良い。 In the first embodiment, as described above, the environment information is acquired in advance and recorded in the storage unit 104. However, the present disclosure is not limited to this, and information may be input in real time through an information input terminal (not shown in the first embodiment) such as a tablet terminal. Also, image processing is performed from an image taken by a camera installed at an arbitrary position in the viewing environment (for example, a marker is attached to the audio output unit 20 and this is recognized by a camera installed on the ceiling of the room). It is good also as a structure. Alternatively, a device that transmits position information to the audio output unit 20 itself (for example, using a beacon or the like) may be used to acquire various information.
 [レンダリング切替指示信号算出部10202]
 レンダリング切替指示信号算出部10202は、環境情報取得部10201から得られた環境情報と、コンテンツ解析部101で得られたトラック情報201(図2)の発音オブジェクト位置情報とに基づき、音声トラック毎に、その音声信号を、複数のレンダリング方式のいずれでレンダリングするか決定し、その情報をレンダリング部103に出力する。
[Rendering switching instruction signal calculation unit 10202]
The rendering switching instruction signal calculation unit 10202 is provided for each audio track based on the environment information obtained from the environment information acquisition unit 10201 and the sounding object position information of the track information 201 (FIG. 2) obtained by the content analysis unit 101. The audio signal is determined to be rendered by any of a plurality of rendering methods, and the information is output to the rendering unit 103.
 ここで、本実施形態1では、説明をより分かりやすくするため、レンダリング部103が、レンダリング方式Aとレンダリング方式Bという2種類のレンダリング方式(レンダリングアルゴリズム)を同時に駆動させるものとする。 Here, in the first embodiment, it is assumed that the rendering unit 103 simultaneously drives two types of rendering methods (rendering algorithms), that is, the rendering method A and the rendering method B, in order to make the description easier to understand.
 以下に、図5を用いて、レンダリング切替指示信号算出部10202の動作を説明する。図5は、レンダリング切替指示信号算出部10202の動作を説明するフローチャートである。 Hereinafter, the operation of the rendering switching instruction signal calculation unit 10202 will be described with reference to FIG. FIG. 5 is a flowchart for explaining the operation of the rendering switching instruction signal calculation unit 10202.
 レンダリング切替指示信号算出部10202は、先述の環境情報及びトラック情報201(図2)を受け取ると、レンダリング方式選択処理を開始する(ステップS101)。 When the rendering switching instruction signal calculation unit 10202 receives the above-described environment information and track information 201 (FIG. 2), it starts a rendering method selection process (step S101).
 そして、全ての音声トラックに対してレンダリング方式選択処理が行われたかを確認する(ステップS102)。全ての音声トラックに対してステップS103以降のレンダリング方式選択処理が完了していれば(ステップS102におけるYES)、レンダリング方式選択処理を終了する(ステップS106)。一方で、レンダリング方式選択処理が未処理の音声トラックがあれば(ステップS102におけるNO)、ステップS103に移行する。 Then, it is confirmed whether or not rendering method selection processing has been performed for all audio tracks (step S102). If the rendering method selection process after step S103 is completed for all audio tracks (YES in step S102), the rendering method selection process is terminated (step S106). On the other hand, if there is an audio track that has not been subjected to rendering method selection processing (NO in step S102), the process proceeds to step S103.
 ステップS103では、取得したトラック情報201(図2)から、当該未処理の音声トラックに対応する発音オブジェクト位置情報を参照し、当該発音オブジェクト位置情報の一部として記録されている音像位置が、レンダリング方式Aにおけるレンダリング処理可能範囲内に含まれるか否か判別する。 In step S103, the sound generation object position information corresponding to the unprocessed audio track is referred to from the acquired track information 201 (FIG. 2), and the sound image position recorded as a part of the sound generation object position information is rendered. It is determined whether or not the image is included in the rendering processable range in the method A.
 ここで、レンダリング処理可能範囲は、特定のレンダリング方式における、音源を配置可能な範囲を示すものであり、必要に応じて環境情報の一部として得られたスピーカの位置を示す情報(位置情報)を参照して決定されるものである。なお、レンダリング処理可能範囲の決定には、必ずしも環境情報(すなわち、現在の環境について何らかの手段を用いて取得した情報)を参照することを要しない。例えば、スピーカ位置が予めシステム上で決定されており、ユーザがシステムの指示に従ってこの位置にスピーカを配置する場合には、当該情報を取得する必要はない。また、スピーカの位置とは無関係にレンダリング処理可能範囲を定めることも可能である(後述するように、レンダリング処理がモノラル信号へのダウンミックスである場合は、全領域を処理可能範囲と定めることもできる)。 Here, the rendering processable range indicates a range in which a sound source can be arranged in a specific rendering method, and information (position information) indicating the position of the speaker obtained as part of the environment information as necessary. To be determined. Note that the determination of the rendering processable range does not necessarily require reference to the environment information (that is, information acquired by using some means regarding the current environment). For example, when the speaker position is determined on the system in advance and the user places the speaker at this position in accordance with an instruction from the system, it is not necessary to acquire the information. It is also possible to define a rendering processable range regardless of the position of the speaker (as will be described later, if the rendering process is a downmix to a monaural signal, the entire area can be defined as the processable range. it can).
 より具体的な例を、図6を参照して説明する。仮に、原点Oの位置にユーザ(受聴者)601が存在し、その周囲にスピーカ(音声出力装置)602、603、604、605が配されているものとする。スピーカ602、603、604、605は、視聴者の頭の位置と同じ高さに配されている。図6中の(a)は配置図を上方から見た図であり、図6中の(b)は側方から見た図である。606、607、608、609は各々の音声トラックの音声信号に基づく音像が定位されるべき位置(音像位置)を示している。音像位置606、607、608は、視聴者の頭の位置と同じ高さにあり、音像位置609は、視聴者の頭の位置よりも高い位置にある。この場合において、レンダリング方式AをVBAP(第一のレンダリング方式)、レンダリング方式Bをトランスオーラル(第二のレンダリング方式)とし、VBAPに使用可能なスピーカを602、603、604、605として、トランスオーラルに使用可能なスピーカを602、603とする。その場合、レンダリング方式A(VBAP)におけるレンダリング処理可能範囲は、隣接するスピーカに挟まれた範囲、具体的にはスピーカ602と603とに挟まれた範囲、603と605とに挟まれた範囲、604と605とに挟まれた範囲、及び、602と604とに挟まれた範囲である。そのため、この範囲に含まれる(図5のステップS103におけるYES)音像位置606、607、608に定位されるべき音声信号は、レンダリング方式A(VBAP)で処理可能である。一方、図6に示す音像位置609は、スピーカの位置よりも高い位置にあり、レンダリング方式A(VBAP)におけるレンダリング処理可能範囲に含まれない(図5のステップS103におけるNO)。この場合、音像位置609については、スピーカの位置に因らず任意の位置に音像を定位することが可能なレンダリング方式(トランスオーラル)であるレンダリング方式Bによる音声信号のレンダリング処理となる。 A more specific example will be described with reference to FIG. Assume that a user (listener) 601 exists at the position of the origin O, and speakers (sound output devices) 602, 603, 604, and 605 are arranged around the user (listener) 601. The speakers 602, 603, 604, 605 are arranged at the same height as the position of the viewer's head. (A) in FIG. 6 is a diagram when the layout is viewed from above, and (b) in FIG. 6 is a diagram when viewed from the side. Reference numerals 606, 607, 608, and 609 denote positions (sound image positions) where sound images based on the sound signals of the respective sound tracks should be localized. The sound image positions 606, 607, and 608 are at the same height as the position of the viewer's head, and the sound image position 609 is higher than the position of the viewer's head. In this case, the rendering method A is VBAP (first rendering method), the rendering method B is transoral (second rendering method), and the speakers usable for VBAP are 602, 603, 604, 605, and transoral. Speakers 602 and 603 that can be used in In that case, the rendering processable range in the rendering method A (VBAP) is a range sandwiched between adjacent speakers, specifically a range sandwiched between speakers 602 and 603, a range sandwiched between 603 and 605, A range between 604 and 605 and a range between 602 and 604. Therefore, audio signals to be localized at the sound image positions 606, 607, and 608 included in this range (YES in step S103 in FIG. 5) can be processed by the rendering method A (VBAP). On the other hand, the sound image position 609 shown in FIG. 6 is higher than the position of the speaker and is not included in the rendering processable range in the rendering method A (VBAP) (NO in step S103 of FIG. 5). In this case, the sound image position 609 is a sound signal rendering process by a rendering method B which is a rendering method (trans-oral) that can localize a sound image regardless of the position of the speaker.
 すなわち、ステップS103における判別の結果、未処理の音声トラックの音像位置が、レンダリング方式Aにおけるレンダリング処理可能範囲内に含まれるものであれば(ステップS103におけるYES)、ステップS104へ移行する。一方、ステップS103における判別の結果、未処理の音声トラックの音像位置が、レンダリング方式Aにおけるレンダリング処理可能範囲内に含まれないものであれば(ステップS103におけるNO)、ステップS105へ移行する。 That is, as a result of the determination in step S103, if the sound image position of the unprocessed audio track is included in the rendering processable range in the rendering method A (YES in step S103), the process proceeds to step S104. On the other hand, as a result of the determination in step S103, if the sound image position of the unprocessed audio track is not included in the rendering processable range in the rendering method A (NO in step S103), the process proceeds to step S105.
 ステップS104では、未処理の音声トラックの音声信号を、レンダリング方式Aを用いてレンダリングする指示信号(レンダリング切替信号)をレンダリング部103に出力する。 In step S104, an instruction signal (rendering switching signal) for rendering the audio signal of the unprocessed audio track using the rendering method A is output to the rendering unit 103.
 一方、ステップS105では、未処理の音声トラックの音声信号を、レンダリング方式Bを用いてレンダリングする指示信号(レンダリング切替信号)をレンダリング部103に出力する。 On the other hand, in step S105, an instruction signal (rendering switching signal) for rendering the audio signal of the unprocessed audio track using the rendering method B is output to the rendering unit 103.
 なお、上記では、全ての音声トラックの音像位置が、レンダリング方式A及びレンダリング方式Bのいずれかのレンダリング処理可能範囲に収まるものとして説明している。しかしながら、これに当てはまらない場合、すなわち、レンダリング方式A及びレンダリング方式Bのいずれのレンダリング処理可能範囲にも収まらない可能性がある場合は、図7に示すようなフローでレンダリング方式選択処理するものとしても良い。 In the above description, the sound image positions of all the audio tracks are described as being within the rendering processable range of either the rendering method A or the rendering method B. However, if this is not the case, that is, if there is a possibility that it does not fall within the rendering processable range of either rendering system A or rendering system B, the rendering system selection process is performed according to the flow shown in FIG. Also good.
 図7は、図5に示すフローの変形例である。 FIG. 7 is a modification of the flow shown in FIG.
 図7に示す処理フローにおいて、前半は図5の処理フローと同じである。 In the processing flow shown in FIG. 7, the first half is the same as the processing flow in FIG.
 すなわち、まず、レンダリング切替指示信号算出部10202が環境情報及びトラック情報201(図2)を受け取りレンダリング方式選択処理が開始する(ステップS111)。 That is, first, the rendering switching instruction signal calculation unit 10202 receives the environment information and the track information 201 (FIG. 2), and the rendering method selection process starts (step S111).
 続いて、全ての音声トラックに対してレンダリング方式選択処理が行われたかを確認し(ステップS112)、全ての音声トラックに対してステップS113以降のレンダリング方式選択処理が完了していれば(ステップS112におけるYES)、レンダリング方式選択処理を終了する(ステップS118)。一方で、レンダリング方式選択処理が未処理の音声トラックがあれば(ステップS112におけるNO)、取得したトラック情報201(図2)から、当該未処理の音声トラックに対応する発音オブジェクト位置情報を参照し、先述のステップS103と同様に、当該未処理の音声トラックに対応する発音オブジェクト位置情報の一部として記録された音像位置が、レンダリング方式Aにおけるレンダリング処理可能範囲内に含まれるものか否かを判別する(ステップS113)。 Subsequently, it is confirmed whether or not rendering method selection processing has been performed for all audio tracks (step S112). If rendering method selection processing in step S113 and subsequent steps has been completed for all audio tracks (step S112). In step S118, the rendering method selection process is terminated. On the other hand, if there is an unprocessed audio track for which the rendering method selection process is not performed (NO in step S112), the sound generation object position information corresponding to the unprocessed audio track is referred to from the acquired track information 201 (FIG. 2). Similarly to step S103 described above, whether or not the sound image position recorded as part of the sound generation object position information corresponding to the unprocessed audio track is included in the rendering processable range in the rendering method A is determined. It discriminate | determines (step S113).
 そして、ステップS113における判別の結果、当該音像位置が、レンダリング方式Aにおけるレンダリング処理可能範囲内に含まれるのであれば(ステップS113におけるYES)、ステップS114へ移行する。ステップS114では、未処理の音声トラックの音声信号をレンダリング方式Aを用いてレンダリングする指示信号をレンダリング部103に出力する。 If the result of determination in step S113 is that the sound image position is within the rendering processable range in rendering method A (YES in step S113), the process proceeds to step S114. In step S <b> 114, an instruction signal for rendering the audio signal of the unprocessed audio track using the rendering method A is output to the rendering unit 103.
 一方、ステップS113における判別の結果、当該音像位置が、レンダリング方式Aにおけるレンダリング処理可能範囲内に含まれないものであれば(ステップS113におけるNO)、ステップS115へ移行する。 On the other hand, as a result of the determination in step S113, if the sound image position is not included in the rendering processable range in the rendering method A (NO in step S113), the process proceeds to step S115.
 ステップS115では、当該音像位置が、レンダリング方式Bにおけるレンダリング処理可能範囲内に含まれるかどうかを判別する。 In step S115, it is determined whether or not the sound image position is included in the rendering processable range in the rendering method B.
 そして、ステップS115における判別の結果、当該音像位置が、レンダリング方式Bにおけるレンダリング処理可能範囲内に含まれるものであれば(ステップS115におけるYES)、ステップS116へ移行する。一方、判別の結果、当該音像位置が、レンダリング方式Bにおけるレンダリング処理可能範囲内に含まれないものであれば(ステップS115におけるNO)、ステップS117へ移行する。すなわち、当該音像位置がレンダリング方式A及びレンダリング方式Bのレンダリング処理可能範囲に含まれない場合に、ステップS117へ移行する。 If the result of determination in step S115 is that the sound image position is within the rendering processable range in rendering method B (YES in step S115), the process proceeds to step S116. On the other hand, as a result of the determination, if the sound image position is not included in the rendering processable range in the rendering method B (NO in step S115), the process proceeds to step S117. That is, if the sound image position is not included in the rendering processable range of the rendering method A and the rendering method B, the process proceeds to step S117.
 ステップS116では、未処理の音声トラックの音声信号を、レンダリング方式Bを用いてレンダリングする指示信号をレンダリング部103に出力する。 In step S116, an instruction signal for rendering the audio signal of the unprocessed audio track using the rendering method B is output to the rendering unit 103.
 一方で、ステップS117では、未処理の音声トラックの音声信号をレンダリングしないように指示を出す。指示信号は、レンダリング部103に出力される。 On the other hand, in step S117, an instruction is issued not to render the audio signal of the unprocessed audio track. The instruction signal is output to the rendering unit 103.
 以上に示した通り、本実施形態1では選択可能なレンダリング方式は2種類として説明したが、3種類以上のレンダリング方式から選択しても良いことは言うまでもない。 As described above, in the first embodiment, the selectable rendering methods are described as two types, but it goes without saying that three or more types of rendering methods may be selected.
 また、上述では、レンダリング切替指示信号算出部10202において生成される信号について、レンダリング方式の切替を指示するものであるという表現を用いているが、ここでいう「切替を指示する」という表現には、レンダリング方式をAからBへ、あるいはBからAへ切り替えるよう指示するという態様だけでなく、レンダリング方式Aを用いたトラックの次のトラックでもレンダリング方式Aを用いるよう指示する態様(方式Bについても同様)も含まれる。 In the above description, the expression that the rendering switching instruction signal calculation unit 10202 is for instructing switching of the rendering method is used. However, the expression “instructing switching” here is used. In addition to the mode of instructing to switch the rendering mode from A to B or from B to A, the mode of instructing to use the rendering mode A in the next track of the track using the rendering mode A (also for the mode B) The same).
 なお、図7に示したフローにおいてレンダリング方式A及びレンダリング方式Bのいずれの処理可能範囲にも含まれないトラックは、一切音が出力されないこととなるが、レンダリング方式Bを処理可能範囲の広いレンダリング方式、例えばモノラル信号へのダウンミックス、とすることで、一切音が出力されないトラックが発生することを実用上の範囲で避けることが可能である。 In the flow shown in FIG. 7, no sound is output for any track that is not included in any of the processing ranges of the rendering method A and the rendering method B, but the rendering method B has a wide processing range. By adopting a method, for example, downmixing to a monaural signal, it is possible to avoid the occurrence of a track in which no sound is output in a practical range.
 [レンダリング部103]
 レンダリング部103は、入力音声信号と、レンダリング切替信号生成部102のレンダリング切替指示信号算出部10202から出力された指示信号とに基づき、音声出力部20から出力されるべき音声信号を構築する。
[Rendering unit 103]
The rendering unit 103 constructs an audio signal to be output from the audio output unit 20 based on the input audio signal and the instruction signal output from the rendering switching instruction signal calculation unit 10202 of the rendering switching signal generation unit 102.
 具体的には、コンテンツに含まれる音声信号を受け、レンダリング切替信号生成部102のレンダリング切替指示信号算出部10202からの指示信号に基づいたレンダリング方式によって音声信号をレンダリングし、更にミキシングした後に、音声出力部20に出力する。 Specifically, after receiving the audio signal included in the content, rendering the audio signal by a rendering method based on the instruction signal from the rendering switching instruction signal calculating unit 10202 of the rendering switching signal generating unit 102, and further mixing the audio signal, Output to the output unit 20.
 換言すれば、レンダリング部103では、2種類のレンダリングアルゴリズムを同時に駆動させ、レンダリング切替指示信号算出部10202から出力された指示信号に基づいて、用いるレンダリングアルゴリズムを切り替えて、音声信号をレンダリングする。 In other words, the rendering unit 103 simultaneously drives two types of rendering algorithms, switches the rendering algorithm to be used based on the instruction signal output from the rendering switching instruction signal calculation unit 10202, and renders the audio signal.
 ここで、レンダリングとは、コンテンツに含まれる音声信号(入力音声信号)を、音声出力部20から出力されるべき信号に変換する処理を行うことをいう。 Here, rendering means performing processing for converting an audio signal (input audio signal) included in the content into a signal to be output from the audio output unit 20.
 以下、レンダリング部103の動作を、図8に示すフローを用いて説明する。 Hereinafter, the operation of the rendering unit 103 will be described using the flow shown in FIG.
 図8は、レンダリング部103の動作を示すフローチャートである。 FIG. 8 is a flowchart showing the operation of the rendering unit 103.
 レンダリング部103は、入力音声信号と、レンダリング切替信号生成部102のレンダリング切替指示信号算出部10202からの指示信号とを受け取ると、レンダリング処理を開始する(ステップS201)。 When the rendering unit 103 receives the input audio signal and the instruction signal from the rendering switching instruction signal calculation unit 10202 of the rendering switching signal generation unit 102, the rendering unit 103 starts the rendering process (step S201).
 まず、全ての音声トラックに対してレンダリング処理が行われたかを確認する(ステップS202)。ステップS202において、全ての音声トラックに対してステップS203以降のレンダリング処理が完了していれば(ステップS202におけるYES)、レンダリング処理を終了する(S208)。一方で未処理の音声トラックがあれば(ステップS202におけるNO)、レンダリング切替信号生成部102のレンダリング切替指示信号算出部10202からの指示信号に基づいたレンダリング方式を用いてレンダリングを行う。指示信号が、レンダリング方式Aを示す場合には(ステップS203におけるレンダリング方式A)、レンダリング方式Aを用いて音声信号をレンダリングするのに必要なパラメータを記憶部104から読み出し(ステップS204)、これに基づくレンダリングを行う(ステップS205)。同様に、指示信号が、レンダリング方式Bを示す場合には(ステップS203におけるレンダリング方式B)、レンダリング方式Bで音声信号をレンダリングするのに必要なパラメータを記憶部104から読み出し(ステップS206)、これに基づくレンダリングを行う(ステップS207)。また、図7のフローに基づいて、指示信号がレンダリングなしを示す場合には(ステップS203におけるレンダリングなし)、該当トラックはレンダリングされず、出力音声には含められない。 First, it is confirmed whether rendering processing has been performed for all audio tracks (step S202). In step S202, if the rendering process after step S203 has been completed for all audio tracks (YES in step S202), the rendering process is terminated (S208). On the other hand, if there is an unprocessed audio track (NO in step S202), rendering is performed using a rendering method based on the instruction signal from the rendering switching instruction signal calculation unit 10202 of the rendering switching signal generation unit 102. When the instruction signal indicates the rendering method A (rendering method A in step S203), parameters necessary for rendering the audio signal using the rendering method A are read from the storage unit 104 (step S204). Rendering based on this is performed (step S205). Similarly, when the instruction signal indicates the rendering method B (rendering method B in step S203), parameters necessary for rendering the audio signal in the rendering method B are read from the storage unit 104 (step S206). Rendering based on is performed (step S207). If the instruction signal indicates no rendering based on the flow of FIG. 7 (no rendering in step S203), the corresponding track is not rendered and is not included in the output audio.
 [記憶部104]
 記憶部104は、レンダリング切替信号生成部102やレンダリング部103で用いられる種々のデータを記録するための二次記憶装置によって構成される。記憶部104は、例えば、磁気ディスク、光ディスク、フラッシュメモリなどによって構成され、より具体的な例としては、HDD、SSD(Solid State Drive)、SDメモリーカード、BD、DVDなどが挙げられる。レンダリング切替信号生成部102及びレンダリング部103は、必要に応じて記憶部104からデータを読み出す。また、レンダリング切替信号生成部102において算出された係数等を含む各種パラメータデータを記憶部104に記録することもできる。
[Storage unit 104]
The storage unit 104 is configured by a secondary storage device for recording various data used in the rendering switching signal generation unit 102 and the rendering unit 103. The storage unit 104 is configured by, for example, a magnetic disk, an optical disk, a flash memory, and the like, and more specific examples include an HDD, an SSD (Solid State Drive), an SD memory card, a BD, a DVD, and the like. The rendering switching signal generation unit 102 and the rendering unit 103 read data from the storage unit 104 as necessary. In addition, various parameter data including the coefficient calculated by the rendering switching signal generation unit 102 can be recorded in the storage unit 104.
 <音声出力部20>
 音声出力部20は、レンダリング部103で得られた音声を出力する。ここで、音声出力部20は、独立した複数のスピーカで構成され、個々のスピーカはスピーカユニットとこれを駆動させる増幅器(アンプ)によって構成される。
<Audio output unit 20>
The audio output unit 20 outputs the audio obtained by the rendering unit 103. Here, the audio output unit 20 includes a plurality of independent speakers, and each speaker includes a speaker unit and an amplifier (amplifier) that drives the speaker unit.
 すなわち、環境情報取得部10201は、音声出力部20に構成される各スピーカの位置情報を取得する。そして、レンダリング切替指示信号算出部10202は、環境情報取得部10201が取得した複数の位置情報に基づいて、レンダリング方式を選択する。 That is, the environment information acquisition unit 10201 acquires the position information of each speaker configured in the audio output unit 20. Then, the rendering switching instruction signal calculation unit 10202 selects a rendering method based on a plurality of pieces of position information acquired by the environment information acquisition unit 10201.
 以上のように、本実施形態1によれば、ユーザが配したスピーカの配置とコンテンツから得られる情報に応じて、音像定位を考慮した好適なレンダリング方式を自動で算出し、音声再生を行うことにより、良好な定位感のある音声をユーザに届けることが可能となる。 As described above, according to the first embodiment, a suitable rendering method considering sound image localization is automatically calculated according to the arrangement of speakers arranged by the user and information obtained from the content, and audio reproduction is performed. Thus, it is possible to deliver a sound with a good localization feeling to the user.
 なお、本実施形態1では、複数の音声トラックを含むコンテンツを再生対象としているが、本開示はこれに限定されるものではなく、一つの音声トラックを含むコンテンツを再生対象としても良い。その場合には、当該一つの音声トラックについて好適なレンダリング方式を、複数のレンダリング方式のなかから選択する。 In the first embodiment, content including a plurality of audio tracks is targeted for reproduction. However, the present disclosure is not limited to this, and content including one audio track may be targeted for reproduction. In this case, a suitable rendering method for the one audio track is selected from a plurality of rendering methods.
 (レンダリング方式)
 本実施形態1では、VBAP、トランスオーラル、及び、モノラル信号へのダウンミックスというレンダリング方式を挙げたが、本開示はこれらのレンダリング方式に限定されるものではない。
(Rendering method)
In the first embodiment, a rendering method of VBAP, trans-oral, and downmixing to a monaural signal has been described. However, the present disclosure is not limited to these rendering methods.
 例えば、音声信号を音像位置(再生位置)に応じた音圧の比率で各音声出力部から出力させる、VBAPと同様のレンダリング方式を採用してもよい。また、音像位置(再生位置)に応じた加工がされた音声信号を各音声出力部から出力させる、トランスオーラルと同様のレンダリング方式を採用してもよい。音像位置が複数の音声出力部の配置位置によって規定される範囲内に含まれる場合には、音像位置に応じた音圧の比率で各音声出力部から出力させるレンダリング方式を採用することによって、音質が重視されたオーディオ環境を実現することができる。一方、トランスオーラルのように音像位置(再生位置)に応じた加工がされるレンダリング方式を採用することにより、音声出力部の配置による制限を受けることなく音像を定位させることができる。 For example, a rendering method similar to VBAP may be employed, in which an audio signal is output from each audio output unit at a sound pressure ratio corresponding to the sound image position (playback position). Also, a rendering method similar to transaural, in which an audio signal processed according to the sound image position (reproduction position) is output from each audio output unit, may be employed. When the sound image position is included in the range defined by the arrangement positions of the plurality of sound output units, the sound quality is improved by adopting a rendering method that outputs from each sound output unit at a sound pressure ratio according to the sound image position. An audio environment where emphasis is placed on can be realized. On the other hand, by adopting a rendering method that is processed according to the sound image position (reproduction position) such as transaural, it is possible to localize the sound image without being restricted by the arrangement of the sound output unit.
 また本開示は、例えば、ステレオ信号へのダウンミックスをレンダリング方式の一つとして採用することも可能である。 Also, in the present disclosure, for example, downmixing to a stereo signal can be adopted as one of rendering methods.
 〔実施形態2〕
 本開示の他の実施形態について、図9から図12に基づいて説明すれば、以下のとおりである。なお、説明の便宜上、上述の実施形態1にて説明した部材と同じ機能を有する部材については、同じ符号を付記し、その説明を省略する。
[Embodiment 2]
Another embodiment of the present disclosure will be described below with reference to FIGS. 9 to 12. For convenience of explanation, members having the same functions as those described in the first embodiment are denoted by the same reference numerals and description thereof is omitted.
 図9は、本開示の実施形態2にかかる音声信号処理システム1aの主要な構成を示すブロック図である。ここで、本実施形態2に係る音声信号処理システム1aは、上述の実施形態1に示した音声信号処理システム1のうちのレンダリング切替信号生成部102の挙動のみが異なり、これ以外の処理部については同一であるので、これ以外の構成の説明については以下において説明しない限り、実施形態1において説明したものと同等である。 FIG. 9 is a block diagram illustrating a main configuration of the audio signal processing system 1a according to the second embodiment of the present disclosure. Here, the audio signal processing system 1a according to the second embodiment is different only in the behavior of the rendering switching signal generation unit 102 in the audio signal processing system 1 shown in the first embodiment, and other processing units are used. Since these are the same, the description of the other configuration is the same as that described in the first embodiment unless described below.
 本実施形態2の音声信号処理システム1aの音声信号処理部10aは、上述の実施形態1に示した音声信号処理部10のレンダリング切替信号生成部102に替えて、レンダリング切替信号生成部102a(取得部)を具備する。 The audio signal processing unit 10a of the audio signal processing system 1a according to the second embodiment is replaced with the rendering switching signal generation unit 102a (acquisition) instead of the rendering switching signal generation unit 102 of the audio signal processing unit 10 described in the first embodiment. Part).
 レンダリング切替信号生成部102aは、実施形態1のレンダリング切替信号生成部102において取得されるトラック情報及び環境情報(スピーカの位置情報)に加えて、ユーザの視聴位置を示す視聴位置情報を更に取得する。そして、レンダリング切替信号生成部102aは、上記トラック情報、上記位置情報及び上記視聴位置情報に基づいて複数のレンダリング方式のなかから一つのレンダリング方式を選択する。以下、詳細に説明する。なお、本実施形態2においても、説明の便宜上、2種類のレンダリング方式から適宜選択するものとする。 The rendering switching signal generation unit 102a further acquires viewing position information indicating the viewing position of the user in addition to the track information and environment information (speaker position information) acquired by the rendering switching signal generation unit 102 of the first embodiment. . The rendering switching signal generation unit 102a selects one rendering method from among a plurality of rendering methods based on the track information, the position information, and the viewing position information. Details will be described below. In the second embodiment as well, for convenience of explanation, an appropriate selection is made from two types of rendering methods.
 [レンダリング切替信号生成部102a]
 レンダリング切替信号生成部102aは、詳細は後述するが、視聴環境に関する情報と、コンテンツ解析部101で得られるトラック情報201(図2)と、視聴位置情報と、に基づいてレンダリング方式の切り替え指示信号を生成する。レンダリング切替信号生成部102aの詳細を、図10に基づいて説明する。
[Rendering switching signal generator 102a]
Although the details will be described later, the rendering switching signal generation unit 102a, based on the information related to the viewing environment, the track information 201 (FIG. 2) obtained by the content analysis unit 101, and the viewing position information, is used as a rendering method switching instruction signal Is generated. Details of the rendering switching signal generation unit 102a will be described with reference to FIG.
 図10は、レンダリング切替信号生成部102aの構成を示すブロック図である。図10に示すように、レンダリング切替信号生成部102aは、環境情報取得部10201aと、レンダリング切替指示信号算出部10202aとを備えている。 FIG. 10 is a block diagram showing a configuration of the rendering switching signal generation unit 102a. As shown in FIG. 10, the rendering switching signal generation unit 102a includes an environment information acquisition unit 10201a and a rendering switching instruction signal calculation unit 10202a.
 [環境情報取得部10201a]
 環境情報取得部10201aは、ユーザがコンテンツを視聴する環境の情報(以下、環境情報と記載)を取得するように構成されている。本実施形態2における環境情報は、実施形態1に示した音声出力部20として本システムに接続されるスピーカの個数、位置、タイプに、ユーザの視聴位置を示す情報(視聴環境情報)を加えたものであるものとする。
[Environmental information acquisition unit 10201a]
The environment information acquisition unit 10201a is configured to acquire information on an environment in which the user views content (hereinafter referred to as environment information). For the environment information in the second embodiment, information (viewing environment information) indicating the viewing position of the user is added to the number, position, and type of speakers connected to the system as the audio output unit 20 shown in the first embodiment. It shall be a thing.
 本実施形態2においては、視聴環境情報はリアルタイムに取得・更新するものとし、視聴環境の任意の位置に設置され、環境情報取得部10201aに接続されたカメラ(図示しない)で、あらかじめマーカを付したユーザ、スピーカ(音声出力部20)を撮影し、その3次元的位置を取得し、視聴環境情報を更新させることとする。 In the second embodiment, viewing environment information is acquired / updated in real time, and a marker is attached in advance by a camera (not shown) installed at an arbitrary position in the viewing environment and connected to the environment information acquisition unit 10201a. The user and the speaker (sound output unit 20) are photographed, the three-dimensional position is acquired, and the viewing environment information is updated.
 ユーザ位置取得の別の手段としては、同じく設置されたカメラから得られる情報から顔認識を使用して、ユーザ位置を取得するようにしても良い。 As another means for acquiring the user position, the user position may be acquired by using face recognition from information obtained from a camera that is also installed.
 また、ユーザ、スピーカ各々自身に位置情報発信機器を付しておき、その位置情報を取得する構成としても良いし、タブレット端末などの情報入力端末(不図示)を通じて、情報をリアルタイムに入力できるようにしても良い。 Moreover, it is good also as a structure which attaches a positional-information transmission apparatus to each user and a speaker, and acquires the positional information, and it can input information in real time through information input terminals (not shown), such as a tablet terminal. Anyway.
 [レンダリング切替指示信号算出部10202a]
 レンダリング切替指示信号算出部10202aは、環境情報取得部10201aから得られた環境情報と、コンテンツ解析部101で得られたトラック情報201(図2)の発音オブジェクト位置情報とに基づき、音声トラック毎に、その音声信号を、複数のレンダリング方式のいずれでレンダリングするか決定し、その情報をレンダリング部103に出力する。
[Rendering switching instruction signal calculation unit 10202a]
The rendering switching instruction signal calculation unit 10202a is provided for each audio track based on the environment information obtained from the environment information acquisition unit 10201a and the sound generation object position information of the track information 201 (FIG. 2) obtained by the content analysis unit 101. The audio signal is determined to be rendered by any of a plurality of rendering methods, and the information is output to the rendering unit 103.
 以下、レンダリング切替指示信号算出部10202aの動作フローを、図11を用いて説明する。 Hereinafter, an operation flow of the rendering switching instruction signal calculation unit 10202a will be described with reference to FIG.
 図11に示すように、レンダリング切替指示信号算出部10202aは、先述の環境情報及びトラック情報201(図2)を受け取ると、レンダリング方式選択処理を開始する(ステップS301)。 As shown in FIG. 11, upon receiving the above-described environment information and track information 201 (FIG. 2), the rendering switching instruction signal calculation unit 10202a starts a rendering method selection process (step S301).
 そして、全ての音声トラックに対してレンダリング方式選択処理が行われたかを確認し(S302)、全ての音声トラックに対してステップS303以降のレンダリング方式選択処理が完了していれば(ステップS302におけるYES)、レンダリング方式選択処理を終了する(ステップS310)。一方で、レンダリング方式選択処理が未処理の音声トラックがあれば(ステップS302におけるNO)、ステップS303に移行する。 Then, it is confirmed whether or not rendering method selection processing has been performed for all audio tracks (S302), and if rendering method selection processing in step S303 and subsequent steps has been completed for all audio tracks (YES in step S302). ), The rendering method selection process is terminated (step S310). On the other hand, if there is an audio track that has not been subjected to the rendering method selection process (NO in step S302), the process proceeds to step S303.
 ステップS303では、取得したトラック情報201(図2)から、当該未処理の音声トラックに対応する発音オブジェクト位置情報を参照し、当該発音オブジェクト位置情報の一部として記録されている音像位置が、レンダリング方式Aにおけるレンダリング処理可能範囲内に含まれるものであり(ステップS303におけるYES)、且つ視聴位置情報に基づいてユーザの現在位置がレンダリング方式Aの視聴有効範囲である場合(ステップS304におけるYES)、当該音声トラックの音声信号をレンダリング方式Aでレンダリングする指示信号を出力する(ステップS305)。 In step S303, the sound object position recorded as part of the sounding object position information is rendered by referring to the sounding object position information corresponding to the unprocessed audio track from the acquired track information 201 (FIG. 2). If it is included in the rendering processable range in method A (YES in step S303), and the current position of the user is within the viewing effective range of rendering method A based on the viewing position information (YES in step S304), An instruction signal for rendering the audio signal of the audio track by the rendering method A is output (step S305).
 一方、発音オブジェクト位置情報の一部として記録されている音像位置が、レンダリング方式Aにおけるレンダリング処理可能範囲内に含まれないものである場合(ステップS303におけるNO)や、視聴位置情報に基づいてユーザの現在位置がレンダリング方式Aの視聴有効範囲外である場合(ステップS304におけるNO)、ステップS306に進み、レンダリング方式Bによるレンダリング可否を確認する。 On the other hand, when the sound image position recorded as a part of the pronunciation object position information is not included in the rendering processable range in the rendering method A (NO in step S303), the user is based on the viewing position information. Is outside the effective viewing range of the rendering method A (NO in step S304), the process proceeds to step S306, and whether or not rendering by the rendering method B is possible is confirmed.
 そして、発音オブジェクト位置情報の一部として記録されている音像位置が、レンダリング方式Bによるレンダリング処理可能範囲内に含まれるものであり(ステップS306におけるYES)、且つ視聴位置情報に基づいてユーザの現在位置がレンダリング方式Bの視聴有効範囲である場合(ステップS307におけるYES)、当該音声トラックの音声信号をレンダリング方式Bによってレンダリングする指示信号を出力する(ステップS308)。一方で、発音オブジェクト位置情報の一部として記録されている音像位置がレンダリング方式Bにおけるレンダリング処理可能範囲内に含まれないものである場合(ステップS306におけるNO)や、ユーザの現在位置がレンダリング方式Bの視聴有効範囲外である場合(ステップS307におけるNO)、当該音声トラックの音声信号をレンダリングしないように指示を出す(ステップS310)。 The sound image position recorded as part of the pronunciation object position information is included in the rendering processable range by the rendering method B (YES in step S306), and based on the viewing position information, the user's current position When the position is within the viewing effective range of the rendering method B (YES in step S307), an instruction signal for rendering the audio signal of the audio track by the rendering method B is output (step S308). On the other hand, when the sound image position recorded as a part of the pronunciation object position information is not included in the rendering processable range in the rendering method B (NO in step S306), or the current position of the user is the rendering method. If it is out of the viewing effective range of B (NO in step S307), an instruction is issued not to render the audio signal of the audio track (step S310).
 ここで、レンダリング処理可能範囲については、実施形態1で説明した通り、特定のレンダリング方式における、音源を配置可能な範囲を示すものである。また、視聴有効範囲は、各々のレンダリング方式において効果を享受できる推奨視聴エリアであり(例えば図12に示す通り、レンダリング方式Aの視聴有効範囲が1202のようにあらわされ、レンダリング方式Bの視聴有効範囲が1203のようにあらわされる)、レンダリング方式毎に予め記憶部104に記録されているものを適宜読み出すものとする。 Here, the rendering processable range indicates a range in which sound sources can be arranged in a specific rendering method as described in the first embodiment. Also, the viewing effective range is a recommended viewing area where the effect can be enjoyed in each rendering method (for example, as shown in FIG. 12, the viewing effective range of the rendering method A is represented as 1202, and the viewing effective range of the rendering method B is displayed. The range is represented as 1203), and what is recorded in the storage unit 104 in advance for each rendering method is appropriately read.
 以上のように、本実施形態2によれば、ユーザが配したスピーカの配置位置と、コンテンツから得られる情報と、ユーザの視聴位置情報とに応じて、音像定位を考慮した好適なレンダリング方式を算出し、音声再生を行うことにより、良好な定位感のある音声をユーザに届けることが可能となる。 As described above, according to the second embodiment, a suitable rendering method that takes into account sound image localization according to the position of the speaker arranged by the user, the information obtained from the content, and the viewing position information of the user is provided. By calculating and performing sound reproduction, it is possible to deliver sound with good localization to the user.
 〔実施形態3〕
 本開示の他の実施形態として、上述の実施形態1の図5に示したレンダリング切替指示信号算出部10202の動作に関する別態様を以下に説明する。なお、説明の便宜上、上述の実施形態1にて説明した部材と同じ機能を有する部材については、同じ符号を付記し、その説明を省略する。
[Embodiment 3]
As another embodiment of the present disclosure, another aspect related to the operation of the rendering switching instruction signal calculation unit 10202 illustrated in FIG. For convenience of explanation, members having the same functions as those described in the first embodiment are denoted by the same reference numerals and description thereof is omitted.
 上述の実施形態1では、レンダリング方式AをVBAPとし、レンダリング方式Bをトランスオーラルとしたが、本実施形態3では、レンダリング方式Aをトランスオーラルとし、レンダリング方式BをVBAPとする。 In the first embodiment, the rendering method A is VBAP and the rendering method B is trans-oral. However, in the third embodiment, the rendering method A is trans-oral and the rendering method B is VBAP.
 本実施形態3においても、図5に示す動作フローに沿って、まず、トランスオーラルであるレンダリングAの処理可能範囲内にあるかを判定する。先述のように、トランスオーラルは、スピーカの配置位置の範囲内に限らず音像を定位させることができるのに対して、VBAPの場合は音像位置がスピーカの配置位置に依存する。そのため、音声トラックに対して、VBAPで処理可能であるかを先ず判定し、処理不可能である場合にトランスオーラルで処理するという実施形態1の態様の場合は、コンテンツ内でレンダリング方式が変化することになり、ユーザに違和感を与える可能性がある。 Also in the third embodiment, in accordance with the operation flow shown in FIG. As described above, the transoral can localize the sound image without being limited to the range of the speaker arrangement position, whereas in the case of VBAP, the sound image position depends on the speaker arrangement position. Therefore, in the case of the aspect of the first embodiment in which it is first determined whether or not the audio track can be processed by VBAP, and if it cannot be processed, the rendering method changes in the content. As a result, there is a possibility that the user feels uncomfortable.
 そこで、本実施形態3では、スピーカの配置位置に依存しないトランスオーラル(レンダリング方式A)で処理可能であるかを先ず判定する。これにより、音像位置を広くカバーできるレンダリング方式によるレンダリングがコンテンツ内で大部分を占めることになり、上述のような違和感を与えにくい。 Therefore, in the third embodiment, it is first determined whether or not the processing can be performed by transoral (rendering method A) that does not depend on the speaker arrangement position. As a result, rendering based on a rendering method that can cover a wide range of sound image positions occupies most of the content, and it is difficult to give the above-mentioned uncomfortable feeling.
 一方、トランスオーラルに比べてVBAPは、スピーカの配置位置の範囲内に音像を定位させるが故に、音質が良い。したがって、VBAPで処理可能であるかを先ず判定する実施形態1の態様は、音質を重視していると換言することができる。 On the other hand, compared to transoral, VBAP has better sound quality because it localizes the sound image within the range of the speaker placement position. Therefore, it can be said that the aspect of the first embodiment that first determines whether processing is possible with VBAP emphasizes sound quality.
 〔まとめ〕
 本開示の態様1に係る音声信号処理装置(音声信号処理部10、10a)は、一つまたは複数の音声トラックの音声信号をレンダリングして、複数の音声出力装置(音声出力部20(スピーカ602、603、604、605))に出力する音声信号処理装置であって、或る上記音声トラックに基づいてまたは当該音声トラックに付随する情報に基づいて、当該音声トラックの音声信号の再生位置を特定する再生位置特定部(コンテンツ解析部101)と、各上記音声出力装置(音声出力部20(スピーカ602、603、604、605))の位置情報を取得する位置情報取得部(レンダリング切替信号生成部102、102a)と、上記再生位置及び上記位置情報に基づいて複数のレンダリング方式の中から一つのレンダリング方式を選択して、当該一つのレンダリング方式を用いて、当該再生位置に対応する音声トラックの音声信号をレンダリングする処理部(レンダリング部103、レンダリング切替信号生成部102、102a)と、を備えている。
[Summary]
The audio signal processing device (audio signal processing unit 10, 10a) according to aspect 1 of the present disclosure renders audio signals of one or a plurality of audio tracks, and outputs a plurality of audio output devices (audio output unit 20 (speaker 602). 603, 604, 605)), which specifies the playback position of the audio signal of the audio track based on the audio track or information associated with the audio track. And a position information acquisition unit (rendering switching signal generation unit) that acquires position information of each of the audio output devices (audio output unit 20 ( speakers 602, 603, 604, 605)). 102, 102a) and a rendering method selected from a plurality of rendering methods based on the playback position and the position information. To, with the one rendering scheme, a processing unit for rendering the audio signal of the audio track corresponding to the reproduction position (the rendering unit 103, the rendering switching signal generator 102, 102a), a.
 上記の構成によれば、ユーザに対し、その視聴状況下において好適なレンダリング方式でレンダリングした音声を提示することができる。 According to the above configuration, it is possible to present the user with the sound rendered by a suitable rendering method under the viewing situation.
 具体的には、上記の構成によれば、各音声出力装置の位置と、音声トラックの音声信号の再生位置(音像位置)とに基づいて、複数のレンダリング方式のなかから好適なレンダリング方式を選択する。これにより、複数の音声トラックを含む入力音声信号であれば、音声トラック毎に、または、一つの音声トラックを含む入力音声信号であれば当該一つの音声トラックに適したレンダリング方式によってレンダリングすることになる。 Specifically, according to the above configuration, a suitable rendering method is selected from a plurality of rendering methods based on the position of each sound output device and the reproduction position (sound image position) of the sound signal of the sound track. To do. In this way, if the input audio signal includes a plurality of audio tracks, rendering is performed for each audio track, or if the input audio signal includes one audio track, rendering is performed using a rendering method suitable for the one audio track. Become.
 そのため、音像定位を好適に再現し、マルチチャンネル音声を良好に聴取する環境を提供することができる。 Therefore, it is possible to provide an environment where sound image localization can be suitably reproduced and multi-channel sound can be heard satisfactorily.
 本開示の態様2に係る音声信号処理装置(音声信号処理部10a)は、上記態様1において、上記位置情報取得部(レンダリング切替信号生成部102a)は、ユーザの視聴位置を示す視聴位置情報を更に取得し、上記処理部(レンダリング切替信号生成部102a、レンダリング部103)は、上記再生位置、上記位置情報及び上記視聴位置情報に基づいて、上記複数のレンダリング方式のなかから上記一つのレンダリング方式を選択して、当該一つのレンダリング方式を用いて、当該再生位置に対応する音声トラックの音声信号をレンダリングする構成であってもよい。 In the audio signal processing device (audio signal processing unit 10a) according to aspect 2 of the present disclosure, in the aspect 1, the position information acquisition unit (rendering switching signal generation unit 102a) stores the viewing position information indicating the viewing position of the user. Further, the processing unit (the rendering switching signal generation unit 102a and the rendering unit 103) acquires the one rendering method from the plurality of rendering methods based on the reproduction position, the position information, and the viewing position information. The audio signal of the audio track corresponding to the reproduction position may be rendered using the one rendering method.
 上記の構成によれば、ユーザの視聴位置情報を考慮してレンダリング方式を選択することができ、より好適に音像定位を再現することができる。 According to the above configuration, the rendering method can be selected in consideration of the viewing position information of the user, and the sound image localization can be reproduced more suitably.
 本開示の態様3に係る音声信号処理装置(音声信号処理部10、10a)は、上記態様1または2において、上記再生位置特定部(コンテンツ解析部101)は、上記音声トラックまたは当該音声トラックに付随する情報を解析して、上記再生位置を示すトラック情報を生成する構成であってもよい。 The audio signal processing device (audio signal processing unit 10, 10a) according to aspect 3 of the present disclosure is the above-described aspect 1 or 2, wherein the reproduction position specifying unit (content analysis unit 101) is connected to the audio track or the audio track. It may be configured to analyze the accompanying information and generate track information indicating the reproduction position.
 上記の構成によれば、入力音声信号あるいは音声トラックにトラック情報に相当する情報が含まれない場合であっても、再生位置特定部によって音声トラックあるいはこれに付随する情報を解析してトラック情報を生成することができる。 According to the above configuration, even when the information corresponding to the track information is not included in the input audio signal or the audio track, the track information is analyzed by analyzing the audio track or the information associated therewith by the reproduction position specifying unit. Can be generated.
 本開示の態様4に係る音声信号処理装置(音声信号処理部10、10a)は、上記態様1から3において、上記処理部(レンダリング部103、レンダリング切替信号生成部102、102a)が必要とするパラメータを記憶する記憶部(104)を更に備えている構成であってよい。 The audio signal processing device (audio signal processing unit 10, 10a) according to aspect 4 of the present disclosure requires the processing unit (rendering unit 103, rendering switching signal generation unit 102, 102a) in the above-described aspects 1 to 3. The configuration may further include a storage unit (104) for storing parameters.
 本開示の態様5に係る音声信号処理装置(音声信号処理部10、10a)は、上記態様1から4において、上記複数のレンダリング方式は、上記音声信号を再生位置に応じた音圧の比率で各上記音声出力装置(音声出力部20(スピーカ602、603、604、605))から出力させる第一のレンダリング方式と、再生位置に応じた加工がされた上記音声信号を各上記音声出力装置から出力させる第二のレンダリング方式とを含む構成であってよい。 In the audio signal processing device (audio signal processing units 10 and 10a) according to aspect 5 of the present disclosure, in the above aspects 1 to 4, the plurality of rendering methods may be configured such that the audio signal has a sound pressure ratio according to a reproduction position. From each audio output device, the first rendering method output from each audio output device (audio output unit 20 ( speakers 602, 603, 604, 605)) and the audio signal processed according to the reproduction position are output from each audio output device. And a second rendering method to be output.
 本開示の態様6に係る音声信号処理装置(音声信号処理部10、10a)は、上記態様5において、上記第一のレンダリング方式は、VBAPであり、上記第二のレンダリング方式は、トランスオーラルであってよい。 In the audio signal processing device (audio signal processing units 10 and 10a) according to aspect 6 of the present disclosure, in the aspect 5, the first rendering method is VBAP, and the second rendering method is transoral. It may be.
 また本開示の態様7に係る音声信号処理装置(音声信号処理部10、10a)は、上記態様1から6において、上記処理部(レンダリング部103、レンダリング切替信号生成部102、102a)は、上記再生位置が、上記複数の音声出力装置(音声出力部20(スピーカ602、603、604、605))の配置位置によって規定される範囲内に含まれるか否かを判定し、当該判定結果に応じて上記一つのレンダリング方式を選択する構成であってもよい。 The audio signal processing device (audio signal processing unit 10, 10a) according to aspect 7 of the present disclosure is the above-described aspect 1 to 6, and the processing unit (rendering unit 103, rendering switching signal generation unit 102, 102a) It is determined whether or not the reproduction position is included in a range defined by the arrangement position of the plurality of audio output devices (audio output unit 20 ( speakers 602, 603, 604, 605)), and according to the determination result The above-described one rendering method may be selected.
 上記の構成によれば、例えば上記再生位置が、複数の音声出力装置の配置位置によって規定される範囲内に含まれる場合には、VBAPのような音質を重視したレンダリング方式を用いたレンダリングが可能である。 According to the above configuration, for example, when the reproduction position is included in a range defined by the arrangement positions of a plurality of audio output devices, rendering using a rendering method that emphasizes sound quality such as VBAP is possible. It is.
 また本開示の態様8に係る音声信号処理装置(音声信号処理部10、10a)は、上記態様2において、上記処理部(レンダリング部103、レンダリング切替信号生成部102、102a)は、各上記レンダリング方式の視聴有効範囲を特定し、上記再生位置が、上記複数の音声出力装置(音声出力部20(スピーカ602、603、604、605))によって規定される範囲内に含まれるか否かを判定し、且つ、上記視聴位置情報が示す上記ユーザの視聴位置が上記視聴有効範囲に含まれるか否かを判定して、当該判定結果に応じて上記一つのレンダリング方式を選択する構成であってもよい。 Also, the audio signal processing device (audio signal processing unit 10, 10a) according to aspect 8 of the present disclosure is the above-described aspect 2, and the processing unit (rendering unit 103, rendering switching signal generation unit 102, 102a) The effective viewing range of the system is specified, and it is determined whether or not the reproduction position is included in a range defined by the plurality of audio output devices (audio output unit 20 ( speakers 602, 603, 604, 605)). And determining whether or not the viewing position of the user indicated by the viewing position information is included in the effective viewing range and selecting the one rendering method according to the determination result. Good.
 上記の構成によれば、ユーザの視聴位置を考慮することにより、良好な定位感のある音声をユーザに届けることが可能となる。 According to the above configuration, it is possible to deliver a sound with a good sense of localization to the user by considering the viewing position of the user.
 また本開示の態様9に係る音声信号処理システム(音声信号処理システム1、1a)は、上記態様1から8の音声信号処理装置(音声信号処理部10、10a)と、上記複数の音声出力装置(音声出力部20(スピーカ602、603、604、605))と、を備えていることを特徴としている。 Further, the audio signal processing system (audio signal processing system 1, 1a) according to aspect 9 of the present disclosure includes the audio signal processing apparatus (audio signal processing units 10, 10a) according to aspects 1 to 8 and the plurality of audio output apparatuses. (Sound output unit 20 ( speakers 602, 603, 604, 605)).
 本開示は上述した各実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施形態についても本開示の技術的範囲に含まれる。さらに、各実施形態にそれぞれ開示された技術的手段を組み合わせることにより、新しい技術的特徴を形成することができる。 The present disclosure is not limited to the above-described embodiments, and various modifications can be made within the scope of the claims, and the embodiments can be obtained by appropriately combining technical means disclosed in different embodiments. Are also included in the technical scope of the present disclosure. Furthermore, a new technical feature can be formed by combining the technical means disclosed in each embodiment.
 〔関連出願の相互参照〕
 本出願は、2017年2月17日に出願された日本国特許出願:特願2017-028396に対して優先権の利益を主張するものであり、それを参照することにより、その内容の全てが本書に含まれる。
[Cross-reference of related applications]
This application claims the benefit of priority over the Japanese patent application filed on Feb. 17, 2017: Japanese Patent Application No. 2017-028396. Included in this document.
1、1a 音声信号処理システム
10、10a 音声信号処理部(音声信号処理装置)
20 音声出力部(複数の音声出力装置)
101 コンテンツ解析部(再生位置特定部)
102、102a レンダリング切替信号生成部(位置情報取得部、処理部)
103 レンダリング部(処理部)
104 記憶部
201 トラック情報
602、603、604、605 スピーカ(音声出力装置)
606、607、608、609 音像位置(再生位置)
10201、10201a 環境情報取得部(位置情報取得部)
10202、10202a レンダリング切替指示信号算出部(処理部)
1, 1a Audio signal processing system 10, 10a Audio signal processing unit (audio signal processing device)
20 Audio output unit (multiple audio output devices)
101 Content analysis unit (playback position specifying unit)
102, 102a Rendering switching signal generation unit (position information acquisition unit, processing unit)
103 Rendering unit (processing unit)
104 Storage Unit 201 Track Information 602, 603, 604, 605 Speaker (Audio Output Device)
606, 607, 608, 609 Sound image position (playback position)
10201, 10201a Environmental information acquisition unit (position information acquisition unit)
10202, 10202a Rendering switching instruction signal calculation unit (processing unit)

Claims (7)

  1.  一つまたは複数の音声トラックの音声信号をレンダリングして、複数の音声出力装置に出力する音声信号処理装置であって、
     或る上記音声トラックに基づいてまたは当該音声トラックに付随する情報に基づいて、当該音声トラックの音声信号の再生位置を特定する再生位置特定部と、
     各上記音声出力装置の位置情報を取得する位置情報取得部と、
     上記再生位置及び上記位置情報に基づいて複数のレンダリング方式の中から一つのレンダリング方式を選択して、当該一つのレンダリング方式を用いて、当該再生位置に対応する音声トラックの音声信号をレンダリングする処理部と、
    を備えていることを特徴とする音声信号処理装置。
    An audio signal processing device that renders audio signals of one or more audio tracks and outputs the audio signals to a plurality of audio output devices,
    A playback position specifying unit that specifies the playback position of the audio signal of the audio track based on the audio track or information associated with the audio track;
    A position information acquisition unit that acquires position information of each of the audio output devices;
    A process of selecting one rendering method from a plurality of rendering methods based on the reproduction position and the position information, and rendering the audio signal of the audio track corresponding to the reproduction position using the one rendering method And
    An audio signal processing device comprising:
  2.  上記位置情報取得部は、ユーザの視聴位置を示す視聴位置情報を更に取得し、
     上記処理部は、上記再生位置、上記位置情報及び上記視聴位置情報に基づいて、上記複数のレンダリング方式のなかから上記一つのレンダリング方式を選択して、当該一つのレンダリング方式を用いて、当該再生位置に対応する音声トラックの音声信号をレンダリングすることを特徴とする請求項1に記載の音声信号処理装置。
    The position information acquisition unit further acquires viewing position information indicating the viewing position of the user,
    The processing unit selects the one rendering method from the plurality of rendering methods based on the reproduction position, the position information, and the viewing position information, and uses the one rendering method to perform the reproduction. The audio signal processing apparatus according to claim 1, wherein the audio signal of the audio track corresponding to the position is rendered.
  3.  上記再生位置特定部は、上記音声トラックまたは当該音声トラックに付随する情報を解析して、上記再生位置を示すトラック情報を生成することを特徴とする請求項1または2に記載の音声信号処理装置。 3. The audio signal processing apparatus according to claim 1, wherein the reproduction position specifying unit analyzes the audio track or information accompanying the audio track to generate track information indicating the reproduction position. .
  4.  上記処理部が必要とするパラメータを記憶する記憶部を更に備えていることを特徴とする請求項1から3までの何れか1項に記載の音声信号処理装置。 The audio signal processing apparatus according to any one of claims 1 to 3, further comprising a storage unit that stores parameters required by the processing unit.
  5.  上記複数のレンダリング方式は、上記音声信号を再生位置に応じた音圧の比率で各上記音声出力装置から出力させる第一のレンダリング方式と、再生位置に応じた加工がされた上記音声信号を各上記音声出力装置から出力させる第二のレンダリング方式とを含むことを特徴とする請求項1から4までの何れか1項に記載の音声信号処理装置。 The plurality of rendering methods include: a first rendering method for outputting the audio signal from each audio output device at a sound pressure ratio corresponding to a reproduction position; and the audio signal processed according to the reproduction position. 5. The audio signal processing apparatus according to claim 1, further comprising a second rendering method for outputting from the audio output apparatus. 6.
  6.  上記第一のレンダリング方式は、VBAPであり、
     上記第二のレンダリング方式は、トランスオーラルであることを特徴とする請求項5に記載の音声信号処理装置。
    The first rendering method is VBAP,
    The audio signal processing apparatus according to claim 5, wherein the second rendering method is trans-oral.
  7.  請求項1から6までの何れか1項に記載の音声信号処理装置と、
     上記複数の音声出力装置と、
    を備えていることを特徴とする音声信号処理システム。
    The audio signal processing device according to any one of claims 1 to 6,
    The plurality of audio output devices;
    An audio signal processing system comprising:
PCT/JP2018/000736 2017-02-17 2018-01-15 Voice signal processing device and voice signal processing system WO2018150774A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017-028396 2017-02-17
JP2017028396 2017-02-17

Publications (1)

Publication Number Publication Date
WO2018150774A1 true WO2018150774A1 (en) 2018-08-23

Family

ID=63170536

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/000736 WO2018150774A1 (en) 2017-02-17 2018-01-15 Voice signal processing device and voice signal processing system

Country Status (1)

Country Link
WO (1) WO2018150774A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020227140A1 (en) * 2019-05-03 2020-11-12 Dolby Laboratories Licensing Corporation Rendering audio objects with multiple types of renderers
JP7470695B2 (en) 2019-01-08 2024-04-18 テレフオンアクチーボラゲット エルエム エリクソン(パブル) Efficient spatially heterogeneous audio elements for virtual reality

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016525813A (en) * 2014-01-02 2016-08-25 コーニンクレッカ フィリップス エヌ ヴェKoninklijke Philips N.V. Audio apparatus and method therefor
JP2016165117A (en) * 2011-07-01 2016-09-08 ドルビー ラボラトリーズ ライセンシング コーポレイション Audio signal processing system and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016165117A (en) * 2011-07-01 2016-09-08 ドルビー ラボラトリーズ ライセンシング コーポレイション Audio signal processing system and method
JP2016525813A (en) * 2014-01-02 2016-08-25 コーニンクレッカ フィリップス エヌ ヴェKoninklijke Philips N.V. Audio apparatus and method therefor

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7470695B2 (en) 2019-01-08 2024-04-18 テレフオンアクチーボラゲット エルエム エリクソン(パブル) Efficient spatially heterogeneous audio elements for virtual reality
US11968520B2 (en) 2019-01-08 2024-04-23 Telefonaktiebolaget Lm Ericsson (Publ) Efficient spatially-heterogeneous audio elements for virtual reality
WO2020227140A1 (en) * 2019-05-03 2020-11-12 Dolby Laboratories Licensing Corporation Rendering audio objects with multiple types of renderers
CN113767650A (en) * 2019-05-03 2021-12-07 杜比实验室特许公司 Rendering audio objects using multiple types of renderers
JP2022530505A (en) * 2019-05-03 2022-06-29 ドルビー ラボラトリーズ ライセンシング コーポレイション Rendering audio objects with multiple types of renderers
JP7157885B2 (en) 2019-05-03 2022-10-20 ドルビー ラボラトリーズ ライセンシング コーポレイション Rendering audio objects using multiple types of renderers
CN113767650B (en) * 2019-05-03 2023-07-28 杜比实验室特许公司 Rendering audio objects using multiple types of renderers
EP4236378A3 (en) * 2019-05-03 2023-09-13 Dolby Laboratories Licensing Corporation Rendering audio objects with multiple types of renderers
JP7443453B2 (en) 2019-05-03 2024-03-05 ドルビー ラボラトリーズ ライセンシング コーポレイション Rendering audio objects using multiple types of renderers
US11943600B2 (en) 2019-05-03 2024-03-26 Dolby Laboratories Licensing Corporation Rendering audio objects with multiple types of renderers

Similar Documents

Publication Publication Date Title
Rumsey Spatial audio
US9299353B2 (en) Method and apparatus for three-dimensional acoustic field encoding and optimal reconstruction
RU2617553C2 (en) System and method for generating, coding and presenting adaptive sound signal data
KR100739723B1 (en) Method and apparatus for audio reproduction supporting audio thumbnail function
KR101381396B1 (en) Multiple viewer video and 3d stereophonic sound player system including stereophonic sound controller and method thereof
JP2016518067A (en) How to manage the reverberation field of immersive audio
AU2008295723A1 (en) A method and an apparatus of decoding an audio signal
JP6868093B2 (en) Audio signal processing device and audio signal processing system
US20200280815A1 (en) Audio signal processing device and audio signal processing system
JP6663490B2 (en) Speaker system, audio signal rendering device and program
JPWO2017110882A1 (en) Speaker placement position presentation device
JP5338053B2 (en) Wavefront synthesis signal conversion apparatus and wavefront synthesis signal conversion method
WO2018150774A1 (en) Voice signal processing device and voice signal processing system
CN114915874B (en) Audio processing method, device, equipment and medium
Floros et al. Spatial enhancement for immersive stereo audio applications
CN109391896B (en) Sound effect generation method and device
KR20070081735A (en) Apparatus for encoding and decoding audio signal and method thereof
JP5743003B2 (en) Wavefront synthesis signal conversion apparatus and wavefront synthesis signal conversion method
Ando Preface to the Special Issue on High-reality Audio: From High-fidelity Audio to High-reality Audio
JP5590169B2 (en) Wavefront synthesis signal conversion apparatus and wavefront synthesis signal conversion method
RU2779295C2 (en) Processing of monophonic signal in 3d-audio decoder, providing binaural information material
Brandenburg et al. Audio Codecs: Listening pleasure from the digital world
JP2007180662A (en) Video audio reproducing apparatus, method, and program
JP2008147840A (en) Voice signal generating device, sound field reproducing device, voice signal generating method, and computer program
JP2006279555A (en) Signal regeneration apparatus and method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18754453

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18754453

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP