WO2018150774A1

WO2018150774A1 - Voice signal processing device and voice signal processing system

Info

Publication number: WO2018150774A1
Application number: PCT/JP2018/000736
Authority: WO
Inventors: 健明末永; 永雄服部
Original assignee: シャープ株式会社
Priority date: 2017-02-17
Filing date: 2018-01-15
Publication date: 2018-08-23

Abstract

The present invention addresses the problem of presenting a user with a voice rendered by a rendering scheme that is preferable under a viewing situation of the user. A voice signal processing system (1) according to an embodiment of the present disclosure is provided with a voice signal processing unit (10) which selects one rendering scheme from a plurality of rendering schemes on the basis of position information of a voice output device and track information indicating a play back position for an input voice signal, and renders the input voice signal using the one rendering scheme.

Description

Audio signal processing apparatus and audio signal processing system

The present disclosure relates to an audio signal processing device and an audio signal processing system.

Currently, users can easily access content including multi-channel audio (surround audio) via broadcast waves, disc media such as DVD (Digital Versatile Disc) and BD (Blu-ray (registered trademark) Disc), and the Internet. Became available. In movie theaters and the like, many 3D sound systems using object-based audio represented by Dolby Atmos are deployed, and in Japan, 22.2ch audio is adopted as the next-generation broadcasting standard, and users touch multi-channel content. Opportunities have increased significantly. Various techniques for making multi-channels have been studied for conventional stereo audio signals, and a technique for making multi-channels based on the correlation between each channel of stereo signals is disclosed in Patent Document 1.

As for a system for reproducing multi-channel audio, a system that can be easily enjoyed at home is becoming common other than the facilities such as the above-mentioned movie theaters and halls where large sound equipment is arranged. Specifically, the user (listener) arranges a plurality of speakers on the basis of an arrangement standard recommended by the International Telecommunication Union (ITU), so that a multichannel such as 5.1ch or 7.1ch can be used. An environment for listening to channel sound can be established in the home. Also, a technique for reproducing multi-channel sound image localization using a small number of speakers has been studied (Non-patent Document 1).

Japanese Patent Publication “JP 2013-055439 A” (published March 21, 2013) Japanese Patent Publication “Japanese Patent Laid-Open No. 11-113098” (April 23, 1999)

Vector Base Amplitude Panning (VBAP) and sound pressure panning shown in Non-Patent Document 1 are, for example, a group of three

speakers

1302, 1303, and 1304 as shown in (a) of FIG. The sound pressure is controlled based on the positional relationship between the pair of

speakers

1306 and 1307 as shown in b) and the

sound image

1301 or 1305 to be reproduced, and at any position within the range surrounded by the pair of speakers. This technology reproduces sound images. Since the technique can reproduce a sound image within a range surrounded by a set of speakers even when there are a plurality of sound images, a multi-channel audio (for example, 22.2 ch or 5.1 ch) signal is reduced in a speaker. Can be reproduced with numbers.

However, as described above, VBAP and sound pressure panning can reproduce a sound image only within a range surrounded by a set of speakers. Therefore, if the speaker cannot be installed in an area where the speaker cannot be installed, for example, a position close to the ceiling surface, in the user's viewing environment, a sound image in the height direction cannot be reproduced.

On the other hand, if the transoral technique shown in Non-Patent Document 2 or Patent Document 2 is used, three-dimensional sound image control can be performed using at least two speakers. Therefore, for example, there is an advantage that sound image localization at an arbitrary position around the user can be reproduced only by using two speakers installed in front of the user. However, since this technique is a technique that assumes a specific listening area in principle and obtains a sound effect in that area, if the listener is removed from the listening area, the sound image is located at an unexpected position. It may happen that the camera is localized, or the localization is not felt in the first place.

One embodiment of the present disclosure is to realize an audio signal processing device capable of presenting audio rendered by a suitable rendering method to a user in a viewing state, and an audio signal processing system including the device. Objective.

In order to solve the above problem, an audio signal processing device according to an aspect of the present disclosure is an audio signal processing device that renders audio signals of one or more audio tracks and outputs the audio signals to a plurality of audio output devices. Based on a certain audio track or information associated with the audio track, a reproduction position specifying unit for specifying a reproduction position of the audio signal of the audio track, and position information of each audio output device A position information acquisition unit to be acquired, and one rendering method is selected from a plurality of rendering methods based on the reproduction position and the position information, and the sound corresponding to the reproduction position is selected using the one rendering method. And a processing unit for rendering the audio signal of the track. In general, an audio track may include a plurality of audio channels. However, in the present disclosure, it is assumed that one audio channel is included for each audio track for easy understanding. .

In order to solve the above problem, an audio signal processing system according to an aspect of the present disclosure includes the audio signal processing device having the above-described configuration and the plurality of audio output devices. Yes.

According to one aspect of the present disclosure, it is possible to present a sound rendered by a suitable rendering method to the user under the viewing situation.

1 is a block diagram illustrating a main configuration of an audio signal processing system according to Embodiment 1 of the present disclosure. It is a figure showing an example of track information used with an audio signal processing system concerning Embodiment 1 of this indication. It is a figure which shows the coordinate system used for description of this indication. FIG. 3 is a block diagram illustrating a main configuration of a rendering switching signal generation unit according to Embodiment 1 of the present disclosure. FIG. 6 is a diagram illustrating a processing flow of a rendering switching signal generation unit according to the first embodiment of the present disclosure. It is the figure which showed the relationship between the arrangement position of a speaker, and a sound image position. It is a figure showing a processing flow in another form of a rendering change signal generation part concerning Embodiment 1 of this indication. FIG. 6 is a diagram illustrating a processing flow of a rendering unit according to the first embodiment of the present disclosure. It is a block diagram which shows the principal part structure of the audio | voice signal processing system which concerns on Embodiment 2 of this indication. It is a block diagram which shows the principal part structure of the rendering switching signal generation part which concerns on Embodiment 2 of this indication. It is a figure showing a processing flow of a rendering change signal generation part concerning Embodiment 2 of this indication. It is the schematic diagram which showed the viewing-and-listening effective range for every rendering system. It is a schematic diagram explaining a VBAP system and a sound pressure panning system.

Embodiment 1
Hereinafter, an embodiment of the present disclosure will be described with reference to FIGS. 1 to 8.

FIG. 1 is a block diagram showing the main configuration of the audio signal processing system 1 according to the first embodiment. The audio signal processing system 1 according to the first embodiment includes an audio signal processing unit 10 (audio signal processing device) and an audio output unit 20 (a plurality of audio output devices).

<Audio signal processing unit 10>
The audio signal processing unit 10 is an audio signal processing apparatus that renders audio signals of one or a plurality of audio tracks using two different rendering methods. The audio signal after the rendering process is output from the audio signal processing unit 10 to the audio output unit 20.

The audio signal processing unit 10 includes a content analysis unit 101 (reproduction position specifying unit) that specifies the sound image position (reproduction position) of the audio signal of the audio track based on the input audio signal or on information accompanying the input audio signal, A rendering switching signal generation unit 102 (position information acquisition unit, processing unit) that acquires position information of the audio output unit 20, and one selected from a plurality of rendering methods based on a sound image position (playback position) and position information A rendering unit 103 (processing unit) that renders an audio signal of an audio track corresponding to the sound image position using a rendering method;

Further, the audio signal processing unit 10 includes a storage unit 104 as shown in FIG. The storage unit 104 stores various parameters required by the rendering switching signal generation unit 102 and the rendering unit 103 or generated various parameters.

Hereinafter, each configuration will be described in detail.

[Content Analysis Unit 101]
The content analysis unit 101 analyzes an audio track included in video content or audio content recorded on a disc medium such as a DVD or a BD, an HDD (Hard Disc Drive), or any metadata (information) associated therewith. Then, the pronunciation object position information is obtained. The pronunciation object position information is sent from the content analysis unit 101 to the rendering switching signal generation unit 102 and the rendering unit 103.

In the first embodiment, it is assumed that the audio content received by the content analysis unit 101 is an audio content including two or more audio tracks. Further, this audio track may be a “channel-based” audio track employed in stereo (2ch), 5.1ch, etc., and each sound generation object unit is set as one track, and this position / volume It may be an “object-based” audio track to which accompanying information (metadata) describing a change in the environment is added.

Explain the concept of “object-based” audio tracks. The audio track based on the object base is recorded on each track for each sounding object, that is, recorded without mixing, and these sounding objects are appropriately rendered on the player (playing device) side. Although there is a difference in each standard and format, in general, each of these pronunciation objects is associated with metadata such as when, where, and at what volume the player should pronounce. Render individual pronunciation objects based on

On the other hand, the “channel-based” audio track is employed in conventional surround sound (for example, 5.1ch surround), and is presupposed to be sounded from a predetermined playback position (speaker placement position). This is a track recorded in a state where individual sound generation objects are mixed.

(Pronunciation object position information)
Here, the pronunciation object position information will be described with reference to FIG.

FIG. 2 conceptually shows the configuration of the track information 201 obtained by analysis by the content analysis unit 101.

The content analysis unit 101 analyzes all the audio tracks included in the content and reconstructs the track information 201 shown in FIG. In the track information 201, the ID of each audio track and the type of the audio track are recorded.

Here, when the audio track is an object-based track, one or more pronunciation object position information is attached as metadata. The pronunciation object position information is composed of a pair of a reproduction time and a sound image position (reproduction position) at the reproduction time.

On the other hand, when the audio track is a channel-based track, a pair of a playback time and a sound image position (playback position) at the playback time is recorded. Is from the start to the end of the content, and the sound image position at the playback time is based on the playback position defined in advance on the channel base.

Here, it is assumed that the sound image position (playback position) recorded as a part of the pronunciation object position information is expressed in the coordinate system shown in FIG. Further, it is assumed that the track information 201 is described in a markup language such as XML (Extensible Markup Language).

[Rendering switching signal generator 102]
Although the details will be described later, the rendering switching signal generation unit 102 generates a rendering method switching instruction signal based on information related to the viewing environment and the track information 201 (FIG. 2) obtained by the content analysis unit 101. Details of the rendering switching signal generation unit 102 will be described with reference to FIG.

FIG. 4 is a block diagram illustrating a configuration of the rendering switching signal generation unit 102. As illustrated in FIG. 4, the rendering switching signal generation unit 102 includes an environment information acquisition unit 10201 (position information acquisition unit) and a rendering switching instruction signal calculation unit 10202 (processing unit).

[Environmental information acquisition unit 10201]
The environment information acquisition unit 10201 is configured to acquire information on the environment in which the user views the content (hereinafter referred to as environment information).

Here, in the first embodiment, the environment information is assumed to be the number of speakers connected to the audio signal processing unit 10 as the audio output unit 20, the position of the speaker, and the type of the speaker. The speaker type is information indicating which of a plurality of rendering methods used in this system can be used. As described in the first embodiment, when the audio signal processing unit 10 uses two types of rendering methods, information on whether or not each speaker can be used for either or both of the methods at the position where each speaker is arranged. Is the speaker type.

Environmental information is recorded in the storage unit 104 in advance. Therefore, the environment information acquisition unit 10201 reads information from the storage unit 104 as necessary.

The environment information recorded in the storage unit 104 may be recorded as metadata information described according to an arbitrary format, for example, a format such as XML. In this case, the environment information acquisition unit 10201 may be used. Decodes as appropriate to extract information.

Further, it is assumed that the sound image position and the speaker position are shown in a coordinate system as shown in FIG. The coordinate system used here is centered on the origin O as shown in the top view of FIG. 3A, the distance from the origin O is the radius r, the front of the origin O is 0 °, the right position, The azimuth angle θ with the left position being 90 ° and −90 °, respectively, and the front of the origin O is 0 ° and the position just above the origin O is 90 ° as shown in the side view of FIG. The elevation angle φ is assumed, and the sound image position and the speaker position are expressed as (r, θ, φ). In the following description, unless otherwise specified, the coordinate system of FIG. 3 is used for the sound image position and the speaker position.

In the first embodiment, as described above, the environment information is acquired in advance and recorded in the storage unit 104. However, the present disclosure is not limited to this, and information may be input in real time through an information input terminal (not shown in the first embodiment) such as a tablet terminal. Also, image processing is performed from an image taken by a camera installed at an arbitrary position in the viewing environment (for example, a marker is attached to the audio output unit 20 and this is recognized by a camera installed on the ceiling of the room). It is good also as a structure. Alternatively, a device that transmits position information to the audio output unit 20 itself (for example, using a beacon or the like) may be used to acquire various information.

[Rendering switching instruction signal calculation unit 10202]
The rendering switching instruction signal calculation unit 10202 is provided for each audio track based on the environment information obtained from the environment information acquisition unit 10201 and the sounding object position information of the track information 201 (FIG. 2) obtained by the content analysis unit 101. The audio signal is determined to be rendered by any of a plurality of rendering methods, and the information is output to the rendering unit 103.

Here, in the first embodiment, it is assumed that the rendering unit 103 simultaneously drives two types of rendering methods (rendering algorithms), that is, the rendering method A and the rendering method B, in order to make the description easier to understand.

Hereinafter, the operation of the rendering switching instruction signal calculation unit 10202 will be described with reference to FIG. FIG. 5 is a flowchart for explaining the operation of the rendering switching instruction signal calculation unit 10202.

When the rendering switching instruction signal calculation unit 10202 receives the above-described environment information and track information 201 (FIG. 2), it starts a rendering method selection process (step S101).

Then, it is confirmed whether or not rendering method selection processing has been performed for all audio tracks (step S102). If the rendering method selection process after step S103 is completed for all audio tracks (YES in step S102), the rendering method selection process is terminated (step S106). On the other hand, if there is an audio track that has not been subjected to rendering method selection processing (NO in step S102), the process proceeds to step S103.

In step S103, the sound generation object position information corresponding to the unprocessed audio track is referred to from the acquired track information 201 (FIG. 2), and the sound image position recorded as a part of the sound generation object position information is rendered. It is determined whether or not the image is included in the rendering processable range in the method A.

Here, the rendering processable range indicates a range in which a sound source can be arranged in a specific rendering method, and information (position information) indicating the position of the speaker obtained as part of the environment information as necessary. To be determined. Note that the determination of the rendering processable range does not necessarily require reference to the environment information (that is, information acquired by using some means regarding the current environment). For example, when the speaker position is determined on the system in advance and the user places the speaker at this position in accordance with an instruction from the system, it is not necessary to acquire the information. It is also possible to define a rendering processable range regardless of the position of the speaker (as will be described later, if the rendering process is a downmix to a monaural signal, the entire area can be defined as the processable range. it can).

A more specific example will be described with reference to FIG. Assume that a user (listener) 601 exists at the position of the origin O, and speakers (sound output devices) 602, 603, 604, and 605 are arranged around the user (listener) 601. The

speakers

602, 603, 604, 605 are arranged at the same height as the position of the viewer's head. (A) in FIG. 6 is a diagram when the layout is viewed from above, and (b) in FIG. 6 is a diagram when viewed from the side.

Reference numerals

606, 607, 608, and 609 denote positions (sound image positions) where sound images based on the sound signals of the respective sound tracks should be localized. The sound image positions 606, 607, and 608 are at the same height as the position of the viewer's head, and the sound image position 609 is higher than the position of the viewer's head. In this case, the rendering method A is VBAP (first rendering method), the rendering method B is transoral (second rendering method), and the speakers usable for VBAP are 602, 603, 604, 605, and transoral.

Speakers

602 and 603 that can be used in In that case, the rendering processable range in the rendering method A (VBAP) is a range sandwiched between adjacent speakers, specifically a range sandwiched between

speakers

602 and 603, a range sandwiched between 603 and 605, A range between 604 and 605 and a range between 602 and 604. Therefore, audio signals to be localized at the sound image positions 606, 607, and 608 included in this range (YES in step S103 in FIG. 5) can be processed by the rendering method A (VBAP). On the other hand, the sound image position 609 shown in FIG. 6 is higher than the position of the speaker and is not included in the rendering processable range in the rendering method A (VBAP) (NO in step S103 of FIG. 5). In this case, the sound image position 609 is a sound signal rendering process by a rendering method B which is a rendering method (trans-oral) that can localize a sound image regardless of the position of the speaker.

That is, as a result of the determination in step S103, if the sound image position of the unprocessed audio track is included in the rendering processable range in the rendering method A (YES in step S103), the process proceeds to step S104. On the other hand, as a result of the determination in step S103, if the sound image position of the unprocessed audio track is not included in the rendering processable range in the rendering method A (NO in step S103), the process proceeds to step S105.

In step S104, an instruction signal (rendering switching signal) for rendering the audio signal of the unprocessed audio track using the rendering method A is output to the rendering unit 103.

On the other hand, in step S105, an instruction signal (rendering switching signal) for rendering the audio signal of the unprocessed audio track using the rendering method B is output to the rendering unit 103.

In the above description, the sound image positions of all the audio tracks are described as being within the rendering processable range of either the rendering method A or the rendering method B. However, if this is not the case, that is, if there is a possibility that it does not fall within the rendering processable range of either rendering system A or rendering system B, the rendering system selection process is performed according to the flow shown in FIG. Also good.

FIG. 7 is a modification of the flow shown in FIG.

In the processing flow shown in FIG. 7, the first half is the same as the processing flow in FIG.

That is, first, the rendering switching instruction signal calculation unit 10202 receives the environment information and the track information 201 (FIG. 2), and the rendering method selection process starts (step S111).

Subsequently, it is confirmed whether or not rendering method selection processing has been performed for all audio tracks (step S112). If rendering method selection processing in step S113 and subsequent steps has been completed for all audio tracks (step S112). In step S118, the rendering method selection process is terminated. On the other hand, if there is an unprocessed audio track for which the rendering method selection process is not performed (NO in step S112), the sound generation object position information corresponding to the unprocessed audio track is referred to from the acquired track information 201 (FIG. 2). Similarly to step S103 described above, whether or not the sound image position recorded as part of the sound generation object position information corresponding to the unprocessed audio track is included in the rendering processable range in the rendering method A is determined. It discriminate | determines (step S113).

If the result of determination in step S113 is that the sound image position is within the rendering processable range in rendering method A (YES in step S113), the process proceeds to step S114. In step S <b> 114, an instruction signal for rendering the audio signal of the unprocessed audio track using the rendering method A is output to the rendering unit 103.

On the other hand, as a result of the determination in step S113, if the sound image position is not included in the rendering processable range in the rendering method A (NO in step S113), the process proceeds to step S115.

In step S115, it is determined whether or not the sound image position is included in the rendering processable range in the rendering method B.

If the result of determination in step S115 is that the sound image position is within the rendering processable range in rendering method B (YES in step S115), the process proceeds to step S116. On the other hand, as a result of the determination, if the sound image position is not included in the rendering processable range in the rendering method B (NO in step S115), the process proceeds to step S117. That is, if the sound image position is not included in the rendering processable range of the rendering method A and the rendering method B, the process proceeds to step S117.

In step S116, an instruction signal for rendering the audio signal of the unprocessed audio track using the rendering method B is output to the rendering unit 103.

On the other hand, in step S117, an instruction is issued not to render the audio signal of the unprocessed audio track. The instruction signal is output to the rendering unit 103.

As described above, in the first embodiment, the selectable rendering methods are described as two types, but it goes without saying that three or more types of rendering methods may be selected.

In the above description, the expression that the rendering switching instruction signal calculation unit 10202 is for instructing switching of the rendering method is used. However, the expression “instructing switching” here is used. In addition to the mode of instructing to switch the rendering mode from A to B or from B to A, the mode of instructing to use the rendering mode A in the next track of the track using the rendering mode A (also for the mode B) The same).

In the flow shown in FIG. 7, no sound is output for any track that is not included in any of the processing ranges of the rendering method A and the rendering method B, but the rendering method B has a wide processing range. By adopting a method, for example, downmixing to a monaural signal, it is possible to avoid the occurrence of a track in which no sound is output in a practical range.

[Rendering unit 103]
The rendering unit 103 constructs an audio signal to be output from the audio output unit 20 based on the input audio signal and the instruction signal output from the rendering switching instruction signal calculation unit 10202 of the rendering switching signal generation unit 102.

Specifically, after receiving the audio signal included in the content, rendering the audio signal by a rendering method based on the instruction signal from the rendering switching instruction signal calculating unit 10202 of the rendering switching signal generating unit 102, and further mixing the audio signal, Output to the output unit 20.

In other words, the rendering unit 103 simultaneously drives two types of rendering algorithms, switches the rendering algorithm to be used based on the instruction signal output from the rendering switching instruction signal calculation unit 10202, and renders the audio signal.

Here, rendering means performing processing for converting an audio signal (input audio signal) included in the content into a signal to be output from the audio output unit 20.

Hereinafter, the operation of the rendering unit 103 will be described using the flow shown in FIG.

FIG. 8 is a flowchart showing the operation of the rendering unit 103.

When the rendering unit 103 receives the input audio signal and the instruction signal from the rendering switching instruction signal calculation unit 10202 of the rendering switching signal generation unit 102, the rendering unit 103 starts the rendering process (step S201).

First, it is confirmed whether rendering processing has been performed for all audio tracks (step S202). In step S202, if the rendering process after step S203 has been completed for all audio tracks (YES in step S202), the rendering process is terminated (S208). On the other hand, if there is an unprocessed audio track (NO in step S202), rendering is performed using a rendering method based on the instruction signal from the rendering switching instruction signal calculation unit 10202 of the rendering switching signal generation unit 102. When the instruction signal indicates the rendering method A (rendering method A in step S203), parameters necessary for rendering the audio signal using the rendering method A are read from the storage unit 104 (step S204). Rendering based on this is performed (step S205). Similarly, when the instruction signal indicates the rendering method B (rendering method B in step S203), parameters necessary for rendering the audio signal in the rendering method B are read from the storage unit 104 (step S206). Rendering based on is performed (step S207). If the instruction signal indicates no rendering based on the flow of FIG. 7 (no rendering in step S203), the corresponding track is not rendered and is not included in the output audio.

[Storage unit 104]
The storage unit 104 is configured by a secondary storage device for recording various data used in the rendering switching signal generation unit 102 and the rendering unit 103. The storage unit 104 is configured by, for example, a magnetic disk, an optical disk, a flash memory, and the like, and more specific examples include an HDD, an SSD (Solid State Drive), an SD memory card, a BD, a DVD, and the like. The rendering switching signal generation unit 102 and the rendering unit 103 read data from the storage unit 104 as necessary. In addition, various parameter data including the coefficient calculated by the rendering switching signal generation unit 102 can be recorded in the storage unit 104.

<Audio output unit 20>
The audio output unit 20 outputs the audio obtained by the rendering unit 103. Here, the audio output unit 20 includes a plurality of independent speakers, and each speaker includes a speaker unit and an amplifier (amplifier) that drives the speaker unit.

That is, the environment information acquisition unit 10201 acquires the position information of each speaker configured in the audio output unit 20. Then, the rendering switching instruction signal calculation unit 10202 selects a rendering method based on a plurality of pieces of position information acquired by the environment information acquisition unit 10201.

As described above, according to the first embodiment, a suitable rendering method considering sound image localization is automatically calculated according to the arrangement of speakers arranged by the user and information obtained from the content, and audio reproduction is performed. Thus, it is possible to deliver a sound with a good localization feeling to the user.

In the first embodiment, content including a plurality of audio tracks is targeted for reproduction. However, the present disclosure is not limited to this, and content including one audio track may be targeted for reproduction. In this case, a suitable rendering method for the one audio track is selected from a plurality of rendering methods.

(Rendering method)
In the first embodiment, a rendering method of VBAP, trans-oral, and downmixing to a monaural signal has been described. However, the present disclosure is not limited to these rendering methods.

For example, a rendering method similar to VBAP may be employed, in which an audio signal is output from each audio output unit at a sound pressure ratio corresponding to the sound image position (playback position). Also, a rendering method similar to transaural, in which an audio signal processed according to the sound image position (reproduction position) is output from each audio output unit, may be employed. When the sound image position is included in the range defined by the arrangement positions of the plurality of sound output units, the sound quality is improved by adopting a rendering method that outputs from each sound output unit at a sound pressure ratio according to the sound image position. An audio environment where emphasis is placed on can be realized. On the other hand, by adopting a rendering method that is processed according to the sound image position (reproduction position) such as transaural, it is possible to localize the sound image without being restricted by the arrangement of the sound output unit.

Also, in the present disclosure, for example, downmixing to a stereo signal can be adopted as one of rendering methods.

[Embodiment 2]
Another embodiment of the present disclosure will be described below with reference to FIGS. 9 to 12. For convenience of explanation, members having the same functions as those described in the first embodiment are denoted by the same reference numerals and description thereof is omitted.

FIG. 9 is a block diagram illustrating a main configuration of the audio signal processing system 1a according to the second embodiment of the present disclosure. Here, the audio signal processing system 1a according to the second embodiment is different only in the behavior of the rendering switching signal generation unit 102 in the audio signal processing system 1 shown in the first embodiment, and other processing units are used. Since these are the same, the description of the other configuration is the same as that described in the first embodiment unless described below.

The audio signal processing unit 10a of the audio signal processing system 1a according to the second embodiment is replaced with the rendering switching signal generation unit 102a (acquisition) instead of the rendering switching signal generation unit 102 of the audio signal processing unit 10 described in the first embodiment. Part).

The rendering switching signal generation unit 102a further acquires viewing position information indicating the viewing position of the user in addition to the track information and environment information (speaker position information) acquired by the rendering switching signal generation unit 102 of the first embodiment. . The rendering switching signal generation unit 102a selects one rendering method from among a plurality of rendering methods based on the track information, the position information, and the viewing position information. Details will be described below. In the second embodiment as well, for convenience of explanation, an appropriate selection is made from two types of rendering methods.

[Rendering switching signal generator 102a]
Although the details will be described later, the rendering switching signal generation unit 102a, based on the information related to the viewing environment, the track information 201 (FIG. 2) obtained by the content analysis unit 101, and the viewing position information, is used as a rendering method switching instruction signal Is generated. Details of the rendering switching signal generation unit 102a will be described with reference to FIG.

FIG. 10 is a block diagram showing a configuration of the rendering switching signal generation unit 102a. As shown in FIG. 10, the rendering switching signal generation unit 102a includes an environment information acquisition unit 10201a and a rendering switching instruction signal calculation unit 10202a.

[Environmental information acquisition unit 10201a]
The environment information acquisition unit 10201a is configured to acquire information on an environment in which the user views content (hereinafter referred to as environment information). For the environment information in the second embodiment, information (viewing environment information) indicating the viewing position of the user is added to the number, position, and type of speakers connected to the system as the audio output unit 20 shown in the first embodiment. It shall be a thing.

In the second embodiment, viewing environment information is acquired / updated in real time, and a marker is attached in advance by a camera (not shown) installed at an arbitrary position in the viewing environment and connected to the environment information acquisition unit 10201a. The user and the speaker (sound output unit 20) are photographed, the three-dimensional position is acquired, and the viewing environment information is updated.

As another means for acquiring the user position, the user position may be acquired by using face recognition from information obtained from a camera that is also installed.

Moreover, it is good also as a structure which attaches a positional-information transmission apparatus to each user and a speaker, and acquires the positional information, and it can input information in real time through information input terminals (not shown), such as a tablet terminal. Anyway.

[Rendering switching instruction signal calculation unit 10202a]
The rendering switching instruction signal calculation unit 10202a is provided for each audio track based on the environment information obtained from the environment information acquisition unit 10201a and the sound generation object position information of the track information 201 (FIG. 2) obtained by the content analysis unit 101. The audio signal is determined to be rendered by any of a plurality of rendering methods, and the information is output to the rendering unit 103.

Hereinafter, an operation flow of the rendering switching instruction signal calculation unit 10202a will be described with reference to FIG.

As shown in FIG. 11, upon receiving the above-described environment information and track information 201 (FIG. 2), the rendering switching instruction signal calculation unit 10202a starts a rendering method selection process (step S301).

Then, it is confirmed whether or not rendering method selection processing has been performed for all audio tracks (S302), and if rendering method selection processing in step S303 and subsequent steps has been completed for all audio tracks (YES in step S302). ), The rendering method selection process is terminated (step S310). On the other hand, if there is an audio track that has not been subjected to the rendering method selection process (NO in step S302), the process proceeds to step S303.

In step S303, the sound object position recorded as part of the sounding object position information is rendered by referring to the sounding object position information corresponding to the unprocessed audio track from the acquired track information 201 (FIG. 2). If it is included in the rendering processable range in method A (YES in step S303), and the current position of the user is within the viewing effective range of rendering method A based on the viewing position information (YES in step S304), An instruction signal for rendering the audio signal of the audio track by the rendering method A is output (step S305).

On the other hand, when the sound image position recorded as a part of the pronunciation object position information is not included in the rendering processable range in the rendering method A (NO in step S303), the user is based on the viewing position information. Is outside the effective viewing range of the rendering method A (NO in step S304), the process proceeds to step S306, and whether or not rendering by the rendering method B is possible is confirmed.

The sound image position recorded as part of the pronunciation object position information is included in the rendering processable range by the rendering method B (YES in step S306), and based on the viewing position information, the user's current position When the position is within the viewing effective range of the rendering method B (YES in step S307), an instruction signal for rendering the audio signal of the audio track by the rendering method B is output (step S308). On the other hand, when the sound image position recorded as a part of the pronunciation object position information is not included in the rendering processable range in the rendering method B (NO in step S306), or the current position of the user is the rendering method. If it is out of the viewing effective range of B (NO in step S307), an instruction is issued not to render the audio signal of the audio track (step S310).

Here, the rendering processable range indicates a range in which sound sources can be arranged in a specific rendering method as described in the first embodiment. Also, the viewing effective range is a recommended viewing area where the effect can be enjoyed in each rendering method (for example, as shown in FIG. 12, the viewing effective range of the rendering method A is represented as 1202, and the viewing effective range of the rendering method B is displayed. The range is represented as 1203), and what is recorded in the storage unit 104 in advance for each rendering method is appropriately read.

As described above, according to the second embodiment, a suitable rendering method that takes into account sound image localization according to the position of the speaker arranged by the user, the information obtained from the content, and the viewing position information of the user is provided. By calculating and performing sound reproduction, it is possible to deliver sound with good localization to the user.

[Embodiment 3]
As another embodiment of the present disclosure, another aspect related to the operation of the rendering switching instruction signal calculation unit 10202 illustrated in FIG. For convenience of explanation, members having the same functions as those described in the first embodiment are denoted by the same reference numerals and description thereof is omitted.

In the first embodiment, the rendering method A is VBAP and the rendering method B is trans-oral. However, in the third embodiment, the rendering method A is trans-oral and the rendering method B is VBAP.

Also in the third embodiment, in accordance with the operation flow shown in FIG. As described above, the transoral can localize the sound image without being limited to the range of the speaker arrangement position, whereas in the case of VBAP, the sound image position depends on the speaker arrangement position. Therefore, in the case of the aspect of the first embodiment in which it is first determined whether or not the audio track can be processed by VBAP, and if it cannot be processed, the rendering method changes in the content. As a result, there is a possibility that the user feels uncomfortable.

Therefore, in the third embodiment, it is first determined whether or not the processing can be performed by transoral (rendering method A) that does not depend on the speaker arrangement position. As a result, rendering based on a rendering method that can cover a wide range of sound image positions occupies most of the content, and it is difficult to give the above-mentioned uncomfortable feeling.

On the other hand, compared to transoral, VBAP has better sound quality because it localizes the sound image within the range of the speaker placement position. Therefore, it can be said that the aspect of the first embodiment that first determines whether processing is possible with VBAP emphasizes sound quality.

[Summary]
The audio signal processing device (audio signal processing unit 10, 10a) according to aspect 1 of the present disclosure renders audio signals of one or a plurality of audio tracks, and outputs a plurality of audio output devices (audio output unit 20 (speaker 602). 603, 604, 605)), which specifies the playback position of the audio signal of the audio track based on the audio track or information associated with the audio track. And a position information acquisition unit (rendering switching signal generation unit) that acquires position information of each of the audio output devices (audio output unit 20 (

speakers

602, 603, 604, 605)). 102, 102a) and a rendering method selected from a plurality of rendering methods based on the playback position and the position information. To, with the one rendering scheme, a processing unit for rendering the audio signal of the audio track corresponding to the reproduction position (the rendering unit 103, the rendering switching

signal generator

102, 102a), a.

According to the above configuration, it is possible to present the user with the sound rendered by a suitable rendering method under the viewing situation.

Specifically, according to the above configuration, a suitable rendering method is selected from a plurality of rendering methods based on the position of each sound output device and the reproduction position (sound image position) of the sound signal of the sound track. To do. In this way, if the input audio signal includes a plurality of audio tracks, rendering is performed for each audio track, or if the input audio signal includes one audio track, rendering is performed using a rendering method suitable for the one audio track. Become.

Therefore, it is possible to provide an environment where sound image localization can be suitably reproduced and multi-channel sound can be heard satisfactorily.

In the audio signal processing device (audio signal processing unit 10a) according to aspect 2 of the present disclosure, in the aspect 1, the position information acquisition unit (rendering switching signal generation unit 102a) stores the viewing position information indicating the viewing position of the user. Further, the processing unit (the rendering switching signal generation unit 102a and the rendering unit 103) acquires the one rendering method from the plurality of rendering methods based on the reproduction position, the position information, and the viewing position information. The audio signal of the audio track corresponding to the reproduction position may be rendered using the one rendering method.

According to the above configuration, the rendering method can be selected in consideration of the viewing position information of the user, and the sound image localization can be reproduced more suitably.

The audio signal processing device (audio signal processing unit 10, 10a) according to aspect 3 of the present disclosure is the above-described

aspect

1 or 2, wherein the reproduction position specifying unit (content analysis unit 101) is connected to the audio track or the audio track. It may be configured to analyze the accompanying information and generate track information indicating the reproduction position.

According to the above configuration, even when the information corresponding to the track information is not included in the input audio signal or the audio track, the track information is analyzed by analyzing the audio track or the information associated therewith by the reproduction position specifying unit. Can be generated.

The audio signal processing device (audio signal processing unit 10, 10a) according to aspect 4 of the present disclosure requires the processing unit (rendering unit 103, rendering switching

signal generation unit

102, 102a) in the above-described aspects 1 to 3. The configuration may further include a storage unit (104) for storing parameters.

In the audio signal processing device (audio signal processing units 10 and 10a) according to aspect 5 of the present disclosure, in the above aspects 1 to 4, the plurality of rendering methods may be configured such that the audio signal has a sound pressure ratio according to a reproduction position. From each audio output device, the first rendering method output from each audio output device (audio output unit 20 (

speakers

602, 603, 604, 605)) and the audio signal processed according to the reproduction position are output from each audio output device. And a second rendering method to be output.

In the audio signal processing device (audio signal processing units 10 and 10a) according to aspect 6 of the present disclosure, in the aspect 5, the first rendering method is VBAP, and the second rendering method is transoral. It may be.

The audio signal processing device (audio signal processing unit 10, 10a) according to aspect 7 of the present disclosure is the above-described aspect 1 to 6, and the processing unit (rendering unit 103, rendering switching

signal generation unit

102, 102a) It is determined whether or not the reproduction position is included in a range defined by the arrangement position of the plurality of audio output devices (audio output unit 20 (

speakers

602, 603, 604, 605)), and according to the determination result The above-described one rendering method may be selected.

According to the above configuration, for example, when the reproduction position is included in a range defined by the arrangement positions of a plurality of audio output devices, rendering using a rendering method that emphasizes sound quality such as VBAP is possible. It is.

Also, the audio signal processing device (audio signal processing unit 10, 10a) according to aspect 8 of the present disclosure is the above-described aspect 2, and the processing unit (rendering unit 103, rendering switching

signal generation unit

102, 102a) The effective viewing range of the system is specified, and it is determined whether or not the reproduction position is included in a range defined by the plurality of audio output devices (audio output unit 20 (

speakers

602, 603, 604, 605)). And determining whether or not the viewing position of the user indicated by the viewing position information is included in the effective viewing range and selecting the one rendering method according to the determination result. Good.

According to the above configuration, it is possible to deliver a sound with a good sense of localization to the user by considering the viewing position of the user.

Further, the audio signal processing system (audio signal processing system 1, 1a) according to aspect 9 of the present disclosure includes the audio signal processing apparatus (audio signal processing units 10, 10a) according to aspects 1 to 8 and the plurality of audio output apparatuses. (Sound output unit 20 (

speakers

602, 603, 604, 605)).

The present disclosure is not limited to the above-described embodiments, and various modifications can be made within the scope of the claims, and the embodiments can be obtained by appropriately combining technical means disclosed in different embodiments. Are also included in the technical scope of the present disclosure. Furthermore, a new technical feature can be formed by combining the technical means disclosed in each embodiment.

[Cross-reference of related applications]
This application claims the benefit of priority over the Japanese patent application filed on Feb. 17, 2017: Japanese Patent Application No. 2017-028396. Included in this document.

1, 1a Audio signal processing system 10, 10a Audio signal processing unit (audio signal processing device)
20 Audio output unit (multiple audio output devices)
101 Content analysis unit (playback position specifying unit)
102, 102a Rendering switching signal generation unit (position information acquisition unit, processing unit)
103 Rendering unit (processing unit)
104 Storage Unit 201

Track Information

602, 603, 604, 605 Speaker (Audio Output Device)
606, 607, 608, 609 Sound image position (playback position)
10201, 10201a Environmental information acquisition unit (position information acquisition unit)
10202, 10202a Rendering switching instruction signal calculation unit (processing unit)

Claims

An audio signal processing device that renders audio signals of one or more audio tracks and outputs the audio signals to a plurality of audio output devices,
A playback position specifying unit that specifies the playback position of the audio signal of the audio track based on the audio track or information associated with the audio track;
A position information acquisition unit that acquires position information of each of the audio output devices;
A process of selecting one rendering method from a plurality of rendering methods based on the reproduction position and the position information, and rendering the audio signal of the audio track corresponding to the reproduction position using the one rendering method And
An audio signal processing device comprising:
The position information acquisition unit further acquires viewing position information indicating the viewing position of the user,
The processing unit selects the one rendering method from the plurality of rendering methods based on the reproduction position, the position information, and the viewing position information, and uses the one rendering method to perform the reproduction. The audio signal processing apparatus according to claim 1, wherein the audio signal of the audio track corresponding to the position is rendered.
3. The audio signal processing apparatus according to claim 1, wherein the reproduction position specifying unit analyzes the audio track or information accompanying the audio track to generate track information indicating the reproduction position. .
The audio signal processing apparatus according to any one of claims 1 to 3, further comprising a storage unit that stores parameters required by the processing unit.
The plurality of rendering methods include: a first rendering method for outputting the audio signal from each audio output device at a sound pressure ratio corresponding to a reproduction position; and the audio signal processed according to the reproduction position. 5. The audio signal processing apparatus according to claim 1, further comprising a second rendering method for outputting from the audio output apparatus. 6.
The first rendering method is VBAP,
The audio signal processing apparatus according to claim 5, wherein the second rendering method is trans-oral.
The audio signal processing device according to any one of claims 1 to 6,
The plurality of audio output devices;
An audio signal processing system comprising: