WO2019207912A1

WO2019207912A1 - Information processing device and information processing method

Info

Publication number: WO2019207912A1
Application number: PCT/JP2019/005286
Authority: WO
Inventors: 康治浅野
Original assignee: ソニー株式会社
Priority date: 2018-04-23
Filing date: 2019-02-14
Publication date: 2019-10-31

Abstract

The present invention provides an information processing device and an information processing method capable of distinguishing a speech spoken by a user from sounds other than the speech (disruptive sounds). The information processing device comprises a processing unit that processes the history information for a sound source position that is estimated from a sound signal collected by a plurality of microphones for each time frame, and a determination unit for determining a sound source that should be suppressed on the basis of the history information for each sound source position. The determination unit determines a sound source with a sound that lasts for a long period of time and a smaller fluctuation in the estimated position as the sound source that should be suppressed on the basis of the distribution of time frames in which the sound source is present and the spatial distribution of the sound source which are estimated on the basis of the history information.

Description

Information processing apparatus and information processing method

The technology disclosed in this specification relates to an information processing apparatus and an information processing method that handle voice input from a user.

There are many devices that handle voice input from users, including voice call devices. Recently, various devices such as a voice agent that returns an appropriate response when a user speaks have been developed, announced and released.

In this kind of equipment, it is necessary to distinguish the device operation command by the user's voice from other sounds (interfering sound) and react appropriately only to the former. If the performance of identifying a user's specific voice is low, problems such as the inability to hear the original user utterance due to the disturbing sound and the malfunction of the device due to the disturbing sound occur.

For example, an utterance interval detection technique that a device captures only a time interval having characteristics similar to a human voice from sounds collected by a microphone of the device and is accepted by the device has been proposed (for example, see Patent Document 1). thing). There is also a proposal for activation word technology that speaks a specific phrase such as a predetermined “activation word” before the user speaks the device operation command, and accepts the utterance after placing the device in the user utterance standby mode. (For example, see Patent Document 2). It is desirable to use a phrase that hardly appears in daily life as the activation word. Both of these utterance interval detection technologies and activation word technologies are for device operation only during the time interval that seems to be appropriate as the user's device operation command among the sounds collected by the microphone. By excluding the time interval, it can be said that the technique is mainly for preventing malfunction of the device.

On the other hand, there is also a technique of limiting the sound to be processed by the spatial distribution of the sound source instead of the restriction by the time interval. For example, a proposal has been made to emphasize and suppress sound coming from an appropriate direction by separating sound sources using phase differences of sound waves that reach a plurality of microphones (see, for example, Patent Document 3). .

JP2018-40982A Japanese Unexamined Patent Publication No. 2016-218852 Special table 2008-542798 gazette Special table 2012-512413 gazette

An object of the technology disclosed in this specification is to provide an information processing apparatus and an information processing method that handle voice input from a user.

The first aspect of the technology disclosed in this specification is:
A processing unit for processing history information of sound source positions estimated from audio signals collected from a plurality of microphones for each time frame;
A determination unit that determines a sound source to be suppressed based on history information of each sound source position;
Is an information processing apparatus.

The determination unit is configured to suppress a sound source that has a long sound duration or a sound duration time based on a distribution and a spatial distribution of a time interval in which the sound source exists, which is estimated based on the history information. A sound source that is long and has a small estimated position variation is determined as a sound source to be suppressed.

The information processing apparatus includes: a beamform unit that adjusts beamform parameters in the plurality of microphones so as to suppress a sound signal from the sound source to be suppressed; and sound data generated by the beamform unit A speech section detection unit that cuts out a section that seems to be human speech from the above, a speech recognition unit that converts the speech of the section extracted by the speech section detection unit into text, and analyzes the user's speech that the speech recognition unit converts into text A semantic analysis unit that extracts an operation requested by the user and a parameter for realizing the operation request, and a response generation unit that generates a response that satisfies the operation request of the user based on the operation request and the parameter. Is further provided.

In addition, the second aspect of the technology disclosed in this specification is:
Processing steps for processing history information of sound source positions estimated from audio signals collected from a plurality of microphones for each time frame;
A determination step for determining a sound source to be suppressed based on history information of each sound source position;
Is an information processing method.

According to the technology disclosed in the present specification, it is possible to provide an information processing apparatus and an information processing method that can distinguish a voice uttered by a user from other sounds (interfering sounds).

In addition, the effect described in this specification is an illustration to the last, and the effect of this invention is not limited to this. In addition to the above effects, the present invention may have additional effects.

Other objects, features, and advantages of the technology disclosed in the present specification will become apparent from a more detailed description based on embodiments to be described later and the accompanying drawings.

FIG. 1 is a diagram illustrating a configuration example of the information processing apparatus 100. FIG. 2 is a flowchart showing a processing procedure executed on the information processing apparatus 100. FIG. 3 is a diagram showing an example of a temporal change of each sound source position stored in the sound source position history storage unit 107 in the form of a graph. FIG. 4 is a layout view of the sound sources and the information processing apparatus 100 viewed from directly above when the recording of the sound source positions shown in FIG. 3 is obtained. FIG. 5 is a diagram showing an example of history information of each sound source position stored in the sound source position history storage unit 107 in a table format. FIG. 6 is a diagram showing another example of the temporal change of each sound source position stored in the sound source position history storage unit 107 in the form of a graph. FIG. 7 is a flowchart showing a processing procedure for creating beamform parameters based on the position information of the sound source at the current time and the past sound source position. FIG. 8 is a diagram illustrating a configuration example of an information device 800 that executes a part of functional modules for suppressing interference sound and responding to a user's utterance on the cloud. FIG. 9 is a diagram illustrating a configuration example of another information processing apparatus 900 that executes a part of a functional module for suppressing interference sound and responding to a user's utterance on the cloud. FIG. 10 is a diagram schematically illustrating data exchange between the information processing apparatus 900 and the cloud 901.

Hereinafter, embodiments of the technology disclosed in this specification will be described in detail with reference to the drawings.

As described above, as a technique for distinguishing a device operation command by a user's voice from other sounds (interfering sounds), there are two types of techniques: a technique for limiting time space and a technique using a spatial distribution. .

Among technologies that limit the time space, in the case of a technology that uses an activation word, every time the user utters an operation command, the user must activate the activation word immediately before, and the operation is troublesome. Further, once the device is set to the user utterance standby mode, it is ineffective against the intrusion of interfering sound during that time. Further, in the case of a technique for detecting a time interval having characteristics close to human voice, it is difficult to exclude human voices other than users output from a television or a speaker.

On the other hand, in the case of a technique using spatial distribution, it is necessary to know the direction of the speaker in advance, and how to obtain information on the direction of the speaker is a problem. For example, when using a method of specifying the direction of a speaker using an image, it is necessary to perform image processing such as a camera. In addition, when trying to determine the direction of the speaker from only the sound collected from the microphone, in the situation where the speaking user can move freely, the user's position is unknown before the user starts speaking, so the user speaks. After starting, the parameters will be adjusted to adjust the direction (speaker direction) to collect sound, the response return timing will be delayed, and the voice characteristics will change between the start and end of the user's utterance Therefore, there is a problem that the voice recognition accuracy is adversely affected.

Therefore, in this specification, for a stationary device used in a room or the like that receives a user's utterance as an operation command, the user uses both the time interval distribution and the spatial distribution in which the sound source exists. We propose a technique for distinguishing speech from other sounds (interfering sounds). Interfering sounds include human voices other than users that are output from televisions and speakers, etc., and it is difficult to remove them using the speech segment detection technology, but according to the technology disclosed in this specification, televisions and speakers, etc. Can be distinguished from each other by the user.

Specifically, in the technology disclosed in this specification, sound output from a television or a speaker has a temporal and spatial bias as described in (1) and (2) below. Is used to distinguish the voice uttered by the user.

(1) User command utterances occur intermittently on the time axis and rarely continue for a long time. On the other hand, the output from a television or a speaker tends to have a continuous sound for a relatively long time during the reproduction of the content.

(2) The user moves, in other words, the utterance position moves. On the other hand, since the television and the speaker are used stationary, the sound source position does not move. Therefore, the voice input from the latter has little variation in the estimated position of the sound source when the sound source position is estimated using a plurality of microphones.

According to the technology disclosed in the present specification, it is possible to suppress erroneous operation by erroneously recognizing human voice output from a stationary television or speaker as a user utterance for device operation. Moreover, the technique disclosed in this specification can be combined with a technique for distinguishing a user's utterance by limiting a time interval such as an utterance interval detection technique or an activation word technique, and the effect can be further enhanced.

FIG. 1 shows a configuration example of an information processing apparatus 100 that handles voice input from a user to which the technology disclosed in this specification is applied. The illustrated information processing apparatus 100 includes a plurality of microphones 101, an AD conversion unit 102, a sound source position estimation unit 103, a recorded sound source selection unit 104, a speaker identification unit 105, a sound source statistical information processing unit 106, and a sound source position. A history storage unit 107, a movement detection unit 108, a beamform unit 109, a suppression effect adjustment unit 110, a speech section detection unit 111, a speech recognition unit 112, a semantic analysis unit 113, a response generation unit 114, A service providing unit 115 and a speaker 116 are provided.

In the plurality of microphones 101, a user's utterance and disturbing sounds emitted from a television set or a speaker installed in the room are also input. It is installed after adjusting the positions of the individual microphones constituting the plurality of microphones 101.

The AD converter 102 samples and quantizes the audio signals collected by the microphones constituting the plurality of microphones 101 while synchronizing them, and converts them into digital signals.

The sound source position estimation unit 103 analyzes sound data obtained from sound collection by the plurality of microphones 101, performs sound source direction estimation for each of a plurality of microphone pairs, and estimates the position of the sound source by combining the sound source directions. At the same time, the sound source position estimating unit 103 separates the sound waveform from each sound source and provides it to the sound source statistical information processing unit 106 and the recorded sound source selecting unit 104.

The recorded sound source selection unit 104 calculates the power of each sound source from the speech waveform obtained from the sound source position estimation unit 103, and sends the position information of the sound source equal to or greater than the threshold to the speaker identification unit 105 together with time information.

When each sound source includes a human voice, the speaker identification unit 105 compares the voice of the user registered in advance to identify whether the voice is the user or not registered. The sound source position history storage unit 107 records the result. For example, the speaker identification unit 105 assigns a speaker ID to each user who has identified a voice, and outputs the position information of the sound source with the speaker ID when the user ID can be collated with a user's voice registered in advance. When the speaker identification unit 105 cannot identify the speaker, the speaker identification unit 105 outputs the position information of the sound source with “Unkonwn” indicating that the speaker could not be identified instead of the speaker ID.

The sound source statistical information processing unit 106 determines a sound source to be suppressed from the position information of the sound source at the current time supplied from the sound source position estimation unit 103 and the information on the past sound source position stored in the sound source position history storage unit 107. The information of the sound source position is sent to the beamform unit 109.

The sound source position history storage unit 107 holds the sound source position at each time estimated by the sound source position estimation unit 103. Details of the information stored in the sound source position history storage unit 107 will be described later.

The movement detection unit 108 includes, for example, an acceleration sensor or an inertial measurement unit (Internal Measurement Unit: IMU) built in the main body of the information processing apparatus 100, a contact sensor installed on the bottom surface, and the like. To detect. When it is detected that the main body of the information processing apparatus 100 has moved, the sound source position history storage unit 107 is notified.

The beamform unit 109 adjusts beamform parameters in the plurality of microphones 101. Specifically, the beam form unit 109 uses the information on the position of the sound source to be suppressed calculated by the sound source statistical information processing unit 106 so as to suppress the sound signal from the sound source at the corresponding position. Adjust the parameters of the microphone array that composes. Then, the beam form unit 109 sends the sound data to be synthesized from the microphone array to the speech section detection unit 111 using the adjusted parameters.

The suppression effect adjustment unit 110 is arranged for the user to manually adjust the suppression effect when the user does not accept the device operation utterance as a side effect of the suppression effect being too strong. Consists of mechanical controls such as knobs.

The speech section detection unit 111 detects a section that seems to be human speech from the sound data created by the beamform unit 109, and cuts out sound data to be subjected to speech recognition.

The voice recognition unit 112 inputs the voice data cut out by the voice section detection unit 111 and converts the speech into text.

The semantic analysis unit 113 analyzes the user's utterance converted into text by the voice recognition unit 112, determines what kind of operation is requested, and extracts parameters necessary for the realization. For example, for the utterance “Tell me about the weather in Yokohama”, the operation request is “Confirm weather forecast” and the necessary parameter is “Yokohama”.

The response generation unit 114 generates a response that satisfies the user's operation request from the operation request and parameters obtained by the semantic analysis unit 113 in cooperation with the service providing unit 115 as necessary, and uses speech synthesis. Audio is synthesized and output from the speaker 116. However, the response generation unit 114 may output a response satisfying the user's operation request on the screen instead of the voice (or along with the voice).

The service providing unit 115 provides information necessary for generating a response. For example, if a service called “check weather forecast” is to be performed, the service providing unit 115 calls an API (Application Programming Interface) with date information and location information attached to the weather forecast for the relevant date and time in the relevant area. Get the service.

An information processing apparatus 100 illustrated in FIG. 1 is, for example, an information terminal having a dialog function with a user such as a personal computer or a smartphone (or executing a dialog application), a voice agent, a pet robot equipped with a dialog function, and the like. .

In FIG. 1, some or all of the functional modules surrounded by a dotted line can be realized outside the information processing apparatus 100 instead of inside the information processing apparatus 100. For example, an input signal may be transmitted from the plurality of microphones 101 to the cloud, the processing of the functional module surrounded by a dotted line may be executed on the cloud, and the processing result may be received from the cloud and output from the speaker 116 as sound. Good.

FIG. 2 shows a processing procedure executed on the information processing apparatus 100 shown in FIG. 1 in the form of a flowchart.

When the apparatus is turned on (No in step S201), this process takes a certain interval (this is called a time frame) until the power is turned off (Yes in step S201) (step S210). Is executed repeatedly.

First, it is determined whether or not the information processing apparatus 100 has moved based on the information detected by the movement detection unit 108 (step S202). If the information processing apparatus 100 has moved (Yes in step S202), the recording of the sound source position held by the sound source position history storage unit 107 is cleared (step S203). This is because if the information processing apparatus 100 moves, the positional relationship with the stationary television or speaker that is the sound source of the interfering sound changes, so that the sound source position record accumulated so far cannot be used. .

Next, the sound position estimation unit 103 performs sound source separation using the sound data input to the plurality of microphones 101 (or the digital signal AD-converted by the AD conversion unit 102), and estimates the position of the sound source (step S204). ). As a method of sound source separation using a plurality of microphones 101, there are already formulated methods such as a beam forming technique.

Next, the recording sound source selection unit 104 calculates whether there is a sound source having a power equal to or higher than a threshold value determined statically or dynamically for the sound data of each sound source separated by the sound source position estimation unit 103. (Step S205).

The threshold value referred to here may be determined statically or dynamically. The static threshold value is a constant value determined in advance as device design information. On the other hand, the dynamic threshold is a threshold that is dynamically changed in consideration of, for example, the level of the surrounding background sound, and the threshold increases when the surrounding background noise level is high.

Here, when there is a sound source having a power equal to or higher than the threshold (Yes in step S205), the speaker identification unit 105 further selects a user's registered user name when each sound source includes a human voice. By comparing with the voice, it is identified whether it is the voice of the user or not registered (step S206). Then, the time information, position information, and speaker identification information when a human voice is included are recorded in the sound source position history storage unit 107 (step S207).

Next, the sound source statistical information processing unit 106 uses the sound source position at the previous time recorded in the sound source position history storage unit 107 and the current sound source position information given from the sound source position estimation unit 103 to correspond to the sound source. And a sound source to be suppressed is specified (step S208). Details of the method for identifying the sound source to be suppressed will be described later.

Then, the beamform unit 109 calculates and updates the beamform parameters so as to suppress this direction from the position information of the sound source to be suppressed (step S209).

In the processing procedure shown in FIG. 2, the beamform parameters are updated by the beamform unit 109 at regular intervals. Here, if the beamform parameters are updated during the user's utterance, unnecessary distortion may occur in the waveform used for speech recognition. Therefore, it is possible to prevent the beamform parameters from being updated while the user is speaking, that is, while the speech segment detection unit 111 detects a speech segment.

FIG. 3 shows an example of the temporal change of each sound source position stored in the sound source position history storage unit 107 in the form of a graph. In the graph shown in the figure, the sound source position is represented by the direction θ of the sound source from the information processing apparatus 100 (or the plurality of microphones 101), the front of the information processing apparatus 100 (or the plurality of microphones 101) is 0 degrees, and left and right ± 90 degrees is set on the vertical axis. In addition, time is taken on the horizontal axis.

The sound sources indicated by reference numbers 301 to 304 are sound sources identified by the speaker identifying unit 105 as being the voice of the same speaker (hereinafter referred to as “user 1”). Due to the nature that human utterances occur intermittently on the time axis and rarely continue for a long time (as described above), the sound sources 301 to 304 are based on the sound source position history information, such as televisions and speakers. It can be estimated that the sound information is not a sound device but a natural person (user 1).

Further, the sound sources indicated by

reference numbers

311 and 312 are sound sources identified by the speaker identifying unit 105 as the voice of the same speaker (hereinafter referred to as “user 2”). The sound of each of the

sound sources

311 and 312 is intermittently generated on the time axis and is not a sound device such as a television or a speaker but a natural person (user 2) based on sound source position history information that is rarely continuous for a long time. ) Utterance. In addition, since the positions of the

sound sources

311 and 312 are moving, it can be estimated that the sound information is not a stationary acoustic device such as a television or a speaker but the sound information of a natural person (user 2) whose utterance position moves.

On the other hand, the sound sources indicated by

reference numbers

321 and 322 have different speech waveforms and do not collate with a user's voice registered in advance, so that the speaker identification unit 105 can be a speaker who has already been registered even with the same speaker. It is determined that the voice is not. In addition, the playback sound of content output from audio equipment such as TVs and speakers tends to exist continuously for a relatively long time, or the sound equipment is stationary and the sound source position does not move Therefore, it can be estimated that the

sound sources

321 and 322 are the same acoustic device based on the sound source position history information.

It should be fully understood that each sound source can be estimated using the sound source direction angle that can be estimated by a plurality of microphones 101 including two or more microphones.

FIG. 4 shows the information processing apparatus 100 when the recording of the sound source position shown in FIG. 3 is obtained, the user 1 serving as the sound source of the utterance indicated by reference numbers 301 to 304, the reference number 311 and The user 2 who became the sound source of the utterance indicated by 312 and the layout view of the audio equipment which became the sound source of the sound indicated by

reference numerals

321 and 322 are viewed from directly above.

The stationary television 401 is placed in the direction of +20 degrees from the front of the information processing apparatus 100. Further, the user 1 is in the direction of −45 degrees from the front of the information processing apparatus 100 and is sitting on the sofa 402 and watching the television 401. In addition, another user 2 is standing in the direction of +80 degrees from the front of the information processing apparatus 100, and is no longer in the room after a while.

Consider the sound source position recording of the sound sources of the user 1, the user 2, and the television 401 shown in FIG.

The sound from the direction of the television 401 continues to sound from the direction of +20 degrees from the front of the information processing apparatus 100 while the content is being reproduced.

On the other hand, the utterances of the user 1 and the user 2 are shorter in duration than each of the sounds for reproducing the content, and since the user 2 moves, the variance (or fluctuation) of the utterance positions is stationary. It is larger than the TV 401.

FIG. 5 shows an example of history information of each sound source position stored in the sound source position history storage unit 107 in FIG. Each time the sound source position history storage unit 107 inputs each sound source and its position information estimated by the sound source position estimation unit 103 via the recording sound source selection unit 104 and the speaker identification unit 105, the sound source position information (information processing device) The sound source direction θ from 100), the sound output start time of the sound source, and the sound duration time are stored in association with the speaker ID.

When the sound source position information is input via the recording sound source selection unit 104 and the speaker identification unit 105, the sound source position history storage unit 103 creates a new entry in the table as shown in FIG. Is recorded together with the sound output start time. Thereafter, when the information input of the sound source position is completed, the duration of the sound (sound source duration) is recorded in the entry, and when the speaker ID is output from the speaker identification unit 105, it is also recorded.

FIG. 6 shows another example of the temporal change of each sound source position stored in the sound source position history storage unit 107 in the form of a graph. Similar to the graph shown in FIG. 3, the sound source position is represented by the direction θ of the sound source from the information processing apparatus 100 (or the plurality of microphones 101), and the front of the information processing apparatus 100 (or the plurality of microphones 101) is set to 0 degrees. The vertical axis represents ± 90 degrees to the left and right, and the horizontal axis represents time.

The sound sources indicated by reference numerals 301 to 304, 311 and 312, 312, 321 and 322 are the same as the graph shown in FIG.

Since the sound source indicated by reference number 601 continues for a long time and the speaker identification unit 105 cannot identify the speaker, it can be estimated that the sound source is not an utterance of a natural person but an acoustic device. Further, since the position of the sound source 601 moves from the front of the information processing apparatus 100 toward the direction near −80 degrees from the front of the information processing apparatus 100 with the passage of time, it is estimated that the position is not a stationary type. be able to. Therefore, based on the sound source position history information, the sound source 601 can be estimated to be a mobile or portable acoustic device such as a music playback device carried by the user.

Although not shown in FIG. 6, when there is a sound source of the same speaker moving with the sound source 601, it can be estimated that the sound source 601 is a music playback device carried by the speaker.

In the information processing apparatus 100 according to the present embodiment, the sound source statistical information processing unit 106 is temporally and spatially between a user's voice uttering an operation command and a disturbing sound output from a television or a speaker. Using the bias, the voice spoken by the user is distinguished. An example in which sound sources are distinguished based on temporal and spatial bias is shown in Table 1 below. In the example shown in Table 1, a sound source with a long duration is subject to suppression as a disturbing sound regardless of the variance (or fluctuation) of the estimated position.

According to the information processing apparatus 100 according to the present embodiment, the interference sound output from the “vacuum cleaner”, the “music playback device carried by the user”, the “stationary acoustic device”, or the like is targeted for suppression. As a result, it becomes easy to listen to the voice of the user who speaks the operation command to the voice agent or the like, and it is possible to prevent an unintended malfunction of the device due to the disturbing sound.

FIG. 7 is a flowchart showing the processing procedure for the sound source statistical information processing unit 106 to create beamform parameters based on the position information of the sound source at the current time and the past sound source position. The illustrated processing procedure corresponds to the details of the processing executed in step S208 in the flowchart shown in FIG. In FIG. 7, S _i (t) is information on the i th sound source in time frame t, P (S _i (t)) is the sound source position of S _i (t), and | P (S _i (t)). −P (S _j (t−1)) | is the distance between two sound source positions (i th and j th sound source positions), and T (S _i (t)) is the i th sound source information S _i (t). This is the duration of the sound that has been played so far in the time frame t.

First, the sound source statistical information processing unit 106 acquires one sound source information S _i (t) among a plurality of sound sources in the current time frame t (step S701).

Next, the sound source statistical information processing unit 106 acquires the sound source information of the previous time frame from the sound source position history information storage unit 107, and the deviation of the estimated position from the sound source S _i (t) is equal to or less than a predetermined threshold ε _1. It is checked whether or not there is a sound source S _j (t−1) (step S703).

When there is a sound source S _j (t−1) whose estimated position shift from the sound source S _i (t) is equal to or less than a predetermined threshold ε ₁ (Yes in step S703), the sound source statistical information processing unit 106 Further, it is checked whether or not the acoustic features of the two sound sources S _i (t) and the sound source S _j (t−1) are similar (step S704).

If the acoustic feature quantities of the two sound sources S _i (t) and the sound source S _j (t−1) within the predetermined distance ε ₁ are similar (Yes in step S704), the sound source statistical information processing unit 106 determines that the two sound sources S _i (t) and the sound source S _j (t−1) are the same sound source (step S705), and increments the duration of the sound source S _i (t) by one time frame (ie, T (S _i (t)) = T (S _j (t−1)) + 1) is recorded in the sound source position history information storage unit 107 (S706).

Further, when there is no sound source whose estimated position shift from the sound source S _i (t) is equal to or smaller than the predetermined threshold value ε ₁ (No in step S703), or two sound sources S _i (within the predetermined distance ε ₁ ( When the acoustic feature quantities of t) and the sound source S _j (t−1) are not similar (No in step S704), the sound source statistical information processing unit 106 determines the sound source S _i (t) acquired in step S701. The duration T (S _i (t)) = 1 is recorded in the sound source position history storage unit 107 (step S707).

If unprocessed sound source information remains in the current time frame t (No in step S708), the process returns to step S701, and the sound source statistical information processing unit 106 acquires one unprocessed sound source information. Then, the same processing as described above is repeatedly executed.

When the processing of all sound source information is completed in the current time frame t (Yes in step S708), the sound source statistical information processing unit 106 stores the sound source location information stored in the sound source location history storage unit 107. Among them, it is checked whether there is a sound source that has continued for a predetermined time or more and whose estimated sound source position variance (or fluctuation) is less than or equal to a threshold ε ₂ (step S709).

A sound source that has continued for a predetermined time or more and whose estimated sound source position variance (or fluctuation) is less than or equal to the threshold ε ₂ is not a natural person, that is, a user who speaks an operation command, but a stationary acoustic device such as a television or a speaker It can be estimated that the disturbing sound is emitted. Therefore, when the sound source statistical information processing unit 106 detects such a sound source (Yes in step S709), the sound source statistical information processing unit 106 transmits the position information to the beamform unit 109 as a sound source to be suppressed. Then, the beamform unit 109 calculates beamform parameters so as to suppress this direction from the position information of the sound source to be suppressed (step S710).

In this way, the beam form unit 109 can adjust the parameters of the microphone array constituting the plurality of microphones 101 to suppress the sound signal from the sound source estimated as the interference sound.

As the threshold value ε ₁ of the estimated position deviation in step S703 or the threshold value ε ₂ in step S709, for example, a user's utterance talking while walking is set by setting a distance that a person moves during one time frame, for example. Listen, but suppress it without equating to a sound source located further away.

In step S704, it is determined whether or not the sound source is the same from the acoustic point of view by using an acoustic feature used for speaker identification. That is, it is possible to more accurately determine whether the sound sources are the same by checking in steps S703 and S704.

By the way, when the sound source is an utterance by a user (human), a method of tracking the moving sound source using a particle filter using the sound source feature amount used for speaker identification in addition to the position information of the sound source has been proposed. (For example, see Patent Document 4). By using this method, it is possible to improve the accuracy of the sound source determination based only on the position information.

7, if the speaker recognition function from the voice data by the speaker identification unit 105 can be used when the processing shown in FIG. 7 is performed, the correspondence relationship between the sound sources can be estimated with higher accuracy.

It should be noted that the sound from the TV and speakers may sometimes be quiet depending on the content being played. Therefore, the sound source statistical information processing unit 106, when estimating the relationship of sound sources between time frames, happens to generate content by referring to not only the previous time frame but also past sound source information near the same position. Even when there is a silent section and the sound source cannot be estimated well in the immediately preceding frame, it can operate more robustly.

As described above, according to the information processing apparatus 100 according to the present embodiment, the voice uttered by the user is made to other sounds (interfering sounds) using both the distribution of the time interval in which the sound source exists and the spatial distribution. Can be distinguished from Therefore, the information processing apparatus 100 can suppress erroneous operation by erroneously recognizing a human voice output from a stationary television or speaker as a user utterance for device operation. Further, it is possible to further enhance the effect by further applying to the information processing apparatus 100 a technology that distinguishes the user's utterance by limiting the time interval such as the utterance interval detection technique and the activation word technique.

In the above description, the embodiment has been described in which the processing for distinguishing the voice uttered by the user from other sounds (interfering sounds) is performed on the information processing apparatus 100 configured as a physically single apparatus.

However, some or all of the functional modules for suppressing the interfering sound and responding to the user's utterance provided in the information processing apparatus 100 shown in FIG. 1 are executed on the cloud, and the processing result is received from the cloud. Thus, it is possible to execute response output.

In FIG. 8, an information processing apparatus 800 configured to execute the functions of the speech recognition unit 112, the semantic analysis unit 113, the response generation unit 114, and the service providing unit 115 surrounded by a dotted line on the cloud 801. Is illustrated.

The information processing apparatus 800 further includes a communication unit (not shown) for communicating with the cloud. The audio waveform data (speech utterance section of the sound source corresponding to the user's utterance detected by the speech section detection unit 111 ( A) is transmitted to the cloud 801.

On the cloud 801 side, the voice recognition unit 112 converts the voice waveform data of the voice utterance section of the sound source corresponding to the user's utterance into text. The semantic analysis unit 113 analyzes the user's utterance converted into text by the speech recognition unit 112, determines what kind of operation is requested, and extracts parameters necessary for the realization. Then, the response generation unit 114 generates a response satisfying the user's operation request from the operation request and parameters obtained by the semantic analysis unit 113 in cooperation with the service providing unit 115 as necessary, and uses speech synthesis. Then, voice is synthesized, or instead of voice, response text and screen information are generated, and the generated response content (B) is transmitted to the information processing apparatus 800.

When the communication unit (described above) receives response voice data, text, or screen information, the information processing apparatus 800 outputs the voice from the speaker 116 or displays a response text message or screen on a screen (not shown). To do.

FIG. 9 shows a configuration example of another information processing apparatus 900 that executes part of a functional module for suppressing the interference sound and responding to the user's utterance on the cloud. In the figure, a sound source position estimation unit 103, a recorded sound source selection unit 104, a speaker identification unit 105, a sound source statistical information processing unit 106, a beam form unit 109, a suppression effect adjustment unit 110, and a speech section detection unit surrounded by a dotted line. 111, the voice recognition unit 112, the semantic analysis unit 113, the response generation unit 114, and the service providing unit 115 are configured to execute on the cloud 901. Compared with the information processing apparatus 800 illustrated in FIG. 8, processing in the information processing apparatus 900 is reduced, and many processes are performed on the cloud 901 side.

The information processing apparatus 900 further includes a communication unit (not shown) for communicating with the cloud. On the information processing apparatus 900 side, the voice waveform data (C) for the number of microphones of the plurality of microphones 101, which is signal-processed by the AD conversion unit 102, is transmitted to the cloud 901 side.

On the cloud 901 side, the sound source position estimation unit 103 estimates the sound source direction by separating the sound waveform for each sound source from the sound waveform data (C) for the number of microphones. The recorded sound source selection unit 104 calculates the power of each sound source from the speech waveform obtained from the sound source position estimation unit 103. The speaker identification unit 105 acquires the speaker identification parameter (D) from the sound source position history storage unit 107 of the information processing apparatus 900 when each sound source includes a human voice, and is registered in advance. To identify whether the user's voice is registered or not registered. Then, the cloud 901 transmits the sound source position information (E) associated with the speaker ID in the current time frame to the information processing apparatus 900.

On the information processing apparatus 900 side, the sound source position information (E) received from the cloud 901 is recorded in the sound source position history storage unit 107. Further, the information processing apparatus 900 transmits information (F) related to the sound source position in the past time frame, which is recorded in the sound source position history storage unit 107, to the cloud 901.

On the cloud 901 side, the sound source statistical information processing unit 106 receives the position information of the sound source at the current time supplied from the sound source position estimation unit 103 and the position information (F) of the sound source in the past time frame received from the information processing device 900. From this, the sound source to be suppressed is determined. Then, when the beamformer 109 calculates the parameters of the microphone array for suppressing the sound signal from the sound source at the corresponding position from the information regarding the position of the sound source to be suppressed, the sound waveform received from the information processing apparatus 900 is calculated. Data (C) is synthesized with the parameters.

The speech section detection unit 111 detects a section that seems to be human speech from the sound data created by the beamform unit 109, and cuts out sound data to be subjected to speech recognition. The voice recognition unit 112 inputs the voice data cut out by the voice section detection unit 111 and converts the speech into text. The semantic analysis unit 113 analyzes the user's utterance converted into text by the speech recognition unit 112 and extracts parameters for realizing the operation requested by the user. Then, the response generation unit 114 responds to the user's operation request from the operation request and parameters obtained by the semantic analysis unit 113, in accordance with the service providing unit 115 as necessary, and includes a response including a voice text, a screen, and the like. And the response content (G) is transmitted to the information processing apparatus 900.

When the response content (G) including response voice data, text, or screen information is received by the communication unit (described above), the information processing apparatus 900 outputs a voice from the speaker 116 or responds to a screen (not shown). Display text messages and screens for use.

FIG. 10 schematically shows data exchange between the information processing apparatus 900 and the cloud 901.

The information processing apparatus 900 transmits voice waveform data (C) for the number of microphones of the plurality of microphones 101 to the cloud 901 side. Further, when each sound source includes a human voice, the information processing apparatus 900 transmits the speaker identification parameter (D) read from the sound source position history storage unit 107 to the cloud 901 side. In contrast, the cloud 901 calculates the power of each sound source from the received speech waveform data (C), identifies the speaker based on the speaker identification parameter, and determines the speaker in the current time frame. The position information (E) of the sound source associated with the ID is returned to the information processing apparatus 900.

The information processing apparatus 900 uses the sound source position information (E) sent back from the cloud 901 to update the recorded content of the sound source position history storage unit 107, and uses the information (F for the sound source position in the past time frame). ) To the cloud 901.

On the cloud 901 side, a sound source to be suppressed is determined from the position information of the sound source at the current time and the position information of the sound source in the past time frame. Then, the speech waveform data (C) is synthesized using the microphone array parameters for suppressing the sound signal from the sound source at the corresponding position, and the speech data of the section that seems to be human speech is subjected to speech recognition and semantic analysis. Then, a response content (G) for the user operation request is generated and transmitted to the information processing apparatus 900. On the information processing apparatus 900 side, the response content received from the cloud 901 is output by a method such as voice, text, or screen display.

If the function of the sound source position history storage unit 107 is also arranged on the cloud 901 side, the sound source position information (E) in the current time frame from the cloud 901 to the information processing apparatus 900 shown in FIGS. Loads related to communication processing, such as transmission and transmission of speaker identification parameters (D) from the information processing apparatus 900 to the cloud 901 and sound source position information (F) in past time frames, are reduced. However, if all the sound source position history information in each client (information processing apparatus) is to be managed on the cloud server side, the amount of information that must be managed increases as the number of clients connected to the cloud server increases. Therefore, it becomes difficult to process access from a client without delay. Therefore, when the number of clients connected to the cloud server increases, it can be said that the function of the sound source position history storage unit 107 is preferably arranged in each client as shown in FIG. Similarly, in the configuration example shown in FIG. 8, it can be said that the function of the sound source position history storage unit 107 is preferably arranged in each client.

Further, the information processing apparatus 900 transmits the sound waveform data (C), the speaker identification parameter (D), and (F) the sound source position information in the past time frame to the cloud 901, and the cloud 901 If (E) the sound source position information and the response content (G) in the current time frame are received together from 901, the communication overhead can be reduced.

As described above, the technology disclosed in this specification has been described in detail with reference to specific embodiments. However, it is obvious that those skilled in the art can make modifications and substitutions of the embodiments without departing from the scope of the technology disclosed in this specification.

The technology disclosed in this specification is applied to various stationary devices used in a room that receives a user's utterance as an operation command, such as a voice agent, and the voice uttered by the user is set to other sound. It is possible to distinguish from (interfering sound), and to prevent an unintended malfunction of the device due to the interfering sound. Needless to say, the technology disclosed in this specification can be applied to a mobile device such as a dialogue robot in the same manner, and a voice uttered by a user at a current position in a room or in a stationary state in a room. A suitable distinction can be made. In addition, the technology disclosed in this specification can be applied to a telephone device such as a videophone to suppress disturbing sounds other than the user's speech.

In short, the technology disclosed in the present specification has been described in the form of examples, and the description content of the present specification should not be interpreted in a limited manner. In order to determine the gist of the technology disclosed in this specification, the claims should be taken into consideration.

Note that the technology disclosed in the present specification can also be configured as follows.
(1) a processing unit that processes history information of a sound source position estimated from sound signals collected from a plurality of microphones for each time frame;
A determination unit that determines a sound source to be suppressed based on history information of each sound source position;
An information processing apparatus comprising:
(2) The determination unit determines a sound source to be suppressed based on a distribution and a spatial distribution of a time section where the sound source is estimated, which is estimated based on the history information.
The information processing apparatus according to (1) above.
(3) The determination unit determines a sound source having a long sound duration as a sound source to be suppressed.
The information processing apparatus according to any one of (1) or (2) above.
(4) The determination unit determines a sound source having a long sound duration and a small estimated position variation as a sound source to be suppressed.
The information processing apparatus according to any one of (1) or (2) above.
(5) a sound source position estimation unit that estimates the direction of a sound source from sound signals collected from a plurality of microphones and separates a sound waveform from each sound source;
A sound source position history storage unit that holds the sound source position of the time frame estimated by the sound source position estimation unit;
Further comprising
The determination unit performs statistical processing on information of a sound source position for each time frame held by the sound source position history storage unit, and determines a sound source to be suppressed.
The information processing apparatus according to any one of (1) to (4).
(6) The sound source position history storage unit stores estimated position information of each sound source, sound output start time, and information related to the sound duration time.
The information processing apparatus according to (5) above.
(7) a speaker identification unit for identifying a speaker of a human voice included in each sound source;
The sound source position history storage unit stores the estimated position of each sound source in association with speaker identification information.
The information processing apparatus according to any of (5) or (6) above.
(8) a recording sound source selection unit that calculates the power of each sound source from the sound waveform obtained from the sound source position estimation unit and selectively outputs the position information of the sound source equal to or greater than a threshold value together with time information;
The information processing apparatus according to any one of (5) to (7) above.
(9) a movement detection unit that detects that the information processing apparatus main body has moved;
In response to detecting the movement of the information processing apparatus main body by the movement detection unit, the information held by the sound source position history storage unit is cleared.
The information processing apparatus according to any one of (5) to (8).
(10) It further comprises a beamform unit for adjusting beamform parameters in the plurality of microphones so as to suppress sound signals from the sound source to be suppressed.
The information processing apparatus according to any one of (1) to (9) above.
(11) A suppression effect adjustment unit that adjusts an effect of suppressing the sound source to be suppressed by the beamform unit is further provided.
The information processing apparatus according to (10) above.
(12) a voice section detection unit that cuts out a section that seems to be human speech from the sound data created by the beamform unit;
A speech recognition unit that converts the utterance of the section extracted by the speech section detection unit into text,
Analyzing the user's utterance converted into text by the voice recognition unit, and extracting a parameter required for realizing the operation requested by the user and the operation request;
A response generation unit that generates a response that satisfies the user's operation request based on the operation request and the parameters;
Further comprising
The information processing apparatus according to any one of (10) or (11) above.
(13) It further includes an output unit that outputs the response generated by the response generation unit.
The information processing apparatus according to (12) above.
(14) A service providing unit that provides the response generation unit with information necessary to generate a response.
The information processing apparatus according to any one of (12) and (13).
(15) The determination unit determines that a plurality of sound sources whose estimated positional deviation over a predetermined time frame is within a predetermined threshold and whose acoustic feature amounts are similar are the same sound source, and estimation over a predetermined time frame Determining a plurality of sound sources whose acoustic feature values are not similar even if the positional deviation amount is within the first threshold as different sound sources, and storing information on each sound source in the sound source position history storage unit;
The information processing apparatus according to (6) above.
(16) The determination unit may select a sound source that has continued for a predetermined time or more and the estimated sound source position variance is equal to or less than a second threshold among the sound source position information stored in the sound source position history storage unit. Determine the sound source to be suppressed,
The information processing apparatus according to (15) above.
(17) As the first threshold value or the second threshold value, a distance that a person moves during one time frame is set.
The information processing apparatus according to any one of (15) or (16) above.
(18) The apparatus further includes the plurality of microphones.
The information processing apparatus according to any one of (1) to (17).
(19) The deciding unit decides a sound source to be suppressed according to an utterance section in which a voice having characteristics similar to a human voice is detected or a predetermined activation word is detected.
The information processing apparatus according to any one of (1) to (18).
(20) a processing step of processing history information of a sound source position estimated from audio signals collected from a plurality of microphones for each time frame;
A determination step for determining a sound source to be suppressed based on history information of each sound source position;
An information processing method comprising:

DESCRIPTION OF SYMBOLS 100 ... Information processing apparatus 101 ... Multiple microphones 102 ... AD conversion part 103 ... Sound source position estimation part 104 ... Recording sound source selection part 105 ... Speaker identification part 106 ... Sound source statistical information processing part 107 ... Sound source position history storage part DESCRIPTION OF SYMBOLS 108 ... Movement detection part 109 ... Beamform part 110 ... Suppression effect adjustment part 111 ... Speech section detection part 112 ... Speech recognition part 113 ... Semantic analysis part 114 ... Response generation part 115 ... Service provision part 116 ... Speaker

Claims

A processing unit for processing history information of sound source positions estimated from audio signals collected from a plurality of microphones for each time frame;
A determination unit that determines a sound source to be suppressed based on history information of each sound source position;
An information processing apparatus comprising:
The determining unit determines a sound source to be suppressed based on a distribution and a spatial distribution of a time interval in which the sound source exists, which is estimated based on the history information.
The information processing apparatus according to claim 1.
The determination unit determines a sound source having a long sound duration as a sound source to be suppressed,
The information processing apparatus according to claim 1.
The determination unit determines a sound source having a long sound duration and a small estimated position variation as a sound source to be suppressed.
The information processing apparatus according to claim 1.
A sound source position estimation unit that estimates a direction of a sound source from sound signals collected from a plurality of microphones and separates a sound waveform from each sound source;
A sound source position history storage unit that holds the sound source position of the time frame estimated by the sound source position estimation unit;
Further comprising
The determination unit performs statistical processing on information of a sound source position for each time frame held by the sound source position history storage unit, and determines a sound source to be suppressed.
The information processing apparatus according to claim 1.
The sound source position history storage unit stores information on the estimated position information of each sound source, sound output start time, and sound duration time length,
The information processing apparatus according to claim 5.
A speaker identification unit for identifying a speaker of a human voice included in each sound source;
The sound source position history storage unit stores the estimated position of each sound source in association with speaker identification information.
The information processing apparatus according to claim 5.
A sound source selection unit that calculates the power of each sound source from the sound waveform obtained from the sound source position estimation unit and selectively outputs the position information of the sound source equal to or greater than a threshold value together with time information,
The information processing apparatus according to claim 5.
A movement detection unit for detecting that the information processing apparatus main body has moved;
In response to detecting the movement of the information processing apparatus main body by the movement detection unit, the information held by the sound source position history storage unit is cleared.
The information processing apparatus according to claim 5.
A beamform unit for adjusting a beamform parameter in the plurality of microphones so as to suppress a sound signal from the sound source to be suppressed;
The information processing apparatus according to claim 1.
A suppression effect adjustment unit that adjusts an effect of suppressing the sound source to be suppressed by the beamform unit;
The information processing apparatus according to claim 10.
A voice section detector that cuts out a section that seems to be human speech from the sound data created in the beamform section;
A speech recognition unit that converts the utterance of the section extracted by the speech section detection unit into text,
Analyzing the user's utterance converted into text by the voice recognition unit, and extracting a parameter required for realizing the operation requested by the user and the operation request;
A response generation unit that generates a response that satisfies the user's operation request based on the operation request and the parameters;
Further comprising
The information processing apparatus according to claim 10.
An output unit that outputs the response generated by the response generation unit;
The information processing apparatus according to claim 12.
A service provider that provides the response generator with information necessary to generate a response;
The information processing apparatus according to claim 12.
The determination unit determines that a plurality of sound sources whose estimated positional deviation over a predetermined time frame is within a predetermined threshold and whose acoustic feature values are similar are the same sound source, and the estimated positional deviation over a predetermined time frame. Even if the amount is within the first threshold, a plurality of sound sources that do not have similar acoustic feature amounts are determined as different sound sources, and information on each sound source is stored in the sound source position history storage unit.
The information processing apparatus according to claim 6.
The determination unit suppresses a sound source that has continued for a predetermined time or more and whose estimated sound source position variance is equal to or less than a second threshold among the sound source position information stored in the sound source position history storage unit. Determine the sound source
The information processing apparatus according to claim 15.
As the first threshold value or the second threshold value, a distance that a person moves during one time frame is set.
The information processing apparatus according to claim 16.
Further comprising the plurality of microphones;
The information processing apparatus according to claim 1.
The determination unit determines a sound source to be suppressed according to an utterance period in which a voice having characteristics close to human voice is detected or a predetermined activation word is detected.
The information processing apparatus according to claim 1.
Processing steps for processing history information of sound source positions estimated from audio signals collected from a plurality of microphones for each time frame;
A determination step for determining a sound source to be suppressed based on history information of each sound source position;
An information processing method comprising: