CN111754986A

CN111754986A - Voice input device, voice input method, and recording medium

Info

Publication number: CN111754986A
Application number: CN202010206519.0A
Authority: CN
Inventors: 西川刚树
Original assignee: Panasonic Intellectual Property Corp of America
Current assignee: Panasonic Intellectual Property Corp of America
Priority date: 2019-03-27
Filing date: 2020-03-23
Publication date: 2020-10-09
Also published as: JP2020160430A; JP7449070B2

Abstract

A voice input device, a voice input method and a recording medium. A speaker recognition device (1) is provided with: an acquisition unit (21) which acquires each speech when 1 or more speakers speak; a storage unit (22) that stores the speech of each of the utterances of 1 or more speakers acquired by the acquisition unit (21); a trigger input unit (23) to which a trigger is input; an utterance start detection unit (24) that detects a start position at which utterance starts based on each of the voices stored in the storage unit (22) each time a trigger is input by the trigger input unit (23); and a speaker recognition unit (26) that recognizes one of the 1 or more speakers based on at least the 1 st time at which the trigger is input by the trigger input unit (23) and the 2 nd time at which the utterance start position detected by the utterance start detection unit (24) from each voice is started.

Description

Voice input device, voice input method, and recording medium

Technical Field

The present disclosure relates to a voice input device, a voice input method, and a recording medium.

Background

For example, patent document 1 discloses a speech recognition device including: a voice input start operation mechanism that enables a voice input operation by an operation of a user; a voice input means for acquiring a voice of a user; an utterance start time learning data holding unit that holds an utterance start learning time obtained by learning a time from a time when the user operates the speech input start operation unit to a time when the user actually starts to utter an utterance; and a speech recognition means for comparing the measured time with the utterance start learning time from the utterance start time learning data holding means, determining whether or not the speech whose time has been measured is an input speech of the user, and performing speech recognition when the speech is the input speech of the user.

According to this speech recognition device, it is possible to recognize whether or not a speech is made by a user by learning for each user and using the learned utterance start time.

Prior art documents

Patent document

Patent document 1: japanese patent laid-open No. 2006-313261

Disclosure of Invention

Problems to be solved by the invention

However, in the technique disclosed in patent document 1, it is necessary to learn in advance a period from a time when the user operates the voice input device until the user actually starts speaking. Therefore, in the conventional speech recognition apparatus, the amount of calculation due to learning may increase.

Accordingly, an object of the present disclosure is to provide a voice input device, a voice input method, and a recording medium that can recognize a speaker by simple processing and suppress an increase in the amount of computation.

Means for solving the problems

A voice input device according to an aspect of the present disclosure includes: an acquisition unit that acquires each speech of 1 or more speakers when speaking; a storage unit that stores the speech of each of the utterances of the 1 or more speakers acquired by the acquisition unit; a trigger input unit to which a trigger is input; an utterance start detection unit that detects a start position of an utterance from each of the voices stored in the storage unit each time the trigger is input by the trigger input unit; and a speaker recognition unit that recognizes one of the 1 or more speakers based on at least a 1 st time point at which the trigger is input by the trigger input unit and a 2 nd time point at which a speech start position detected by the speech start detection unit from each of the voices is started.

Note that a specific embodiment of a part of these elements may be implemented by a system, a method, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM, or any combination of a system, a method, an integrated circuit, a computer program, and a recording medium.

Effects of the invention

According to the voice input device and the like of the present disclosure, it is possible to recognize a speaker by simple processing and suppress an increase in the amount of calculation.

Drawings

Fig. 1 is a diagram showing an example of an appearance of a speaker recognition apparatus and a use scenario of the speaker recognition apparatus based on speech of a speaker in the embodiment.

Fig. 2A is a block diagram showing an example of a speaker recognition apparatus according to the embodiment.

Fig. 2B is a block diagram showing an example of another speaker recognition apparatus according to the embodiment.

Fig. 3 is a flowchart showing the operation of the speaker recognition apparatus when the 1 st speaker uttered.

Fig. 4 is a diagram illustrating the timing of the 1 st time and the 2 nd time for each speech of speech uttered when the 1 st speaker uttered and the 2 nd speaker uttered.

Fig. 5 is a flowchart showing the operation of the speaker recognition apparatus when the 2 nd speaker utters.

Fig. 6 is a flowchart showing an operation in the speaker recognition unit of the speaker recognition device according to the embodiment.

Description of reference numerals:

1 speaker recognition device (Voice input device)

21 acquisition part

22 storage section

23 trigger input part

24 utterance start detection unit

25 speaking opportunity registration unit

26 speaker recognition unit

Detailed Description

This makes it possible to identify any speaker from among 1 or more speakers, for example, based on the temporal context between the 1 st time point at which a trigger by a speaker among 1 or more speakers is detected and the 2 nd time point at which the speaker utters speech. That is, it is possible to recognize which speaker is the speaker among 1 or more speakers of the speech acquired by the acquisition unit without learning the period from the 1 st time to the 2 nd time.

Therefore, according to the voice input device, the speaker can be recognized by a simple process, and an increase in the amount of calculation can be suppressed.

In particular, the speech input device can recognize the speaker of the speech based on the timing of the utterance with respect to the 1 st time. Therefore, according to the voice input device, the speaker of the voice can be recognized by a simple operation. Further, since the operation of the voice input device is simplified, the complexity of the voice input device in which a plurality of buttons and the like are arranged on the voice input device can be suppressed. Therefore, according to the voice input device, for example, when the trigger input unit is a button, it is possible to identify which speaker is among 1 or more speakers even with 1 button, and thus the configuration of the voice input device can be simplified.

A voice input method according to another aspect of the present disclosure includes: obtaining each voice of more than 1 speaker when speaking; storing the obtained speech of each of the 1 or more speakers in a storage unit; is triggered by an input; detecting a start position at which to start speaking from the respective voices stored in the storage portion every time the trigger is input; and identifying one of the 1 or more speakers based on at least a 1 st time point at which the trigger is input and a 2 nd time point at which an utterance is detected from the voices.

This voice input method also has the same operational effects as the voice input device described above.

A recording medium according to another aspect of the present disclosure is a computer-readable nonvolatile recording medium on which a program for causing a computer to execute a voice input method is recorded.

The recording medium also has the same operational effects as those of the voice input device described above.

A voice input device according to another aspect of the present disclosure includes: and a speech timing registration unit that registers at least one of the 1 st time and the 2 nd time at a time that is earlier than the first time, wherein the speaker recognition unit recognizes one of the 1 or more speakers based on the 1 st time, the 2 nd time, and a plurality of pieces of registration information indicating a timing of the 2 nd time with respect to the 1 st time by the speech timing registration unit.

Thus, as a condition desired by 1 or more speakers, the temporal context of the 1 st time and the 2 nd time can be registered in advance. Therefore, the speaker recognition unit can recognize any speaker from among 1 or more speakers only by determining whether or not the temporal context of the 1 st time and the 2 nd time is shown in the registration information. As a result, according to the voice input device, the speaker can be more reliably recognized by simple processing.

In the voice input device according to another aspect of the present disclosure, the utterance timing registration unit registers 1 st registration information in which 1 st time information is registered in association with one of the 1 or more speakers when registering the timing of each utterance of the 1 or more speakers, the 1 st time information indicating that the 2 nd time at which the utterance start position is located is later than the 1 st time at which the trigger input unit inputs the trigger, and registers 2 nd registration information in which 2 nd time information is registered in association with one of the 1 or more speakers, the 2 nd time information indicating that the 2 nd time information is greater than the 1 st time at which the trigger input unit inputs the trigger, the 2 nd time at the start position where speech is started is a further advanced time.

Thus, the speaker can register a condition that a trigger is input before the start of utterance or a condition that a trigger is input after the start of utterance. As described above, if the speaker registers the condition in advance, the voice input device can easily and reliably recognize the speaker without learning.

In the voice input device according to another aspect of the present disclosure, the speaker recognition unit calculates a timing of the 2 nd time with respect to the 1 st time, compares a result indicating the calculated timing with the plurality of pieces of registration information, determines that the speaking speaker is the 1 st speaker when the 2 nd time is a later time than the 1 st time, and determines that the speaking speaker is the 2 nd speaker different from the 1 st speaker when the 2 nd time is a earlier time than the 1 st time.

Thus, the speaker recognition unit can calculate the timing of the 2 nd time with respect to the 1 st time based on the 1 st time at which the trigger input unit is input and the 2 nd time detected by the utterance start detection unit. Thus, the speaker recognition unit can calculate the result of the timing of the expression that the 1 st time is a time earlier or later than the 2 nd time. As a result, the speaker recognition unit can more reliably recognize which speaker is among 1 or more speakers by comparing the calculated result indicating the timing with the plurality of pieces of registration information.

In addition, when a plurality of speakers exist, for example, a period from the 1 st time to the 2 nd time is registered, and even if a plurality of speakers exist, which speaker is the speaker can be identified.

In the voice input device according to another aspect of the present disclosure, the trigger input unit is a voice input interface that accepts input of a preset voice, and the preset voice is input to the trigger input unit as the trigger.

Thus, the voice input device can perform magic word recognition and speaker recognition only by the speaker uttering a preset voice such as a wakeup word. Therefore, the operability of the voice input device is excellent.

In the voice input device according to another aspect of the present disclosure, the trigger input unit is an operation button provided in the voice input device, and the accepted operation input is input to the trigger input unit as the trigger.

Thus, the speaker can operate the trigger input unit to reliably input the trigger to the trigger input unit.

The embodiments described below all show a specific example of the present disclosure. The numerical values, shapes, materials, constituent elements, arrangement positions and connection modes of the constituent elements, steps, order of the steps, and the like shown in the following embodiments are examples, and are not intended to limit the present disclosure. Further, among the components in the following embodiments, components not recited in the independent claims are described as arbitrary components. In all the embodiments, the respective contents can be combined.

Hereinafter, a voice input device, a voice input method, and a recording medium according to one embodiment of the present disclosure will be described in detail with reference to the drawings.

(embodiment mode)

< constitution: speaker recognition apparatus 1 >

Fig. 1 is a diagram showing an example of an appearance of the speaker recognition apparatus 1 and a use situation of the speaker recognition apparatus 1 based on speech of a speaker in the embodiment. The following situation is illustrated in fig. 1: the speaker recognition apparatus 1 is shared by a plurality of speakers, and the speaker recognition apparatus 1 is used for speaking.

As shown in fig. 1, the speaker recognition apparatus 1 is an apparatus that acquires speech uttered by 1 or more speakers and recognizes which speaker is among the 1 or more speakers based on the acquired speech. That is, the speaker recognition apparatus 1 acquires each speech uttered by each of 1 or more speakers, and recognizes the speaker for each acquired speech. The speaker recognition apparatus 1 is an example of a voice input apparatus.

The speaker recognition apparatus 1 may acquire a conversation between the speaker and the conversation target, and recognize which speaker is the speaker or the conversation target based on the acquired conversation.

In the present embodiment, the speaker recognition apparatus 1 acquires each speech uttered by each of 1 or more speakers, and recognizes the speaker based on each timing (timing) of the acquired speech and the inputted trigger.

In fig. 1 of the present embodiment, the following is illustrated: the speaker recognition apparatus 1 is used by each of the 1 st speaker and the 2 nd speaker, which are a plurality of speakers, and each speaker makes a speech. For example, the speaker recognition apparatus 1 shown by a two-dot chain line may be used by the 2 nd speaker after the speech recognition of the 1 st speaker is finished. That is, the speaker recognition apparatus 1 may be used by each speaker at the timing and event, or may be used by both the 1 st speaker and the 2 nd speaker at the time of conversation. The 1 st speaker and the 2 nd speaker are examples of speakers. Furthermore, the 2 nd speaker may also be a conversation object of the 1 st speaker.

Here, the 1 st speaker and the 2 nd speaker may speak in the same language or may speak across 2 different languages. In this case, the speaker recognition apparatus 1 recognizes whether the 1 st speaker or the 2 nd speaker is present for each speech uttered by the 1 st speaker and the 2 nd speaker between 2 languages, which are the same or different, of the 1 st language uttered by the 1 st speaker and the 2 nd language uttered by the 2 nd speaker. For example, the 1 st language and the 2 nd language are japanese, english, french, german, chinese, and the like.

In the present embodiment, the 1 st speaker is the owner of the speaker recognition apparatus 1, and the input of the trigger to the speaker recognition apparatus 1 and the registration of the timing of the utterance of the speaker with respect to the input trigger are mainly performed by the 1 st speaker. That is, the 1 st speaker is a user of the speaker recognition apparatus 1 who understands the operation method of the speaker recognition apparatus 1.

In the present embodiment, the speaker recognition apparatus 1 recognizes that, for example, the 1 st speaker has uttered by uttering after the speaker has inputted the trigger to the speaker recognition apparatus 1. After another speaker makes a speech, the speaker recognition apparatus 1 receives an input trigger, and thereby the speaker recognition apparatus 1 recognizes that, for example, the 2 nd speaker has made a speech.

The speaker recognition apparatus 1 is a portable terminal such as a smartphone or a tablet terminal that can be carried by the 1 st speaker.

Fig. 2A is a block diagram showing the speaker recognition apparatus 1 according to the embodiment.

As shown in fig. 2A, the speaker recognition apparatus 1 includes an utterance timing registration unit 25, an acquisition unit 21, a storage unit 22, a trigger input unit 23, an utterance start detection unit 24, a speaker recognition unit 26, an output unit 31, and a power supply unit 35.

[ speaking opportunity registration section 25]

The utterance timing registration unit 25 registers at least which of the 1 st time and the 2 nd time is a time before. Specifically, utterance timing registration unit 25 is a registration device that registers the timing of an utterance of each of 1 or more speakers with respect to an input of a trigger.

The utterance timing registration unit 25 can set a desired condition by 1 or more speaker operations and register the set condition. Specifically, the utterance timing registration unit 25 registers 1 st registration information in which 1 st time information is associated with one of 1 or more speakers when registering the timing of each utterance of 1 or more speakers, the 1 st time information indicating that the 2 nd time at which the utterance start position starts is a later time than the 1 st time at which the trigger is input by the trigger input unit 23. A specific example is shown in which a condition that the 1 st speaker starts speaking after the trigger is input to the trigger input unit 23 is set, and the speaking timing registration unit 25 registers the 1 st registration information in which the 1 st time information indicating the set condition is associated with the tag a. The utterance timing registration unit 25 stores a memory and stores the set 1 st registration information. The 1 st registration information set by the utterance timing registration unit 25 may be stored in the storage unit 22.

When registering the timing of each utterance, the utterance timing registration unit 25 registers 2 nd registration information in which 2 nd time information indicating that the 2 nd time at which the utterance start position is earlier than the 1 st time at which the trigger input unit 23 is input is associated with any other speaker among 1 or more speakers. A specific example is shown in which a condition that the 2 nd speaker starts speaking before the trigger is input to the trigger input unit 23 is set, and the speaking timing registration unit 25 registers the 2 nd registration information in which the 2 nd time information indicating the set condition is associated with the tag B. The utterance timing registration unit 25 stores a memory and stores the set 2 nd registration information. The 2 nd registration information set by the utterance timing registration unit 25 may be stored in the storage unit 22.

For example, if the 1 st speaker makes a speech under the condition of the 1 st registered information set in the tag a, if the 1 st speaker prompts the 2 nd speaker to make a speech under the condition of the 2 nd registered information set in the tag B (the condition to be used is determined in advance between the 1 st speaker and the 2 nd speaker), it is possible to make a speech under a different condition by a different speaker. Therefore, if the conditions for each utterance are registered by the utterance timing registration unit 25, the speaker recognition unit 26 becomes a determination material for performing speaker recognition.

The utterance timing registration unit 25 outputs a plurality of pieces of registered information such as the 1 st registered information and the 2 nd registered information to the speaker recognition unit 26.

The utterance timing registration unit 25 can set a period from the 1 st time point when the trigger is input to the trigger input unit 23 to the 2 nd time point when the speaker makes an utterance. That is, the utterance timing registration unit 25 may register, as registration information, the following conditions: the speaker starts speaking after o seconds from the 1 st time point when the trigger is input to the trigger input unit 23 or after o seconds. Further, the utterance timing registration unit 25 may register, as registration information, the following conditions: the trigger is input to the trigger input unit 23 o seconds after or o seconds after the speaker starts speaking. In other words, the utterance timing registration unit 25 may set the 2 nd time to be o seconds after or o seconds after the 1 st time, and set the 1 st time to be o seconds after or o seconds after the 2 nd time, and register the set information as the registration information. Here, "o" is an arbitrary number, and does not necessarily indicate the same time.

Note that the utterance timing registration unit 25 may register, as registration information, the length of the continuous input time of the trigger to the trigger input unit 23. For example, in the case where the trigger input unit 23 is an operation button, if the utterance timing registration unit 25 registers the length of the time (continuously input to the trigger input unit 23) during which the operation button is pressed for a long time according to the timing of utterance by the speaker, the speaker recognition unit 26 can also use the registered long-pressed time as a determination material for recognizing the speaker.

For example, the utterance timing registration unit 25 may register, as registration information, the following conditions: after o seconds from the 1 st time point when the trigger is input to the trigger input portion 23 or after o seconds, the trigger is continuously input to the trigger input portion 23 for the good quality second. Further, the utterance timing registration unit 25 may register, as registration information, the following conditions: after o seconds from the start of speech of the speaker or after o seconds, the trigger is continuously input to the trigger input section 23 for good quality in second.

[ obtaining part 21]

The acquisition unit 21 acquires speech when 1 or more speakers speak. That is, the acquisition unit 21 acquires speech uttered by each of 1 or more speakers, converts the speech uttered by the acquired speaker into a speech signal, and outputs the converted speech signal to the storage unit 22.

The acquisition unit 21 is a microphone unit that acquires a voice signal by converting voice into a voice signal. The acquisition unit 21 may be an input interface electrically connected to a microphone. That is, the acquisition unit 21 may acquire the voice signal from a microphone. The acquisition unit 21 may be a microphone array unit including a plurality of microphones. The acquisition unit 21 is not particularly limited as long as it can collect the voice of the speaker present around the speaker recognition device 1, and the arrangement of the acquisition unit 21 in the speaker recognition device 1 is not particularly limited.

[ storage section 22]

The storage unit 22 stores the speech information of each speech of 1 or more speakers acquired by the acquisition unit 21. Specifically, the storage unit 22 stores the speech information of the speech represented by the speech signal acquired by the acquisition unit 21. That is, the storage unit 22 automatically stores speech information of speech uttered by each of 1 or more speakers.

Further, the storage unit 22 restarts recording when the speaker recognition device 1 is activated. The storage unit 22 may start recording from the time when the speaker first inputs the trigger to the trigger input unit 23 after the speaker recognition device 1 is activated. That is, the speaker may first input a trigger to the trigger input unit 23, and the storage unit 22 may start recording of a voice. Further, the storage unit 22 may stop or stop the recording of the voice by inputting a trigger to the trigger input unit 23.

Further, since there is a limit to the capacity stored in the storage unit 22, the voice information stored in the storage unit 22 can be automatically deleted from the earlier voice data if the voice information reaches a predetermined capacity. That is, the speech information may be added with the speech of the speaker and information (time stamp) indicating the date and time. The storage unit 22 deletes the earlier speech information based on the information indicating the date and time.

The storage unit 22 is formed of an HDD (Hard Disk Drive) or a semiconductor memory.

[ trigger input section 23]

The trigger is input to the trigger input unit 23 by the speaker. In a specific example, the trigger input unit 23 receives an input of a preset trigger from the speaker before the 1 st speaker speaks, for example. The trigger input unit 23 receives an input of a preset trigger from the speaker after the 2 nd speaker speaks, for example. That is, the trigger input unit 23 receives an input of a trigger before the 1 st speaker speaks in the case of the 1 st speaker, and receives an input of a trigger after the 2 nd speaker speaks in the case of the 2 nd speaker. The trigger input unit 23 receives an input of a trigger from a speaker every time 1 or more speakers speak.

The trigger input unit 23 may start recording of the voice to the storage unit 22 by an operation input from the speaker, or may stop or stop recording of the voice to the storage unit 22.

The trigger input unit 23 generates an input signal if detecting an input trigger, and outputs the generated input signal to the utterance start detection unit 24 and the speaker recognition unit 26. The input signal includes information (time stamp) indicating the 1 st time.

In the present embodiment, the trigger input unit 23 is 1 operation button provided in the speaker recognition device 1. In this case, the operation input generated by the speaker pressing the operation button is received as the trigger input to the trigger input unit 23. That is, in the present embodiment, the trigger is an input signal that the speaker inputs to the trigger input unit 23. The trigger input unit 23 may be provided in the speaker recognition device 1 in 2 or more.

The trigger input unit 23 may be a touch sensor provided integrally with the display unit 33 of the speaker recognition device 1. In this case, the trigger input unit 23, which is an operation button for receiving an operation input of the speaker, may be displayed on the display unit 33 of the speaker recognition apparatus 1.

Fig. 2B is a block diagram showing an example of another speaker recognition apparatus 1 according to the embodiment.

As shown in fig. 2B, the trigger input unit 23a may be a voice input interface that receives input of a preset voice. In this case, a preset voice is input as a trigger to the trigger input unit 23a via the acquisition unit 21 a. That is, in this case, the voice uttered by the speaker input to the trigger input unit 23a serves as an input signal as a trigger. Here, the preset speech is a wakeup word or the like. The speaker recognition apparatus 1, if it is previously set that the wake-up word is, for example, "OK! O, Xx is the 1 st speaker and the wake-up word is, for example, O, OK! When the speaker is the 2 nd speaker, the speaker speaks "OK! When the speaker is recognized as the 1 st speaker, the speaker says "O", OK! Identified as the 2 nd speaker in the case of xxx ". Further, if the trigger input unit 23a is a voice input interface, by setting speakers for each content of voice, it is possible to reliably recognize each speaker from the 1 st speaker and the 2 nd speaker.

[ utterance start detection unit 24]

As shown in fig. 1 and 2A, utterance start detection unit 24 is a detection device that detects a start position of an utterance to be started from each of voices stored in storage unit 22 every time a trigger is input to trigger input unit 23.

Specifically, the utterance start detection unit 24 detects, among the utterances of the respective pieces of speech information stored in the storage unit 22, the start position of the speech indicated by the speech information that is uttered by the 1 st speaker and stored by the utterance of the 1 st speaker between the 1 st time when the speaker inputs the trigger to the trigger input unit 23 and the time when the predetermined period elapses. That is, the utterance start detection unit 24 detects the start position at the 2 nd time point, which is the utterance start of the speech uttered by the 1 st speaker, between the 1 st time point at which the trigger input unit 23 detects the input of the trigger and the time point at which the predetermined period elapses.

The utterance start detection unit 24 detects, among the utterances of the respective pieces of speech information stored in the storage unit 22, a start position of a speech indicated by the speech information stored by the utterance of the 2 nd speaker, from the 1 st time when the speaker inputs the trigger to the trigger input unit 23 to a time earlier than the 1 st time by a predetermined period, the utterance of the 2 nd speaker starting. That is, the utterance start detection unit 24 detects the start position at the 2 nd time that is the utterance start of the speech uttered by the 2 nd speaker between the 1 st time and a time earlier than the 1 st time by a predetermined period.

The utterance start detection unit 24 generates start position information indicating a start position of a speech for each speech, and outputs the generated start position information to the speaker recognition unit 26. The start position information is information (time stamp) indicating a start position that is a speech start time of speech uttered by the speaker.

[ speaker recognition unit 26]

The speaker recognition unit 26 is a device that recognizes any one speaker from among 1 or more speakers based on the 1 st time at which the trigger is input by the trigger input unit 23, the 2 nd time at which the utterance start detection unit 24 detects the start position of the utterance from each voice, and a plurality of pieces of registration information in which the utterance timing registration unit 25 indicates the timing of the 2 nd time with respect to the 1 st time.

Specifically, the speaker recognition unit 26 calculates the timing of the 2 nd time with respect to the 1 st time if the input signal indicated at the 1 st time is acquired from the trigger input unit 23 and the start position information is acquired from the utterance start detection unit 24. That is, the speaker recognition unit 26 compares and calculates the temporal context of the 2 nd time indicated by the start position information with respect to the 1 st time indicated by the input signal. The result calculated by the speaker recognition unit 26 is a result indicating the timing of the 2 nd time with respect to the 1 st time.

In addition, if the speaker recognition unit 26 acquires the registered information from the utterance timing registration unit 25, the result indicating the timing of the 2 nd time with respect to the 1 st time is compared with the plurality of registered information, and when the 2 nd time is later than the 1 st time, the speaker recognition unit determines that the speaker who uttered is the 1 st speaker and specifies the speaker. The speaker recognition unit 26 compares the result indicating the timing with the plurality of pieces of registration information, and determines that the speaker who uttered is the 2 nd speaker and specifies the speaker when the 2 nd time is earlier than the 1 st time.

More specifically, the speaker recognition unit 26 determines which speaker is based on each speech uttered by 1 or more speakers in a predetermined period before and after the 1 st time point at which the input of the trigger is received from the trigger input unit 23. The speaker recognition unit 26 selects the latest (newest) speech uttered by the speaker from among the speeches stored in the storage unit 22 between the 1 st time and a time earlier than the 1 st time by a predetermined period, or between the 1 st time and a time when the predetermined period has elapsed, with the 1 st time as a base point. The speaker recognition unit 26 recognizes a certain speaker by using the selected speech.

Here, the predetermined period may be, for example, several seconds such as 1 second or 2 seconds, or may be, for example, 10 seconds. Thus, the speaker recognition unit 26 recognizes the speaker based on the 1 st time and the 2 nd time of each speech which is uttered by 1 or more speakers most recently. This is to avoid the following problems: even if the speaker recognition unit 26 recognizes the speaker based on the early speech, it is not possible to accurately recognize the speaker who has recently uttered.

The speaker recognition unit 26 outputs result information including a result of recognizing the speaker to the output unit 31. The result information includes information indicating one speaker identified from among 1 or more speakers. For example, the result information includes: the speech information stored by the utterance of the speaker is information indicating the recognized 1 st speaker, or the speech information stored by the utterance of the speaker is information indicating the recognized 2 nd speaker.

[ display part 33]

The display unit 33 is a monitor such as a liquid crystal panel or an organic EL panel. The display unit 33 displays the speaker indicated by the result information acquired from the speaker recognition unit 26 as a text. For example, if the speaker speaks, the display unit 33 displays a message indicating that the speaking speaker is the 1 st speaker. In addition, if the speaker speaks, the display unit 33 displays to indicate that the speaker speaking is the 2 nd speaker. The display unit 33 is an example of the output unit 31.

The speaker recognition apparatus 1 may also have a voice output unit. In this case, the voice output unit may be a speaker that outputs the speaker indicated by the result information acquired from the speaker recognition unit 26 as a voice. That is, when the speaker utters, the speech output unit outputs the speech indicating that the speaker indicated by the result information is the 1 st speaker. When the speaker utters, the speech output unit outputs speech indicating that the speaker indicated in the result information is the 2 nd speaker. The voice output unit is an example of the output unit 31.

[ Power supply section 35]

The power supply unit 35 is, for example, a primary battery or a secondary battery, and is electrically connected to the utterance timing registration unit 25, the acquisition unit 21, the storage unit 22, the trigger input unit 23, the utterance start detection unit 24, the speaker recognition unit 26, the output unit 31, and the like via wiring. The power supply unit 35 supplies electric power to the utterance timing registration unit 25, the acquisition unit 21, the storage unit 22, the trigger input unit 23, the utterance start detection unit 24, the speaker recognition unit 26, the output unit 31, and the like.

< action >

The operation performed by the speaker recognition apparatus 1 configured as described above will be described.

Fig. 3 is a flowchart showing the operation of the speaker recognition apparatus 1 when the 1 st speaker utters. Fig. 4 is a diagram illustrating the timing of the 1 st time and the 2 nd time for each speech of speech uttered when the 1 st speaker uttered and the 2 nd speaker uttered.

In fig. 3 and 4, the utterance timing registration unit 25 registers 1 st registration information, which is obtained by associating 1 st time information indicating a condition that the 1 st speaker starts to speak with the tag a after the speaker inputs a trigger to the trigger input unit 23, in the memory of the utterance timing registration unit 25. Moreover, the utterance timing registration unit 25 is configured to: the 2 nd registration information in which the 2 nd time information indicating the condition that the 2 nd speaker starts speaking before the speaker inputs the trigger to the trigger input unit 23 and the tag B are associated with each other is registered in the memory of the speaking timing registration unit 25.

As shown in fig. 2A, 3, and 4, first, the trigger input unit 23 inputs a trigger for the acquisition unit 21 to start acquiring each voice. That is, the trigger input unit 23 receives an input of a trigger preset by one speaker before the speaker speaks. Thereby, the trigger input unit 23 detects a trigger input from the speaker (S11). The trigger input unit 23 generates an input signal if it detects an input of a trigger, and outputs the generated input signal to the utterance start detection unit 24 and the speaker recognition unit 26.

Next, the acquisition unit 21 acquires the speech uttered by one of the speakers (S12). The acquisition unit 21 converts the acquired speech uttered by one of the speakers into a speech signal, and outputs the converted speech signal to the storage unit 22.

Next, the storage unit 22 stores the speech information of the speech represented by the speech signal acquired by the acquisition unit 21 (S13). That is, the storage unit 22 automatically stores speech information of the latest speech uttered by one speaker.

Next, if the utterance start detection unit 24 receives an input signal from the trigger input unit 23, the utterance start detection unit detects a start position (time 2) at which utterance starts in the speech information stored in the storage unit 22 (S14). Specifically, the utterance start detection unit 24 detects a start position of a speech indicated by speech information that is uttered by one speaker immediately after the speaker inputs a trigger to the trigger input unit 23 and is stored by the utterance of one speaker. The utterance start detection unit 24 generates start position information indicating a start position of a speech, and outputs the generated start position information to the speaker recognition unit 26.

Next, the speaker recognition unit 26 recognizes either the 1 st speaker or the 2 nd speaker based on the 1 st time at which the trigger is input by the trigger input unit 23, the 2 nd time at which the utterance start detection unit 24 detects the start position of the utterance from each voice, and a plurality of pieces of registration information indicating the timing of the 2 nd time with respect to the 1 st time by the utterance timing registration unit 25 (S15). In fig. 3, the 1 st time is a time earlier than the 2 nd time, and therefore the speaker recognition unit 26 recognizes the voice of the start position information (the voice of the utterance) as the 1 st speaker. That is, the speaker recognition unit 26 recognizes one speaker as the 1 st speaker.

Next, the speaker recognition unit 26 outputs result information including the result of recognizing the 1 st speaker to the output unit 31 (S16).

Then, the speaker recognition apparatus 1 ends the processing.

Fig. 5 is a flowchart showing the operation of the speaker recognition apparatus 1 when the 2 nd speaker utters. The same processing as that in fig. 3 is omitted as appropriate.

As shown in fig. 2A, 4, and 5, the acquisition unit 21 first acquires the speech uttered by the speaker of the other party (S21). The acquisition unit 21 converts the acquired speech uttered by the other speaker into a speech signal, and outputs the converted speech signal to the storage unit 22.

Next, a trigger for the acquisition unit 21 to start acquiring each voice is input to the trigger input unit 23. That is, the trigger input unit 23 receives an input of a trigger preset by the speaker after the other speaker speaks. Thereby, the trigger input unit 23 detects a trigger input from the speaker (S22). The trigger input unit 23 generates an input signal if it detects an input of a trigger, and outputs the generated input signal to the utterance start detection unit 24 and the speaker recognition unit 26.

Next, the storage unit 22 stores the speech information of the speech represented by the speech signal acquired by the acquisition unit 21 (S13). That is, the storage unit 22 automatically stores speech information of the latest speech uttered by the speaker of the other party.

Next, if the utterance start detection unit 24 receives an input signal from the trigger input unit 23, the utterance start detection unit detects a start position (time 2) at which utterance starts in the speech information stored in the storage unit 22 (S14). Specifically, the utterance start detection unit 24 detects the start position of a speech shown in speech information that the speaker utters by the other speaker immediately before the trigger input unit 23 and that is stored by the utterance of the other speaker. The utterance start detection unit 24 generates start position information indicating a start position of a speech, and outputs the generated start position information to the speaker recognition unit 26.

Next, the speaker recognition unit 26 recognizes either the 1 st speaker or the 2 nd speaker based on the 1 st time at which the trigger is input by the trigger input unit 23, the 2 nd time at which the utterance start detection unit 24 detects the start position of the utterance from each voice, and a plurality of pieces of registration information indicating the timing of the 2 nd time with respect to the 1 st time by the utterance timing registration unit 25 (S15). In fig. 5, the 2 nd time is a time earlier than the 1 st time, and therefore the speaker recognition unit 26 recognizes the voice of the start position information as the 2 nd speaker. That is, the speaker recognition unit 26 recognizes the speaker of the other party as the 2 nd speaker.

Next, the speaker recognition unit 26 outputs result information including the result of recognizing the 2 nd speaker to the output unit 31 (S16).

Then, the speaker recognition apparatus 1 ends the processing.

Fig. 6 is a flowchart showing the operation of the speaker recognition unit 26 of the speaker recognition device 1 according to the embodiment.

As shown in fig. 3, 5, and 6, first, if the input signal indicated by the 1 st time is acquired from the trigger input unit 23 and the start position information indicated by the 2 nd time is acquired from the utterance start detection unit 24, the speaker recognition unit 26 calculates the timing of the 2 nd time with respect to the 1 st time (S31). That is, the speaker recognition unit 26 compares and calculates the temporal context of the 2 nd time with respect to the 1 st time.

The speaker recognition unit 26 compares the calculated result indicating the timing of the 2 nd time with respect to the 1 st time with the registration information, and determines whether or not the 1 st time is a time earlier than the 2 nd time (S32).

When the 1 st time is a time earlier than the 2 nd time, the speaker recognition unit 26 determines that the content is the same as that shown in the 1 st registration information in the registration information (S32: yes), and determines that the speaker who uttered the speech is the 1 st speaker (S33).

The speaker recognition unit 26 outputs result information including a result of recognizing the 1 st speaker from among the 1 st speaker and the 2 nd speaker to the display unit. Then, the speaker recognition unit 26 ends the processing.

When the 1 st time is a time later than the 2 nd time, the speaker recognition unit 26 determines that the content is the same as that shown in the 2 nd registration information in the registration information (S32: no), and determines that the speaker who uttered is the 2 nd speaker (S34).

The speaker recognition unit 26 outputs result information including a result of recognizing the 2 nd speaker from among the 1 st speaker and the 2 nd speaker to the display unit. Then, the speaker recognition unit 26 ends the processing.

< action Effect >

Next, the operation and effects of the speaker recognition apparatus 1 in the present embodiment will be described.

As described above, the speaker recognition device 1 according to the present embodiment includes: an acquisition unit 21 that acquires each speech of 1 or more speakers when speaking; a storage unit 22 that stores the speech of each of the utterances of 1 or more speakers acquired by the acquisition unit 21; a trigger input unit 23 to which a trigger is input; an utterance start detection unit 24 that detects a start position of an utterance from each of the voices stored in the storage unit 22 every time a trigger is input to the trigger input unit 23; and a speaker recognition unit 26 that recognizes one of the 1 or more speakers based on at least the 1 st time at which the trigger is input by the trigger input unit 23 and the 2 nd time at which the utterance start position detected by the utterance start detection unit 24 from each voice is started.

This makes it possible to identify any speaker from among 1 or more speakers, for example, based on the temporal context between the 1 st time point at which a trigger by a speaker among 1 or more speakers is detected and the 2 nd time point at which the speaker uttered speech. That is, it is possible to recognize which speaker is the speaker among 1 or more speakers of the speech acquired by the acquisition unit 21 without learning the period from the 1 st time to the 2 nd time.

Therefore, according to the speaker recognition device 1, it is possible to recognize the speaker by a simple process and suppress an increase in the amount of calculation.

In particular, the speaker recognition apparatus 1 can recognize the speaker of the speech based on the timing of the utterance with respect to the 1 st time. Therefore, according to the speaker recognition apparatus 1, it is possible to recognize the speaker of the voice by a simple operation. Further, since the operation of the speaker recognition device 1 is simplified, the speaker recognition device 1 in which a plurality of buttons and the like are arranged in the speaker recognition device 1 can be prevented from becoming complicated. Therefore, according to the voice input device 1, for example, when the trigger input unit 23 is a button, it is possible to identify which speaker is among 1 or more speakers even with 1 button, and therefore, the configuration of the voice input device 1 can be simplified.

In addition, the voice input method in the present embodiment includes: obtaining each voice of more than 1 speaker when speaking; storing the acquired speech of each of the 1 or more speakers in the storage unit 22; is triggered by an input; detecting a start position at which to start speaking from each voice stored in the storage unit 22 every time a trigger is input; and identifying one speaker from among 1 or more speakers based on at least the 1 st time at which the trigger is input and the 2 nd time at which the utterance is detected from each voice.

This speech input method also has the same operational effects as those of the speaker recognition apparatus 1 described above.

The recording medium in the present embodiment is a computer-readable nonvolatile recording medium on which a program for causing a computer to execute the voice input method is recorded.

This recording medium also has the same operational effects as those of the speaker recognition apparatus 1 described above.

In addition, the speaker recognition device 1 in the present embodiment includes: the utterance timing registration unit 25 registers at least one of the 1 st time and the 2 nd time as a time before. The speaker recognition unit 26 recognizes one speaker from among 1 or more speakers based on the 1 st time, the 2 nd time, and a plurality of pieces of registration information indicating the timing of the 2 nd time with respect to the 1 st time by the utterance timing registration unit 25.

Thus, as a condition desired by 1 or more speakers, the temporal context of the 1 st time and the 2 nd time can be registered in advance. Therefore, the speaker recognition unit 26 can recognize any speaker from among 1 or more speakers only by determining whether or not the temporal context of the 1 st time and the 2 nd time is shown in the registration information. As a result, the speaker recognition device 1 can more reliably recognize the speaker by simple processing.

In the speaker identification device 1 according to the present embodiment, the utterance timing registration unit 25 registers 1 st registration information in which 1 st time information is associated with any one of 1 or more speakers when registering the timing of an utterance of each of the 1 or more speakers, the 1 st time information indicating that the 2 nd time at which the utterance start position is started is a later time than the 1 st time at which the trigger is input by the trigger input unit 23. When registering the timing of each utterance, the utterance timing registration unit 25 registers 2 nd registration information in which 2 nd time information indicating that the 2 nd time at which the utterance start position is earlier than the 1 st time at which the trigger input unit 23 is input is associated with any other speaker among 1 or more speakers.

Thus, the speaker can register a condition that a trigger is input before the start of utterance or a condition that a trigger is input after the start of utterance. As described above, if the speaker registers the condition in advance, the speaker recognition apparatus 1 can easily and reliably recognize the speaker without learning.

In the speaker recognition device 1 according to the present embodiment, the speaker recognition unit 26 calculates the timing of the 2 nd time with respect to the 1 st time, compares the calculated timing with the plurality of pieces of registration information, determines that the speaker who uttered is the 1 st speaker when the 2 nd time is a later time than the 1 st time, and determines that the speaker who uttered is the 2 nd speaker different from the 1 st speaker when the 2 nd time is a earlier time than the 1 st time.

Thus, the speaker recognition unit 26 can calculate the timing of the 2 nd time with respect to the 1 st time from the 1 st time input to the trigger input unit 23 and the 2 nd time detected by the utterance start detection unit 24. Thus, the speaker recognition unit 26 can calculate the result of the timing of the expression that the 1 st time is a time earlier or later than the 2 nd time. As a result, the speaker recognition unit 26 can more reliably recognize which speaker is among 1 or more speakers by comparing the calculated result indicating the timing with the plurality of pieces of registration information.

In addition, when there are a plurality of speakers, for example, by registering the period from the 1 st time to the 2 nd time, it is possible to identify which speaker is even if there are a plurality of speakers.

In the speaker recognition device 1 according to the present embodiment, the trigger input unit 23 is a voice input interface that receives input of a preset voice. In addition, a preset voice is input to the trigger input unit 23 as a trigger.

Thus, the speaker recognition apparatus 1 can perform magic word recognition and perform speaker recognition only by the speaker uttering a predetermined voice such as a wakeup word. Therefore, the speaker recognition apparatus 1 is excellent in operability.

In the speaker recognition device 1 according to the present embodiment, the trigger input unit 23 is an operation button provided in the speaker recognition device 1. The received operation input is input as a trigger to the trigger input unit 23.

Thus, the speaker can operate the trigger input unit 23 to reliably input the trigger to the trigger input unit 23.

(other modifications, etc.)

The present disclosure has been described above based on the embodiments, but the present disclosure is not limited to these embodiments and the like.

For example, in the voice input device, the voice input method, and the recording medium according to the above embodiments, the direction of the speaker with respect to the voice input device may be estimated based on the voice acquired by the acquisition unit. In this case, the sound source direction of each speaker with respect to the speech input device may be estimated by using the acquisition unit of the microphone array unit. Specifically, the voice input device may calculate a time difference (phase difference) between the arrival of the voice at each microphone in the acquisition unit, and estimate the sound source direction by, for example, a delay time estimation method or the like.

In the voice input device, the voice input method, and the recording medium according to the above-described embodiments, the voice input device may automatically stop or stop the recording if a period in which the acquisition unit cannot acquire the voice of the speaker is detected for a predetermined period or longer by detecting a section of the voice of the speaker acquired by the acquisition unit.

The voice input method according to each of the above embodiments may be realized by a program using a computer, and such a program may be stored in a storage device.

The processing units included in the voice input device, the voice input method, and the program according to the above embodiments are typically realized as an LSI which is an integrated circuit. These may be individually formed as a single chip, or may be formed as a single chip so as to include a part or all of them.

The integrated circuit is not limited to an LSI, and may be realized by a dedicated circuit or a general-purpose processor. An FPGA (Field Programmable Gate Array) that can be programmed after LSI manufacturing or a reconfigurable processor that can reconfigure connection and setting of circuit cells within an LSI may be used.

In the above embodiments, each component may be configured by dedicated hardware or may be realized by executing a software program suitable for each component. Each component may be realized by reading out and executing a software program recorded in a recording medium such as a hard disk or a semiconductor memory by a program execution unit such as a CPU or a processor.

All the numbers used in the above description are exemplified for specifically explaining the present disclosure, and the embodiments of the present disclosure are not limited to the exemplified numbers.

Note that division of functional blocks in the block diagrams is an example, and a plurality of functional blocks may be implemented as one functional block, one functional block may be divided into a plurality of functional blocks, or a part of functions may be transferred to another functional block. Further, the functions of a plurality of functional blocks having similar functions may be processed in parallel or in a time-division manner by a single piece of hardware or software.

The order in which the steps in the flowcharts are executed is exemplified for the purpose of specifically describing the present disclosure, and may be an order other than the above. Further, a part of the above steps may be executed simultaneously (in parallel) with other steps.

Other embodiments obtained by implementing various modifications to the embodiments and embodiments that can be realized by arbitrarily combining the components and functions in the embodiments within a scope not departing from the gist of the present disclosure are also included in the present disclosure.

Industrial applicability

The present disclosure can be applied to a voice input device, a voice input method, and a storage medium that are used to determine which speaker each utterance of a plurality of speakers is.

Claims

1. A voice input device is provided with:

an acquisition unit that acquires each speech of 1 or more speakers when speaking;

a storage unit that stores the speech of each of the utterances of the 1 or more speakers acquired by the acquisition unit;

a trigger input unit to which a trigger is input;

an utterance start detection unit that detects a start position of an utterance from each of the voices stored in the storage unit each time the trigger is input by the trigger input unit; and

and a speaker recognition unit configured to recognize any one speaker among the 1 or more speakers based on at least a 1 st time point at which the trigger is input by the trigger input unit and a 2 nd time point at which the utterance start detection unit detects the start position of the utterance from each of the voices.

2. The voice input device according to claim 1, comprising:

a speech timing registration unit that registers at least one of the 1 st time and the 2 nd time that is a time before,

the speaker recognition unit recognizes one of the 1 or more speakers based on the 1 st time, the 2 nd time, and a plurality of pieces of registration information indicating timings of the 2 nd time with respect to the 1 st time, the utterance timing registration unit.

3. The speech input device of claim 2,

when registering the timing of utterance of each of the 1 or more speakers,

registering 1 st registration information in which 1 st time information is associated with one of the 1 or more speakers, the 1 st time information indicating that the 2 nd time at which the utterance starts is later than the 1 st time at which the trigger is input by the trigger input unit,

registering 2 nd registration information, the 2 nd registration information being registration information in which 2 nd time information is associated with any other speaker among the 1 or more speakers, the 2 nd time information indicating that the 2 nd time at which a start position of speech starts is earlier than the 1 st time at which the trigger is input by the trigger input unit.

4. The voice input apparatus according to claim 2 or 3,

the speaker recognition unit is:

calculating the timing of the 2 nd time relative to the 1 st time,

comparing the calculated result indicating the timing with the plurality of pieces of registration information, and determining that the speaker who uttered is the 1 st speaker when the 2 nd time is later than the 1 st time, and determining that the speaker who uttered is the 2 nd speaker different from the 1 st speaker when the 2 nd time is earlier than the 1 st time.

5. The voice input device according to any one of claims 1 to 3,

the trigger input unit is a voice input interface for receiving a preset voice input,

a preset voice is input to the trigger input unit as the trigger.

6. The voice input device according to any one of claims 1 to 3,

the trigger input part is an operation button arranged on the voice input device,

the received operation input is input to the trigger input unit as the trigger.

7. A method of speech input, comprising:

obtaining each voice of more than 1 speaker when speaking;

storing the obtained speech of each of the 1 or more speakers in a storage unit;

is triggered by an input;

detecting a start position at which to start speaking from the respective voices stored in the storage portion every time the trigger is input; and

one of the 1 or more speakers is identified based on at least a 1 st time point at which the trigger is input and a 2 nd time point at which an utterance detected from each of the voices starts.

8. A computer-readable nonvolatile recording medium recording a program for causing a computer to execute the voice input method according to claim 7.