CN113228170A - Information processing apparatus and nonvolatile storage medium - Google Patents

Information processing apparatus and nonvolatile storage medium Download PDF

Info

Publication number
CN113228170A
CN113228170A CN202080005757.3A CN202080005757A CN113228170A CN 113228170 A CN113228170 A CN 113228170A CN 202080005757 A CN202080005757 A CN 202080005757A CN 113228170 A CN113228170 A CN 113228170A
Authority
CN
China
Prior art keywords
score
unit
information processing
voice
processing apparatus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202080005757.3A
Other languages
Chinese (zh)
Other versions
CN113228170B (en
Inventor
千葉俊一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Visual Technology Co Ltd
Toshiba Visual Solutions Corp
Original Assignee
Hisense Visual Technology Co Ltd
Toshiba Visual Solutions Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Visual Technology Co Ltd, Toshiba Visual Solutions Corp filed Critical Hisense Visual Technology Co Ltd
Publication of CN113228170A publication Critical patent/CN113228170A/en
Application granted granted Critical
Publication of CN113228170B publication Critical patent/CN113228170B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The present invention relates to an information processing apparatus and a nonvolatile storage medium that assist in determining a user who attempts to detect a trigger. An information processing device according to an embodiment includes: an acquisition unit that acquires, as a sound signal, a sound of a user input to the sound input unit; a score calculation unit that calculates a score of the voice signal with respect to the voice data, the score serving as a reference for detecting a trigger word for starting a voice recognition service from the voice signal; and a display control unit that displays the score on the display unit.

Description

Information processing apparatus and nonvolatile storage medium
This application claims priority to the filing of japanese patent application having application number 2019-.
Technical Field
The present application relates to an information processing apparatus and a nonvolatile storage medium.
Background
In a device such as a television apparatus having a voice recognition function, for example, a user can operate the device by voice. The device starts the voice recognition service when a Trigger word (Trigger word) uttered by the user is detected.
Prior art documents
Patent document
Patent document 1: japanese laid-open patent publication No. 2012 and 008554
Disclosure of Invention
However, the detection accuracy of the trigger word may be degraded depending on the way of speaking of the user, the surrounding environment, and the like. Since the detection accuracy is lowered in consideration of various factors, the user sometimes cannot determine what the cause of the undetected trigger word is.
An object of the present application is to provide an information processing apparatus and a nonvolatile storage medium that can assist a user's judgment that is attempted to detect a trigger.
An information processing device according to an embodiment of the present application includes: an acquisition unit that acquires, as a sound signal, a sound of a user input to the sound input unit; a score calculation unit that calculates a score of the voice signal with respect to voice data, the score serving as a reference for detecting a trigger word for starting a voice recognition service from the voice signal; and a display control unit that displays the score on a display unit.
Drawings
Fig. 1 is a diagram showing an example of a configuration of a voice recognition system according to an embodiment;
fig. 2 is a diagram showing an example of a hardware configuration of a television device according to the embodiment;
fig. 3 is a diagram showing an example of a functional configuration of a television device according to the embodiment;
fig. 4 is a diagram showing an example of a score display screen displayed by the television device according to the embodiment;
fig. 5 is a diagram showing some examples of a score calculation method by a television device according to an embodiment;
fig. 6 is a flowchart showing an example of a procedure of trigger detection processing in the television device according to the embodiment;
fig. 7 is a diagram showing an example of a score display screen displayed on the television device according to modification 1 of the embodiment;
fig. 8 is a diagram showing an example of a functional configuration of a television device according to modification 2 of the embodiment;
fig. 9 is a diagram showing an example of a score display screen displayed on the television device according to modification 2 of the embodiment;
fig. 10 is a diagram showing another example of a score display screen displayed on the television device according to modification 2 of the embodiment;
fig. 11 is a diagram showing an example of a score display screen displayed on the television device according to modification 3 of the embodiment.
Description of the reference numerals
1 … voice recognition system, 10, 30 … television device, 11 … input receiving part, 12 … test function setting part, 13 … trigger word detecting part, 14 … score calculating part, 15, 35 … display control part, 16 … application executing part, 17 … device control part, 18 … communication part, 19 … storing part, 19a … voice dictionary, 20 … voice recognition server, 31 … volume judging part, 40 … network.
Detailed Description
(configuration of Voice recognition System)
Fig. 1 is a diagram showing an example of the configuration of a voice recognition system 1 according to the embodiment. As shown in fig. 1, the voice recognition system 1 includes a television device 10 and a voice recognition server 20, and provides, for example, a voice recognition service to a user of the television device 10. With the voice recognition service, the user can perform the operation of the television apparatus 10 by voice, for example.
The television apparatus 10 and the voice recognition server 20 are connected to each other wirelessly or by wire via a network 40 such as the internet. The Network 40 may also be, for example, a home Network based on DLNA (Digital Living Network Alliance) (registered trademark), a Local Area Network (LAN) in the home, or the like.
The television apparatus 10 as an information processing apparatus can receive various programs by receiving a broadcast signal from a broadcasting station, for example. Further, the television apparatus 10 has a voice recognition function, and starts performing a voice recognition service when a trigger word uttered by the user is detected. The Trigger word is a predetermined voice command that becomes a Trigger (Trigger) for starting the voice recognition service. The voice recognition function of the television apparatus 10 is exclusively used for detecting the trigger word. After the voice recognition service is started, the television apparatus 10 provides the voice recognition service to the user by using, for example, the voice recognition function of the voice recognition server 20. In this manner, the television apparatus 10 functions as a communication apparatus that communicates with the voice recognition server 20.
The voice recognition server 20 is configured as a cloud server or the like placed on the cloud. However, the voice recognition server 20 may be configured as one or more computers having physical structures such as a CPU (Central Processing Unit), a ROM (Read Only Memory), and a RAM (Random Access Memory). The functions of the voice recognition server 20, such as the voice recognition function, are realized by the CPU constituting the cloud server or the computer executing a program stored in, for example, a ROM or the like.
The voice recognition server 20 includes a voice recognition unit 21, a processing unit 22, a communication unit 23, and a storage unit 24 as functional units for realizing a voice recognition function and the like.
The voice recognition unit 21 analyzes and recognizes a voice signal or the like based on the user utterance transmitted from the television apparatus 10 via the communication unit 23. At this time, the voice recognition unit 21 refers to the voice dictionary 24a of the storage unit 24.
The processing unit 22 performs various processes based on the recognition result of the audio signal. For example, when the audio signal is an instruction to operate the television device 10, the processing unit 22 transmits the instruction content to the television device 10 through the communication unit 23. For another example, when the audio signal is an instruction to acquire information from the internet, the processing unit 22 searches for information on the internet and transmits the search result to the television apparatus 10 via the communication unit 23. For another example, when the audio signal is a request dialog, the processing unit 22 may transmit the content of the reply to the television apparatus 10 via the communication unit 23.
The communication unit 23 performs communication with the television apparatus 10. For example, the communication unit 23 receives an audio signal of the user from the television apparatus 10. For another example, the communication unit 23 transmits the processing result by the processing unit 22 to the television apparatus 10.
The storage unit 24 stores various parameters, information, and the like necessary for realizing the above-described functions of the voice recognition server 20. For example, the storage unit 24 includes a voice dictionary 24a, and data for analyzing a voice signal from the user is stored in the voice dictionary 24 a. As will be described later, the television apparatus 10 also has a voice dictionary for voice recognition. However, the storage unit 24 of the voice recognition server 20 is configured as a large-capacity storage device, and the voice dictionary 24a of the storage unit 24 stores more detailed and diversified data.
By thus making the voice recognition server 20 with high processing capability assume a main part of the functions related to the voice recognition service, it is possible to improve the recognition accuracy and recognition speed of the voice signal from the user and to provide the voice recognition service with more enriched contents.
(hardware construction of television apparatus)
Fig. 2 is a diagram showing an example of the hardware configuration of the television apparatus 10 according to the embodiment.
As shown in fig. 2, the television device 10 includes an antenna 101, input terminals 102a to 102c, a tuner 103, a demodulator 104, a demultiplexer 105, an a/D (analog/digital) converter 106, a selector 107, a signal processing unit 108, a speaker 109, a display panel 110, an operation unit 111, a light receiving unit 112, an IP communication unit 113, a CPU 114, a memory 115, a storage 116, a microphone 117, and an Audio (Audio) I/F (interface) 118.
The antenna 101 receives a broadcast signal of digital broadcasting, and supplies the received broadcast signal to the tuner 103 through the input terminal 102 a.
The tuner 103 selects a broadcast signal of a desired channel from broadcast signals supplied from the antenna 101, and supplies the selected broadcast signal to the demodulator 104.
The demodulator 104 demodulates the broadcast signal supplied from the tuner 103, and supplies the demodulated broadcast signal to the demultiplexer 105.
The demultiplexer 105 separates the broadcast signal supplied from the demodulator 104 and generates an image signal and a sound signal, and supplies the generated image signal and sound signal to the selector 107.
The selector 107 selects one signal from the plurality of signals supplied from the demultiplexer 105, the a/D converter 106, and the input terminal 102c, and supplies the selected one signal to the signal processing unit 108.
The signal processing unit 108 performs predetermined signal processing on the image signal supplied from the selector 107, and supplies the processed image signal to the display panel 110. The signal processing unit 108 performs predetermined signal processing on the audio signal supplied from the selector 107, and supplies the processed audio signal to the speaker 109.
The speaker 109 outputs a voice or various sounds based on the sound signal supplied from the signal processing unit 108. The speaker 109 changes the volume of the output voice or various sounds based on the control of the CPU 114.
The display panel 110 as a display unit displays images such as still images and moving images, other images, character information, and the like based on an image signal supplied from the signal processing unit 108 or control of the CPU 114.
The input terminal 102b receives analog signals such as video signals and audio signals input from the outside. The input terminal 102c receives digital signals such as video signals and audio signals input from the outside. For example, the input terminal 102c may be configured to input a digital signal from a Recorder (Recorder) or the like equipped with a drive device that drives a recording medium for recording and playing such as BD (Blu-ray Disc) (registered trademark) to record and play.
The a/D converter 106 supplies a selector 107 with a digital signal generated by a/D converting an analog signal supplied from the input terminal 102 b.
The operation unit 111 receives an operation input from a user.
The light receiving unit 112 receives infrared rays from the remote control 119.
The IP communication unit 113 is a communication interface for performing IP (internet protocol) communication via the network 40.
The CPU 114 as a control section controls the entire television apparatus 10.
The memory 115 is a ROM that stores various computer programs executed by the CPU 114, a RAM that provides a work area (area) for the CPU 114, and the like. For example, the ROM stores a voice recognition program for detecting a trigger word by the television apparatus 10, an application program for providing a voice recognition service, and the like.
The memory 116 is an HDD (Hard Disk Drive) or an SSD (Solid State Drive) or the like. The memory 116 stores the signal selected by the selector 107 as video data, for example.
The microphone 117 as a voice input unit acquires voice uttered by the user and transmits the acquired voice to the audio I/F118.
The audio I/F118 analog/digital converts the sound acquired by the microphone 117, and sends the converted sound to the CPU 114 as a sound signal. It should be noted that, as such, the digital "sound signal" converted by the audio I/F118 is sometimes simply referred to as "sound" hereinafter.
(functional Structure of television apparatus)
Next, an example of a functional configuration of the television apparatus 10 according to the embodiment will be described with reference to fig. 3. Fig. 3 is a diagram showing an example of a functional configuration of the television apparatus 10 according to the embodiment.
In the television apparatus 10, the CPU 114 described above executes a program stored in, for example, a ROM or the like, thereby realizing a voice recognition function or the like of the television apparatus 10. The program executed by the television apparatus 10 has a module structure including the functional units described below.
As shown in fig. 3, the television apparatus 10 includes, as functional units for realizing the functions of the television apparatus 10, an input receiving unit 11, a test function setting unit 12, a trigger detecting unit 13, a score (score) calculating unit 14, a display control unit 15, an application executing unit 16, an equipment control unit 17, a communication unit 18, and a storage unit 19.
The input receiving section 11 as an acquisition section receives various inputs from a user. For example, the input receiving unit 11 acquires a user's voice input to the microphone 117 via the audio I/F118. Further, for example, the input receiving unit 11 acquires various instructions based on operation input from the operation unit 111 or the remote control 119.
When the start of the test function is instructed by an operation input from the operation unit 111 or the remote control 119, the test function setting unit 12 sets the test function to be valid. In a state where the test function is enabled, as will be described later, a score for the audio signal from the user is calculated and displayed on the display panel 110 of the television apparatus 10.
The trigger detection unit 13 performs acoustic processing such as noise cancellation processing on the obtained user audio signal. Then, the trigger word detection unit 13 refers to the voice dictionary 19a of the storage unit 19 to detect a trigger word from the voice signal subjected to the acoustic processing. At this time, the trigger detection unit 13 calculates the degree of matching between the voice data stored in the voice dictionary 19a and serving as a reference for trigger detection and the voice signal of the user. When the degree of coincidence between the audio data and the audio signal is equal to or greater than a predetermined value, the trigger word detection unit 13 recognizes that the audio signal contains a trigger word, and determines that the trigger word is detected. When the degree of coincidence between the audio data and the audio signal is less than a predetermined value, the trigger word detection unit 13 recognizes that the acquired audio signal is not a trigger word, and determines that the trigger word has not been detected.
When the test function is valid, the score calculation unit 14 calculates a score of the voice signal of the user with respect to the voice data serving as a reference for trigger detection. More specifically, the score calculating unit 14 normalizes the degree of coincidence between the calculated sound data and the sound signal and calculates the score. Therefore, when the score is high, the matching degree between the voice data and the voice signal is high, and the score is equal to or greater than the predetermined value, indicating that the trigger word detection unit 13 recognizes that the voice signal represents a trigger word.
The display control unit 15 controls various displays on the display panel 110. For example, when the input receiving unit 11 acquires an operation by a user input to the remote control 119 or the like, an operation screen corresponding to the operation is displayed on the display panel 110. For another example, when the test function is valid, the display control unit 15 displays the calculated score on the display panel 110. For example, when the voice recognition service is started by detecting the trigger, the display control unit 15 displays a message, an icon, or the like, which responds to the voice, on the display panel 110. The message, icon, or the like that responds to the voice may be, for example, content that urges the user to speak, or may be displayed as character data that represents a result of recognition of the voice of the user.
The application execution unit 16 starts the voice recognition service when the trigger word is detected from the voice signal. More specifically, the application execution section 16 starts the voice recognition service providing application when the trigger word is detected from the voice signal. The voice recognition service providing application is a user interface for performing information exchange between the voice recognition server 20 and the user. That is, the voice recognition service providing application realizes communication between the television apparatus 10 and the voice recognition server 20 via the communication unit 18. Also, the voice recognition service providing application transmits a voice signal of the user to the voice recognition server 20, and receives a response to the content indicated by the voice signal from the voice recognition server 20.
The device control section 17 controls each section of the television apparatus 10. For example, the device control unit 17 controls the speaker 109 to decrease the volume after detecting the trigger. This is to reduce the situation where the input of the voice of the user speaking after the trigger word is disturbed by the voice of the content. For another example, the device control unit 17 controls each unit of the television apparatus 10 based on a command included in the voice of the user during the provision of the voice recognition service.
The communication unit 18 controls communication with an external device or the like via the network 40. For example, the communication unit 18 controls communication between the voice recognition server 20 and the television apparatus 10 in accordance with a voice recognition service providing application.
The storage unit 19 stores various parameters, information, and the like necessary for realizing the functions of the television apparatus 10 described above. For example, the storage unit 19 includes a voice dictionary 19a, and the voice dictionary 19a stores voice data serving as a reference for detecting a trigger word in a voice signal from a user. The voice data includes information on various elements such as phonemes and features included in the trigger word, and the trigger word detection unit 13 compares the voice data with a voice signal from the user to serve as an index for identifying whether or not the voice signal includes the trigger word. However, the voice data stored in the voice dictionary 19a may be plural. For example, the plurality of audio data may include various audio data based on gender and age, such as for male, female, and children.
(detailed function of television device)
Next, details of the functions of the television apparatus 10 according to the embodiment will be described with reference to fig. 4 and 5. Fig. 4 is a diagram showing an example of a score display screen 110a displayed by the television device 10 according to the embodiment. When the user validates the test function, a score display screen 110a is displayed on the display panel 110.
The user can input an instruction to start the test function by operating the remote control 119 or the like, for example. When the input receiving unit 11 receives an instruction to start the test function, the test function setting unit 12 sets the test function to be valid. When the test function is set to be valid, the display control unit 15 displays the score display screen 110a on the display panel 110.
As shown in fig. 4, a message urging the user to say the trigger is first displayed on the score display screen 110 a. For example, in the case where the trigger word is "niee ie, tie rie bi" (reading in japanese, corresponding to "hey, tv" in chinese), the "please say' niee ie, tie rie bi" is displayed. "etc.
Further, a message indicating a threshold value of the score for detecting the sound uttered by the user as the trigger word may be displayed on the score display screen 110 a. If the threshold value is, for example, 50, a message "if the score is 50 or more, the voice recognition service is started is displayed. "etc.
Note that, the score display screen 110a may display the sound volume setting of the television apparatus 10 at that time. Since the volume of sound emitted from the television apparatus 10 may hinder the detection of the trigger, the user can be alerted by displaying the volume setting.
When the user speaks "nie ie, tie rie bi" or the like based on the message of the score display screen 110a, the voice is acquired by the microphone 117, converted into a voice signal by the audio I/F118, and received by the input receiving unit 11. Then, when the trigger detection unit 13 calculates the degree of coincidence between the voice data stored in the voice dictionary 19a of the storage unit 19 and the voice signal subjected to the acoustic processing after being received by the input reception unit 11, the score calculation unit 14 calculates the score by normalizing the degree of coincidence to a numerical value of, for example, 0 to 100. The display control unit 15 displays the calculated score on the score display screen 110a in the form of, for example, a bar (bar) of 0 to 100.
In the case where the degree of coincidence between the sound data and the sound signal is insufficient and the score is smaller than the threshold value, for example, it may be effective to improve the smoothness of the tongue, it may be effective to speak a little slower, and it may be effective to make a little louder in order to obtain a higher score. The user can refer to the score displayed on the score display screen 110a and try various speaking methods to obtain a higher score. The remote control 119 or the like may be operated to reduce the volume of the television apparatus 10. In this case, the display control unit 15 may display, for example, the maximum value of the score acquired before, in addition to the current score of the voice of the user, on the score display screen 110 a.
However, when calculating the degree of coincidence between the audio data and the audio signal, the trigger word detecting unit 13 decomposes the audio data and the audio signal into a plurality of elements included in the trigger word, and then obtains the degree of coincidence for each of the elements. The score calculating unit 14 calculates a score to be displayed on the score display screen 110a based on these multiple matching degrees. Various methods may be considered for the calculation of the score.
Fig. 5 is a diagram showing an example of some score calculation methods of the television apparatus 10 according to the embodiment. In the example of fig. 5, for the sake of simplicity of explanation, a case is shown in which the speech data and the speech signal are decomposed into a plurality of phonemes 1 to 5, and the degree of coincidence and the score are calculated. However, the sound data and the sound signal may include not only the phoneme 1 to phoneme 5 but also information on other elements such as features and intonation, and the degree of coincidence and the score may be calculated for these elements.
As shown in the left diagrams of (a) and (b) in fig. 5, the trigger detection unit 13 obtains the probability X of occurrence in the audio signal of a plurality of phonemes 1 to 5, for example. These appearance probabilities X are values obtained by comparing the audio signal with the audio data, and correspond to the above-described coincidence degree between the audio signal and the audio data. In the example of the left diagrams of (a) and (b) of fig. 5, the appearance probability X is shown by a numerical value of 0 to 1.00, for example.
As shown in the right diagrams of (a) and (b) of fig. 5, the score calculating unit 14 calculates a calculation result Y of the score normalized with respect to the occurrence probability X. In this case, the score calculating unit 14 normalizes the appearance probability X by the following equations (1) and (2), for example.
The following expression (1) is applied when the matching degree Xn, such as the occurrence probability X, is smaller than the threshold Tn, for example.
[ mathematical formula 1 ]
Figure PCTCN2020123669-APPB-000001
The following expression (2) is applied to a case where the coincidence degree Xn, such as the occurrence probability X, exceeds the threshold Tn, for example.
[ mathematical formula 2 ]
Figure PCTCN2020123669-APPB-000002
From the above equations (1) and (2), a numerical value in the range of 0 to 100 is obtained as the calculation result Yn for normalizing the degree of coincidence Xn. Note that, when the matching degree Xn is the same value as the threshold Tn, the calculation result Yn is the same regardless of which of the expressions (1) and (2) is used.
Here, the settings are as follows: the audio signal and the audio data include L elements, and a maximum value An that the matching degree Xn can take and a threshold Tn that the matching degree Xn should satisfy are set for the L matching degrees Xn, respectively. That is, when the matching degree Xn of a certain element is equal to or greater than the threshold Tn, it is determined that the audio signal matches the audio data with respect to the element. Then, the matching degree Xn of the elements 1 to L and the threshold Tn are appropriately substituted into the above formula (1) or formula (2) to obtain L calculation results Yn.
The examples of the right diagrams of (a) and (b) of fig. 5 are set as follows: the threshold value T for all the occurrence probabilities X is 0.90, and the maximum value a that all the occurrence probabilities X can take is 1.00, from which the calculation result Y is obtained. The score calculating unit 14 obtains the score displayed on the score display screen 110a based on the calculation result Y. As mentioned above, there are several methods for this.
In the example of fig. 5(a), the score calculating unit 14 adopts the calculation result 30 of the phoneme 5 as the minimum value among the calculation results Y obtained for the phonemes 1 to 5 as the score to be displayed on the score display screen 110 a.
In the example of fig. 5(b), the score calculating unit 14 divides the calculation result 75 of phoneme 1 and the calculation result 60 of phoneme 3, which exceed 50, from the calculation results Y obtained for phoneme 1 to phoneme 5, as shown in the lower right of fig. 5(b), and divides the part exceeding 50 as a mantissa to obtain the calculation result 50. In addition, the average 44 of the calculation results Y for the phoneme 1 to phoneme 5 is adopted as the score to be displayed on the score display screen 110 a.
The method of calculating the score by the score calculating unit 14 is not limited to the examples of (a) and (b) in fig. 5. The user can directly grasp the difference between the score required for detecting the trigger word and the score of the user, and can calculate the score by any method as long as the score can be used as an index for obtaining a higher score.
(trigger detection processing of television device)
Next, an example of trigger detection processing of the television apparatus 10 according to the embodiment will be described with reference to fig. 6. Fig. 6 is a flowchart showing an example of the procedure of the trigger detection processing of the television apparatus 10 according to the embodiment.
As shown in fig. 6, the input receiving unit 11 receives a use instruction of the test function by the user (step S101). That is, when the user operates the operation unit 111 or the remote control 119 to instruct the start of the test function, the input unit 11 receives the instruction (yes in step S101), the test function setting unit 12 validates the setting of the test function, and the display control unit 15 displays the score display screen 110a on the display panel 110 (step S102). If there is no instruction to start the test function by the user (no in step S101), the process proceeds to step S103 without performing the process of step S102.
The input receiving unit 11 receives a voice signal based on the user utterance (step S103). The input receiving section 11 waits until the user speaks (step S103: no). When the user speaks toward the microphone 117 of the television apparatus 10, the sound acquired from the microphone 117 is converted into a sound signal by the audio I/F118. When the input receiving unit 11 acquires the voice signal (yes in step S103), the trigger detecting unit 13 refers to the voice dictionary 19a, and calculates the degree of coincidence between the voice data stored in the voice dictionary 19a and the voice signal based on the user utterance (step S104).
The score calculating unit 14 checks whether or not the setting of the test function is valid (step S105). When the setting of the test function is enabled (step S105: yes), the score calculating section 14 calculates a score based on the calculated degree of coincidence (step S106). Further, the display control unit 15 displays the calculated score on the score display screen 110a of the display panel 110 (step S107). If the setting of the test function is not set to be valid (step S105: NO), the process proceeds to step S108 without performing the processes of steps S106 to S107.
The trigger detecting unit 13 determines whether or not the degree of coincidence between the sound data and all the elements of the sound signal is equal to or greater than a threshold value (step S108). When there is an element in which the degree of coincidence between the audio data and the audio signal is smaller than the threshold value (no in step S108), the trigger word detection unit 13 determines that the audio signal is not a trigger word and does not perform the detection process of the trigger word, and repeats the processes from step S103.
When the degree of coincidence between the audio data and the audio signal is equal to or greater than the threshold value (yes in step S108), the trigger detection unit 13 determines that the audio signal contains a trigger and detects the trigger (step S109). The application execution unit 17 starts the voice recognition service provision to start the voice recognition service (step S110).
Through the above steps, the trigger detection processing of the television apparatus 10 according to the embodiment is ended.
In recent years, television devices and the like having a voice recognition function have been known. When the trigger word is detected, the television apparatus starts providing the voice recognition service. The detection accuracy of the trigger may be lowered depending on the way of speaking of the user, the surrounding environment, and the like.
In this case, the user repeatedly tries various measures such as increasing the sound or slowly speaking to cause the television apparatus to detect the trigger word. However, the user can only determine which of such trial and error is effective by starting the provision of the voice recognition service.
According to the television apparatus 10 of the embodiment, the score of the audio signal with respect to the audio data is calculated and displayed on the display panel 110. Thus, the user can easily recognize the directionality in which the user's own voice is easily detected as a trigger by repeating the trial and error while referring to the change in the score. In this way, the television apparatus 10 according to the embodiment can assist the determination of the user attempted to detect the trigger.
According to the television apparatus 10 of the embodiment, the degree of coincidence between the audio data and the audio signal is normalized to calculate the score. For detecting the trigger, for example, the trigger detecting unit 13 calculates the degree of coincidence between the sound data and the sound signal. However, such a degree of coincidence is calculated from various elements of various contents. Therefore, even if the calculated matching degree is presented to the user as it is, it is difficult for the user to understand the content thereof and grasp whether or not the attempt of the user is close to the detection of the trigger. Since the television apparatus 10 standardizes the degree of matching and presents it to the user, the user can intuitively understand the content thereof and use it as an index for obtaining a higher score.
(modification 1)
Next, a television device according to modification 1 of the embodiment will be described with reference to fig. 7. The television device according to modification 1 is different from the above-described embodiment in that the calculated score is displayed for each phoneme.
Fig. 7 is a diagram showing an example of a score display screen 110b displayed on the television device according to modification 1 of the embodiment. As shown in fig. 7, the display control unit included in the television device according to modification 1 displays the score of the audio signal calculated by the score calculating unit for each phoneme included in the audio data on a score display screen 110 b.
Thus, the user can identify the weakness of his own speech. For example, in the example shown in fig. 7, it is determined that the phonemes of "ie" and "bi" have low scores in the user's voice. The user can detect his/her own voice as a trigger by, for example, paying attention to the end of each word to increase the score.
(modification 2)
Next, a television device 30 according to modification 2 of the embodiment will be described with reference to fig. 8 to 10. The modification 2 is different from the above-described embodiment in that the television apparatus 30 displays the calculated score and the advice to the user at the same time.
Fig. 8 is a diagram showing an example of a functional configuration of a television apparatus 30 according to modification 2 of the embodiment. As shown in fig. 8, the television device 30 according to modification 2 includes a display control unit 35 and a volume determination unit 31 instead of the television device 10 according to the above-described embodiment.
For example, when the test function is set to be valid, the sound volume determination unit 31 determines whether or not the sound volume setting of the speaker of the television apparatus 30 exceeds a predetermined value. When the volume setting exceeds a predetermined value, the display control unit 35 displays the calculated score and displays a message prompting the user to lower the volume setting.
Fig. 9 is a diagram showing an example of a score display screen 110c displayed on the television device 30 according to modification 2 of the embodiment. As shown in fig. 9, the score display screen 110c displays "the sound of the television seems too loud. Please try to set the volume below 10. "etc.
One of the most accurate and largest factors that trigger words are difficult to detect is the sound emitted by the speaker of the television apparatus. By displaying a message that prompts a decrease in the volume setting, the user may notice that the volume of television apparatus 30 may decrease the detection accuracy, and the trigger word is easily detected.
The display control unit 35 included in the television device 30 according to modification 2 may display suggestions for increasing the score and facilitating detection of trigger words at random or in a predetermined order.
Fig. 10 is a diagram showing another example of the score display screen 110d displayed on the television device 30 according to modification 2 of the embodiment. As shown in fig. 10, the score display screen 110d displays, for example, a scroll display "please attempt to speak clearly. "," please try to speak slowly. "," please attempt to speak loudly. "etc. messages that eliminate the general factor that trigger words cannot be detected.
Thus, for example, an attempt that the user does not think can be prompted, and a sound that helps the user is detected as a trigger.
(modification 3)
Next, a television device according to modification 3 of the embodiment will be described with reference to fig. 11. The modification 3 is different from the above-described embodiment in that the television apparatus displays scores for a plurality of trigger words.
Fig. 11 is a diagram showing an example of a score display screen 110e displayed on the television device according to modification 3 of the embodiment. As shown in fig. 11, the television device of modification 3 is provided with a plurality of trigger words such as "nie ie, tie rie bi" (reading in japanese corresponding to "hey, tv" in chinese), "mo si mo si, tie rie bi" (reading in japanese corresponding to "hello, tv" in chinese), and "ha ro, tie rie bi" (reading in japanese corresponding to "hello, tv" in chinese). Then, the score calculating unit of the television device according to modification 3 calculates scores for the trigger words. The display control unit displays scores for a plurality of trigger words on the score display screen 110 e.
The user can issue "please say 'nieie ie, tie rie bi' upon prompt. "or the like, for example, each trigger is uttered and the score corresponding to the trigger is referred to in the message on the score display screen 110e of the predetermined trigger. In the example shown in fig. 11, among the plurality of trigger words, the user obtains the highest score among the trigger words of "mo si mo si, tie rie bi". Therefore, the user can easily detect the sound of himself as a trigger by selecting a trigger using "mo si mo si, tie rie bi" among a plurality of triggers.
In the above-described embodiment and modifications 1 to 3, the voice recognition server 20 as an external device such as the television apparatus 10 provides a main voice recognition service, but the configuration of the embodiment is not limited to this. The television apparatus 10 or the like may have functions related to all of the voice recognition services and may provide the voice recognition services independently.
In the above-described embodiment and modifications 1 to 3, the information processing device having the voice recognition function is the television device 10 or the like, but the configuration of the embodiment is not limited thereto. For example, the information processing apparatus or the communication apparatus having the voice recognition function may be another device such as a Smart speaker (Smart speaker). In the case where the information processing apparatus is an intelligent audio device, the display unit for displaying the score of the audio signal with respect to the audio data may be a separate monitor or the like attached to the intelligent audio device.
In addition, the program that realizes the various functions described above in the television apparatus 10 or the like is provided as a computer program product in an installable or executable form. That is, the above-described program is provided in a state of being included in a computer program product having a non-volatile computer-readable recording medium such as a CD-ROM, a Floppy Disk (FD), a CD-R, DVD, or the like.
Further, the program may be provided or distributed via a network in a state of being stored in a computer connected to a network such as the internet. The above program may also be provided in a state of being installed in advance in a ROM or the like.
By installing such a program in the television apparatus 10 or the like, the CPU of the television apparatus 10 or the like reads the program from the ROM and executes the above-described respective functional configurations on the RAM.
However, the above-described program may be provided as a network application stored in a cloud server or the like, in which case the program may be executed without being installed in the television apparatus 10 or the like.
Although the embodiments of the present application have been described, the embodiments are provided as examples, and do not limit the scope of the application. The novel embodiment may be implemented in other various forms, and various omissions, substitutions, and changes may be made without departing from the spirit of the invention. These embodiments are intended to include modifications within the scope and spirit of the claims and within the scope and range equivalent to the inventions recited in the claims.

Claims (15)

  1. An information processing apparatus includes:
    an acquisition unit that acquires, as a sound signal, a sound of a user input to the sound input unit;
    a score calculation unit that calculates a score of the voice signal with respect to voice data, the score serving as a reference for detecting a trigger word for starting a voice recognition service from the voice signal; and
    and a display control unit that displays the score on a display unit.
  2. The information processing apparatus according to claim 1,
    the score calculation unit calculates the score by normalizing the degree of coincidence between the sound data and the sound signal.
  3. The information processing apparatus according to claim 2,
    the information processing apparatus includes a trigger detection unit that detects the trigger from the audio signal,
    the trigger detection unit decomposes the audio data and the audio signal into a plurality of elements, calculates the degree of coincidence for the plurality of elements, and detects the trigger from the audio signal based on the degree of coincidence.
  4. The information processing apparatus according to claim 3,
    the score calculating unit calculates the score for each of the plurality of elements based on the degree of matching.
  5. The information processing apparatus according to claim 4,
    the display control unit displays the smallest score among the scores on the display unit.
  6. The information processing apparatus according to claim 4,
    the display control unit displays the scores calculated for the respective matching degrees on the display unit.
  7. The information processing apparatus according to claim 4,
    the display control unit displays an average value of the scores calculated for the respective matching degrees on the display unit.
  8. The information processing apparatus according to any one of claims 3 to 7,
    the plurality of elements are phonemes included in the trigger word.
  9. The information processing apparatus according to any one of claims 1 to 8,
    the score calculating section calculates the score for a plurality of the trigger words.
  10. The information processing apparatus according to claim 9,
    the display control unit displays the scores calculated for the plurality of trigger words on the display unit.
  11. The information processing apparatus according to any one of claims 1 to 10,
    the display control unit displays a suggestion for increasing the score on the display unit.
  12. The information processing apparatus according to any one of claims 1 to 11,
    the acquisition unit receives an input of an instruction to display the score on the display unit.
  13. The information processing apparatus according to any one of claims 1 to 12,
    the information processing apparatus includes an application execution unit that starts the voice recognition service when the trigger word is detected from the voice signal.
  14. The information processing apparatus according to any one of claims 1 to 13,
    the voice recognition service is provided by a voice recognition server connected through a network.
  15. A non-volatile storage medium readable by a computer, the storage medium storing a program for causing a computer to execute:
    acquiring a voice of a user input to a voice input unit as a voice signal;
    calculating a score of the voice signal with respect to voice data, wherein the score becomes a reference for detecting a trigger word from the voice signal, the trigger word being used to start a voice recognition service; and
    and displaying the score on a display part.
CN202080005757.3A 2019-12-05 2020-10-26 Information processing apparatus and nonvolatile storage medium Active CN113228170B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
JP2019-220035 2019-12-05
JP2019220035A JP7248564B2 (en) 2019-12-05 2019-12-05 Information processing device and program
PCT/CN2020/123669 WO2021109751A1 (en) 2019-12-05 2020-10-26 Information processing apparatus and non-volatile storage medium

Publications (2)

Publication Number Publication Date
CN113228170A true CN113228170A (en) 2021-08-06
CN113228170B CN113228170B (en) 2023-06-27

Family

ID=76220032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080005757.3A Active CN113228170B (en) 2019-12-05 2020-10-26 Information processing apparatus and nonvolatile storage medium

Country Status (3)

Country Link
JP (1) JP7248564B2 (en)
CN (1) CN113228170B (en)
WO (1) WO2021109751A1 (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002196765A (en) * 2000-12-25 2002-07-12 Yamaha Corp Informing apparatus, musical instrument, and sounding apparatus for vehicle
CN101266593A (en) * 2008-02-25 2008-09-17 北京理工大学 Voice and audio frequency quality subjective evaluation method based on network opinion collection
JP2009124324A (en) * 2007-11-13 2009-06-04 Sharp Corp Sound apparatus and control method of sound apparatus
CN101547387A (en) * 2008-03-26 2009-09-30 鸿富锦精密工业(深圳)有限公司 Earphone and audio display system using same
US20120316876A1 (en) * 2011-06-10 2012-12-13 Seokbok Jang Display Device, Method for Thereof and Voice Recognition System
WO2015008502A1 (en) * 2013-07-19 2015-01-22 株式会社ベネッセコーポレーション Information processing device, information processing method, and program
CN104575504A (en) * 2014-12-24 2015-04-29 上海师范大学 Method for personalized television voice wake-up by voiceprint and voice identification
US9275637B1 (en) * 2012-11-06 2016-03-01 Amazon Technologies, Inc. Wake word evaluation
CN105702253A (en) * 2016-01-07 2016-06-22 北京云知声信息技术有限公司 Voice awakening method and device
JP2017098798A (en) * 2015-11-25 2017-06-01 オリンパス株式会社 Sound recorder, advice output method and program
CN107358954A (en) * 2017-08-29 2017-11-17 成都启英泰伦科技有限公司 It is a kind of to change the device and method for waking up word in real time
CN108538293A (en) * 2018-04-27 2018-09-14 青岛海信电器股份有限公司 Voice awakening method, device and smart machine
US20180275951A1 (en) * 2017-03-21 2018-09-27 Kabushiki Kaisha Toshiba Speech recognition device, speech recognition method and storage medium
CN109036393A (en) * 2018-06-19 2018-12-18 广东美的厨房电器制造有限公司 Wake-up word training method, device and the household appliance of household appliance
CN109601017A (en) * 2017-08-02 2019-04-09 松下知识产权经营株式会社 Information processing unit, sound recognition system and information processing method
US20190180744A1 (en) * 2017-12-11 2019-06-13 Hyundai Motor Company Apparatus and method for determining reliability of recommendation based on environment of vehicle

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05158493A (en) * 1991-12-10 1993-06-25 Fujitsu Ltd Speech recognizing device
JP2001005480A (en) * 1999-06-23 2001-01-12 Denso Corp User uttering discriminating device and recording medium
JP2006011641A (en) * 2004-06-23 2006-01-12 Fujitsu Ltd Information input method and device
CN101630448B (en) * 2008-07-15 2011-07-27 上海启态网络科技有限公司 Language learning client and system
JP2013072974A (en) * 2011-09-27 2013-04-22 Toshiba Corp Voice recognition device, method and program
US9536528B2 (en) * 2012-07-03 2017-01-03 Google Inc. Determining hotword suitability
US9767795B2 (en) * 2013-12-26 2017-09-19 Panasonic Intellectual Property Management Co., Ltd. Speech recognition processing device, speech recognition processing method and display device
US10789041B2 (en) * 2014-09-12 2020-09-29 Apple Inc. Dynamic thresholds for always listening speech trigger
KR102420450B1 (en) * 2015-09-23 2022-07-14 삼성전자주식회사 Voice Recognition Apparatus, Voice Recognition Method of User Device and Computer Readable Recording Medium
US20170330564A1 (en) * 2016-05-13 2017-11-16 Bose Corporation Processing Simultaneous Speech from Distributed Microphones
US10957322B2 (en) * 2016-09-09 2021-03-23 Sony Corporation Speech processing apparatus, information processing apparatus, speech processing method, and information processing method
CN109739354B (en) * 2018-12-28 2022-08-05 广州励丰文化科技股份有限公司 Voice-based multimedia interaction method and device

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002196765A (en) * 2000-12-25 2002-07-12 Yamaha Corp Informing apparatus, musical instrument, and sounding apparatus for vehicle
JP2009124324A (en) * 2007-11-13 2009-06-04 Sharp Corp Sound apparatus and control method of sound apparatus
CN101266593A (en) * 2008-02-25 2008-09-17 北京理工大学 Voice and audio frequency quality subjective evaluation method based on network opinion collection
CN101547387A (en) * 2008-03-26 2009-09-30 鸿富锦精密工业(深圳)有限公司 Earphone and audio display system using same
US20120316876A1 (en) * 2011-06-10 2012-12-13 Seokbok Jang Display Device, Method for Thereof and Voice Recognition System
US9275637B1 (en) * 2012-11-06 2016-03-01 Amazon Technologies, Inc. Wake word evaluation
WO2015008502A1 (en) * 2013-07-19 2015-01-22 株式会社ベネッセコーポレーション Information processing device, information processing method, and program
CN104575504A (en) * 2014-12-24 2015-04-29 上海师范大学 Method for personalized television voice wake-up by voiceprint and voice identification
JP2017098798A (en) * 2015-11-25 2017-06-01 オリンパス株式会社 Sound recorder, advice output method and program
CN105702253A (en) * 2016-01-07 2016-06-22 北京云知声信息技术有限公司 Voice awakening method and device
US20180275951A1 (en) * 2017-03-21 2018-09-27 Kabushiki Kaisha Toshiba Speech recognition device, speech recognition method and storage medium
CN109601017A (en) * 2017-08-02 2019-04-09 松下知识产权经营株式会社 Information processing unit, sound recognition system and information processing method
CN107358954A (en) * 2017-08-29 2017-11-17 成都启英泰伦科技有限公司 It is a kind of to change the device and method for waking up word in real time
US20190180744A1 (en) * 2017-12-11 2019-06-13 Hyundai Motor Company Apparatus and method for determining reliability of recommendation based on environment of vehicle
CN108538293A (en) * 2018-04-27 2018-09-14 青岛海信电器股份有限公司 Voice awakening method, device and smart machine
CN109036393A (en) * 2018-06-19 2018-12-18 广东美的厨房电器制造有限公司 Wake-up word training method, device and the household appliance of household appliance

Also Published As

Publication number Publication date
WO2021109751A1 (en) 2021-06-10
JP2021089376A (en) 2021-06-10
CN113228170B (en) 2023-06-27
JP7248564B2 (en) 2023-03-29

Similar Documents

Publication Publication Date Title
EP3619707B1 (en) Customizable wake-up voice commands
JP4729927B2 (en) Voice detection device, automatic imaging device, and voice detection method
US11037574B2 (en) Speaker recognition and speaker change detection
US9477304B2 (en) Information processing apparatus, information processing method, and program
US9767795B2 (en) Speech recognition processing device, speech recognition processing method and display device
KR20170032096A (en) Electronic Device, Driving Methdo of Electronic Device, Voice Recognition Apparatus, Driving Method of Voice Recognition Apparatus, and Computer Readable Recording Medium
JP2022033258A (en) Speech control apparatus, operation method and computer program
JP7259307B2 (en) Minutes output device and control program for the minutes output device
US10916249B2 (en) Method of processing a speech signal for speaker recognition and electronic apparatus implementing same
EP2504745B1 (en) Communication interface apparatus and method for multi-user
US20180158462A1 (en) Speaker identification
KR20230118089A (en) User Speech Profile Management
JP6459330B2 (en) Speech recognition apparatus, speech recognition method, and speech recognition program
EP3826009A1 (en) Electronic device and method for controlling the same, and storage medium
JP2001067091A (en) Voice recognition device
JP6616182B2 (en) Speaker recognition device, discriminant value generation method, and program
US7177806B2 (en) Sound signal recognition system and sound signal recognition method, and dialog control system and dialog control method using sound signal recognition system
CN113228170B (en) Information processing apparatus and nonvolatile storage medium
US10950227B2 (en) Sound processing apparatus, speech recognition apparatus, sound processing method, speech recognition method, storage medium
JP4938719B2 (en) In-vehicle information system
US20210134272A1 (en) Information processing device, information processing system, information processing method, and program
CN110875034A (en) Template training method for voice recognition, voice recognition method and system thereof
CN116504246B (en) Voice remote control method, device, storage medium and device based on Bluetooth device
EP3477634B1 (en) Information processing device and information processing method
KR20220064695A (en) Method and appratus for estimating driver intention using driver's voice

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant