WO2019142418A1 - Dispositif de traitement d'informations et procédé de traitement d'informations - Google Patents

Dispositif de traitement d'informations et procédé de traitement d'informations Download PDF

Info

Publication number
WO2019142418A1
WO2019142418A1 PCT/JP2018/038716 JP2018038716W WO2019142418A1 WO 2019142418 A1 WO2019142418 A1 WO 2019142418A1 JP 2018038716 W JP2018038716 W JP 2018038716W WO 2019142418 A1 WO2019142418 A1 WO 2019142418A1
Authority
WO
WIPO (PCT)
Prior art keywords
information processing
user
control unit
input
noise
Prior art date
Application number
PCT/JP2018/038716
Other languages
English (en)
Japanese (ja)
Inventor
亜由美 中川
賢次 杉原
Original Assignee
ソニー株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニー株式会社 filed Critical ソニー株式会社
Publication of WO2019142418A1 publication Critical patent/WO2019142418A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Definitions

  • the present disclosure relates to an information processing apparatus and an information processing method.
  • Patent Document 1 discloses a technique for selecting a word to be recognized from a dictionary according to the ambient noise environment.
  • the present disclosure proposes a novel and improved information processing apparatus and information processing method that can provide more convenient voice input means according to the situation.
  • control unit controls a plurality of input modes related to voice input, and the control unit determines whether or not the execution mode can be continued based on the ambient noise level and the reflection accuracy of the user intention.
  • the plurality of input modes include at least a free speech input mode and a command input mode.
  • the processor controls a plurality of input modes related to voice input, and the controlling is performed based on the ambient noise level and the reflection accuracy of the user intention.
  • the information processing method is provided, further comprising determining whether the mode continues or not, wherein the plurality of input modes include at least a free speech input mode and a command input mode.
  • the agent device can output an answer using voice or visual information, for example, in response to a user's inquiry by speech. Also, the agent device can register a schedule or a task based on the user's speech, and can also present information to the user based on the registered information.
  • FIG. 1 is a diagram for describing a free speech input mode and a command input mode according to the present embodiment.
  • FIG. 1 shows a situation where the user U instructs the information processing terminal 10 to register a schedule using an utterance. Further, FIG. 1 shows a display area DA displayed by the information processing terminal 10 by the projection function. In the example shown in FIG. 1, the information processing terminal 10 displays two forms corresponding to the title of the schedule and the date and time.
  • the user U can instruct the registration of the schedule using a free expression that is not caught in the form of the form, for example, like the speech UO 1 a.
  • the information processing terminal 10 can extract an element corresponding to each form from the recognized speech UO 1 a and input a character string corresponding to the element to the form.
  • the free speech input mode it is possible to allow the user to freely speak and to improve the convenience of the speech input.
  • the free speech input mode there is no clue to recognition such as commands described later, so it is easily influenced by ambient noise, especially the speech of others, and in environments where noise is large, the accuracy of the speech recognition result is It is also assumed that it declines.
  • the user U specifies a command (here, a form name) corresponding to each form, for example, as in the case of the utterance UO1b, and the content to be input to the form after the command Utters
  • the information processing terminal 10 can extract a command and an element uttered following the command from the recognized utterance UO 1 b, and can input a character string corresponding to the element to the form.
  • the command input mode by using a command included in the user's utterance as a clue, for example, the influence of noise such as the utterance of the other person not including the command is reduced, and the accuracy of voice recognition is improved. It is possible.
  • the command input mode in order to force the user to utter a command, it is assumed that the degree of freedom concerning voice input is reduced, and that the user is bothered and stressed.
  • the free speech input mode may be applied, and in an environment where noise is large, the command input mode may be applied to take advantage of the advantages of both input modes.
  • noise when the magnitude of noise is simply determined based on the ambient noise level, it is also assumed that an element that is not originally noise is treated as noise.
  • the above elements include, for example, the user's own utterance.
  • the noise according to the present embodiment can be defined as a sound that lowers the speech recognition accuracy for the speech of the user who intentionally intends to execute speech input.
  • the noise according to the present embodiment may include various sounds, but in particular, the effects of the speech of others who do not intend to input speech, the sound flowing from various devices, the sound similar to the speech of a person, etc. It can be said that it is big.
  • the input mode is frequently command input mode even in an environment where speech recognition accuracy does not actually decrease significantly. And the user's convenience is greatly reduced.
  • the above-mentioned low influence sound an operation sound of a home appliance without voice, a work sound of a person without speech, an interphone, an alarm, a ringing tone of a telephone, etc. are assumed.
  • the information processing apparatus may include a control unit that controls a plurality of input modes related to voice input. Further, one feature of the present invention is that the control unit determines whether or not the execution mode can be continued based on the ambient noise level and the reflection accuracy of the user intention.
  • the above input mode may include at least a free speech input mode and a command input mode.
  • FIG. 2A and FIG. 2B are diagrams for explaining switching of the input mode according to the present embodiment.
  • FIG. 2A shows a situation where the user U instructs the information processing terminal 10 to register a schedule using an utterance. At this time, the user is registering the schedule in the free speech input mode.
  • persons P1 and P2 who talk are present around the user U and the information processing terminal 10.
  • the information processing server 20 that controls the information processing terminal 10 erroneously recognizes the utterance PO2 of the person P1 as the utterance of the user U who is the original user, and recognizes the character string corresponding to the utterance PO2 of the information processing terminal 10 It is output to the form of the title arranged in the display area DA.
  • noise such as the utterance PO2
  • a content not intended by the user is input or the content input by the user is overwritten due to the noise.
  • a phenomenon called may occur.
  • the information processing server 20 may switch the input mode to the free speech input mode when it is determined that the reflection accuracy of the user intention is lowered due to the influence of noise.
  • the reflection accuracy of the user intention according to the present embodiment may be defined as the accuracy with which the intention of the user who intends to intentionally execute the speech input is accurately reflected in the output.
  • the information processing server 20 indicates that the user U has generated a warning based on the instruction for correcting the output voice recognition result by the utterance UO 2a or the like, that is, the user intention It is possible to estimate that the reflection accuracy of
  • the information processing server 20 according to the present embodiment may switch the input mode from the free speech input mode to the command input mode as shown in FIG. 2B. According to the above-described function of the information processing server 20 according to the present embodiment, it is possible to reduce the influence of noise and to improve the reflection accuracy of the user intention.
  • the information processing server 20 When switching to the command input mode, the information processing server 20 causes the user to perceive the switching of the input mode by explicitly displaying the current input mode on the display area DA as illustrated. May be
  • the information processing server 20 can make the utterance suitable for the command input mode by displaying the wording “Title is”, “Date and time is” and the like corresponding to each form. It is also possible to guide the user.
  • the information processing server 20 causes the information processing terminal 10 to output voice SO2 and the like, and actively inquires the user about the content to be input in each form, thereby guiding the user's utterance suitable for the command input mode It is also good.
  • the outline of the present embodiment has been described above. As described above, according to the information processing server 20 according to the present embodiment, it is possible to realize the switching of the effective input mode according to the situation by performing the determination in consideration of the reflection accuracy of the user intention in addition to the noise level. It becomes possible.
  • FIG. 3 is a block diagram showing an exemplary configuration of the information processing system according to the present embodiment.
  • the information processing system according to the present embodiment includes an information processing terminal 10 and an information processing server 20. Further, the information processing terminal 10 and the information processing server 20 are connected via the network 30 so as to be able to communicate with each other.
  • the information processing terminal 10 is an information processing apparatus that collects a user's speech and the like based on control by the information processing server 20 and provides the user with a speech recognition result.
  • the information processing terminal 10 according to the present embodiment is realized by, for example, a smartphone, a tablet, a head mounted display, a general-purpose computer, or a dedicated device of a stationary type or an autonomous moving type.
  • the information processing server 20 is an information processing apparatus that controls a plurality of input modes related to voice input. As described above, one of the features of the information processing server 20 according to the present embodiment is to determine whether the execution mode can be continued based on the ambient noise level and the reflection accuracy of the user intention.
  • the network 30 has a function of connecting the information processing terminal 10 and the information processing server 20.
  • the network 30 may include the Internet, a public network such as a telephone network, a satellite communication network, various LANs (Local Area Networks) including Ethernet (registered trademark), a WAN (Wide Area Network), and the like.
  • the network 30 may include a leased line network such as an Internet Protocol-Virtual Private Network (IP-VPN).
  • IP-VPN Internet Protocol-Virtual Private Network
  • the network 30 may also include a wireless communication network such as Wi-Fi (registered trademark) or Bluetooth (registered trademark).
  • the configuration example of the information processing system according to the present embodiment has been described above.
  • the configuration described above with reference to FIG. 3 is merely an example, and the configuration of the information processing system according to the present embodiment is not limited to such an example.
  • the functions of the information processing terminal 10 and the information processing server 20 according to the present embodiment may be realized by a single device.
  • the configuration of the information processing system according to the present embodiment can be flexibly deformed according to the specification and the operation.
  • FIG. 4 is a block diagram showing an example of a functional configuration of the information processing terminal 10 according to the present embodiment.
  • the information processing terminal 10 according to the present embodiment includes a display unit 110, an audio output unit 120, an audio input unit 130, an imaging unit 140, a sensor input unit 150, a control unit 160, and a server communication unit 170. Prepare.
  • the display unit 110 has a function of outputting visual information such as an image or text.
  • the display unit 110 according to the present embodiment displays a voice recognition result based on control by the information processing server 20, for example.
  • the display unit 110 includes a display device or the like that presents visual information.
  • the display device include a liquid crystal display (LCD) device, an organic light emitting diode (OLED) device, and a touch panel.
  • the display unit 110 according to the present embodiment may output visual information by a projection function.
  • the voice output unit 120 has a function of outputting various sounds including voice.
  • the audio output unit 120 according to the present embodiment includes an audio output device such as a speaker or an amplifier.
  • the voice input unit 130 has a function of collecting sounds such as user's speech and ambient sound generated around the information processing terminal 10. Further, the audio input unit 130 may perform signal processing for converting the collected sound into a digital signal, various filtering processing, and the like.
  • the voice input unit 130 according to the present embodiment includes a plurality of microphones for collecting sound.
  • the imaging unit 140 has a function of capturing an image of the user or the surrounding environment.
  • the image information captured by the imaging unit 140 may be used for face recognition and gaze detection of the user by the information processing server 20.
  • the imaging unit 140 according to the present embodiment includes an imaging device capable of capturing an image. Note that the above image includes moving images as well as still images.
  • the sensor input unit 150 has a function of collecting various sensor information related to the surrounding environment and the user.
  • the sensor information collected by the sensor input unit 150 may be used for, for example, person detection by the information processing server 20.
  • the sensor input unit 150 includes, for example, a human sensor including an infrared sensor.
  • Control unit 160 The control part 160 which concerns on this embodiment has a function which controls each structure with which the information processing terminal 10 is provided.
  • the control unit 160 controls, for example, start and stop of each component. Further, the control unit 160 inputs a control signal generated by the information processing server 20 to the display unit 110 or the audio output unit 120.
  • the control part 160 which concerns on this embodiment may have a function equivalent to the input-output control part 240 of the information processing server 20 mentioned later.
  • the server communication unit 170 has a function of performing information communication with the information processing server 20 via the network 30. Specifically, the server communication unit 170 transmits, to the information processing server 20, the sound information collected by the voice input unit 130, the image information captured by the imaging unit 140, and the sensor information collected by the sensor input unit 150. The server communication unit 170 also receives, from the information processing server 20, a control signal and the like relating to the output of the speech recognition result.
  • the example of the functional configuration of the information processing terminal 10 according to the present embodiment has been described above.
  • the above configuration described using FIG. 4 is merely an example, and the functional configuration of the information processing terminal 10 according to the present embodiment is not limited to such an example.
  • the information processing terminal 10 according to the present embodiment may not necessarily include all of the configurations shown in FIG. 4.
  • the information processing terminal 10 can be configured not to include the imaging unit 140, the sensor input unit 150, and the like.
  • the control unit 160 according to the present embodiment may have the same function as the input / output control unit 240 of the information processing server 20.
  • the functional configuration of the information processing terminal 10 according to the present embodiment can be flexibly deformed according to the specification and the operation.
  • FIG. 5 is a block diagram showing an example of a functional configuration of the information processing server 20 according to the present embodiment.
  • the information processing server 20 according to the present embodiment includes a noise level calculation unit 210, a voice recognition unit 220, a user recognition unit 230, an input / output control unit 240, and a terminal communication unit 250.
  • the noise level calculator 210 calculates the level of noise generated around the information processing terminal 10.
  • the noise level calculation unit 210 may estimate the type of noise from the sound collected by the microphone provided in the information processing terminal 10 and the like, assign different weights to each type, and may comprehensively determine the noise level.
  • the noise level calculation unit 210 weights in the following order: human speech> voice output from a television apparatus or radio> pet call> operation sound of a home appliance> running sound of a vehicle generated outdoors, construction sound, etc.
  • the noise level may be calculated to reflect the degree of influence of noise. Further, the noise level calculation unit 210 may comprehensively determine the noise level in consideration of the distance between the sound source emitting the noise and the information processing terminal 10, the sound pressure of the noise, and the like.
  • the noise level calculation unit 210 can also estimate the noise level from the image captured by the information processing terminal 10. By analyzing the image and recognizing the sound source, the noise level calculation unit 210 can estimate the possibility of noise being generated in the future even if noise is not generated at the present time.
  • the noise level calculation unit 210 may estimate the noise level based on prior knowledge.
  • the noise level calculation unit 210 may, for example, the position of the home appliance in the room where the information processing terminal 10 is installed, or the noise generated periodically (for example, by the wind band practicing outdoors at a nearby school every Friday) It is possible to calculate the noise level based on the information related to the generated noise).
  • the speech recognition unit 220 executes speech recognition processing based on the sound collected by the information processing terminal 10. Further, the speech recognition unit 220 controls the setting of a session to be subjected to the speech recognition process and a listening area corresponding to the session.
  • the above-mentioned session refers to continuous interaction between the user and the information processing server 20.
  • the session being opened for the user indicates that it is possible to continuously execute speech recognition processing for a plurality of utterances by the user.
  • it is required to input an activation word for starting voice input for each question or request, but it is troublesome for the user to speak the activation word each time. Therefore, when the speech recognition unit 220 according to the present embodiment detects an activation word once, the speech recognition unit 220 can continuously process the speech of the user by opening a session with the user who has uttered the activation word. It is possible.
  • the voice recognition unit 220 sets a listening area corresponding to the session with respect to the direction of the user who has uttered the activation word.
  • the above-mentioned listening area may be a direction in which beamforming relating to speech collection is performed. According to the above-described function of the speech recognition unit 220, it is possible to efficiently and continuously recognize the speech of the user who has uttered the activation word.
  • the user recognition unit 230 performs detection of a person, identification of a user, and the like based on the sound, the image, and the sensor information collected by the information processing terminal 10. Also, the user recognition unit 230 can perform recognition of the face area of a user or a person, identification of an expression, gaze detection, and the like.
  • the input / output control unit 240 controls the output related to the speech recognition result recognized by the speech recognition unit 220.
  • the input / output control unit 240 causes, for example, the information processing terminal 10 to display a voice input interface for performing schedule registration, task registration, and the like.
  • the input / output control unit 240 has a function of controlling a plurality of input modes related to voice input. At this time, the input / output control unit 240 determines one of the features based on the ambient noise level and the reflection accuracy of the user's intention.
  • the input mode according to the present embodiment includes a free speech input mode, a command input mode, and the like.
  • the input / output control unit 240 inputs the command in the input mode. You may change to the mode. According to the above-described function of the input / output control unit 240 according to the present embodiment, the influence of noise can be reduced, and the reflection accuracy of the user intention can be enhanced.
  • the input / output control unit 240 may first execute the first determination relating to the noise level, and may execute the second determination relating to the reflection accuracy of the user intention when the noise level is equal to or higher than the threshold. . According to such processing, more efficient and effective determination can be realized by considering the user's feedback or the like instead of immediately changing the input mode when the noise level is high.
  • the input / output control unit 240 may not necessarily perform the multistage determination as described above, and may switch the input mode based on a single determination based on the noise level and the reflection accuracy of the user intention. Good. Details of the functions of the input / output control unit 240 according to the present embodiment will be described later separately.
  • Terminal communication unit 250 The terminal communication unit 250 according to the present embodiment performs information communication with the information processing terminal 10 via the network 30. For example, the terminal communication unit 250 receives sound information, image information, sensor information, and the like from the information processing terminal 10. In addition, the terminal communication unit 250 transmits, to the information processing terminal 10, the control signal related to the output control generated by the input / output control unit 240.
  • the functional configuration example of the information processing server 20 according to an embodiment of the present disclosure has been described.
  • the above configuration described using FIG. 5 is merely an example, and the functional configuration of the information processing server 20 according to the present embodiment is not limited to such an example.
  • the configuration shown above may be realized by being distributed by a plurality of devices.
  • the functions of the information processing terminal 10 and the information processing server 20 may be realized by a single device.
  • the functional configuration of the information processing server 20 according to the present embodiment can be flexibly deformed according to the specification and the operation.
  • the input / output control unit 240 can switch the input mode based on the noise level and the reflection accuracy of the user intention.
  • the input / output control unit 240 may perform, for example, estimation of the reflection accuracy of the user intention or noise determination based on the listening area described above.
  • FIGS. 6 and 7 are diagrams for describing estimation of the reflection accuracy of the user intention according to the present embodiment.
  • FIGS. 6 and 7 show a situation where the user U instructs the information processing terminal 10 to register a schedule using an utterance.
  • the listening area R corresponding to the session spanned by the user U is shown.
  • the listening area R according to the present embodiment may be formed at a predetermined angle as illustrated.
  • the character string corresponding to the utterance PO6 of the person P1 is output to the form of the title arranged in the display area DA of the information processing terminal 10. .
  • the input / output control unit 240 may estimate the reflection accuracy of the user's intention based on the feedback of the user U with respect to the speech recognition result “it seems to rain” displayed in the display area DA.
  • the above feedback includes, for example, a correction instruction using the utterance SO6 by the user U or the like.
  • the input / output control unit 240 can determine that the reflection accuracy of the user intention is low and can switch the input mode to the command input mode.
  • the above-mentioned correction instruction widely includes instructions such as partial correction of the input content, overwriting, and correction of the input form. Also, when it is recognized that the user repeatedly repeats the same utterance in a short time, the input / output control unit 240 estimates that the user is going to overwrite the input content with the correct content, and the user's intention is It may be determined that the reflection accuracy is low. Thus, the correction instruction according to the present embodiment may not be explicit.
  • the above feedback includes the non-speech response of the user who perceived the speech recognition result.
  • the input / output control unit 240 may determine that the speech recognition result is not intended by the user, that is, the reflection accuracy of the user intention is low.
  • the non-verbal reaction according to the present embodiment broadly includes, in addition to the above-described expressions, an operation such as shaking of a neck, a tongue strike, a sigh, and the like.
  • the user recognition unit 230 can detect the non-speech reaction as described above based on various types of information collected by the information processing terminal 10.
  • the estimation of the reflection accuracy of the user intention is subsequently described with reference to FIG. Unlike the example shown in FIG. 6, in the example shown in FIG. 7, the person P ⁇ b> 1 is located in the listening area R set for the user U. In this case, the input / output control unit 240 determines whether the feedback as described with reference to FIG. 6 is intentionally performed by the user U or performed accidentally by the person P1. May be difficult.
  • the input / output control unit 240 can more accurately estimate the reflection accuracy of the user intention by performing the determination as described below.
  • the reflection accuracy of the user intention may be estimated based on the completeness, the logic, and the like of the sentence related to the speech recognition result.
  • the speech recognition result "Memoruso" recognized based on the speech UO7 of the person P1 is displayed.
  • the voice recognition result is collected because the person P1 does not speak with the face directed toward the information processing terminal 10 It is assumed that some information is lost as a result of time or recognition.
  • the input / output control unit 240 may estimate the reflection accuracy of the user intention.
  • the input / output control unit 240 can determine the completeness on the basis of the correctness of the grammar, etc., in addition to the missing characters as described above.
  • the input / output control unit 240 can also estimate the reflection accuracy of the user intention based on the logic of the sentence, that is, whether the sentence makes sense.
  • the input / output control unit 240 may estimate the reflection accuracy of the user intention based on the pulse of the text related to the speech recognition result.
  • the input / output control unit 240 may consider the context in the dialog with the user. For example, the input / output control unit 240 may determine that the reflection accuracy of the user intention is low, since it is usually hard to think that the user U makes an utterance such as “Memoir” when performing the schedule registration. .
  • the input / output control unit 240 also reflects the user intention reflection accuracy based on the difference between the speech volume of the activation word detected at the start of the session and the speech volume of speech detected after the activation word in the same session. It may be estimated. In the example shown in FIG. 7, it is represented by the difference in font size that the difference between the volume of the speech UO7 of the user U relating to the activation word and the volume of the speech PO7 of the person P1 issued thereafter is large.
  • the control unit 240 may estimate that the later utterance is not from the original user, and may determine that the reflection accuracy of the user intention is low.
  • the listening area according to the present embodiment is an area set corresponding to the session opened for the user who has uttered the activation word.
  • the input / output control unit 240 may basically determine the sound including the speech detected outside the listening area as noise.
  • the speech recognition unit 220 can also form a plurality of sessions for each of a plurality of users. For this reason, when the activation word is included in the speech detected from outside the listening area R, the input / output control unit 240 determines that a new user has requested a speech input, and the speech is not noise. It can be determined. At this time, the input / output control unit 240 newly starts a session based on the activation word, and sets a listening area R corresponding to the session in the direction in which the activation word is detected.
  • the person P1 who does not want to start the session accidentally performs the utterance PO8 including the activation word or a word similar to the activation word, the television apparatus NS, etc. It is also assumed that the voice SO8a including the start-up word is output.
  • the input / output control unit 240 does not detect a user who is presumed to intentionally issue the activation word.
  • the activation word may be determined as noise.
  • the input / output control unit 240 determines whether a person is detected in the detection direction of the activation word. At this time, when a person is not detected in the detection direction, the input / output control unit 240 may process the activation word included in the sound SO 8a as noise. When learning that the television set NS is arranged in the detection direction as prior knowledge, the input / output control unit 240 processes the activation word detected from the detection direction as noise. It is also good.
  • the input / output control unit 240 may determine the activation word as noise if an utterance following the activation word is not detected. For example, when the conversation of the persons P1 and P2 is ended by the utterance PO8 of the person P1, the utterance following the activation word is not detected. In this case, the input / output control unit 240 processes the activation word included in the utterance PO8 as noise. can do.
  • the input / output control unit 240 can also perform noise determination based on the direction of the line of sight or face of the person P1 who has uttered the utterance PO8. For example, as illustrated, when the person P1 does not face the direction of the information processing terminal 10, the input / output control unit 240 estimates that the activation word included in the utterance PO8 is not intended to start the session. The activation word can be treated as noise.
  • the input / output control unit 240 requests the person assumed to have issued the activation word to perform a predetermined action, and when the user performing the action is not detected, the activation word is determined as noise. May be For example, in the example illustrated in FIG. 8, the input / output control unit 240 causes the information processing terminal 10 to output a voice SO 8 b which requests the face to face the information processing terminal 10.
  • the input / output control unit 240 can process the detected activation word as noise.
  • the input / output control unit 240 causes the user to display information related to the detected noise on the information processing terminal 10. It is also possible to present the occurrence of noise.
  • FIG. 9 is a diagram for explaining the presentation of the noise occurrence state according to the present embodiment.
  • FIG. 9 shows an example of the case where the input / output control unit 240 visualizes the occurrence of noise detected in the space where the user is present and causes the information processing terminal 10 to output it.
  • the occurrence of noise around the information processing terminal 10 is represented by an object O1 arranged on the display area DA.
  • the object O1 may be an indicator indicating the magnitude of noise around the information processing terminal 10 for each position.
  • FIG. 9 it is shown that the higher the dot density, the larger the noise.
  • the input / output control unit 240 can control the display of the object O1 based on, for example, the relative positions of the people P1 and P2 that are noise generation sources and the information processing terminal 10. Also, the input / output control unit 240 may present the user with the risk of noise that may occur in the future by the object O1 or the like.
  • the input / output control unit 240 may have the possibility that the persons P1 and P2 may talk in the future. It is possible to estimate that there is, and to display the possibility on the information processing terminal 10 as visual information. At this time, the input / output control unit 240 may cause the display area DA to display a noise source such as a detected person or device.
  • the input / output control unit 240 requests the user to move to a position with less noise, for example, by displaying the object O 2 or the like when the noise is large or when the future occurrence of noise is estimated. It is also possible.
  • the input / output control unit 240 by presenting information related to noise to the user, it is possible to urge the user to move spontaneously or request the user to move. It is possible to improve the speech recognition accuracy.
  • the visual expression of the noise which concerns on this embodiment is not limited to an example shown in FIG. 9, It can design suitably.
  • FIG. 10 is a flowchart showing a flow of control by the information processing server 20 according to the present embodiment.
  • the information processing server 20 performs the first determination relating to the noise level and the second determination relating to the reflection accuracy of the user intention is described as an example, the information processing server 20 is not necessarily required. It is not necessary to perform the multistage determination as described above.
  • the terminal communication unit 250 receives the collected information collected by the information processing terminal 10 (S1101).
  • the above collected information includes sound information, image information, and sensor information.
  • the noise level calculation unit 210 calculates the noise level based on the collected information received in step S1101 (S1102).
  • the input / output control unit 240 determines whether the noise level calculated in step S1102 is equal to or higher than a threshold (S1103).
  • the information processing server 20 returns to the standby state.
  • recognition of the user state by the user recognition unit 230 is executed (S1104).
  • the user recognition unit 230 is not limited to the order shown in FIG. 10, and may execute recognition processing regarding the user and surrounding persons continuously.
  • the voice recognition unit 220 continuously attempts to detect a correction instruction by the user (S1105).
  • the input / output control unit 240 determines whether the reflection accuracy of the user intention estimated based on the feedback of the user detected in step S1104 or step S1105 is equal to or less than a threshold (S1106).
  • the information processing server 20 returns to the standby state.
  • the input / output control unit 240 subsequently performs control to switch the input mode to the command input mode (S1107).
  • the input / output control unit 240 may switch the input mode at the timing when the user's speech is interrupted, or may switch the input mode immediately. Further, the input / output control unit 240 may request the user to temporarily stop speech so that the speech of the user is not lost by switching.
  • the input / output control unit 240 can also determine that an utterance that does not comply with the above request is noise.
  • the information processing server 20 may perform, for example, control for enabling input only to a form that the user is watching, lowering the volume of a television device that generates noise, etc. Control may be further performed.
  • the information processing server 20 can also provide the user with an interface for designating a session by himself. At this time, the user may be able to actively specify a session, for example, by voice.
  • the information processing server 20 starts a session based on the activation word
  • the information processing server 20 starts a session based on, for example, detection of a user's face. Is also possible.
  • the information processing server 20 may perform control such as lowering the priority of the session for a face detected in the direction in which the television device is arranged.
  • FIG. 11 is a block diagram illustrating an example of a hardware configuration of the information processing server 20 according to an embodiment of the present disclosure.
  • the information processing terminal 10 and the information processing server 20 include, for example, a processor 871, a ROM 872, a RAM 873, a host bus 874, a bridge 875, an external bus 876, an interface 877, and an input device 878. , An output device 879, a storage 880, a drive 881, a connection port 882, and a communication device 883.
  • the hardware configuration shown here is an example, and some of the components may be omitted. In addition, components other than the components shown here may be further included.
  • the processor 871 functions as, for example, an arithmetic processing unit or a control unit, and controls the overall operation or a part of each component based on various programs recorded in the ROM 872, RAM 873, storage 880, or removable recording medium 901. .
  • the ROM 872 is a means for storing a program read by the processor 871, data used for an operation, and the like.
  • the RAM 873 temporarily or permanently stores, for example, a program read by the processor 871 and various parameters and the like that appropriately change when the program is executed.
  • the processor 871, the ROM 872, and the RAM 873 are connected to one another via, for example, a host bus 874 capable of high-speed data transmission.
  • host bus 874 is connected to external bus 876, which has a relatively low data transmission speed, via bridge 875, for example.
  • the external bus 876 is also connected to various components via an interface 877.
  • Input device 8708 For the input device 878, for example, a mouse, a keyboard, a touch panel, a button, a switch, a lever, and the like are used. Furthermore, as the input device 878, a remote controller (hereinafter, remote control) capable of transmitting a control signal using infrared rays or other radio waves may be used.
  • the input device 878 also includes a voice input device such as a microphone.
  • the output device 879 is a display device such as a CRT (Cathode Ray Tube), an LCD, or an organic EL, a speaker, an audio output device such as a headphone, a printer, a mobile phone, or a facsimile. It is a device that can be notified visually or aurally. Also, the output device 879 according to the present disclosure includes various vibration devices capable of outputting haptic stimulation.
  • the storage 880 is a device for storing various data.
  • a magnetic storage device such as a hard disk drive (HDD), a semiconductor storage device, an optical storage device, a magneto-optical storage device, or the like is used.
  • the drive 881 is a device that reads information recorded on a removable recording medium 901 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, or writes information on the removable recording medium 901, for example.
  • a removable recording medium 901 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory
  • the removable recording medium 901 is, for example, DVD media, Blu-ray (registered trademark) media, HD DVD media, various semiconductor storage media, and the like.
  • the removable recording medium 901 may be, for example, an IC card equipped with a non-contact IC chip, an electronic device, or the like.
  • connection port 882 is, for example, a port for connecting an externally connected device 902 such as a USB (Universal Serial Bus) port, an IEEE 1394 port, a SCSI (Small Computer System Interface), an RS-232C port, or an optical audio terminal. is there.
  • an externally connected device 902 such as a USB (Universal Serial Bus) port, an IEEE 1394 port, a SCSI (Small Computer System Interface), an RS-232C port, or an optical audio terminal. is there.
  • the external connection device 902 is, for example, a printer, a portable music player, a digital camera, a digital video camera, an IC recorder, or the like.
  • the communication device 883 is a communication device for connecting to a network.
  • a communication card for wired or wireless LAN Bluetooth (registered trademark) or WUSB (Wireless USB), a router for optical communication, ADSL (Asymmetric Digital) (Subscriber Line) router, or modem for various communications.
  • Bluetooth registered trademark
  • WUSB Wireless USB
  • ADSL Asymmetric Digital
  • Subscriber Line Subscriber Line
  • the information processing server 20 includes the input / output control unit 240 that controls a plurality of input modes related to voice input. Further, the input / output control unit 240 is characterized in that it is determined whether or not the execution mode can be continued based on the ambient noise level and the reflection accuracy of the user intention. Further, the plurality of input modes include at least a free speech input mode and a command input mode. According to such a configuration, it is possible to provide more convenient voice input means according to the situation.
  • each step concerning processing of information processing server 20 of this specification does not necessarily need to be processed in chronological order according to the order described in the flowchart.
  • the steps related to the processing of the information processing server 20 may be processed in an order different from the order described in the flowchart or may be processed in parallel.
  • a control unit that controls a plurality of input modes related to voice input Equipped with The control unit determines whether or not the execution mode can be continued based on the ambient noise level and the reflection accuracy of the user intention,
  • the plurality of input modes include at least a free speech input mode and a command input mode, Information processing device.
  • the control unit changes the input mode to the command input mode when the noise level detected during execution of the free speech input mode is equal to or higher than a threshold and the reflection accuracy of the user intention is equal to or lower than a threshold.
  • the control unit executes a first determination related to the noise level, and executes a second determination related to the reflection accuracy of the user intention when the noise level is equal to or higher than a threshold.
  • the control unit estimates the reflection accuracy of the user intention based on user feedback on the speech recognition result.
  • the information processing apparatus according to any one of the above (1) to (3).
  • the feedback includes a correction instruction to the speech recognition result,
  • the control unit determines that the reflection accuracy of the user intention is low when the correction instruction by the user is recognized.
  • the feedback includes the non-verbal response of the user who perceived the speech recognition result
  • the control unit determines that the reflection accuracy of the user intention is low when the negative non-speech reaction by the user is detected.
  • the information processing apparatus according to (4) or (5).
  • the control unit estimates the reflection accuracy of the user intention based on the completeness or logic of a sentence related to the speech recognition result.
  • the information processing apparatus according to any one of the above (4) to (6).
  • the control unit estimates the reflection accuracy of the user intention based on the difference between the detected speech volume of the activation word related to the start of the voice input and the volume of the speech detected following the activation word. , The information processing apparatus according to any one of the above (1) to (7).
  • the control unit sets a listening area for a user who has uttered an activation word relating to the start of the voice input, and determines a sound detected from outside the listening area as noise.
  • the information processing apparatus according to any one of the above (1) to (8).
  • the control unit determines that a sound other than the activation word detected from outside the listening area is noise.
  • the control unit determines the activation word as noise when the activation word is detected outside the listening area and a user who is presumed to have intentionally emitted the activation word is not detected.
  • the information processing apparatus according to (10).
  • (12) When the activation word is detected outside the listening area and the speech following the activation word is not detected, the control unit determines the activation word as noise.
  • the information processing apparatus according to (10) or (11).
  • the control unit requests a person who is presumed to have issued the activation word to perform a predetermined action, and a person who performs the action is not detected. In this case, the activation word is determined as noise,
  • the information processing apparatus according to any one of the above (10) to (12).
  • the control unit guides the user to make an utterance suitable for the command input mode.
  • the information processing apparatus according to any one of the above (1) to (13).
  • the control unit displays, as visual information, the occurrence of noise in the space where the user is present.
  • the information processing apparatus according to any one of the above (1) to (14).
  • the control unit displays, as visual information, the possibility of noise generation in the space where the user exists.
  • the control unit requests the user to move to a position with less noise in the space where the user is present.
  • the processor controlling a plurality of input modes related to voice input; Including The controlling may determine whether or not the execution mode continues based on the ambient noise level and the reflection accuracy of the user intention. Further include The plurality of input modes include at least a free speech input mode and a command input mode, Information processing method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Cette invention a pour objet de fournir des moyens d'entrée vocale plus commodes. Plus particulièrement, l'invention concerne un dispositif de traitement d'informations, comprenant : une unité de commande pour commander une pluralité de modes d'entrée concernant une entrée vocale. L'unité de commande détermine s'il faut ou non poursuivre un mode d'exécution sur la base d'un niveau de bruit environnant et d'une précision de réflexion d'une intention d'utilisateur, et la pluralité de modes d'entrée comprend au moins un mode d'entrée de parole libre et un mode d'entrée de commande. L'invention concerne en outre un procédé de traitement d'informations, consistant à commander, par un processeur, une pluralité de modes d'entrée concernant une entrée vocale. La commande consiste en outre à déterminer s'il faut ou non poursuivre un mode d'exécution sur la base d'un niveau de bruit environnant et d'une précision de réflexion d'une intention d'utilisateur, et la pluralité de modes d'entrée comprend au moins un mode d'entrée de parole libre et un mode d'entrée de commande.
PCT/JP2018/038716 2018-01-22 2018-10-17 Dispositif de traitement d'informations et procédé de traitement d'informations WO2019142418A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2018008157 2018-01-22
JP2018-008157 2018-01-22

Publications (1)

Publication Number Publication Date
WO2019142418A1 true WO2019142418A1 (fr) 2019-07-25

Family

ID=67301368

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/038716 WO2019142418A1 (fr) 2018-01-22 2018-10-17 Dispositif de traitement d'informations et procédé de traitement d'informations

Country Status (1)

Country Link
WO (1) WO2019142418A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021021852A (ja) * 2019-07-29 2021-02-18 シャープ株式会社 音声認識装置、電子機器、制御方法および制御プログラム
JP2021135648A (ja) * 2020-02-26 2021-09-13 富士フイルムビジネスイノベーション株式会社 情報処理装置及び情報処理プログラム
JP2021148971A (ja) * 2020-03-19 2021-09-27 日産自動車株式会社 音声認識方法及び音声認識装置
JP2022539673A (ja) * 2019-10-15 2022-09-13 グーグル エルエルシー グラフィカルユーザインターフェース内への内容の音声制御入力
JP7556202B2 (ja) 2020-03-19 2024-09-26 日産自動車株式会社 音声認識方法及び音声認識装置

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004109361A (ja) * 2002-09-17 2004-04-08 Toshiba Corp 指向性設定装置、指向性設定方法及び指向性設定プログラム
JP2004184716A (ja) * 2002-12-04 2004-07-02 Nissan Motor Co Ltd 音声認識装置
JP2004198831A (ja) * 2002-12-19 2004-07-15 Sony Corp 音声認識装置および方法、プログラム、並びに記録媒体
JP2004354722A (ja) * 2003-05-29 2004-12-16 Nissan Motor Co Ltd 音声認識装置
JP2006195302A (ja) * 2005-01-17 2006-07-27 Honda Motor Co Ltd 音声認識システム及びこの音声認識システムを備える車両
JP2006201286A (ja) * 2005-01-18 2006-08-03 Matsushita Electric Ind Co Ltd 車載収音装置及び車内収音情報表示方法
JP2009069707A (ja) * 2007-09-17 2009-04-02 Nippon Seiki Co Ltd 車両用音声認識装置
JP2010128015A (ja) * 2008-11-25 2010-06-10 Toyota Central R&D Labs Inc 音声認識の誤認識判定装置及び音声認識の誤認識判定プログラム
JP2015503119A (ja) * 2011-11-23 2015-01-29 キム ヨンジン 音声認識付加サービス提供方法及びこれに適用される装置

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004109361A (ja) * 2002-09-17 2004-04-08 Toshiba Corp 指向性設定装置、指向性設定方法及び指向性設定プログラム
JP2004184716A (ja) * 2002-12-04 2004-07-02 Nissan Motor Co Ltd 音声認識装置
JP2004198831A (ja) * 2002-12-19 2004-07-15 Sony Corp 音声認識装置および方法、プログラム、並びに記録媒体
JP2004354722A (ja) * 2003-05-29 2004-12-16 Nissan Motor Co Ltd 音声認識装置
JP2006195302A (ja) * 2005-01-17 2006-07-27 Honda Motor Co Ltd 音声認識システム及びこの音声認識システムを備える車両
JP2006201286A (ja) * 2005-01-18 2006-08-03 Matsushita Electric Ind Co Ltd 車載収音装置及び車内収音情報表示方法
JP2009069707A (ja) * 2007-09-17 2009-04-02 Nippon Seiki Co Ltd 車両用音声認識装置
JP2010128015A (ja) * 2008-11-25 2010-06-10 Toyota Central R&D Labs Inc 音声認識の誤認識判定装置及び音声認識の誤認識判定プログラム
JP2015503119A (ja) * 2011-11-23 2015-01-29 キム ヨンジン 音声認識付加サービス提供方法及びこれに適用される装置

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021021852A (ja) * 2019-07-29 2021-02-18 シャープ株式会社 音声認識装置、電子機器、制御方法および制御プログラム
JP2022539673A (ja) * 2019-10-15 2022-09-13 グーグル エルエルシー グラフィカルユーザインターフェース内への内容の音声制御入力
JP7250180B2 (ja) 2019-10-15 2023-03-31 グーグル エルエルシー グラフィカルユーザインターフェース内への内容の音声制御入力
US11853649B2 (en) 2019-10-15 2023-12-26 Google Llc Voice-controlled entry of content into graphical user interfaces
US12093609B2 (en) 2019-10-15 2024-09-17 Google Llc Voice-controlled entry of content into graphical user interfaces
JP2021135648A (ja) * 2020-02-26 2021-09-13 富士フイルムビジネスイノベーション株式会社 情報処理装置及び情報処理プログラム
JP2021148971A (ja) * 2020-03-19 2021-09-27 日産自動車株式会社 音声認識方法及び音声認識装置
JP7556202B2 (ja) 2020-03-19 2024-09-26 日産自動車株式会社 音声認識方法及び音声認識装置

Similar Documents

Publication Publication Date Title
US11462213B2 (en) Information processing apparatus, information processing method, and program
JP6635049B2 (ja) 情報処理装置、情報処理方法およびプログラム
US9293133B2 (en) Improving voice communication over a network
JP5772069B2 (ja) 情報処理装置、情報処理方法およびプログラム
WO2019142418A1 (fr) Dispositif de traitement d'informations et procédé de traitement d'informations
JP6585733B2 (ja) 情報処理装置
WO2019107145A1 (fr) Dispositif et procédé de traitement d'informations
JP6904357B2 (ja) 情報処理装置、情報処理方法、及びプログラム
WO2018105373A1 (fr) Dispositif de traitement d'informations, procédé de traitement d'informations et système de traitement d'informations
JP2009166184A (ja) ガイドロボット
US12062360B2 (en) Information processing device and information processing method
JP2020021025A (ja) 情報処理装置、情報処理方法及びプログラム
US20230037085A1 (en) Preventing non-transient storage of assistant interaction data and/or wiping of stored assistant interaction data
JP2017167247A (ja) 誤認識訂正方法、誤認識訂正装置及び誤認識訂正プログラム
WO2016103809A1 (fr) Dispositif de traitement d'informations, procédé de traitement d'informations, et programme
US20200090663A1 (en) Information processing apparatus and electronic device
JPWO2017175442A1 (ja) 情報処理装置、および情報処理方法
WO2021153101A1 (fr) Dispositif de traitement d'informations, procédé de traitement d'informations et programme de traitement d'informations
JP6950708B2 (ja) 情報処理装置、情報処理方法、および情報処理システム
US11170754B2 (en) Information processor, information processing method, and program
WO2019187543A1 (fr) Dispositif de traitement d'informations et procédé de traitement d'informations
WO2019142420A1 (fr) Dispositif de traitement d'informations et procédé de traitement d'informations
KR102594683B1 (ko) 전자 장치 및 이의 음성 인식 방법
WO2021153102A1 (fr) Dispositif de traitement d'informations, système de traitement d'informations, procédé de traitement d'informations et programme de traitement d'informations
WO2020017165A1 (fr) Dispositif de traitement d'informations, système de traitement d'informations, procédé de traitement d'informations, et programme

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18901226

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18901226

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP