WO2022224310A1

WO2022224310A1 - Information processing device, information processing method, and program

Info

Publication number: WO2022224310A1
Application number: PCT/JP2021/015884
Authority: WO
Inventors: 明日香小野; 充裕後藤
Original assignee: 日本電信電話株式会社
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2022-10-27

Abstract

Provided is a technology that makes it possible to present a target action to a speaker in a manner that causes little interference with language processing. According to the embodiments, an information processing device comprises a first acquisition unit that acquires a target value for a non-linguistic indicator for speech, a generation unit that generates an instruction signal that includes non-language information that represents the target value, and an output unit that outputs the instruction signal as a stimulus that can be perceived by the speaker. According to the first embodiment, the information processing device acquires a measured value for the non-linguistic indicator from voice information spoken by the speaker and outputs the instruction signal on the basis of comparison results for the measured value and an allowable range. According to the second embodiment, the information processing device associates the target value for the non-linguistic indicator with a word or phrase included in text information that represents the expected content of the speech of the speaker and outputs the instruction signal at the speaking timing of the word or phrase.

Description

Information processing device, information processing method and program

Embodiments of the present invention relate to an information processing device, an information processing method, and a program.

Public speaking skills such as speeches and presentations are important in both academic and business settings. In order to improve public speaking skills, it is conceivable to present a target behavior (target behavior) to the speaker. For example, a technology has been proposed that displays a target behavior in a language on a wearable device.

Patent Document 1 proposes a technique for generating instruction information for actions to be executed by a presentation device such as a robot. Non-Patent Document 1 proposes a technology that enables self-review of a presentation by using the technology of Patent Document 1 to reproduce the speaker's past actions with a robot.

Patent Document 2 proposes a technology that compares a target behavior with a subject's behavior and presents a concrete improvement method to the subject as an advice sentence in order to improve communication skills.

Japanese Patent Application Publication No. 2019-144732 Japanese Patent Application Publication No. 2016-157388

In the context of public speaking, speakers use many cognitive resources for language processing (for example, reading manuscripts, recalling, etc.). In such a situation where language processing is being performed, if the target behavior is presented in a manner that requires further language processing, public speaking activities may be hindered. In public speaking, it has been reported that the speaker's cognitive function declines due to tension. There is a need for a technology that presents a target action to the speaker without interfering with the speaker's language processing.

The present invention has been made with a focus on the above circumstances, and its purpose is to provide a technology that enables presenting a target action to a speaker in a manner that causes little interference with language processing.

In order to solve the above problems, a first aspect of the present invention is an information processing apparatus comprising: a first acquisition unit that acquires a target value of a nonverbal index in speech; and an output unit for outputting the instruction signal as a stimulus perceivable by the speaker.

According to the first aspect of the present invention, an instruction signal including nonverbal information representing a target value of a nonverbal index in utterance is generated and output as a stimulus perceivable by the speaker. Therefore, the information processing apparatus of the first aspect can provide a technology that enables transmission of a target action in a non-verbal manner with little interference with the language processing of the speaker. A speaker who receives a stimulus can intuitively perceive the target behavior.

In other words, according to the present invention, it is possible to provide a technology that enables presenting a target action to a speaker in a manner that does not interfere with language processing.

FIG. 1 is a diagram showing the configuration of a system including an information processing device according to the first embodiment. FIG. 2 is a block diagram showing the hardware configuration of the information processing device according to the first embodiment. FIG. 3 is a flow chart showing a processing procedure and processing contents by the information processing apparatus according to the first embodiment. FIG. 4 is a diagram illustrating an example of a vibration signal generated by the information processing apparatus according to the first embodiment; FIG. 5 is a diagram showing the configuration of a system including an information processing device according to the second embodiment. FIG. 6 is a flowchart showing a processing procedure and processing details regarding generation of an instruction signal by the information processing apparatus according to the second embodiment. FIG. 7 is a flowchart showing a processing procedure and processing details regarding output of an instruction signal by the information processing apparatus according to the second embodiment.

Hereinafter, embodiments according to the present invention will be described with reference to the drawings. Elements that are the same as or similar to elements that have already been explained are denoted by the same or similar reference numerals, and overlapping explanations are basically omitted. For example, when there are a plurality of identical or similar elements, common reference numerals may be used to describe each element without distinction, and the common reference numerals may be used to distinguish and describe each element. In addition, branch numbers are sometimes used.

[First embodiment]
(1) Configuration (1-1) System Configuration FIG. 1 is a diagram showing an example configuration of a system including an information processing apparatus according to the first embodiment. The system presents a target behavior to the speaker via instructional signals containing non-verbal information representing target values for non-verbal indicators in the utterance. A target behavior may be rephrased as an ideal speaking method. As an example, the following description assumes that the speaker is giving a presentation to an audience, but the present invention is not limited to this. The system is applicable to one-to-many public speaking situations as well as one-to-one communication situations. In the following, as a method of presenting the target action, an example of giving tactile stimulation by vibration will be explained, but this is not limited to this, and visual stimulation by image display, auditory stimulation by voice output, etc. can be used. may be

The system of FIG. 1 includes a presentation support device 1 as an information processing device according to the first embodiment, an input device 41, an output device 42, a microphone 43, and a vibration device 44.

The input device 41 is a device for receiving input from the user of the presentation support device 1, such as a keyboard, mouse, and touch screen. As used herein, a "user" may be a speaker, assistant, administrator or operator, or the like.

The output device 42 is a device for outputting, such as a liquid crystal display device, an organic EL (Electro-Luminescence) display, a speaker, or the like.

The microphone 43 is placed, for example, near the speaker, collects the speaker's voice, and converts it into an electrical signal.

The vibration device 44 is a device, such as a smart watch, other wearable device, mobile terminal, or smart phone, which incorporates a vibration element and can output vibration stimulation according to a drive signal.

Note that one or more of the input device 41, the output device 42, the microphone 43, and the vibration device 44 may be configured integrally or may be built into the presentation support device 1.

(1-2) Functional Configuration of Presentation Supporting Apparatus Next, functions of the presentation supporting apparatus 1 according to the first embodiment will be described. The presentation support device 1 is configured by, for example, a personal computer. As shown in FIG. 1 , the presentation support device 1 includes a control section 10 , a storage section 20 and an input/output interface 30 .

The input/output interface 30 inputs and outputs data between the presentation support device 1 and an external device. For example, the input/output interface 30 captures data input by the user from the input device 41, outputs output data generated by the control unit 10 to the output device 42, captures an audio signal output from the microphone 43, or vibrates A drive signal is output to the device 44 . The input/output interface 30 includes a USB (Universal Serial Bus) port, a cable connection terminal, a card slot, or the like, and can communicate with the input device 41, the output device 42, the microphone 43, and the vibration device 44 respectively. Data is exchanged according to the method. Input/output interface 30 may include a wired or wireless communication interface. The wired communication interface is, for example, a wired LAN interface, and the wireless communication interface is, for example, a wireless LAN or Bluetooth (registered trademark) interface. Transmission and reception of data between the presentation support device 1 and the input device 41, the output device 42, the microphone 43, or the vibration device 44 may be performed via a communication interface.

The storage unit 20 includes a non-language plan database 21 and a vibration pattern database 22.

The non-verbal plan database 21 stores non-verbal (which may be read as "paralinguistic") index data of generally ideal speech in public speaking. Here, regarding the information contained in the spoken voice, language refers to information that can be expressed as text, and non-language refers to information other than language (for example, speech speed, volume of voice, pitch of voice, etc.). Point. Non-verbal information is also referred to herein as "non-verbal indicators". The data stored in the non-verbal plan database 21 includes information on ideal reference ranges (upper and lower limits) for non-verbal indicators. The reference range is arbitrarily set or automatically updated as a numerical value that prompts the audience's understanding and attention, for example, based on the latest research results. The reference range is used when determining whether or not to present the target to the speaker, and can also be called an allowable range. The reference range includes the target value. The target value is used in generating an instruction signal for presenting the target. The target value may be, for example, the median value of the reference range, the lower limit value, the upper limit value, or other values. The non-language plan database 21 includes, for example, an intelligible and fluent speech rate, periodic volume level inflection within a phrase, high pitch inflection within a phrase, word or phrase emphasis, or Information on target values and reference ranges regarding the insertion of a pause for each phrase or an appropriate length of time is stored. A word or a phrase is merely an expression for convenience of explanation, and may be read as a set of language (word, phrase, clause, or sentence) including one or more words. The reference range or target value of the nonverbal index may be based on measurement results obtained by pre-recording ideal speech and using commonly used speech analysis software from the recorded speech.

The vibration pattern database 22 stores vibration patterns as an example of a stimulus mode that can guide the speaking method during the speaker's speech without interfering with language processing. The vibration patterns stored in the vibration pattern database 22 are, for example, rhythm patterns that guide speech speed, amplitude modulation that guides periodic intonation, accent vibration at the beginning of words that guides emphasis of words or phrases, or breaks in speech. Including marker vibration etc. to guide the start.

The non-language plan database 21 and the vibration pattern database 22 do not have to be built into the presentation support device 1, and may be connected to the presentation support device 1 via a network.

The control unit 10 includes a voice information acquisition unit 11, a non-verbal characteristic measurement unit 12, a determination unit 13, a vibration signal generation unit 14, and a vibration device drive unit 15 as functional units according to the first embodiment.

The voice information acquisition unit 11 acquires the voice signal output from the microphone 43 as voice information uttered by the speaker. The voice information acquisition unit 11 is an example of a second acquisition unit that acquires voice information uttered by a speaker.

The non-verbal characteristic measurement unit 12 measures non-verbal indicators such as speech speed, voice volume, or pitch from the speaker's voice information for each window length of X [ms]. The non-verbal characteristic measuring unit 12 is an example of a measuring unit that measures a non-verbal index from voice information and obtains a measured value.

The determination unit 13 reads the reference range and target value of the nonverbal index stored in the nonverbal plan database 21, compares the reference range with the measured values in each time window, and records the comparison results. The determination unit 13 is an example of a determination unit that compares the measured value and the allowable range, and determines whether or not to present the target based on the comparison result. The determination unit 13 is also an example of a first acquisition unit that acquires target values of nonverbal indicators in speech. When it is determined that the measurement value from the voice signal of the speaker is continuously out of the reference range in Y windows, the determination unit 13 determines that the target presentation is necessary, and outputs a vibration signal call command. The vibration signal calling command includes information on target values of nonverbal indicators that require target presentation.

The vibration signal generator 14 calls up the vibration pattern data from the vibration pattern database 22 in response to the vibration signal call command and generates a vibration signal. The generated vibration signal is an example of an instruction signal containing nonverbal information representing a target value of the nonverbal index. The vibration signal generation unit 14 is an example of a generation unit that generates an instruction signal including nonverbal information representing a target value of a nonverbal index. The vibration signal generator 14 can also integrate vibration signals for multiple non-verbal indicators. For example, the vibration signal generation unit 14 generates an integrated vibration signal by applying a sinusoidal vibration that guides speech speed as a carrier wave and applying an amplitude modulation filter that guides intonation to the carrier wave. In this example, the integrated vibration signal can represent the target speech speed by vibration rhythm and the target intonation by vibration amplitude change.

The vibration device drive section 15 generates and outputs a drive signal for driving the vibration device 44 based on the generated vibration signal. The drive signal can also be said to be an example of an instruction signal that includes nonverbal information representing the target value of the nonverbal index. The vibration device drive unit 15 is an example of an output unit that outputs an instruction signal as a stimulus perceivable by the speaker.

(1-3) Hardware Configuration FIG. 2 is a block diagram showing an example of the hardware configuration of the presentation support device 1. As shown in FIG.
The presentation support device 1 includes, as hardware, a CPU (Central Processing Unit) 51, a RAM (Random Access Memory) 52, a ROM (Read Only Memory) 53, an auxiliary storage device 54, and the input/output interface 30 described above. The CPU 51 , RAM 52 , ROM 53 , auxiliary storage device 54 and input/output interface 30 are electrically connected via bus 55 .

The CPU 51 is an example of a general-purpose hardware processor, and controls the overall operation of the presentation support device 1. The RAM 52 is used by the CPU 51 as working memory. RAM 52 includes volatile memory such as SDRAM (Synchronous Dynamic Random Access Memory). The ROM 53 non-temporarily stores programs for causing the presentation support apparatus 1 to perform various functions and setting data necessary for executing the programs. The programs stored in ROM 53 include computer-executable instructions. The CPU 51 expands the program (computer-executable instructions) stored in the ROM 53 into the RAM, interprets and executes the program, thereby realizing the functions of the control section 10 .

The auxiliary storage device 54 includes, for example, a hard disk drive (HDD), solid state drive (SSD), semiconductor memory, or the like, and non-temporarily stores data necessary for the functions of the control unit 10 . The auxiliary storage device 54 functions as the storage section 20 described above. A part of the above program may be stored in the auxiliary storage device 54 . The program may be provided to the presentation support device 1 while being stored in a computer-readable recording medium. In this case, for example, the presentation support device 1 has a drive that reads data from a recording medium, and acquires a program from the recording medium. Examples of recording media include magnetic disks, optical disks (CD-ROM, CD-R, DVD-ROM, DVD-R, etc.), magneto-optical disks (MO, etc.), and semiconductor memories. Alternatively, the program may be stored in a server on the network, and the presentation supporting apparatus 1 may download the program from the server.

With respect to the specific functional configuration or hardware configuration of the presentation support device 1, it is possible to omit, replace, or add components as appropriate according to the embodiment. For example, the CPU 51 may be replaced with an MPU (Micro Processing Unit), GPU (Graphics Processing Unit), ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array), or the like. The CPU 51 may be a single CPU or the like, or may include a plurality of CPUs or the like.

(2) Operation Next, an information processing operation by the presentation support apparatus 1 configured as described above will be described.
FIG. 3 is a flow chart showing an example of the processing procedure and processing contents. It is assumed that the non-language plan database 21 and the vibration pattern database 22 store data necessary for processing in advance.

First, in step S<b>101 , the presentation support device 1 uses the voice information acquisition unit 11 under the control of the control unit 10 to acquire a voice signal related to the speaker's utterance from the microphone 43 . The audio information acquisition section 11 also converts the audio signal into a digital signal at a predetermined sampling rate.

Next, in step S102, the presentation support device 1 uses the non-linguistic characteristic measurement unit 12 to measure non-linguistic indicators from the digitized speech signal and acquire the measured values. Non-verbal indicators to be measured include, for example, speech speed (number of mora per unit time (mora/s), number of words (word/s), number of syllables (syllable/s), or similar), voice Including magnitude (eg dB), and pitch (eg Hz). Commonly used speech analysis software may be used to measure non-verbal indicators. The non-verbal characteristic measuring unit 12 measures a non-verbal index for each time window of a predetermined length (X [ms]). The window length X may be adjusted to the required length for each non-verbal indicator. The type and number of nonverbal indices to be measured may be arbitrarily selected by the user.

In step S103, the presentation support device 1 causes the determination unit 13 to read the reference range of the nonverbal index from the nonverbal plan database 21, compare the measured value with the reference range for each time window, and record the comparison result. do.

When the non-verbal index is speech speed, the measured value is measured as, for example, the number of moras, the number of words, the number of syllables per unit time, or equivalent thereto. In step S103, the determination unit 13 compares the read reference range (upper limit and lower limit) with the measured value.

If the non-verbal index is the intonation of the volume level, the measured value is, for example, if the sound pressure level difference between the maximum and minimum values of the sound pressure level of the voice during the unit time is greater than or equal to the reference value A (lower limit). or the standard deviation of the time interval between the maximum and minimum values at which the sound pressure level difference is equal to or greater than the reference value A (lower limit), or equivalent thereto. The determination unit 13 reads a reference range B (upper limit and lower limit) set based on the reference value A, and compares the reference range B with the measured value.

If the non-verbal index is the inflection of the pitch of the voice, the measured value is, for example, the frequency difference between the maximum and minimum values of the fundamental frequency of the voice in the unit time that is greater than or equal to the reference value C (lower limit). It is measured as the number of appearances, or the standard deviation of the time interval between the maximum value and the minimum value at which the frequency difference is equal to or greater than the reference value C (lower limit), or equivalent thereto. The determination unit 13 reads a reference range D (upper limit and lower limit) set based on the reference value C, and compares the reference range D with the measured value.

If the nonverbal indicator is word or phrase stress, the measure is the ratio of the sound pressure level of the word or phrase segment to be stressed to the average sound pressure level, or the fundamental frequency of the stressed word or phrase segment to the average fundamental frequency. , or equivalent. The determination unit 13 reads the reference range (upper limit and lower limit) and compares the reference range with the measured value.

　When the non-verbal index is pause (pause), the measured value is measured as, for example, sound pressure level or equivalent. The determination unit 13 reads out the reference range (upper limit) and compares the reference range and the measured value.

In step S104, the presentation support device 1 uses the determination unit 13 to determine whether the measured values are outside the reference range in Y consecutive windows. The presentation support device 1 according to the first embodiment determines that target presentation is necessary when the measured values obtained from the speaker's voice are outside the reference range in Y consecutive windows. If it is determined in step S104 that the measured values are outside the reference range in Y consecutive windows (YES), the process proceeds to step S105. If it is determined that it is not outside the reference range in Y consecutive windows (NO), the determination is continued using new measured values. The number Y of windows used for determination may be arbitrarily adjusted according to the proficiency of the speaker. For example, if the speaker is a beginner, the value of Y is decreased because he/she cannot immediately respond to the presentation of the target action, and if the speaker is an expert, the value of Y is increased for more accurate judgment. can be adjusted.

In step S<b>105 , the presentation support device 1 causes the determination unit 13 to output a vibration signal call command to the vibration signal generation unit 14 . The vibration signal calling command includes information indicating a non-verbal indicator (target should be presented) outside the reference range and its target value.

In step S106, the presentation supporting apparatus 1 causes the vibration signal generation unit 14 to determine vibration pattern data that matches the target value of the nonverbal indicator that needs to be presented, in response to the vibration signal call command. Call from database 22 . For example, a rhythm pattern that guides speech speed is composed of sine wave intervals and pause intervals, and the rhythm is characterized by the number of consecutive waves and the length of the pause interval. Also, for example, the amplitude modulation that guides the intonation is represented by an amplitude modulation filter of 1 unit from a starting point with a value of zero, through a maximum point, to an end point with a value of zero. The vibration signal generator 14 generates and outputs a vibration signal based on the called vibration pattern data. The vibration signal generator 14 can integrate vibration signals relating to a plurality of non-verbal indicators, as described above and later.

In step S107, the presentation supporting apparatus 1 uses the vibration device drive section 15 to generate and output a drive signal for driving the vibration device 44 based on the vibration signal generated by the vibration signal generation section 14. An example of a vibrating device 44 is a smartwatch worn on the wrist of the speaker. When the vibration device 44 receives a drive signal from the presentation support apparatus 1 via, for example, short-range wireless communication, the vibration device 44 drives a built-in vibration element according to the drive signal, and outputs a vibration stimulus to the wrist of the speaker. The part of the speaker's body that outputs the vibration stimulus is not limited to the wrist, and may be other parts such as fingers, upper arms, legs, and torso.

FIG. 4 is a diagram showing an example of the vibration signal generated by the vibration signal generator 14. FIG. In FIG. 4, the rhythm pattern SP that guides speech speed and the amplitude modulation pattern FP that guides intonation are integrated to generate a vibration signal VS that guides speech speed and intonation.

The rhythm pattern SP that guides the speech speed includes sine wave intervals and pause intervals, and the rhythm is characterized by the number of consecutive waves and the length of the pause interval. For example, in the pattern SP1, one beat includes a sine wave section with a wave number of 4 and a pause section, and the length of the pause section is set so that seven beats are included per second. In the pattern SP2, one beat includes a sine wave section with a wave number of 4 and a pause section, and the length of the pause section is set so that five beats are included per second. In the illustrated example, the pattern SP2 guides a slower (lower) speech speed than the pattern SP1. The pattern SP1 and the pattern SP2 can also be rephrased as those obtained by subjecting the fundamental carrier to ON/OFF modulation or amplitude modulation. The number of waves in each sine wave section may be adjusted accordingly. The number of waves in the sinusoidal section can be adjusted by frequency modulation on the fundamental carrier. The rhythm pattern that guides the speech speed is not limited to the illustrated example, and may express rhythm in other manners.

The amplitude modulation pattern FP that guides the intonation is represented by an amplitude modulation filter with one unit from the start point to the end point. For example, the pattern FP1 has zero values at the start and end points, linearly increases from the start point to the maximum point, and linearly decreases from the maximum point to the end point. The pattern FP2 has three local maximum points and two local minimum points. From the starting point with a value of zero, through the first local maximum point, the first local minimum point, the second local maximum point, the second local minimum point, and the third local maximum point, Increases or decreases the value linearly, up to an endpoint of zero value. The values of pattern FP1 and pattern FP2 represent amplitude values. In the example shown, pattern FP2 guides more varied intonations than pattern FP1. The value and function of each point may be appropriately adjusted according to the characteristics of the speaker and language. One unit of time (Z seconds) may be set arbitrarily. The amplitude modulation pattern FP has its time axis length adjusted according to the length of the word or phrase to be guided. The amplitude modulation pattern that guides intonation is also not limited to the illustrated example, and may express intonation in other manners.

The vibration signal VS is an example of a waveform image integrated using the rhythm pattern SP1 as a carrier wave and the amplitude modulation pattern FP1 as a modulation filter. The vibration signal VS can simultaneously guide both speech rate and intonation via non-verbal stimuli.

FIG. 4 is only an example, and an instruction signal including only the rhythm pattern SP or only the amplitude modulation pattern FP may be generated, or the vibration signal VS may be further integrated with accent vibration or marker vibration. A design is also possible in which only the speech rate is monitored and guided for one minute from the start of the speaker's speech, and both the speech rate and amplitude are monitored and guided after one minute has elapsed from the start of speech.

(3) Effect As described in detail above, the information processing apparatus according to the first embodiment of the present invention measures a non-verbal index measured from the speech signal of the speaker who is giving a presentation or the like. An indication signal containing non-verbal information representing a target value is output when the value falls outside the reference range for non-verbal indicators in ideal speech. The instruction signal is a non-verbal stimulus perceivable by the speaker and guides the measured value measured from the speaker's speech signal to an ideal value. The speaker can receive non-verbal feedback in real time while speaking, intuitively understand the target behavior, and improve the skill.

Therefore, in the first embodiment of the present invention, it is possible to convey the target behavior (ideal utterance method) to a speaker who spends a lot of cognitive resources on language processing in a manner that does not interfere with language processing. .

[Second embodiment]
Next, the configuration and operation of the second embodiment, which are mainly different from those of the first embodiment, will be described.

(1) Configuration (1-1) System Configuration FIG. 5 is a diagram showing an example of the configuration of a system including an information processing device according to the second embodiment. This system also presents a target behavior to the speaker via instructional signals containing non-verbal information representing target values for non-verbal indicators in the utterance. A target behavior may be rephrased as an ideal speaking method. Unlike the first embodiment, in the second embodiment, the system associates non-verbal indicators in advance with the text data representing the contents of the speaker's planned utterance, and outputs an instruction signal in accordance with the utterance timing. Similar to the first embodiment, the following description assumes a situation in which a speaker gives a presentation to an audience, but the present invention is not limited to this. The system is applicable to one-to-many public speaking situations as well as one-to-one communication situations. In the following, as a method of presenting the target action, an example of giving tactile stimulation by vibration will be explained, but this is not limited to this, and visual stimulation by image display, auditory stimulation by voice output, etc. can be used. may be

The system of FIG. 5 includes a presentation support device 2 as an information processing device according to the second embodiment, an input device 41, an output device 42, a microphone 43, a vibration device 44, and a presentation device (presentation device) 45. Since the input device 41, the output device 42, the microphone 43 and the vibration device 44 are the same as those described in the first embodiment, detailed description thereof will be omitted.

The presentation device 45 is, for example, a personal computer, and is used by the speaker or an assistant during the presentation. The presentation device 45 is installed with, for example, software for reproducing slides, and outputs images or sounds included in the slide materials. The presentation device 45 can acquire and output information regarding the progress of the presentation, such as slide numbers, slide switching information, and animation display information.

One or more of the input device 41, the output device 42, the microphone 43, the vibration device 44, and the presentation device 45 may be configured integrally or may be built into the presentation support device 2.

(1-2) Functional Configuration of Presentation Support Device Next, functions of the presentation support device 2 will be described. The presentation support device 2 is configured by, for example, a personal computer. As shown in FIG. 5 , the presentation support device 2 includes a control section 10 , a storage section 20 and an input/output interface 30 .

The input/output interface 30 inputs and outputs data between the presentation support device 2 and an external device in the same manner as the presentation support device 1 in the first embodiment. The input/output interface 30 is connected to the presentation device 45 via a USB port, a cable connection terminal, a card slot, a communication interface, or the like, and can capture information relating to the progress of the presentation, such as slide numbers, output from the presentation device 45 . can.

The storage unit 20 includes a non-language plan database 21 , a vibration pattern database 22 and a vibration signal storage unit 23 .
The non-language plan database 21 and the vibration pattern database 22 are the same as those described with respect to the presentation support device 1 in the first embodiment, so detailed description thereof will be omitted.

The vibration signal storage unit 23 stores the vibration signal generated by the vibration signal generation unit 105 in association with the words or phrases included in the text data representing the planned utterance content of the speaker. The vibration signal storage unit 23 also need not be built in the presentation support device 2, and may be connected to the presentation support device 2 via a network.

The control unit 10 includes, as functional units according to the second embodiment, a language plan acquisition unit 101, a non-language plan synthesis unit 102, a non-language plan presentation unit 103, an operation reception unit 104, a vibration signal generation unit 105, and a progress tracking unit 106. , a timing calculator 107 , a signal caller 108 and a vibration device driver 109 .

The language plan acquisition unit 101 acquires the speaker's language plan from the input device 41 or the presentation device 45 via the input/output interface 30 . The language plan here refers to text information (text data) representing scheduled utterance contents. The language plan acquisition unit 101 is an example of a third acquisition unit that acquires text information representing the planned utterance contents of the speaker.

The non-language plan synthesizing unit 102 reads target values of non-verbal indicators of speech that are generally ideal in public speaking from the non-language plan database 21, and converts them into words or phrases included in the text information as non-language plans. Perform the process of associating. Here, the process of associating a language plan with a non-language plan is called synthesis. A non-verbal plan here refers to the non-verbal indicators associated with the language plan or their target values. The non-language plan synthesizing unit 102 is an example of a synthesizing unit that associates target values with words or phrases included in text information. The non-verbal plan synthesizing unit 102 is also an example of a first acquisition unit that acquires target values of non-verbal indicators in speech. In the text information, there may be multiple words or phrases associated with the target value of the non-linguistic index, or there may be overlaps. Also, one word may be associated with target values of multiple types of non-verbal indices. For example, a word A may be associated with an emphasis target value, and a phrase B including the word A may be associated with a speaking speed target value. As in the first embodiment, the word or phrase is merely an expression for convenience of explanation, and may be read as a group of language (word, phrase, clause, sentence) including one or more words.

The non-language plan presentation unit 103 presents the non-language plan associated with the language plan to the user as an initial setting. The non-verbal plan presenting unit 103 is an example of a presenting unit that presents non-verbal indicators or their target values associated with words or phrases included in text information.

The operation reception unit 104 receives user's operations for the initial settings of the non-language plan. The operation reception unit 104 is an example of a reception unit that receives a change request for the target value. The user's operation includes a request to change the target value. The operation reception unit 104 generates setting data that reflects the user's operation and includes an association between the language plan and the non-language plan.

The vibration signal generation unit 105 generates a vibration signal including nonverbal information representing the target value of the nonverbal index based on the setting data. The vibration signal generation unit 105 is an example of a generation unit that generates an instruction signal including nonverbal information representing the target value of the nonverbal index. The vibration signal generation unit 105 saves the generated vibration signal in the vibration signal storage unit 23 .

The progress tracking unit 106 converts the voice information related to the speaker's utterance acquired from the microphone 43 into text data by voice recognition, compares it with the text data acquired as the language plan, and judges the progress of the presentation (tracking data). )do. Progress tracking unit 106 may additionally or alternatively use information such as slide numbers obtained from presentation device 45 .

The timing calculation unit 107 compares the text data obtained by speech recognition with the language plan, and calculates the instruction timing for outputting the instruction signal. Instruction timing may be translated as speech timing at which the word or phrase associated with the non-verbal plan is spoken. The timing calculator 107 is an example of a calculator that calculates the utterance timing of a word or phrase associated with a target value in the utterance of the speaker. Timing calculation unit 107 outputs a vibration signal call command including information about command timing.

The signal calling unit 108 calls, integrates, and outputs the necessary vibration signals from the vibration signal storage unit 23 in response to the vibration signal calling command.

The vibration device drive section 109, like the vibration device drive section 15, generates and outputs a drive signal based on the vibration signal. The vibration device driving section 109 outputs a drive signal in relation to the instruction timing (for example, at the instruction timing or a predetermined time before the instruction timing). The drive signal also includes non-verbal information representing the target value of the non-verbal indicator, and thus can be said to be an example of the instruction signal. The vibration device drive unit 109 is an example of an output unit that outputs an instruction signal as a stimulus perceivable by the speaker.

(1-3) Hardware Configuration The presentation support device 2 according to the second embodiment can have the same hardware configuration as explained for the presentation support device 1 in the first embodiment.

(2) Operation Next, an information processing operation by the presentation support device 2 configured as described above will be described. The operation by the presentation support device 2 includes an operation of generating an instruction signal in advance before the presentation and an operation of outputting the saved instruction signal during the presentation.

(2-1) Generation of Instruction Signal FIG. 6 is a flow chart showing an example of processing related to generation of an instruction signal by the presentation support device 2 . It is assumed that the non-language plan database 21 and the vibration pattern database 22 store data necessary for processing in advance.

First, in step S201, the presentation support device 2 acquires a language plan using the language plan acquisition section 101 under the control of the control section 10. The language plan acquisition unit 101 acquires a language plan (text information) by, for example, receiving text data input to the input device 41 or the presentation device 45 by a user such as a speaker via the input/output interface 30. .

In step S202, the presentation support device 2 synthesizes the language plan with the non-language plan using the non-language plan synthesizing unit 102 . For example, the non-language plan synthesizing unit 102 analyzes the structure of sentences included in the text data, and based on the information accumulated in the non-language plan database 21, identifies words or phrases to which the non-language plan should be associated. , target values of the non-verbal indicators read from the non-verbal plan database 21 are associated. A generally known technique may be used to analyze the sentence structure. Words or phrases to which the non-verbal plan should be associated include, for example, structurally important words or phrases, specific proper nouns, phrases suggesting a change of topic or a conclusion, and the like. In the synthesis process of step S202, the non-language plan synthesizing unit 102, for example, emphasizes this word and utters it (for example, the word section P has a sound pressure level equal to or greater than the ratio Q to the average sound pressure level), and this phrase Creates initial setting data in which a target value of a nonverbal index such as speaking slowly (for example, the number of moras per unit time is R or less) is associated with text data.

In step S<b>203 , the presentation support device 2 uses the non-language plan presentation unit 103 to generate display data based on the initial setting data, and outputs the display data to the output device 42 . For example, the non-verbal plan presenting unit 103 generates display data that visually displays the relationship between text data representing the planned utterance content of the speaker and the associated non-verbal index or its target value, and displays Presented to the user through a user interface such as The presentation method to the user is not limited to visual display. Additionally or alternatively, the non-verbal plan presenting unit 103 generates synthesized speech of scheduled utterance contents reflecting target values of non-verbal indicators based on the initial setting data, and outputs the synthesized speech from a speaker or the like. good too.

In step S204, the presentation support device 2 receives the user's operation through the operation reception unit 104. User operations include, for example, the user deciding whether to adopt each word or phrase included in the initial setting data and the non-linguistic index, discarding the association, adding an association, and responding to the association. This includes processing such as changing relationships or changing specific target values. The process of step S204 can be rephrased as a process of receiving a change request from the user for the initial setting of the association between the language plan and the non-language plan created by the presentation support device 2 . The process of step S204 may accept a user's instruction to adopt the initial settings without change. The operation reception unit 104 receives a user's operation via the input device 41 such as a keyboard or mouse. The operation accepting unit 104 may also accept user operations via voice commands input via the microphone 43 . The operation reception unit 104 reflects the operation received from the user on the initial setting data that associates the language plan and the non-language plan, and passes the reflected setting data to the vibration signal generation unit 105 .

In step S205, the presentation support device 2 causes the vibration signal generation unit 105 to call up necessary vibration pattern data from the vibration pattern database 22 based on the target value of the non-verbal index included in the received setting data, and generate a vibration signal. do. For example, the vibration signal generator 105 determines vibration pattern data that matches the target value of the non-verbal index included in the setting data, and calls it from the vibration pattern database 22 . The vibration signal generation unit 105 also adjusts the called vibration pattern data according to the number of moras of each word or each phrase to generate a vibration signal. The generated vibration signal is an example of an instruction signal containing nonverbal information representing a target value of the nonverbal index. The vibration signal generation unit 105 stores the generated vibration signal in the vibration signal storage unit 23 in association with the word or phrase in language planning.

(2-2) Output of Instruction Signal FIG. 7 is a flow chart showing an example of processing related to the output of an instruction signal by the presentation support device 2 . It is assumed that the vibration signal storage unit 23 stores a vibration signal associated with the language plan, and that the speaker is making a presentation based on the language plan.
First, in step S<b>301 , the presentation support device 2 tracks the progress of the presentation using the progress tracking unit 106 under the control of the control unit 10 . The progress tracking unit 106 converts the speech signal related to the speaker's utterance obtained from the microphone 43 into text data using, for example, existing speech recognition software, and compares it with the text data representing the scheduled speech content obtained in advance. By doing so, the progress of the presentation is determined. Additionally or alternatively, the progress tracking unit 106 may determine the progress of the presentation based on information relating to the progress of the presentation output from the presentation device 45, such as slide numbers, slide switching information, animation display information, and the like. can.

In step S302, the presentation support device 2 uses the timing calculation unit 107 to compare the language plan (text data) with the speech-recognized text data, and calculates the instruction timing for outputting the instruction signal. The timing calculation unit 107 may measure the speech speed of the speaker from the speech voice of the speaker acquired via the microphone 43, and additionally calculate the instruction timing based on the speech speed.

In step S303, the presentation supporting apparatus 2 monitors the progress of the presentation by means of the timing calculation unit 107, and upon detecting the instruction timing of the word or phrase associated with the target value of the non-verbal index (YES), proceeds to step S304. move on. The timing calculator 107 continues to monitor the progress until the instruction timing is detected (NO).

In step S304, the presentation support device 2 uses the timing calculation unit 107 to output a vibration signal call command to the signal call unit . The vibration signal call command contains information specifying the word or phrase in the language plan that is the subject of the sensed command timing.

In step S305, the presentation support device 2 causes the signal calling unit 108 to call the vibration signal associated with the word or phrase specified in the vibration signal call command from the vibration signal storage unit 23 and output it. The signal calling unit 108 can also integrate and output vibration signals related to a plurality of nonverbal indicators. The signal calling unit 108 uses, for example, a sinusoidal vibration that guides the speed of speech as a carrier wave, and an amplitude modulation filter that guides intonation for this carrier wave, in the same manner as the vibration signal generation unit 14 in the first embodiment. By applying , the vibration signal is integrated and output. The integration process may be performed by the vibration signal generator 105 in step S205 of FIG. In this case, the integrated vibration signal is stored in the vibration signal storage unit 23 .

In step S306, the presentation supporting apparatus 2 uses the vibration device drive unit 109 to generate and output a drive signal based on the vibration signal, as in the first embodiment.

One or more types of non-verbal indicators may be associated with the language plan (textual information). Also, different non-verbal indicators may be associated with each word or phrase. Free design, e.g. guiding a given accent for a particular word throughout the language plan, guiding speech rate at the beginning of the language plan, guiding intonation at the conclusion of the language plan, or a combination of these. is possible.

(3) Effect As described in detail above, in the second embodiment of the invention, after associating the target value of the non-verbal index with the text data representing the scheduled utterance content and reflecting the change request from the user, A command signal containing non-verbal information representing a target value is generated in advance, and the target behavior is presented by outputting the command signal at an appropriate timing to a speaker who is giving a presentation or the like. The instruction signal guides, in a non-verbal manner, the speaking method of the speaker who is giving a presentation or the like to an ideal value. The speaker can receive an instruction signal generated in advance based on the planned speech content during the actual speech, intuitively understand the target behavior, and strive to realize the ideal speech.

Therefore, in the second embodiment of the present invention, it is possible to convey the target behavior (ideal utterance method) to a speaker who spends a lot of cognitive resources on language processing in a manner that does not interfere with language processing. .

[Other embodiments]
In addition, this invention is not limited to the said embodiment.
For example, as an instruction signal for presenting a target action to a speaker, an example of giving a vibration stimulus to the speaker via a smartwatch worn on the wrist has been described. As described above, the part of the body to which the vibration stimulus is applied is not limited to the wrist, but may be other parts such as the fingers, upper arm, leg, and torso. Also, the vibration device 44 is not limited to a smart watch, and may be another portable device.

As described above, the instruction signal is not limited to vibration stimulation, and may be output in a manner perceivable by the speaker. For example, the vibration patterns or vibration signal waveforms illustrated in FIG. 4 are also applicable to visual or auditory stimuli. As an example, an object visible to the speaker while speaking is displayed on a display such as a small monitor or smart glasses, and the color, brightness, size, shape, etc. of the displayed object are changed according to the waveform of the vibration signal VS. good too. As another example, audible sounds may be output from earphones, headphones, or a small directional speaker while the speaker is speaking, and the pitch or volume of the sound may be changed according to the waveform of the vibration signal VS. good. Alternatively, the waveform of the vibration signal VS itself may be displayed as an image. Also, such visual, auditory and tactile stimuli may be used in combination.

The functional units included in the

presentation support device

1 or 2 may be distributed to a plurality of devices, and these devices may cooperate with each other to perform processing. Also, each functional unit may be realized by using a circuit. A circuit may be a dedicated circuit that implements a specific function, or it may be a general-purpose circuit such as a processor.

Furthermore, the flow of each process described above is not limited to the described procedures, and the order of some steps may be changed, and some steps may be performed in parallel. . Also, the series of processes described above need not be executed consecutively in terms of time, and each step may be executed at any timing.

The method described above can be executed by a computer (computer) as a program (software means), such as a magnetic disk (floppy (registered trademark) disk, hard disk, etc.), an optical disk (CD-ROM, DVD, MO, etc.) , semiconductor memory (ROM, RAM, flash memory, etc.) or other recording medium (storage medium), or can be transmitted and distributed via a communication medium. The programs stored on the medium also include a setting program for configuring software means (including not only execution programs but also tables and data structures) to be executed by the computer. A computer that realizes the above apparatus reads a program recorded on a recording medium, and in some cases, builds software means by a setting program, and executes the above-described processes by controlling the operation of the software means. The term "recording medium" as used herein is not limited to those for distribution, and includes storage media such as magnetic disks, semiconductor memories, etc. provided in computers or devices connected via a network.

It should be noted that the present invention is not limited to the above-described embodiments, and various modifications can be made in the implementation stage without departing from the gist of the invention. Further, each embodiment may be implemented in combination as appropriate, in which case the combined effect can be obtained. Furthermore, various inventions are included in the above embodiments, and various inventions can be extracted by combinations selected from a plurality of disclosed constituent elements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiments, if the problem can be solved and effects can be obtained, the configuration with the constituent elements deleted can be extracted as an invention.

1 ... presentation support device,
10... control unit,
20... Storage unit,
30 ... input/output interface,
11 ... voice information acquisition unit,
12 ... Non-verbal characteristic measurement unit,
13 ... determination unit,
14 Vibration signal generator,
15... Vibration device drive unit,
21... non-verbal planning database,
22 Vibration pattern database,
41 ... input device,
42 ... output device,
43 ... microphone,
44... Vibration device,
2 ... presentation support device,
23 ... vibration signal storage unit,
45 ... presentation device,
101 ... language plan acquisition unit,
102 ... non-language plan synthesis unit,
103 ... non-verbal plan presentation unit,
104 ... operation reception unit,
105... Vibration signal generator,
106 ... progress tracking unit,
107... Timing calculation unit,
108... signal calling unit,
109... Vibration device drive unit.

Claims

a first acquisition unit that acquires a target value of a nonverbal index in an utterance;
a generation unit that generates an instruction signal including non-verbal information representing the target value;
an output unit that outputs the instruction signal as a stimulus perceivable by a speaker;
An information processing device.
a second acquisition unit that acquires voice information uttered by the speaker;
a measurement unit that measures the nonverbal index from the voice information and obtains a measured value;
A determination unit that compares the measured value with a predetermined allowable range including the target value, and determines whether or not to present the target based on the comparison result,
The output unit outputs the instruction signal when it is determined that the target presentation is necessary.
The information processing device according to claim 1 .
The measurement unit obtains a measurement value of the nonverbal index from the speech information for each time window of a predetermined length,
The determination unit compares the measured value with the allowable range in each time window, and when it is determined that the measured value deviates from the target value over a predetermined number of continuous time windows, the target is presented. deems necessary,
The information processing apparatus according to claim 2.
a third acquisition unit that acquires text information representing the scheduled utterance content of the speaker;
a synthesizing unit that associates the target value with a word or phrase included in the text information;
a calculation unit that calculates the utterance timing of the word or phrase associated with the target value in the utterance of the speaker;
The output unit outputs the instruction signal in relation to the speech timing.
The information processing device according to claim 1 .
a presenting unit that presents the target value associated with the word or phrase by the synthesizing unit;
a reception unit that receives a change request for the target value;
further comprising
The generation unit generates the instruction signal by reflecting the received change request.
The information processing apparatus according to claim 4.
the generating unit generates the indication signal by modulating a fundamental carrier with at least one of on/off modulation, amplitude modulation or frequency modulation;
The information processing apparatus according to any one of claims 1 to 5.
obtaining a target value for a non-verbal indicator in the utterance;
generating an indication signal including non-verbal information representing the target value;
outputting the indicator signal as a stimulus perceivable by a speaker;
A method of processing information, comprising:
A program that causes a computer to execute processing by each unit of the information processing apparatus according to any one of claims 1 to 6.