WO2020079918A1 - Information processing device and information processing method - Google Patents

Information processing device and information processing method Download PDF

Info

Publication number
WO2020079918A1
WO2020079918A1 PCT/JP2019/029985 JP2019029985W WO2020079918A1 WO 2020079918 A1 WO2020079918 A1 WO 2020079918A1 JP 2019029985 W JP2019029985 W JP 2019029985W WO 2020079918 A1 WO2020079918 A1 WO 2020079918A1
Authority
WO
WIPO (PCT)
Prior art keywords
time
voice
information
information processing
image
Prior art date
Application number
PCT/JP2019/029985
Other languages
French (fr)
Japanese (ja)
Inventor
暦本 純一
Original Assignee
ソニー株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニー株式会社 filed Critical ソニー株式会社
Priority to CN201980065946.7A priority Critical patent/CN112840397A/en
Publication of WO2020079918A1 publication Critical patent/WO2020079918A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features

Definitions

  • the present disclosure relates to an information processing device and an information processing method.
  • voice commands may be limited. For example, operating a smartphone or the like by voice in a public space such as on a train or in a library is difficult for people around to accept. In addition, speaking out confidential information such as personal information in a public space poses a risk of personal information leakage. Therefore, the voice interface using voice commands is limited to use in places where the influence of utterance on the surroundings is clear, such as smart speakers used in homes and car navigation devices used in vehicles. It tends to be done.
  • the above devices can be operated without actually producing a voice
  • the above devices can be used regardless of the location.
  • the wearable computer has a function capable of operating the device without producing a voice
  • by always wearing the wearable computer it becomes possible to always obtain the service regardless of the place. Therefore, research on recognition technology for unvoiced utterances that can perform voice recognition without producing a voice is under way.
  • Patent Document 1 discloses a technology of detecting a movement and a place of a voice organ by an electromagnetic wave to identify a voice. Further, in addition to the technique disclosed in Patent Document 1 below, research on a pharyngeal microphone and a microphone to be attached to the throat for surely acquiring sound in a noisy environment is also under way.
  • the recognition technology for non-voiced utterances mentioned above requires only a whispering amount of voice, so its use in public spaces is still limited.
  • the volume of whispering is reduced in order to bring the voice closer to the non-voiced state, the recognition accuracy may decrease.
  • the present disclosure proposes a new and improved information processing apparatus and information processing method that allow a user to obtain intended acoustic information without vocalization.
  • An information processing apparatus including:
  • An information processing method executed by a processor including:
  • Embodiments of the present disclosure >> ⁇ 1.1. Overview>
  • devices that can be controlled by voice commands have become widespread. For example, in smartphones, car navigation devices, and the like, it has become common to use a search function using voice commands. Further, it has become possible to create a document by documenting the contents input by voice. Further, a speaker-type voice interface device, which is called a smart speaker and operates by a voice command, has become widespread.
  • voice commands may be limited. For example, operating a smartphone or the like by voice in a public space such as on a train or in a library is difficult for people around to accept. In addition, speaking out confidential information such as personal information in a public space poses a risk of personal information leakage. Therefore, the voice interface using voice commands is limited to use in places where the influence of utterance on the surroundings is clear, such as smart speakers used in homes and car navigation devices used in vehicles. It tends to be done.
  • the above devices can be operated without actually producing a voice
  • the above devices can be used regardless of the location.
  • the wearable computer has a function capable of operating the device without producing a voice
  • by always wearing the wearable computer it becomes possible to always obtain the service regardless of the place. Therefore, research on recognition technology for unvoiced utterances that can perform voice recognition without producing a voice is under way.
  • a technology for detecting the movement and location of a voice organ by electromagnetic waves to identify a voice is disclosed.
  • a pharyngeal microphone and a microphone to be attached to the throat for surely acquiring sound in a noisy environment is also under way.
  • the recognition technology for non-voiced utterances mentioned above requires only a whispering amount of voice, so its use in public spaces is still limited.
  • the volume of whispering is reduced in order to bring the voice closer to the non-voiced state, the recognition accuracy may decrease.
  • the embodiment of the present disclosure is conceived with the above points in mind, and proposes a technology that enables the user to obtain the intended acoustic information without uttering.
  • the present embodiment will be sequentially described in detail.
  • FIG. 1 is a diagram illustrating a configuration example of a voiceless speech system according to an embodiment of the present disclosure.
  • the voiceless speech system 1000 according to the present embodiment includes a mobile terminal 10, an ultrasonic echo device 20, and a voice input / output device 30.
  • Various devices may be connected to the mobile terminal 10.
  • the ultrasonic echo device 20 and the voice input / output device 30 are connected to the mobile terminal 10, and information cooperation is performed between the devices.
  • the ultrasonic echo device 20 and the voice input / output device 30 are wirelessly connected to the mobile terminal 10 according to the present embodiment.
  • the mobile terminal 10 performs short-distance wireless communication with the ultrasonic echo device 20 and the voice input / output device 30 using Bluetooth (registered trademark).
  • the ultrasonic echo device 20 and the audio input / output device 30 may be connected to the mobile terminal 10 by wire or may be connected via a network.
  • the mobile terminal 10 is an information processing device capable of recognition processing based on machine learning.
  • the recognition process according to the present embodiment is, for example, a voice recognition process.
  • the voice recognition process is performed, for example, on information about a voice generated from an image (still image / moving image).
  • the mobile terminal 10 converts an image (hereinafter also referred to as an echo image) showing the state of the inside of the oral cavity of the user 12 into information regarding voice, and performs voice recognition processing on the converted information regarding voice. I do.
  • a plurality of time-series images showing a time-series change in the intraoral state when the user 12 changes the intraoral state without uttering a voice is converted into information regarding voice. It Thereby, the mobile terminal 10 according to the present embodiment can realize voice recognition without vocalization.
  • the plurality of time-series images are echo images showing changes in the state of the oral cavity when the user moves at least one of the mouth and tongue without uttering.
  • the plurality of time-series images showing the time-series changes in the oral cavity state of the user 12 are also referred to as time-series echo images.
  • Information related to voice is, for example, information that can be recognized by a voice recognition device (hereinafter, also referred to as acoustic information).
  • the acoustic information is, for example, a spectrogram that three-dimensionally shows a time-series change in the characteristics of the voice such as the pitch, strength, and the like of the voice according to frequency, amplitude, and time.
  • Information related to audio is converted from images using algorithms acquired by machine learning.
  • the machine learning according to this embodiment is performed by, for example, deep learning.
  • the algorithm acquired by the machine learning is, for example, a neural network (NN: Neural Network).
  • An image is used as an input for the machine learning. Therefore, the machine learning is performed using a convolutional neural network (CNN: Convolutional Neural Network) suitable for deep learning of image processing.
  • CNN Convolutional Neural Network
  • a time-series echo image when the user 12 makes a voice is used for machine learning.
  • the first algorithm is a first neural network that performs a process of converting a time-series echo image into acoustic information (first acoustic information) when the user 12 changes the state of the oral cavity without producing a sound. It is a network (hereinafter, also referred to as NN1).
  • the second algorithm is a second neural network (hereinafter, also referred to as NN2) that performs a process of converting acoustic information converted by NN1 into more accurate acoustic information (second acoustic information). is there.
  • the more accurate acoustic information is, for example, acoustic information obtained by converting the uttered voice that is the voice when the user 12 actually utters.
  • the details of NN1 and NN2 will be described later.
  • time-series echo images there are two types of time-series echo images according to the present embodiment, a time-series echo image converted into acoustic information by the NN1 and a time-series echo image used for machine learning.
  • the time-series echo image converted into acoustic information is a time-series echo image in the oral cavity when the user 12 changes the state in the oral cavity without making a sound, and therefore, in the following, a non-voiced time-series echo. Also called an image.
  • the time-series echo image used for machine learning is the time-series echo image in the oral cavity when the user 12 utters, it is also referred to as a vocalization time-series echo image below.
  • the acoustic information according to the present embodiment has a plurality of acoustic information.
  • the acoustic information (first acoustic information) converted by the NN1 is a spectrogram obtained by converting an unvoiced time-series echo image, and is hereinafter referred to as an unvoiced image spectrogram.
  • the acoustic information (second acoustic information) converted by the NN2 is a spectrogram with higher accuracy obtained by converting the unvoiced image spectrogram, it is hereinafter referred to as a high-accuracy unvoiced image spectrogram.
  • the learning information (first learning information) used for the machine learning of NN1 is a vocalization time-series echo image and vocalized voice.
  • the learning information (second learning information) used for machine learning of the NN2 includes acoustic information (third acoustic information) obtained by converting the vocal image spectrogram via the NN1 and acoustic information corresponding to the vocal sound (fourth acoustic information). Acoustic information).
  • the acoustic information (third acoustic information) obtained by converting the vocal image spectrogram through the NN1 is a spectrogram obtained by converting the vocal time-series echo image by the NN1, and is hereinafter referred to as a vocal image spectrogram.
  • the acoustic information (fourth acoustic information) corresponding to the uttered voice is a spectrogram corresponding to the voice when the user 12 actually utters, and is hereinafter referred to as a uttered voice spectrogram.
  • the vocal image spectrogram (third acoustic information) is used as an input of the NN2 machine learning, and the vocalized speech spectrogram (fourth acoustic information) is used as an output of the NN2 machine learning.
  • the mobile terminal 10 also has a function of controlling the overall operation of the voiceless speech system 1000.
  • the mobile terminal 10 controls the overall operation of the voiceless speech system 1000 based on the information associated with each device.
  • the mobile terminal 10 controls the process related to voice recognition in the mobile terminal 10 and the operation of the voice input / output device 30 based on the information received from the ultrasonic echo device 20 and the voice input / output device 30.
  • the mobile terminal 10 may control the operation of the ultrasonic echo device 20.
  • the mobile terminal 10 is realized by, for example, a smartphone as shown in FIG.
  • the mobile terminal 10 is not limited to a smartphone.
  • the mobile terminal 10 may be a terminal device such as a tablet terminal, a PC, a wearable terminal, or an agent device that has the functions of the mobile terminal 10 implemented as an application. That is, the mobile terminal 10 can be realized as an arbitrary terminal device.
  • the ultrasonic echo device 20 is a device that acquires an echo image in the oral cavity of the user 12.
  • the ultrasonic echo device 20 acquires an echo image by using an ultrasonic examination technique widely used in medicine.
  • the ultrasonic echo device 20 includes an ultrasonic output device capable of outputting ultrasonic waves, and causes the ultrasonic output device attached to the body surface of the user 12 to output ultrasonic waves into the body of the user 12 and An echo image is acquired based on the ultrasonic waves reflected by the organ. Then, the ultrasonic echo device 20 transmits the acquired echo image to the mobile terminal 10.
  • the ultrasonic echo device 20 is realized as a neckband type device as shown in FIG. 1, for example.
  • the ultrasonic wave output unit 22 of the ultrasonic wave echo device 20 is provided with an ultrasonic wave output device.
  • the ultrasonic echo device 20 shown in FIG. 1 is provided with two ultrasonic wave output parts 22 of ultrasonic wave output parts 22a and 22b because of its structure as a neckband type device.
  • the number of ultrasonic wave output units 22 is not limited to two, and the ultrasonic wave echo device 20 may include at least one ultrasonic wave output unit 22.
  • the ultrasonic echo device 20 When the user 12 wears the ultrasonic echo device 20 so that the ultrasonic wave output unit 22 is located under the chin, ultrasonic waves are output toward the oral cavity of the user 12. Thereby, the ultrasonic echo device 20 can acquire an echo image in the oral cavity of the user 12. The voice is uttered after the vibration due to the vocal cords is adjusted by the opening of the tongue and the mouth. Therefore, it can be said that the echo image in the oral cavity of the user 12 acquired by the ultrasonic echo device 20 has effective information as an image to be converted into acoustic information.
  • FIG. 2 is a diagram showing an echo image according to this embodiment.
  • the echo image 40 is an echo image in the oral cavity of the user 12 acquired by the ultrasonic echo device 20.
  • the tongue tip 402, the tongue surface 404, and the tongue root 406 are shown.
  • the ultrasonic echo device 20 continuously acquires the echo images 40, thereby acquiring a plurality of time-series images (time-series echo images) that show time-series changes in the intraoral state of the user 12. To do.
  • Voice input / output device 30 is a device capable of inputting / outputting voice.
  • the voice input / output device 30 acquires, for example, the voice uttered by the user 12. Then, the voice input / output device 30 transmits the acquired voice to the mobile terminal 10.
  • the voice input / output device 30 also receives, for example, voice data indicating the content recognized by the mobile terminal 10 from the mobile terminal 10. Then, the voice input / output device 30 outputs the received voice data as voice.
  • the voice input / output device 30 is realized by, for example, a wearable terminal.
  • the voice input / output device 30 is preferably a wearable terminal such as an earphone or a bone conduction earphone capable of inputting / outputting voice. Since the voice input / output device 30 is an earphone, a bone conduction earphone, or the like, the amount of voice leaked to the outside can be reduced.
  • the voice input / output device 30 has a structure that allows the user 12 to listen to externally generated voice in addition to the voice output from the voice input / output device 30.
  • the voice input / output device 30 has an opening 32. Therefore, the user 12 can listen to external sound through the opening 32 even when the sound input / output device 30 is worn. Therefore, even if the user 12 always wears the voice input / output device 30 having the structure, he / she can comfortably spend his / her daily life without any trouble. Further, even when the voice indicating the content recognized by the mobile terminal 10 is output from the speaker such as the smart speaker instead of the voice input / output device 30, the user 12 can hear the voice.
  • a voice output function for outputting voice and a voice input function for acquiring voice are realized by one device.
  • the voice output function and the voice input function are independent devices. May be realized by
  • FIG. 3 is a diagram showing an outline of functions of the voiceless speech system according to the present embodiment.
  • the voiceless speech system 1000 first acquires NN1 and NN2 in advance by machine learning based on the vocalization time series echo image and the vocalized voice.
  • the ultrasonic echo device 20 acquires the unvoiced time series echo image 42.
  • the acquired unvoiced time-series echo image 42 is converted to the unvoiced image spectrogram 72 via the first neural network 122 (NN1).
  • the unvoiced image spectrogram 72 is a combination of a plurality of acoustic feature quantities 70 in chronological order. Details of the acoustic feature amount 70 will be described later.
  • the converted unvoiced image spectrogram 72 is converted into a high-accuracy unvoiced image spectrogram 74 via the second neural network 124 (NN2).
  • the converted high-accuracy unvoiced image spectrogram 74 is input to the recognition unit 114 of the mobile terminal 10. Then, the recognition unit 114 performs voice recognition processing based on the input high-accuracy unvoiced image spectrogram 74.
  • FIG. 4 is a block diagram showing a functional configuration example of the voiceless speech system according to the present embodiment.
  • the mobile terminal 10 includes a communication unit 100, a control unit 110, and a storage unit 120.
  • the information processing apparatus according to this embodiment has at least the control unit 110.
  • the communication unit 100 has a function of communicating with an external device. For example, the communication unit 100 outputs the information received from the external device to the control unit 110 in the communication with the external device. Specifically, the communication unit 100 outputs the echo image received from the ultrasonic echo device 20 to the control unit 110. Further, the communication unit 100 outputs the voice received from the voice input / output device 30 to the control unit 110.
  • the communication unit 100 transmits information input from the control unit 110 to an external device in communication with the external device. Specifically, the communication unit 100 transmits the information regarding the acquisition of the echo image input from the control unit 110 to the ultrasonic echo device 20. Further, the communication unit 100 transmits information regarding the input / output of the voice input from the control unit 110 to the voice input / output device 30.
  • Control unit 110 has a function of controlling the operation of the mobile terminal 10. For example, the control unit 110 converts a plurality of time-series images indicating the intraoral state acquired by ultrasonic echo into information corresponding to the intraoral state based on an algorithm acquired by machine learning.
  • the algorithm has a first neural network, and the control unit 110 converts a plurality of input time-sequential images in the unvoiced state into first acoustic information via the first neural network.
  • the control unit 110 inputs the unvoiced time series echo image input from the communication unit 100 to the NN1.
  • the NN1 converts the input unvoiced time series echo image into an unvoiced image spectrogram.
  • the control unit 110 can perform voice recognition processing by converting the spectrogram into a voice waveform. Therefore, the control unit 110 can perform voice recognition processing based on the unvoiced time-series echo image and control a device that can be operated by voice even if the user 12 does not make a voice.
  • the algorithm further has a second neural network, and the control unit 110 converts the first acoustic information into second acoustic information corresponding to the voice at the time of utterance, via the second neural network.
  • the control unit 110 inputs the unvoiced image spectrogram output from NN1 to NN2.
  • the NN2 converts the input unvoiced image spectrogram into a high-precision unvoiced image spectrogram corresponding to the uttered voice.
  • the voice indicated by the unvoiced image spectrogram output from the NN1 is “Ulay music.”
  • the voice indicated by the high-precision unvoiced image spectrogram corresponding to the voiced voice is “Play music.”.
  • the unvoiced image spectrogram indicating the voice "Ulay music.” Is converted to a high-precision unvoiced image spectrogram indicating the voice "Play music.” . That is, the NN2 has a role of correcting the voice indicated by the unvoiced image spectrogram converted from the unvoiced time series echo image by the NN1.
  • control unit 110 has a machine learning unit 112, a recognition unit 114, and a processing control unit 116, as shown in FIG.
  • the machine learning unit 112 has a function of performing machine learning using learning information.
  • the machine learning unit 112 acquires an algorithm for converting an echo image into a spectrogram by machine learning. Specifically, the machine learning unit 112 acquires NN1 which is an algorithm for converting the unvoiced time-series echo image into an unvoiced image spectrogram.
  • the machine learning unit 112 also acquires NN2, which is an algorithm for converting the unvoiced image spectrogram into a high-accuracy unvoiced image spectrogram.
  • NN1 is obtained by machine learning using the first learning information that includes a voice when uttered and a plurality of time-series images when uttered.
  • the NN1 is obtained by machine learning using the voice uttered by the user 12 and the utterance time-series echo image when the user 12 utters the voice as the first learning information.
  • the control part 110 can convert an echo image into a spectrogram via NN1.
  • the first learning information is acquired, for example, by causing the user 12 to read a text or the like. Accordingly, it is possible to acquire an echo image showing a time series change and a speech waveform corresponding to the echo image.
  • the speech waveform can be converted into an acoustic feature amount.
  • the NN1 When the control unit 110 inputs a plurality of time-series images when no voice is input to the NN1, the NN1 generates a plurality of acoustic feature amounts per unit time from the plurality of time-series images when no voice is input.
  • the first acoustic information is generated by synthesizing the generated plural acoustic feature amounts in time series.
  • the NN1 generates a plurality of acoustic feature amounts per unit time from the unvoiced time-series echo image input by the control unit 110, and synthesizes the generated acoustic feature amounts in time-series order to generate a unvoiced image. Generate a spectrogram.
  • FIG. 5 is a diagram showing an example of generation of the acoustic feature quantity according to the present embodiment.
  • the NN1 selects a time-series image at the central time of the unit time from among a plurality of time-series images when no voice is acquired in the unit time, and selects the acoustic feature amount per unit time from the selected time-series image. To generate. For example, the NN1 selects an echo image at the central time of the unit time from the unvoiced time-series echo images acquired in the unit time, and generates an acoustic feature amount per unit time from the selected echo image.
  • the unit time according to the present embodiment is, for example, the time when the number of acquired echo images is any of 5 to 13.
  • the unit time is the time when 13 echo images are acquired.
  • the NN 1 selects the echo image 424 at the center of the unvoiced time series echo image 422 acquired in a unit time from the unvoiced time series echo image 42, and selects the echo image 424.
  • the acoustic feature amount 70 is generated from 424.
  • the NN1 repeats the generation processing of the acoustic feature amount 70 by shifting the start time of the unit time, and synthesizes the plurality of generated acoustic feature amounts 70 to acquire the unvoiced image spectrogram 78.
  • the NN1 can learn the movement of the mouth corresponding to the smallest unit of pronunciation such as th. Furthermore, the recognition unit 114 can recognize the pronunciation more accurately.
  • acoustic features are reduced by processing them with a mel-scale spectrogram, MFCC (mel frequency cepstrum coefficient), short-time FFT (SFFT), and a voice waveform with a neural network (auto encoder).
  • MFCC mel frequency cepstrum coefficient
  • SFFT short-time FFT
  • auto encoder a voice waveform with a neural network
  • the technology related to the auto encoder is disclosed in a paper by Jesse Engel and 6 others ("Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders", URL: https://arxiv.org/abs/1704.01279). .
  • These can be interconverted with acoustic waveforms.
  • a melscale spectrogram can be combined with an acoustic waveform by using a technique called Griffin Lim algorithm.
  • the acoustic feature amount other expressions capable of dividing the speech waveform into a vector expression in a short time can be used.
  • the number of dimensions of the acoustic feature amount is about 64, but the number of dimensions may be changed according to the sound quality of the reproduced voice.
  • the NN2 outputs second learning information including third acoustic information generated by inputting a plurality of time-series images at the time of utterance to the NN1 and fourth acoustic information corresponding to the voice at the time of utterance. Obtained by machine learning used.
  • NN2 is obtained by machine learning using a vocalization image spectrogram generated by inputting a vocalization time-series echo image to NN1 and a vocalization spectrogram corresponding to a vocalization as second learning information.
  • the control unit 110 can convert the unvoiced image spectrogram output from the NN1 via the NN2 into a more accurate spectrogram.
  • NN2 may convert the unvoiced image spectrogram into a spectrogram of the same length. For example, if the unvoiced image spectrogram is a spectrogram corresponding to the command issued by the user 12 to the smart speaker or the like, the NN 2 converts the unvoiced image spectrogram into a spectrogram of the same length.
  • a fixed value may be set for the length of the spectrogram input to NN2.
  • the control unit 110 inserts the voiceless part having the insufficient length into the spectrogram and then inputs the spectrogram to the NN2. May be.
  • NN2 since the root mean square error is used as the loss function, NN2 learns so that the input to NN2 matches the output as much as possible.
  • NN2 The purpose of using NN2 is to adjust the unvoiced image spectrogram generated from the unvoiced time series echo image corresponding to the command so as to be closer to the uttered voice spectrogram generated from the voice when the command is actually uttered. is there.
  • NN1 since only the specified number of unvoiced time-series echo images is input, it is not possible to capture the context in a time width longer than the time width corresponding to the specified number.
  • conversion can be performed including the context of the command.
  • FIG. 6 is a diagram showing the structure of the second neural network according to the present embodiment.
  • FIG. 6 shows an example in which an unvoiced image spectrogram 72 having a length of 184 and a dimension number of 64 and having a prescribed length is converted into a high-precision unvoiced image spectrogram 74 of the same length.
  • the first-stage 1D-Convolution Bank 80 is a one-dimensional CNN, but is composed of eight different NNs with filter sizes ranging from 1 to 8.
  • Features with different time widths are extracted by using multiple filter sizes.
  • the features are, for example, phonetic symbol level features, word level features, and the like.
  • U-Network which is a combination of 1D-Convolution and 1D-Deconvolution.
  • the U-Network can recognize global information from the information converted by the Convolution / Deconvolution. However, since local information tends to be lost, U-Network has a structure for guaranteeing the local information.
  • an acoustic feature amount 802 having a length of 184 and a dimensionality of 128 is set as an acoustic feature amount 804 having a length of 96 and a dimensionality of 256.
  • the U-Network sets the acoustic feature amount 804 as the acoustic feature amount 806 having a length of 46 and a dimension number of 512.
  • the U-Network sets the acoustic feature amount 806 as the acoustic feature amount 808 having a length of 23 and a dimension number of 1024.
  • the local feature is extracted by increasing the spatial depth instead of decreasing the spatial size.
  • U-Network restores the size and number of dimensions of acoustic features in the reverse order of the extraction of local features.
  • the U-Network also integrates the information in which the input is directly copied to the output into the NN. For example, as shown in FIG. 6, in the U-Network, an acoustic feature amount 808 having a length of 23 and a dimension number of 1024 is set as an acoustic feature amount 810 of a length of 46 and a dimension number of 512, and the acoustic feature amount 806 Is integrated with the copied acoustic feature amount 812.
  • the U-Network integrates the acoustic feature amount 810 integrated with the acoustic feature amount 812 into the acoustic feature amount 814 having a length of 96 and a dimension number of 256, and integrates the acoustic feature amount 804 with the copied acoustic feature amount 816. To do. Further, the U-Network integrates the acoustic feature amount 814 integrated with the acoustic feature amount 816 into an acoustic feature amount 818 having a length 184 and a dimensionality of 128, and integrates the acoustic feature amount 802 with a copied acoustic feature amount 820. To do.
  • the method using the U-Network described above is a method that is generally used for NN that learns conversion of a two-dimensional image (for example, conversion from a monochrome image to a color image).
  • the method is used. Is applied to a one-dimensional acoustic feature quantity sequence.
  • the number of the second learning information used in the NN2 is the number of combinations of the uttered voice and the utterance time-series echo image created by the user 12 for the learning information. For example, when the user 12 speaks 300 times to create learning information, 300 combinations of input and output are created. However, the amount of 300 may not be enough to train NN2. Therefore, if the amount of the second learning information is not sufficient, data extension (Data Augmentation) may be performed. The data expansion can increase the number of the second learning information by perturbing the input acoustic feature quantity with a random number while fixing the output.
  • Data Augmentation Data Augmentation
  • NN1 and NN2 can be performed more effectively by relying on a specific speaker. Therefore, it is desirable that the machine learning be performed depending on a specific speaker. It should be noted that NN1 may be made to depend only on a specific speaker, and NN2 may be made to learn the information of a plurality of speakers all at once, so that complex learning may be performed.
  • the recognition unit 114 has a function of performing recognition processing. For example, the recognition unit 114 accesses the storage unit 120 and performs conversion processing using NN1. Specifically, the recognition unit 114 inputs the unvoiced time series echo image acquired by the ultrasonic echo device 20 input from the communication unit 100 to the NN1. The recognition unit 114 also accesses the storage unit 120 and performs conversion processing using NN2. Specifically, the recognition unit 114 inputs the unvoiced image spectrogram output from NN1 to NN2. The recognition unit 114 also performs a voice recognition process based on the high-accuracy unvoiced image spectrogram output from the NN2. Then, the recognition unit 114 outputs the result of the voice recognition process to the process control unit 116.
  • the recognition unit 114 may perform voice recognition processing using only NN1.
  • the recognition unit 114 may access the storage unit 120, perform conversion processing using NN1, and perform voice recognition processing based on the unvoiced image spectrogram output from NN1.
  • the high-accuracy unvoiced image spectrogram output from NN2 is more accurate than the unvoiced image spectrogram output from NN1. Therefore, the recognition unit 114 can perform the voice recognition process with higher accuracy by performing the voice recognition process using not only the NN1 but also the NN2.
  • the processing control unit 116 has a function of controlling the processing in the control unit 110. For example, the process control unit 116 determines the process to be executed based on the result of the voice recognition process by the recognition unit 114. Specifically, when the result of the voice recognition process indicates that the process performed by the control unit 110 by the user 12 is specified by the user 12, the process control unit 116 executes the process specified by the user 12. Further, when the result of the voice recognition process indicates that the question is made by the user 12, the process control unit 116 executes a process of answering the question.
  • the process control unit 116 When the process executed by the process control unit 116 is a process of outputting a voice to the user, the process control unit 116 transmits the voice to the voice input / output device 30 worn by the user, and outputs the voice.
  • the input / output device 30 can output sound.
  • the voiceless speech system 1000 according to the present embodiment can perform voice communication with the user 12 without leaking voice to the outside.
  • the storage unit 120 has a function of storing data regarding processing in the mobile terminal 10.
  • the storage unit 120 stores a first neural network 122 and a second neural network 124, which are algorithms generated by machine learning in the control unit 110.
  • the control unit 110 accesses the storage unit 120 and uses the first neural network 122 when converting the unvoiced time-series echo image into the unvoiced image spectrogram.
  • the control unit 110 accesses the storage unit 120 and uses the second neural network 124 when converting the unvoiced image spectrogram into a high-accuracy unvoiced image spectrogram.
  • the storage unit 120 may store learning information used by the control unit 110 for machine learning.
  • the data stored in the storage unit 120 is not limited to the above example.
  • the storage unit 120 may store programs such as various applications.
  • the ultrasonic echo device 20 includes a communication unit 200, a control unit 210, and an echo acquisition unit 220.
  • the communication unit 200 has a function of communicating with an external device. For example, the communication unit 200 outputs the information received from the external device to the control unit 210 in the communication with the external device. Specifically, the communication unit 200 outputs information regarding the acquisition of the echo image received from the mobile terminal 10 to the control unit 210.
  • the communication unit 200 also transmits information input from the control unit 210 to an external device in communication with the external device. Specifically, the communication unit 200 transmits the echo image input from the control unit 210 to the mobile terminal 10.
  • Control unit 210 has a function of controlling the overall operation of the ultrasonic echo device 20. For example, the control unit 210 controls the echo image acquisition processing by the echo acquisition unit 220. Further, the control unit 210 controls the process of transmitting the echo image acquired by the echo acquisition unit 220 to the mobile terminal 10 by the communication unit 200.
  • Echo acquisition unit 220 has a function of acquiring an echo image.
  • the echo acquisition unit 220 acquires an echo image using the ultrasonic output device provided in the ultrasonic output unit 22.
  • the echo acquisition unit 220 causes the ultrasonic wave output device to output ultrasonic waves into the body of the user 12 and acquires an echo image based on the ultrasonic waves reflected by the organs inside the body of the user 12.
  • the echo acquisition unit 220 may acquire an echo image showing the state of the inside of the oral cavity of the user 12 by causing the ultrasonic wave output device to output ultrasonic waves from under the chin of the user 12 toward the inside of the oral cavity of the user 12. it can.
  • the voice input / output device 30 includes a communication unit 300, a control unit 310, a voice input unit 320, and a voice output unit 330.
  • the communication unit 300 has a function of communicating with an external device. For example, the communication unit 300 outputs the information received from the external device to the control unit 310 in the communication with the external device. Specifically, the communication unit 300 outputs the audio data received from the mobile terminal 10 to the control unit 310.
  • the communication unit 300 also transmits information input from the control unit 310 to the external device in communication with the external device. Specifically, the communication unit 300 transmits the voice data input from the control unit 310 to the mobile terminal 10.
  • Control unit 310 has a function of controlling the overall operation of the voice input / output device 30. For example, the control unit 310 controls the voice acquisition process by the voice input unit 320. In addition, the control unit 310 controls the process in which the communication unit 300 transmits the voice acquired by the voice input unit 320 to the mobile terminal 10. The control unit 310 also controls the audio output processing by the audio output unit 330. For example, the communication unit 300 causes the audio output unit 330 to output the audio data received from the mobile terminal 10 as audio.
  • the voice input unit 320 has a function of acquiring a voice generated outside.
  • the voice input unit 320 acquires, for example, a voiced voice that is a voice when the user 12 speaks. Then, the voice input unit 320 outputs the acquired vocalized voice to the control unit 310.
  • the voice input unit 320 can be realized by, for example, a microphone.
  • Audio output unit 330 has a function of outputting a voice received from an external device.
  • the voice output unit 330 receives, for example, voice data generated based on the result of the voice recognition process in the mobile terminal 10 from the control unit 310, and outputs a voice corresponding to the input voice data.
  • the audio output unit 330 can be realized by, for example, a speaker.
  • FIG. 7 is a flowchart showing the flow of machine learning for acquiring the first neural network according to the present embodiment.
  • the mobile terminal 10 acquires the vocalization time series echo image from the ultrasonic echo device 20 as learning information (S100). Further, the mobile terminal 10 acquires the vocalized voice as the learning information from the voice input / output device 30 (S102). Next, the mobile terminal 10 performs machine learning using the acquired learning information (S104). Then, the mobile terminal 10 sets the algorithm generated by the machine learning to NN1 (S106).
  • FIG. 8 is a flowchart showing the flow of machine learning for acquiring the second neural network according to the present embodiment.
  • the mobile terminal 10 inputs the utterance time series echo image to the NN1 (S200).
  • the mobile terminal 10 acquires the vocal image spectrogram output from the NN1 as learning information (S202).
  • the mobile terminal 10 acquires a uttered voice spectrogram from the uttered voice as learning information (S204).
  • the mobile terminal 10 performs machine learning using the acquired learning information (S206).
  • the mobile terminal 10 sets the algorithm generated by the machine learning to NN2 (S208).
  • FIG. 9 is a flowchart showing a flow of processing in the mobile terminal according to the present embodiment.
  • the mobile terminal 10 acquires an unvoiced time series echo image (S300).
  • the mobile terminal 10 inputs the acquired unvoiced time-series echo image to the NN1 and generates a plurality of audio feature amounts from the unvoiced time-series echo image (S302).
  • the mobile terminal 10 synthesizes the plurality of generated voice feature amounts in time series to generate an unvoiced image spectrogram (S304).
  • the mobile terminal 10 After generating the unvoiced image spectrogram from the unvoiced time-series echo image, the mobile terminal 10 inputs the generated unvoiced image spectrogram into the NN2 and converts the unvoiced image spectrogram into a high-precision unvoiced image spectrogram (S306). After conversion, the mobile terminal 10 recognizes the content indicated by the high-accuracy unvoiced image spectrogram in the recognition unit 114 (S308). Then, the mobile terminal 10 executes a process based on the content recognized by the recognition unit 114 (S310).
  • the high-precision unvoiced image spectrogram converted by NN2 is output to the recognition unit 114 of the mobile terminal 10 .
  • the high-precision unvoiced image spectrogram is converted into a voice waveform. Then, it may be output as voice from a voice output device such as a speaker.
  • the user 12 can control the information device with the voice input function such as the smart speaker via the voice output device.
  • the high-accuracy unvoiced image spectrogram may be output to an external voice recognition device instead of the recognition unit 114 of the mobile terminal 10.
  • the high-accuracy unvoiced image spectrogram may be input to the voice recognition unit of the smart speaker via communication. Accordingly, the user 12 can control the information device with the voice input function such as the smart speaker without causing the mobile terminal 10 to emit the sound wave into the air.
  • the voiceless speech system 1000 can be applied to, for example, training for a speaker to move his mouth or tongue without speaking.
  • the voiceless speech system 1000 visually feeds back the content recognized from the voiceless time series echo image acquired by the ultrasonic echo device 20 to the speaker.
  • the speaker can improve the way of moving the mouth and tongue based on the feedback.
  • the voiceless utterance system 1000 displays the voiceless time-series echo images on the display device or the like, so that the speaker can confirm the displayed image and learn how to move the mouth or tongue. .
  • the speaker recognizes how the voiceless utterance system 1000 moves by moving his mouth or tongue. You can learn what will be done.
  • the content recognized by the voiceless speech system 1000 may be fed back as text.
  • the voiceless speech system 1000 according to the present embodiment can be applied as a vocalization assisting device for a person with a defective vocal cord or a deaf person.
  • a technique related to a technique in which a button-controllable oscillator is pressed against the pharynx to substitute the vocal cords for a person who has lost the function of the vocal cords With this technique, a person who has lost the function of the vocal cord can utter a voice without vibrating the vocal cord.
  • the vibrator since the vibrator emits a loud sound, it may happen that the sound of the utterance that has passed through the oral cavity is disturbed.
  • the speaker it is difficult for the speaker to adjust the volume of the loud sound, and the loud sound may be an unpleasant sound for the speaker.
  • the information acquired by the ultrasonic echo is converted into acoustic information, and the acoustic information is uttered as a voice waveform. No offensive sound.
  • the speaker can also adjust the volume of the voice generated from the voiceless speech system 1000. Therefore, even a person who has lost the function of the vocal cords can more comfortably use the voiceless speech system 1000 according to the present embodiment.
  • the voiceless system 1000 recognizes the state of the oral cavity of a person with a defective vocal cord, and outputs the recognized content as a voice from the speaker, so that even if the person with a defective vocal cord is used, And can communicate by voice.
  • the voiceless speech system 1000 according to the present embodiment is effective even for a person who does not have sufficient vital capacity to vibrate the vocal cords, such as an elderly person, regardless of the person who has a vocal cord defect. To do. For example, in the case of an elderly person who cannot speak with a sufficient amount of voice, it may be difficult for the elderly person to speak, but the elderly person can have speaking ability by the voiceless speech system 1000 and can easily carry out the conversation. It will be possible.
  • the deaf person can confirm how he / she is making a voice. Further, with the voiceless speech system 1000, the state of the oral cavity can be confirmed, so that the deaf person can practice speaking while confirming how to move the mouth and tongue.
  • the voiceless speech system 1000 according to the present embodiment can be applied to expand the functions of a hearing aid. By installing the voiceless speech system 1000 in a hearing aid, the convenience of the user of the hearing aid can be improved.
  • FIG. 10 is a block diagram showing a hardware configuration example of the information processing apparatus according to the present embodiment.
  • the information processing apparatus 900 shown in FIG. 10 can realize, for example, the mobile terminal 10, the ultrasonic echo device 20, and the voice input / output device 30 shown in FIGS. 1 and 4, respectively.
  • Information processing by the mobile terminal 10, the ultrasonic echo device 20, and the voice input / output device 30 according to the present embodiment is realized by cooperation of software and hardware described below.
  • the information processing apparatus 900 includes a CPU (Central Processing Unit) 901, a ROM (Read Only Memory) 902, and a RAM (Random Access Memory) 903.
  • the information processing apparatus 900 also includes a host bus 904a, a bridge 904, an external bus 904b, an interface 905, an input device 906, an output device 907, a storage device 908, a drive 909, a connection port 910, and a communication device 911.
  • the hardware configuration shown here is an example, and some of the components may be omitted.
  • the hardware configuration may further include components other than the components shown here.
  • the CPU 901 functions as, for example, an arithmetic processing device or a control device, and controls the overall operation of each component or a part thereof based on various programs recorded in the ROM 902, the RAM 903, or the storage device 908.
  • the ROM 902 is a means for storing a program read by the CPU 901, data used for calculation, and the like.
  • the RAM 903 temporarily or permanently stores, for example, a program read by the CPU 901 and various parameters that change appropriately when the program is executed. These are connected to each other by a host bus 904a including a CPU bus and the like.
  • the CPU 901, the ROM 902, and the RAM 903 can realize the functions of the control unit 110, the control unit 210, and the control unit 310 described with reference to FIG. 4, for example, in cooperation with software.
  • the CPU 901, the ROM 902, and the RAM 903 are mutually connected, for example, via a host bus 904a capable of high-speed data transmission.
  • the host bus 904a is connected to the external bus 904b having a relatively low data transmission rate via the bridge 904, for example.
  • the external bus 904b is also connected to various components via the interface 905.
  • the input device 906 is realized by a device such as a mouse, a keyboard, a touch panel, a button, a microphone, a switch, and a lever, to which information is input by the user. Further, the input device 906 may be, for example, a remote control device that uses infrared rays or other radio waves, or may be an externally connected device such as a mobile phone or PDA that corresponds to the operation of the information processing device 900. . Further, the input device 906 may include, for example, an input control circuit that generates an input signal based on the information input by the user using the above-described input means and outputs the input signal to the CPU 901. By operating the input device 906, the user of the information processing device 900 can input various data to the information processing device 900 and instruct a processing operation.
  • the input device 906 may be formed by a device that detects information about the user.
  • the input device 906 includes an image sensor (for example, a camera), a depth sensor (for example, a stereo camera), an acceleration sensor, a gyro sensor, a geomagnetic sensor, an optical sensor, a sound sensor, and a distance measuring sensor (for example, ToF (Time of Flight). ) Sensors), force sensors and the like.
  • the input device 906 includes information about the state of the information processing device 900, such as the posture and moving speed of the information processing device 900, information about the surrounding environment of the information processing device 900, such as brightness and noise around the information processing device 900. May be obtained.
  • the input device 906 receives a GNSS signal from a GNSS (Global Navigation Satellite System) satellite (for example, a GPS signal from a GPS (Global Positioning System) satellite) and receives position information including latitude, longitude, and altitude of the device. It may include a GNSS module to measure. Regarding the position information, the input device 906 may detect the position by transmission / reception with Wi-Fi (registered trademark), a mobile phone / PHS / smartphone, or the like, short-range communication, or the like. The input device 906 can realize the functions of the echo acquisition unit 220 and the voice input unit 320 described with reference to FIG. 4, for example.
  • GNSS Global Navigation Satellite System
  • GPS Global Positioning System
  • the output device 907 is formed of a device capable of visually or audibly notifying the user of the acquired information.
  • Such devices include CRT display devices, liquid crystal display devices, plasma display devices, EL display devices, display devices such as laser projectors, LED projectors and lamps, audio output devices such as speakers and headphones, and printer devices. .
  • the output device 907 outputs results obtained by various processes performed by the information processing device 900, for example.
  • the display device visually displays the results obtained by the various processes performed by the information processing device 900 in various formats such as text, images, tables, and graphs.
  • the audio output device converts an audio signal composed of reproduced audio data, acoustic data, and the like into an analog signal and outputs it audibly.
  • the output device 907 can realize the function of the audio output unit 330 described with reference to FIG. 4, for example.
  • the storage device 908 is a device for data storage formed as an example of a storage unit of the information processing device 900.
  • the storage device 908 is realized by, for example, a magnetic storage device such as an HDD, a semiconductor storage device, an optical storage device, a magneto-optical storage device, or the like.
  • the storage device 908 may include a storage medium, a recording device that records data in the storage medium, a reading device that reads data from the storage medium, a deletion device that deletes data recorded in the storage medium, and the like.
  • the storage device 908 stores programs executed by the CPU 901, various data, various data acquired from the outside, and the like.
  • the storage device 908 can realize the function of the storage unit 120 described with reference to FIG. 4, for example.
  • the drive 909 is a reader / writer for a storage medium, and is built in or externally attached to the information processing device 900.
  • the drive 909 reads out information recorded on a removable storage medium such as a mounted magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, and outputs it to the RAM 903.
  • the drive 909 can also write information in a removable storage medium.
  • connection port 910 is, for example, a USB (Universal Serial Bus) port, an IEEE 1394 port, a SCSI (Small Computer System Interface), an RS-232C port, or a port for connecting an external device such as an optical audio terminal. .
  • USB Universal Serial Bus
  • IEEE 1394 IEEE 1394
  • SCSI Small Computer System Interface
  • RS-232C RS-232C
  • a port for connecting an external device such as an optical audio terminal.
  • the communication device 911 is, for example, a communication interface formed of a communication device or the like for connecting to the network 920.
  • the communication device 911 is, for example, a communication card for wired or wireless LAN (Local Area Network), LTE (Long Term Evolution), Bluetooth (registered trademark) or WUSB (Wireless USB).
  • the communication device 911 may be a router for optical communication, a router for ADSL (Asymmetric Digital Subscriber Line), a modem for various kinds of communication, or the like.
  • the communication device 911 can send and receive signals and the like to and from the Internet and other communication devices, for example, according to a predetermined protocol such as TCP / IP.
  • the communication device 911 can realize, for example, the functions of the communication unit 100, the communication unit 200, and the communication unit 300 described with reference to FIG.
  • the network 920 is a wired or wireless transmission path for information transmitted from a device connected to the network 920.
  • the network 920 may include a public line network such as the Internet, a telephone line network, and a satellite communication network, various LANs (Local Area Network) including Ethernet (registered trademark), WAN (Wide Area Network), and the like.
  • the network 920 may include a dedicated line network such as an IP-VPN (Internet Protocol-Virtual Private Network).
  • each component described above may be realized by using a general-purpose member, or may be realized by hardware specialized for the function of each component. Therefore, it is possible to appropriately change the hardware configuration to be used according to the technical level at the time of implementing the present embodiment.
  • the mobile terminal 10 sets a plurality of time-series images showing the intraoral state acquired by ultrasonic echo to the intraoral state based on the algorithm acquired by machine learning. Convert to the corresponding information. Thereby, the mobile terminal 10 can convert an image showing the state in the oral cavity when the user intentionally moves at least one of the mouth and the tongue into acoustic information without uttering a voice.
  • each device described in this specification may be realized as a single device, or part or all may be realized as separate devices.
  • the mobile terminal 10, the ultrasonic echo device 20, and the voice input / output device 30 illustrated in FIG. 1 may be realized as independent devices.
  • the mobile terminal 10 may be realized as a server device connected to the ultrasonic echo device 20 and the voice input / output device 30 via a network or the like.
  • the function of the control unit 110 included in the mobile terminal 10 may be included in a server device connected via a network or the like.
  • each device described in this specification may be realized using any of software, hardware, and a combination of software and hardware.
  • the program forming the software is stored in advance in a recording medium (non-transitory medium: non-transmission media) provided inside or outside each device, for example. Then, each program is read into the RAM when it is executed by a computer, and executed by a processor such as a CPU.
  • the effects described in the present specification are merely explanatory or exemplifying ones, and are not limiting. That is, the technique according to the present disclosure may have other effects that are apparent to those skilled in the art from the description of the present specification, in addition to or instead of the above effects.
  • An information processing apparatus comprising: a plurality of time-series images showing the intraoral state acquired by ultrasonic echo, and a control unit that converts the information corresponding to the intraoral state based on an algorithm acquired by machine learning.
  • the algorithm has a first neural network, The control unit converts a plurality of input time-series images when no voice is input into first acoustic information via the first neural network, The information processing device according to (1).
  • the first neural network generates a plurality of acoustic feature amounts per unit time from a plurality of input time-series images when no voice is input, and synthesizes the generated plurality of acoustic feature amounts in time series order.
  • the information processing apparatus wherein the first acoustic information is generated by.
  • the first neural network selects a time-series image at a central time of the unit time from the plurality of time-series images of the unvoiced time acquired in the unit time, and selects the selected time-series image.
  • the information processing apparatus according to (3), wherein the acoustic feature amount per unit time is generated from the.
  • the first neural network is obtained by the machine learning using the first learning information including a voice at the time of utterance and a plurality of time-series images at the time of utterance, (2) to (4) above.
  • the information processing apparatus according to any one of 1.
  • the algorithm further comprises a second neural network
  • the control unit converts the first acoustic information into second acoustic information corresponding to a voice when uttered, via the second neural network.
  • the information processing apparatus according to any one of (2) to (5) above.
  • the second neural network includes third acoustic information generated by inputting the plurality of time-series images at the time of utterance to the first neural network, and a fourth acoustic information corresponding to voice at the time of utterance.
  • the information processing device according to (6) which is obtained by the machine learning using second learning information including acoustic information.
  • the information processing device according to any one of (2) to (7), wherein the acoustic information is a spectrogram.
  • the plurality of time-series images show changes in the state of the oral cavity when the user moves at least one of the mouth and tongue without uttering, and the time-series images are described in any one of (1) to (8) above.
  • the information processing device described. (10) The information processing device according to any one of (1) to (9), wherein the machine learning is performed by deep learning. (11) The information processing device according to any one of (1) to (10), wherein the machine learning is performed using a convolutional neural network. (12) A plurality of time-series images showing the intraoral state acquired by ultrasonic echo, including conversion into information corresponding to the intraoral state based on an algorithm acquired by machine learning, including, executed by the processor. Information processing method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Provided are an information processing device and an information processing method which make it possible to obtain acoustic information intended by a user without uttering voice. The information processing device is provided with a control unit (110) which converts a plurality of time-series images acquired by ultrasonic echo indicating the state of the inside of an oral cavity into information corresponding to the state of the inside of the oral cavity on the basis of an algorithm acquired by machine learning.

Description

情報処理装置及び情報処理方法Information processing apparatus and information processing method
 本開示は、情報処理装置及び情報処理方法に関する。 The present disclosure relates to an information processing device and an information processing method.
 近年、音声認識精度の向上により、音声コマンドで制御可能な機器が普及している。例えば、スマートフォンやカーナビゲーション装置等では、音声コマンドを用いて検索機能使用することが一般化してきている。また、音声により入力した内容を文書化することによる文書作成が可能になってきている。さらに、スマートスピーカと呼ばれる、音声コマンドにより動作するスピーカ型の音声インタフェース装置が普及してきている。 In recent years, with the improvement of voice recognition accuracy, devices that can be controlled by voice commands have become popular. For example, in smartphones, car navigation devices, and the like, it has become common to use a search function using voice commands. Further, it has become possible to create a document by documenting the contents input by voice. Further, a speaker-type voice interface device, which is called a smart speaker and operates by a voice command, has become widespread.
 しかしながら、音声コマンドを使用する状況は限定され得る。例えば、電車の中、図書館等の公共空間で、音声によりスマートフォン等を操作することは周囲の人に受け入れられ難い。また、公共空間で、個人情報等の秘匿性のある情報を声に出すことは、個人情報漏洩のリスクがある。そのため、音声コマンドを用いる音声インタフェースは、家庭内で使用されているスマートスピーカや、車内で使用されているカーナビゲーション装置のように、発声による周囲への影響が明確である場所での利用に限定されがちである。 However, situations in which voice commands are used may be limited. For example, operating a smartphone or the like by voice in a public space such as on a train or in a library is difficult for people around to accept. In addition, speaking out confidential information such as personal information in a public space poses a risk of personal information leakage. Therefore, the voice interface using voice commands is limited to use in places where the influence of utterance on the surroundings is clear, such as smart speakers used in homes and car navigation devices used in vehicles. It tends to be done.
 例えば、実際に音声を発することなく上記の機器等を操作することができれば、場所に限定されずに上記の機器等を利用することができる。具体的に、音声を発することなく機器を操作することが可能な機能を有するウェアラブルコンピュータであれば、当該ウェアラブルコンピュータを常時装着することで、場所を問わず常にサービスを得ることが可能となる。そこで、音声を発することなく音声認識を行うことが可能な無発声発話の認識技術に関する研究が進められている。 For example, if the above devices can be operated without actually producing a voice, the above devices can be used regardless of the location. Specifically, if the wearable computer has a function capable of operating the device without producing a voice, by always wearing the wearable computer, it becomes possible to always obtain the service regardless of the place. Therefore, research on recognition technology for unvoiced utterances that can perform voice recognition without producing a voice is under way.
 上述の無発声発話の認識技術に関連し、例えば、下記特許文献1には、電磁波により音声器官の運動及び場所を検出して音声を識別する技術が開示されている。また、下記特許文献1に開示されている技術以外に、騒音環境で音声を確実に取得するための咽頭マイク及び喉に貼り付けるマイク等に関する研究も進められている。 Related to the above-mentioned recognition technology of unvoiced speech, for example, Patent Document 1 below discloses a technology of detecting a movement and a place of a voice organ by an electromagnetic wave to identify a voice. Further, in addition to the technique disclosed in Patent Document 1 below, research on a pharyngeal microphone and a microphone to be attached to the throat for surely acquiring sound in a noisy environment is also under way.
特表2000-504848号公報Tokuyo 2000-504848
 しかしながら、上述の無発声発話の認識技術は、ささやく程度には音声を発する必要があるため、公共空間における利用は未だ限定的である。また、より無発声に近づけようと、ささやく際の音量を小さくすると認識精度が低下し得る。 However, the recognition technology for non-voiced utterances mentioned above requires only a whispering amount of voice, so its use in public spaces is still limited. In addition, if the volume of whispering is reduced in order to bring the voice closer to the non-voiced state, the recognition accuracy may decrease.
 そこで、本開示では、ユーザが発声せずに意図した音響情報を得ることが可能な、新規かつ改良された情報処理装置及び情報処理方法を提案する。 Therefore, the present disclosure proposes a new and improved information processing apparatus and information processing method that allow a user to obtain intended acoustic information without vocalization.
 本開示によれば、超音波エコーにより取得される口腔内の状態を示す複数枚の時系列画像を、機械学習により取得したアルゴリズムに基づき前記口腔内の状態に対応する情報に変換する制御部、を備える、情報処理装置が提供される。 According to the present disclosure, a plurality of time-series images showing an intraoral state acquired by ultrasonic echo, a control unit that converts information corresponding to the intraoral state based on an algorithm acquired by machine learning, An information processing apparatus including:
 また、本開示によれば、超音波エコーにより取得される口腔内の状態を示す複数枚の時系列画像を、機械学習により取得したアルゴリズムに基づき前記口腔内の状態に対応する情報に変換すること、を含む、プロセッサにより実行される情報処理方法が提供される。 Further, according to the present disclosure, it is possible to convert a plurality of time-series images showing an intraoral state acquired by ultrasonic echo into information corresponding to the intraoral state based on an algorithm acquired by machine learning. An information processing method executed by a processor is provided, including:
 以上説明したように本開示によれば、ユーザが発声せずに意図した音響情報を得ることが可能である。なお、上記の効果は必ずしも限定的なものではなく、上記の効果とともに、または上記の効果に代えて、本明細書に示されたいずれかの効果、または本明細書から把握され得る他の効果が奏されてもよい。 As described above, according to the present disclosure, it is possible to obtain intended acoustic information without the user speaking. Note that the above effects are not necessarily limited, and in addition to or in place of the above effects, any of the effects shown in this specification, or other effects that can be grasped from this specification. May be played.
本開示の実施形態に係る無音声発話システムの構成例を示す図である。It is a figure showing an example of composition of a voiceless utterance system concerning an embodiment of this indication. 同実施形態に係るエコー画像を示す図である。It is a figure which shows the echo image which concerns on the same embodiment. 同実施形態に係る無音声発話システムの機能の概要を示す図である。It is a figure which shows the outline | summary of the function of the voiceless utterance system which concerns on the same embodiment. 同実施形態に係る無音声発話システムの機能構成例を示すブロック図である。It is a block diagram showing an example of functional composition of a voiceless utterance system concerning the embodiment. 同実施形態に係る音響特徴量の生成例を示す図である。It is a figure which shows the example of a production | generation of the acoustic feature-value which concerns on the same embodiment. 同実施形態に係る第2のニューラルネットワークの構造を示す図である。It is a figure which shows the structure of the 2nd neural network which concerns on the same embodiment. 同実施形態に係る第1のニューラルネットワークを取得する機械学習の流れを示すフローチャートである。It is a flow chart which shows a flow of machine learning which acquires the 1st neural network concerning the embodiment. 同実施形態に係る第2のニューラルネットワークを取得する機械学習の流れを示すフローチャートである。It is a flow chart which shows a flow of machine learning which acquires the 2nd neural network concerning the embodiment. 同実施形態に係る携帯端末における処理の流れを示すフローチャートである。It is a flow chart which shows a flow of processing in a personal digital assistant concerning the embodiment. 同実施形態に係る情報処理装置のハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware structural example of the information processing apparatus which concerns on the same embodiment.
 以下に添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In the present specification and the drawings, constituent elements having substantially the same functional configuration are designated by the same reference numerals, and duplicate description will be omitted.
 なお、説明は以下の順序で行うものとする。
 1.本開示の実施形態
  1.1.概要
  1.2.無音声発話システムの構成
  1.3.無音声発話システムの機能
  1.4.無音声発話システムの処理
 2.変形例
 3.応用例
 4.ハードウェア構成例
 5.まとめ
The description will be given in the following order.
1. Embodiment of the present disclosure 1.1. Overview 1.2. Configuration of voiceless speech system 1.3. Function of voiceless speech system 1.4. Processing of voiceless speech system 2. Modification 3. Application example 4. Hardware configuration example 5. Summary
<<1.本開示の実施形態>>
 <1.1.概要>
 近年、音声認識精度の向上により、音声コマンドで制御可能な機器が普及している。例えば、スマートフォンやカーナビゲーション装置等では、音声コマンドを用いて検索機能使用することが一般化してきている。また、音声により入力した内容を文書化することによる文書作成が可能になってきている。さらに、スマートスピーカと呼ばれる、音声コマンドにより動作するスピーカ型の音声インタフェース装置が普及してきている。
<< 1. Embodiments of the present disclosure >>
<1.1. Overview>
In recent years, with the improvement of voice recognition accuracy, devices that can be controlled by voice commands have become widespread. For example, in smartphones, car navigation devices, and the like, it has become common to use a search function using voice commands. Further, it has become possible to create a document by documenting the contents input by voice. Further, a speaker-type voice interface device, which is called a smart speaker and operates by a voice command, has become widespread.
 しかしながら、音声コマンドを使用する状況は限定され得る。例えば、電車の中、図書館等の公共空間で、音声によりスマートフォン等を操作することは周囲の人に受け入れられ難い。また、公共空間で、個人情報等の秘匿性のある情報を声に出すことは、個人情報漏洩のリスクがある。そのため、音声コマンドを用いる音声インタフェースは、家庭内で使用されているスマートスピーカや、車内で使用されているカーナビゲーション装置のように、発声による周囲への影響が明確である場所での利用に限定されがちである。 However, situations in which voice commands are used may be limited. For example, operating a smartphone or the like by voice in a public space such as on a train or in a library is difficult for people around to accept. In addition, speaking out confidential information such as personal information in a public space poses a risk of personal information leakage. Therefore, the voice interface using voice commands is limited to use in places where the influence of utterance on the surroundings is clear, such as smart speakers used in homes and car navigation devices used in vehicles. It tends to be done.
 例えば、実際に音声を発することなく上記の機器等を操作することができれば、場所に限定されずに上記の機器等を利用することができる。具体的に、音声を発することなく機器を操作することが可能な機能を有するウェアラブルコンピュータであれば、当該ウェアラブルコンピュータを常時装着することで、場所を問わず常にサービスを得ることが可能となる。そこで、音声を発することなく音声認識を行うことが可能な無発声発話の認識技術に関する研究が進められている。 For example, if the above devices can be operated without actually producing a voice, the above devices can be used regardless of the location. Specifically, if the wearable computer has a function capable of operating the device without producing a voice, by always wearing the wearable computer, it becomes possible to always obtain the service regardless of the place. Therefore, research on recognition technology for unvoiced utterances that can perform voice recognition without producing a voice is under way.
 上述の無発声発話の認識技術に関連し、例えば、電磁波により音声器官の運動及び場所を検出して音声を識別する技術が開示されている。また、他にも、騒音環境で音声を確実に取得するための咽頭マイク及び喉に貼り付けるマイク等に関する研究も進められている。 Related to the recognition technology for unvoiced utterances described above, for example, a technology for detecting the movement and location of a voice organ by electromagnetic waves to identify a voice is disclosed. In addition, research on a pharyngeal microphone and a microphone to be attached to the throat for surely acquiring sound in a noisy environment is also under way.
 しかしながら、上述の無発声発話の認識技術は、ささやく程度には音声を発する必要があるため、公共空間における利用は未だ限定的である。また、より無発声に近づけようと、ささやく際の音量を小さくすると認識精度が低下し得る。 However, the recognition technology for non-voiced utterances mentioned above requires only a whispering amount of voice, so its use in public spaces is still limited. In addition, if the volume of whispering is reduced in order to bring the voice closer to the non-voiced state, the recognition accuracy may decrease.
 本開示の実施形態では、上記の点に着目して発想されたものであり、ユーザが発声せずに意図した音響情報を得ることが可能な技術を提案する。以下、本実施形態について順次詳細に説明する。 The embodiment of the present disclosure is conceived with the above points in mind, and proposes a technology that enables the user to obtain the intended acoustic information without uttering. Hereinafter, the present embodiment will be sequentially described in detail.
 <1.2.無音声発話システムの構成>
 まず、本開示の実施形態に係る無音声発話システムの構成について説明する。図1は、本開示の実施形態に係る無音声発話システムの構成例を示す図である。図1に示したように、本実施形態に係る無音声発話システム1000は、携帯端末10、超音波エコー装置20、及び音声入出力装置30を備える。携帯端末10には、多様な装置が接続され得る。例えば、携帯端末10には、超音波エコー装置20及び音声入出力装置30が接続され、各装置間で情報の連携が行われる。本実施形態に係る携帯端末10には、超音波エコー装置20及び音声入出力装置30が無線で接続される。例えば、携帯端末10は、超音波エコー装置20及び音声入出力装置30とBluetooth(登録商標)を用いた近距離無線通信を行う。なお、携帯端末10には、超音波エコー装置20及び音声入出力装置30が有線で接続されてもよいし、ネットワークを介して接続されてもよい。
<1.2. Configuration of voiceless speech system>
First, the configuration of the voiceless speech system according to the embodiment of the present disclosure will be described. FIG. 1 is a diagram illustrating a configuration example of a voiceless speech system according to an embodiment of the present disclosure. As illustrated in FIG. 1, the voiceless speech system 1000 according to the present embodiment includes a mobile terminal 10, an ultrasonic echo device 20, and a voice input / output device 30. Various devices may be connected to the mobile terminal 10. For example, the ultrasonic echo device 20 and the voice input / output device 30 are connected to the mobile terminal 10, and information cooperation is performed between the devices. The ultrasonic echo device 20 and the voice input / output device 30 are wirelessly connected to the mobile terminal 10 according to the present embodiment. For example, the mobile terminal 10 performs short-distance wireless communication with the ultrasonic echo device 20 and the voice input / output device 30 using Bluetooth (registered trademark). The ultrasonic echo device 20 and the audio input / output device 30 may be connected to the mobile terminal 10 by wire or may be connected via a network.
 (1)携帯端末10
 携帯端末10は、機械学習に基づく認識処理が可能な情報処理装置である。本実施形態に係る認識処理は、例えば、音声認識処理である。当該音声認識処理は、例えば、画像(静止画/動画)から生成される音声に関する情報に対して行われる。具体的に、携帯端末10は、ユーザ12の口腔内の状態を示す画像(以下では、エコー画像とも称される)を、音声に関する情報に変換し、変換した音声に関する情報に対して音声認識処理を行う。
(1) Mobile terminal 10
The mobile terminal 10 is an information processing device capable of recognition processing based on machine learning. The recognition process according to the present embodiment is, for example, a voice recognition process. The voice recognition process is performed, for example, on information about a voice generated from an image (still image / moving image). Specifically, the mobile terminal 10 converts an image (hereinafter also referred to as an echo image) showing the state of the inside of the oral cavity of the user 12 into information regarding voice, and performs voice recognition processing on the converted information regarding voice. I do.
 なお、本実施形態では、ユーザ12が音声を発さずに口腔内の状態を変化させた際の口腔内の状態の時系列変化を示す複数枚の時系列画像が、音声に関する情報に変換される。これにより、本実施形態に係る携帯端末10は、無発声による音声認識を実現し得る。なお、当該複数枚の時系列画像は、ユーザが発声せずに口又は舌の少なくとも一方を動かした際の口腔内の状態の変化を示すエコー画像である。また、ユーザ12の口腔内の状態の時系列変化を示す複数枚の時系列画像は、以下では、時系列エコー画像とも称される。 In the present embodiment, a plurality of time-series images showing a time-series change in the intraoral state when the user 12 changes the intraoral state without uttering a voice is converted into information regarding voice. It Thereby, the mobile terminal 10 according to the present embodiment can realize voice recognition without vocalization. The plurality of time-series images are echo images showing changes in the state of the oral cavity when the user moves at least one of the mouth and tongue without uttering. In addition, hereinafter, the plurality of time-series images showing the time-series changes in the oral cavity state of the user 12 are also referred to as time-series echo images.
 音声に関する情報は、例えば、音声認識装置が認識可能な情報(以下では、音響情報とも称される)である。音響情報は、例えば、周波数、振幅、及び時間により、音声の高低、強度等の音声の特徴の時系列変化を3次元的に示すスペクトログラムである。 Information related to voice is, for example, information that can be recognized by a voice recognition device (hereinafter, also referred to as acoustic information). The acoustic information is, for example, a spectrogram that three-dimensionally shows a time-series change in the characteristics of the voice such as the pitch, strength, and the like of the voice according to frequency, amplitude, and time.
 音声に関する情報は、機械学習により取得されるアルゴリズムを用いて画像から変換される。本実施形態に係る機械学習は、例えば、ディープラーニングにより行われる。当該機械学習により取得されるアルゴリズムは、例えば、ニューラルネットワーク(NN:Neural Network)である。なお、当該機械学習には入力として画像が用いられる。そのため、当該機械学習は、画像処理のディープラーニングに適したコンボリューショナルニューラルネットワーク(CNN:Convolutional Neural Network)を用いて行われる。なお、本実施形態では、ユーザ12が音声を発した際の時系列エコー画像が機械学習に用いられる。 Information related to audio is converted from images using algorithms acquired by machine learning. The machine learning according to this embodiment is performed by, for example, deep learning. The algorithm acquired by the machine learning is, for example, a neural network (NN: Neural Network). An image is used as an input for the machine learning. Therefore, the machine learning is performed using a convolutional neural network (CNN: Convolutional Neural Network) suitable for deep learning of image processing. In this embodiment, a time-series echo image when the user 12 makes a voice is used for machine learning.
 本実施形態に係るアルゴリズム(ニューラルネットワーク)には、2種類のアルゴリズムが存在する。1つ目のアルゴリズムは、ユーザ12が音声を発さずに口腔内の状態を変化させた際の時系列エコー画像を音響情報(第1の音響情報)に変換する処理を行う第1のニューラルネットワーク(以下では、NN1とも称される)である。2つ目のアルゴリズムは、NN1が変換した音響情報を、より精度の高い音響情報(第2の音響情報)へ変換する処理を行う第2のニューラルネットワーク(以下では、NN2とも称される)である。より精度の高い音響情報とは、例えば、ユーザ12が実際に発声した際の音声である発声音声が変換された音響情報である。なお、NN1及びNN2の詳細は後述される。 There are two types of algorithms (neural networks) according to this embodiment. The first algorithm is a first neural network that performs a process of converting a time-series echo image into acoustic information (first acoustic information) when the user 12 changes the state of the oral cavity without producing a sound. It is a network (hereinafter, also referred to as NN1). The second algorithm is a second neural network (hereinafter, also referred to as NN2) that performs a process of converting acoustic information converted by NN1 into more accurate acoustic information (second acoustic information). is there. The more accurate acoustic information is, for example, acoustic information obtained by converting the uttered voice that is the voice when the user 12 actually utters. The details of NN1 and NN2 will be described later.
 なお、上述のように、本実施形態に係る時系列エコー画像には、NN1により音響情報に変換される時系列エコー画像と、機械学習に用いられる時系列エコー画像の2種類が存在する。音響情報に変換される時系列エコー画像は、ユーザ12が音声を発さずに口腔内の状態を変化させた際の口腔内の時系列エコー画像であるため、以下では、無発声時系列エコー画像とも称される。また、機械学習に用いられる時系列エコー画像は、ユーザ12が発声した際の口腔内の時系列エコー画像であるため、以下では、発声時系列エコー画像とも称される。 Note that, as described above, there are two types of time-series echo images according to the present embodiment, a time-series echo image converted into acoustic information by the NN1 and a time-series echo image used for machine learning. The time-series echo image converted into acoustic information is a time-series echo image in the oral cavity when the user 12 changes the state in the oral cavity without making a sound, and therefore, in the following, a non-voiced time-series echo. Also called an image. Further, since the time-series echo image used for machine learning is the time-series echo image in the oral cavity when the user 12 utters, it is also referred to as a vocalization time-series echo image below.
 また、上述のように、本実施形態に係る音響情報には、複数の音響情報が存在する。NN1により変換される音響情報(第1の音響情報)は、無発声時系列エコー画像が変換されたスペクトログラムであるため、以下では、無発声画像スペクトログラムと称される。また、NN2により変換される音響情報(第2の音響情報)は、無発声画像スペクトログラムが変換されたより精度の高いスペクトログラムであるため、以下では、高精度無発声画像スペクトログラムと称される。 Also, as described above, the acoustic information according to the present embodiment has a plurality of acoustic information. The acoustic information (first acoustic information) converted by the NN1 is a spectrogram obtained by converting an unvoiced time-series echo image, and is hereinafter referred to as an unvoiced image spectrogram. Further, since the acoustic information (second acoustic information) converted by the NN2 is a spectrogram with higher accuracy obtained by converting the unvoiced image spectrogram, it is hereinafter referred to as a high-accuracy unvoiced image spectrogram.
 なお、機械学習は、NN1及びNN2の各々において行われるが、各々の機械学習に用いられる学習情報は異なる。NN1の機械学習に用いられる学習情報(第1の学習情報)は、発声時系列エコー画像及び発声音声である。NN2の機械学習に用いられる学習情報(第2の学習情報)は、発声画像スペクトログラムがNN1を介して変換された音響情報(第3の音響情報)と発声音声と対応する音響情報(第4の音響情報)である。 Note that machine learning is performed in each of NN1 and NN2, but the learning information used for each machine learning is different. The learning information (first learning information) used for the machine learning of NN1 is a vocalization time-series echo image and vocalized voice. The learning information (second learning information) used for machine learning of the NN2 includes acoustic information (third acoustic information) obtained by converting the vocal image spectrogram via the NN1 and acoustic information corresponding to the vocal sound (fourth acoustic information). Acoustic information).
 発声画像スペクトログラムがNN1を介して変換された音響情報(第3の音響情報)は、発声時系列エコー画像がNN1により変換されたスペクトログラムであるため、以下では、発声画像スペクトログラムと称される。発声音声と対応する音響情報(第4の音響情報)は、ユーザ12が実際に発声した際の音声と対応するスペクトログラムであるため、以下では、発声音声スペクトログラムと称される。なお、発声画像スペクトログラム(第3の音響情報)はNN2の機械学習の入力として用いられ、発声音声スペクトログラム(第4の音響情報)はNN2の機械学習の出力として用いられる。 The acoustic information (third acoustic information) obtained by converting the vocal image spectrogram through the NN1 is a spectrogram obtained by converting the vocal time-series echo image by the NN1, and is hereinafter referred to as a vocal image spectrogram. The acoustic information (fourth acoustic information) corresponding to the uttered voice is a spectrogram corresponding to the voice when the user 12 actually utters, and is hereinafter referred to as a uttered voice spectrogram. The vocal image spectrogram (third acoustic information) is used as an input of the NN2 machine learning, and the vocalized speech spectrogram (fourth acoustic information) is used as an output of the NN2 machine learning.
 また、携帯端末10は、無音声発話システム1000の動作全般を制御する機能も有する。例えば、携帯端末10は、各装置間で連携される情報に基づき、無音声発話システム1000の動作全般を制御する。具体的に、携帯端末10は、超音波エコー装置20及び音声入出力装置30から受信する情報に基づき、携帯端末10における音声認識に関する処理、音声入出力装置30の動作を制御する。また、携帯端末10は、超音波エコー装置20の動作を制御してもよい。 The mobile terminal 10 also has a function of controlling the overall operation of the voiceless speech system 1000. For example, the mobile terminal 10 controls the overall operation of the voiceless speech system 1000 based on the information associated with each device. Specifically, the mobile terminal 10 controls the process related to voice recognition in the mobile terminal 10 and the operation of the voice input / output device 30 based on the information received from the ultrasonic echo device 20 and the voice input / output device 30. Further, the mobile terminal 10 may control the operation of the ultrasonic echo device 20.
 携帯端末10は、図1に示すよう、例えば、スマートフォンにより実現される。なお、携帯端末10は、スマートフォンに限定されない。例えば、携帯端末10は、携帯端末10としての機能をアプリケーションとして実装したタブレット端末、PC、ウェアラブル端末、又はエージェントデバイス等の端末装置であってもよい。すなわち、携帯端末10は、任意の端末装置として実現され得る。 The mobile terminal 10 is realized by, for example, a smartphone as shown in FIG. The mobile terminal 10 is not limited to a smartphone. For example, the mobile terminal 10 may be a terminal device such as a tablet terminal, a PC, a wearable terminal, or an agent device that has the functions of the mobile terminal 10 implemented as an application. That is, the mobile terminal 10 can be realized as an arbitrary terminal device.
 (2)超音波エコー装置20
 超音波エコー装置20は、ユーザ12の口腔内のエコー画像を取得する装置である。超音波エコー装置20は、医療に広く使われている超音波検査技術を利用してエコー画像を取得する。超音波エコー装置20は、超音波を出力可能な超音波出力装置を備え、ユーザ12の体表に付着させた当該超音波出力装置にユーザ12の体内へ超音波を出力させ、ユーザ12の体内の器官により反射した超音波に基づきエコー画像を取得する。そして、超音波エコー装置20は、取得したエコー画像を携帯端末10へ送信する。
(2) Ultrasonic echo device 20
The ultrasonic echo device 20 is a device that acquires an echo image in the oral cavity of the user 12. The ultrasonic echo device 20 acquires an echo image by using an ultrasonic examination technique widely used in medicine. The ultrasonic echo device 20 includes an ultrasonic output device capable of outputting ultrasonic waves, and causes the ultrasonic output device attached to the body surface of the user 12 to output ultrasonic waves into the body of the user 12 and An echo image is acquired based on the ultrasonic waves reflected by the organ. Then, the ultrasonic echo device 20 transmits the acquired echo image to the mobile terminal 10.
 本実施形態に係る超音波エコー装置20は、例えば、図1に示すように、ネックバンド型の装置として実現される。超音波エコー装置20の超音波出力部22には、超音波出力装置が備えられている。図1に示す超音波エコー装置20は、ネックバンド型の装置としての構造上、超音波出力部22a及び22bの2つの超音波出力部22を備えている。なお、超音波出力部22の数は2つに限定されず、超音波エコー装置20は少なくとも1つ以上の超音波出力部22を備えていればよい。 The ultrasonic echo device 20 according to the present embodiment is realized as a neckband type device as shown in FIG. 1, for example. The ultrasonic wave output unit 22 of the ultrasonic wave echo device 20 is provided with an ultrasonic wave output device. The ultrasonic echo device 20 shown in FIG. 1 is provided with two ultrasonic wave output parts 22 of ultrasonic wave output parts 22a and 22b because of its structure as a neckband type device. The number of ultrasonic wave output units 22 is not limited to two, and the ultrasonic wave echo device 20 may include at least one ultrasonic wave output unit 22.
 超音波出力部22が顎の下に位置するように超音波エコー装置20をユーザ12が装着することで、超音波は、ユーザ12の口腔内に向けて出力される。これにより、超音波エコー装置20は、ユーザ12の口腔内のエコー画像を取得することができる。音声は、声帯による振動が舌や口の開き方によって調整された上で発声される。そのため、超音波エコー装置20により取得されるユーザ12の口腔内のエコー画像は、音響情報に変換する画像として有効な情報を有していると言える。 When the user 12 wears the ultrasonic echo device 20 so that the ultrasonic wave output unit 22 is located under the chin, ultrasonic waves are output toward the oral cavity of the user 12. Thereby, the ultrasonic echo device 20 can acquire an echo image in the oral cavity of the user 12. The voice is uttered after the vibration due to the vocal cords is adjusted by the opening of the tongue and the mouth. Therefore, it can be said that the echo image in the oral cavity of the user 12 acquired by the ultrasonic echo device 20 has effective information as an image to be converted into acoustic information.
 ここで、超音波エコー装置20により取得されるエコー画像について説明する。図2は、本実施形態に係るエコー画像を示す図である。エコー画像40は、超音波エコー装置20により取得されたユーザ12の口腔内のエコー画像である。図2に示すエコー画像40では、舌先402、舌の表面404、舌の根本406が示されている。本実施形態では、超音波エコー装置20がエコー画像40を連続的に取得することで、ユーザ12の口腔内の状態の時系列変化を示す複数枚の時系列画像(時系列エコー画像)を取得する。 Here, an echo image acquired by the ultrasonic echo device 20 will be described. FIG. 2 is a diagram showing an echo image according to this embodiment. The echo image 40 is an echo image in the oral cavity of the user 12 acquired by the ultrasonic echo device 20. In the echo image 40 shown in FIG. 2, the tongue tip 402, the tongue surface 404, and the tongue root 406 are shown. In the present embodiment, the ultrasonic echo device 20 continuously acquires the echo images 40, thereby acquiring a plurality of time-series images (time-series echo images) that show time-series changes in the intraoral state of the user 12. To do.
 (3)音声入出力装置30
 音声入出力装置30は、音声の入出力が可能な装置である。音声入出力装置30は、例えば、ユーザ12が発声した音声を取得する。そして、音声入出力装置30は、取得した音声を携帯端末10へ送信する。また、音声入出力装置30は、例えば、携帯端末10が認識した内容を示す音声データを携帯端末10から受信する。そして、音声入出力装置30は、受信した音声データを音声として出力する。
(3) Voice input / output device 30
The voice input / output device 30 is a device capable of inputting / outputting voice. The voice input / output device 30 acquires, for example, the voice uttered by the user 12. Then, the voice input / output device 30 transmits the acquired voice to the mobile terminal 10. The voice input / output device 30 also receives, for example, voice data indicating the content recognized by the mobile terminal 10 from the mobile terminal 10. Then, the voice input / output device 30 outputs the received voice data as voice.
 本実施形態に係る音声入出力装置30は、例えば、ウェアラブル端末により実現される。具体的に、音声入出力装置30は、音声の入出力が可能なイヤフォン、骨伝導イヤフォン等のウェアラブル端末であることが望ましい。音声入出力装置30がイヤフォン、骨伝導イヤフォン等であることにより、外部に漏れる音声の量を低減することができる。 The voice input / output device 30 according to the present embodiment is realized by, for example, a wearable terminal. Specifically, the voice input / output device 30 is preferably a wearable terminal such as an earphone or a bone conduction earphone capable of inputting / outputting voice. Since the voice input / output device 30 is an earphone, a bone conduction earphone, or the like, the amount of voice leaked to the outside can be reduced.
 さらに、音声入出力装置30は、音声入出力装置30から出力される音声以外に、外部で生じる音声もユーザ12が聴くことができる構造であることがより望ましい。例えば、図1に示したように、音声入出力装置30は開口部32を有する。そのため、ユーザ12は、音声入出力装置30を装着していても、開口部32を通して外部の音声を聞くことができる。よって、ユーザ12は、当該構造を有する音声入出力装置30を常時装着していても、日常生活に支障なく、快適に過ごすことができる。また、携帯端末10が認識した内容を示す音声が、音声入出力装置30ではなく、スマートスピーカ等のスピーカから出力された場合でも、ユーザ12は、当該音声を聞くことが可能である。 Furthermore, it is more preferable that the voice input / output device 30 has a structure that allows the user 12 to listen to externally generated voice in addition to the voice output from the voice input / output device 30. For example, as shown in FIG. 1, the voice input / output device 30 has an opening 32. Therefore, the user 12 can listen to external sound through the opening 32 even when the sound input / output device 30 is worn. Therefore, even if the user 12 always wears the voice input / output device 30 having the structure, he / she can comfortably spend his / her daily life without any trouble. Further, even when the voice indicating the content recognized by the mobile terminal 10 is output from the speaker such as the smart speaker instead of the voice input / output device 30, the user 12 can hear the voice.
 なお、本実施形態では、音声を出力する音声出力機能と音声を取得する音声入力機能が1つの装置にて実現される例を説明するが、音声出力機能と音声入力機能は、それぞれ独立した装置により実現されてもよい。 In the present embodiment, an example in which a voice output function for outputting voice and a voice input function for acquiring voice are realized by one device will be described. However, the voice output function and the voice input function are independent devices. May be realized by
 <1.3.無音声発話システムの機能>
 以上、無音声発話システム1000の構成について説明した。続いて、無音声発話システム1000の機能について説明する。
<1.3. Functions of voiceless speech system>
The configuration of the voiceless speech system 1000 has been described above. Next, the function of the voiceless speech system 1000 will be described.
  <1.3.1.機能の概要>
 図3は、本実施形態に係る無音声発話システムの機能の概要を示す図である。無音声発話システム1000は、まず、発声時系列エコー画像及び発声音声に基づく機械学習により、あらかじめNN1及びNN2を取得しておく。ユーザ12が音声を発さずに口腔内の状態を変化させた際に、超音波エコー装置20は、無発声時系列エコー画像42を取得する。次いで、取得された無発声時系列エコー画像42は、第1のニューラルネットワーク122(NN1)を介して、無発声画像スペクトログラム72に変換される。無発声画像スペクトログラム72は、複数の音響特徴量70が時系列順に結合されたものである。音響特徴量70の詳細は後述される。
<13.1. Overview of functions>
FIG. 3 is a diagram showing an outline of functions of the voiceless speech system according to the present embodiment. The voiceless speech system 1000 first acquires NN1 and NN2 in advance by machine learning based on the vocalization time series echo image and the vocalized voice. When the user 12 changes the state of the oral cavity without making a sound, the ultrasonic echo device 20 acquires the unvoiced time series echo image 42. Then, the acquired unvoiced time-series echo image 42 is converted to the unvoiced image spectrogram 72 via the first neural network 122 (NN1). The unvoiced image spectrogram 72 is a combination of a plurality of acoustic feature quantities 70 in chronological order. Details of the acoustic feature amount 70 will be described later.
 NN1による変換処理後、変換された無発声画像スペクトログラム72は、第2のニューラルネットワーク124(NN2)を介して、高精度無発声画像スペクトログラム74に変換される。NN2による変換処理後、変換された高精度無発声画像スペクトログラム74は、携帯端末10の認識部114に入力される。そして、認識部114は、入力された高精度無発声画像スペクトログラム74に基づき、音声認識処理を行う。 After the conversion processing by NN1, the converted unvoiced image spectrogram 72 is converted into a high-accuracy unvoiced image spectrogram 74 via the second neural network 124 (NN2). After the conversion processing by NN2, the converted high-accuracy unvoiced image spectrogram 74 is input to the recognition unit 114 of the mobile terminal 10. Then, the recognition unit 114 performs voice recognition processing based on the input high-accuracy unvoiced image spectrogram 74.
  <1.3.2.機能構成例>
 図4は、本実施形態に係る無音声発話システムの機能構成例を示すブロック図である。
<1.3.2. Functional configuration example>
FIG. 4 is a block diagram showing a functional configuration example of the voiceless speech system according to the present embodiment.
 (1)携帯端末10
 図4に示したように、携帯端末10は、通信部100、制御部110、及び記憶部120を備える。なお、本実施形態に係る情報処理装置は、少なくとも制御部110を有する。
(1) Mobile terminal 10
As shown in FIG. 4, the mobile terminal 10 includes a communication unit 100, a control unit 110, and a storage unit 120. The information processing apparatus according to this embodiment has at least the control unit 110.
 (1-1)通信部100
 通信部100は、外部装置と通信を行う機能を有する。例えば、通信部100は、外部装置との通信において、外部装置から受信する情報を制御部110へ出力する。具体的に、通信部100は、超音波エコー装置20から受信するエコー画像を制御部110へ出力する。また、通信部100は、音声入出力装置30から受信する音声を制御部110へ出力する。
(1-1) Communication unit 100
The communication unit 100 has a function of communicating with an external device. For example, the communication unit 100 outputs the information received from the external device to the control unit 110 in the communication with the external device. Specifically, the communication unit 100 outputs the echo image received from the ultrasonic echo device 20 to the control unit 110. Further, the communication unit 100 outputs the voice received from the voice input / output device 30 to the control unit 110.
 通信部100は、外部装置との通信において、制御部110から入力される情報を外部装置へ送信する。具体的に、通信部100は、制御部110から入力されるエコー画像の取得に関する情報を超音波エコー装置20へ送信する。また、通信部100は、制御部110から入力される音声の入出力に関する情報を音声入出力装置30へ送信する。 The communication unit 100 transmits information input from the control unit 110 to an external device in communication with the external device. Specifically, the communication unit 100 transmits the information regarding the acquisition of the echo image input from the control unit 110 to the ultrasonic echo device 20. Further, the communication unit 100 transmits information regarding the input / output of the voice input from the control unit 110 to the voice input / output device 30.
 (1-2)制御部110
 制御部110は、携帯端末10の動作を制御する機能を有する。例えば、制御部110は、超音波エコーにより取得される口腔内の状態を示す複数枚の時系列画像を、機械学習により取得したアルゴリズムに基づき口腔内の状態に対応する情報に変換する。アルゴリズムは、第1のニューラルネットワークを有し、制御部110は、第1のニューラルネットワークを介して、入力された無発声時の複数枚の時系列画像を第1の音響情報に変換する。例えば、制御部110は、通信部100から入力される無発声時系列エコー画像をNN1へ入力する。NN1は、入力された無発声時系列エコー画像を無発声画像スペクトログラムに変換する。制御部110は、スペクトログラムを音声波形に変換することで、音声認識処理を行うことができる。よって、制御部110は、ユーザ12が音声を発せずとも、無発声時系列エコー画像に基づき音声認識処理を行い、音声により操作可能な機器を制御することができる。
(1-2) Control unit 110
The control unit 110 has a function of controlling the operation of the mobile terminal 10. For example, the control unit 110 converts a plurality of time-series images indicating the intraoral state acquired by ultrasonic echo into information corresponding to the intraoral state based on an algorithm acquired by machine learning. The algorithm has a first neural network, and the control unit 110 converts a plurality of input time-sequential images in the unvoiced state into first acoustic information via the first neural network. For example, the control unit 110 inputs the unvoiced time series echo image input from the communication unit 100 to the NN1. The NN1 converts the input unvoiced time series echo image into an unvoiced image spectrogram. The control unit 110 can perform voice recognition processing by converting the spectrogram into a voice waveform. Therefore, the control unit 110 can perform voice recognition processing based on the unvoiced time-series echo image and control a device that can be operated by voice even if the user 12 does not make a voice.
 また、アルゴリズムは、第2のニューラルネットワークをさらに有し、制御部110は、第2のニューラルネットワークを介して、第1の音響情報を発声時の音声と対応する第2の音響情報に変換する。例えば、制御部110は、NN1から出力された無発声画像スペクトログラムをNN2へ入力する。NN2は、入力された無発声画像スペクトログラムを、発声音声と対応する高精度無発声画像スペクトログラムに変換する。具体的に、NN1から出力された無発声画像スペクトログラムが示す音声が「Ulay music.」であり、発声音声と対応する高精度無発声画像スペクトログラムが示す音声が「Play music.」であったとする。この時、「Ulay music.」という音声を示す無発声画像スペクトログラムは、NN2に入力されると、文脈等を考慮して「Play music.」という音声を示す高精度無発声画像スペクトログラムに変換される。即ち、NN2は、NN1が無発声時系列エコー画像から変換した無発声画像スペクトログラムが示す音声を補正する役割を担っている。 Further, the algorithm further has a second neural network, and the control unit 110 converts the first acoustic information into second acoustic information corresponding to the voice at the time of utterance, via the second neural network. . For example, the control unit 110 inputs the unvoiced image spectrogram output from NN1 to NN2. The NN2 converts the input unvoiced image spectrogram into a high-precision unvoiced image spectrogram corresponding to the uttered voice. Specifically, it is assumed that the voice indicated by the unvoiced image spectrogram output from the NN1 is “Ulay music.” And the voice indicated by the high-precision unvoiced image spectrogram corresponding to the voiced voice is “Play music.”. At this time, the unvoiced image spectrogram indicating the voice "Ulay music." Is converted to a high-precision unvoiced image spectrogram indicating the voice "Play music." . That is, the NN2 has a role of correcting the voice indicated by the unvoiced image spectrogram converted from the unvoiced time series echo image by the NN1.
 上述の機能を実現するために、本実施形態に係る制御部110は、図4に示したように、機械学習部112、認識部114、及び処理制御部116を有する。 In order to realize the above functions, the control unit 110 according to the present embodiment has a machine learning unit 112, a recognition unit 114, and a processing control unit 116, as shown in FIG.
 ・機械学習部112
 機械学習部112は、学習情報を用いた機械学習を行う機能を有する。機械学習部112は、機械学習により、エコー画像をスペクトログラムに変換するためのアルゴリズムを取得する。具体的に、機械学習部112は、無発声時系列エコー画像を無発声画像スペクトログラムに変換するためのアルゴリズムであるNN1を取得する。また、機械学習部112は、無発声画像スペクトログラムを高精度無発声画像スペクトログラムに変換するためのアルゴリズムであるNN2を取得する。
Machine learning unit 112
The machine learning unit 112 has a function of performing machine learning using learning information. The machine learning unit 112 acquires an algorithm for converting an echo image into a spectrogram by machine learning. Specifically, the machine learning unit 112 acquires NN1 which is an algorithm for converting the unvoiced time-series echo image into an unvoiced image spectrogram. The machine learning unit 112 also acquires NN2, which is an algorithm for converting the unvoiced image spectrogram into a high-accuracy unvoiced image spectrogram.
 NN1は、発声時の音声と、発声時の複数枚の時系列画像とを含む第1の学習情報を用いた機械学習により得られる。例えば、NN1は、ユーザ12が発声した音声と、当該音声をユーザ12が発した際の発声時系列エコー画像を第1の学習情報として用いた機械学習により得られる。これにより、制御部110は、NN1を介して、エコー画像をスペクトログラムに変換することができる。 NN1 is obtained by machine learning using the first learning information that includes a voice when uttered and a plurality of time-series images when uttered. For example, the NN1 is obtained by machine learning using the voice uttered by the user 12 and the utterance time-series echo image when the user 12 utters the voice as the first learning information. Thereby, the control part 110 can convert an echo image into a spectrogram via NN1.
 なお、当該第1の学習情報は、例えば、ユーザ12にテキスト等を読み上げさせることにより取得する。これにより、時系列変化を示すエコー画像と、当該エコー画像に対応する発話波形を取得することができる。発話波形は、音響特徴量に変換することができる。 Note that the first learning information is acquired, for example, by causing the user 12 to read a text or the like. Accordingly, it is possible to acquire an echo image showing a time series change and a speech waveform corresponding to the echo image. The speech waveform can be converted into an acoustic feature amount.
 なお、制御部110が無発声時の複数枚の時系列画像をNN1に入力すると、NN1は、入力された無発声時の複数枚の時系列画像から単位時間当たりの音響特徴量を複数生成し、生成した複数の音響特徴量を時系列順に合成することで第1の音響情報を生成する。例えば、NN1は、制御部110により入力された無発声時系列エコー画像から単位時間当たりの音響特徴量を複数生成し、生成した複数の音響特徴量を時系列順に合成することで、無発声画像スペクトログラムを生成する。 When the control unit 110 inputs a plurality of time-series images when no voice is input to the NN1, the NN1 generates a plurality of acoustic feature amounts per unit time from the plurality of time-series images when no voice is input. , The first acoustic information is generated by synthesizing the generated plural acoustic feature amounts in time series. For example, the NN1 generates a plurality of acoustic feature amounts per unit time from the unvoiced time-series echo image input by the control unit 110, and synthesizes the generated acoustic feature amounts in time-series order to generate a unvoiced image. Generate a spectrogram.
 ここで、NN1により生成される音響特徴量について説明する。図5は、本実施形態に係る音響特徴量の生成例を示す図である。NN1は、単位時間に取得された無発声時の複数枚の時系列画像の内、単位時間の中央の時刻における時系列画像を選択し、選択した時系列画像から単位時間当たりの音響特徴量を生成する。例えば、NN1は、単位時間に取得された無発声時系列エコー画像の内、単位時間の中央の時刻におけるエコー画像を選択し、選択したエコー画像から単位時間当たりの音響特徴量を生成する。本実施形態に係る単位時間とは、例えば、取得されるエコー画像の枚数が5枚~13枚のいずれかの枚数となる時間である。本実施形態では、13枚のエコー画像が取得される時間を単位時間とする。具体的に、図5に示すように、NN1は、無発声時系列エコー画像42の内、単位時間に取得された無発声時系列エコー画像422の中央におけるエコー画像424を選択し、当該エコー画像424から音響特徴量70を生成する。そして、NN1は、単位時間の開始時刻をずらして音響特徴量70の生成処理を繰り返し、生成された複数の音響特徴量70を合成することで、無発声画像スペクトログラム78を取得する。 Here, the acoustic feature amount generated by NN1 will be described. FIG. 5 is a diagram showing an example of generation of the acoustic feature quantity according to the present embodiment. The NN1 selects a time-series image at the central time of the unit time from among a plurality of time-series images when no voice is acquired in the unit time, and selects the acoustic feature amount per unit time from the selected time-series image. To generate. For example, the NN1 selects an echo image at the central time of the unit time from the unvoiced time-series echo images acquired in the unit time, and generates an acoustic feature amount per unit time from the selected echo image. The unit time according to the present embodiment is, for example, the time when the number of acquired echo images is any of 5 to 13. In the present embodiment, the unit time is the time when 13 echo images are acquired. Specifically, as shown in FIG. 5, the NN 1 selects the echo image 424 at the center of the unvoiced time series echo image 422 acquired in a unit time from the unvoiced time series echo image 42, and selects the echo image 424. The acoustic feature amount 70 is generated from 424. Then, the NN1 repeats the generation processing of the acoustic feature amount 70 by shifting the start time of the unit time, and synthesizes the plurality of generated acoustic feature amounts 70 to acquire the unvoiced image spectrogram 78.
 これにより、NN1は、th等の発音の最小単位に相当する口の動きを学習することができる。さらに、認識部114は、発音をより正確に認識することができる。 With this, the NN1 can learn the movement of the mouth corresponding to the smallest unit of pronunciation such as th. Furthermore, the recognition unit 114 can recognize the pronunciation more accurately.
 なお、音響特徴量には、メルスケールスペクトログラム(Mel-scale spectrogram)、MFCC(メル周波数ケプストラム係数)、短時間FFT(SFFT)、音声波形をニューラルネット(オートエンコーダー)で処理することで次元を縮小した表現などを利用することができる。なお、上記オートエンコーダ―に関する技術は、Jesse Engel、他6名による論文(“Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders”、URL:https://arxiv.org/abs/1704.01279)に開示されている。これらは、音響波形との相互変換が可能である。例えば、メルスケールスペクトログラムは、Griffin Limアルゴリズムという手法を利用して、音響波形に複合することができる。音響特徴量としては、音声波形を短時間に分割してベクトル表現にすることのできるその他の表現も利用可能である。なお、本実施形態では、音響特徴量の次元数は64次元程度とされるが、当該次元数は、再生する音声の音質に応じて変更してもよい。 Note that the acoustic features are reduced by processing them with a mel-scale spectrogram, MFCC (mel frequency cepstrum coefficient), short-time FFT (SFFT), and a voice waveform with a neural network (auto encoder). You can use the expressions you have made. The technology related to the auto encoder is disclosed in a paper by Jesse Engel and 6 others ("Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders", URL: https://arxiv.org/abs/1704.01279). . These can be interconverted with acoustic waveforms. For example, a melscale spectrogram can be combined with an acoustic waveform by using a technique called Griffin Lim algorithm. As the acoustic feature amount, other expressions capable of dividing the speech waveform into a vector expression in a short time can be used. In the present embodiment, the number of dimensions of the acoustic feature amount is about 64, but the number of dimensions may be changed according to the sound quality of the reproduced voice.
 NN2は、発声時の複数枚の時系列画像をNN1に入力することで生成される第3の音響情報と、発声時の音声に対応する第4の音響情報とを含む第2の学習情報を用いた機械学習により得られる。例えば、NN2は、発声時系列エコー画像をNN1に入力することで生成される発声画像スペクトログラムと、発声音声に対応する発声音声スペクトログラムを第2の学習情報として用いた機械学習により得られる。これにより、制御部110は、NN2を介して、NN1から出力された無発声画像スペクトログラムを、より精度の高いスペクトログラムに変換することができる。 The NN2 outputs second learning information including third acoustic information generated by inputting a plurality of time-series images at the time of utterance to the NN1 and fourth acoustic information corresponding to the voice at the time of utterance. Obtained by machine learning used. For example, NN2 is obtained by machine learning using a vocalization image spectrogram generated by inputting a vocalization time-series echo image to NN1 and a vocalization spectrogram corresponding to a vocalization as second learning information. As a result, the control unit 110 can convert the unvoiced image spectrogram output from the NN1 via the NN2 into a more accurate spectrogram.
 なお、NN2では、無発声画像スペクトログラムを同一の長さのスペクトログラムに変換してもよい。例えば、無発声画像スペクトログラムが、ユーザ12がスマートスピーカ等に対して発したコマンドと対応するスペクトログラムである場合、NN2は、無発声画像スペクトログラムを同一の長さのスペクトログラムに変換する。 NN2 may convert the unvoiced image spectrogram into a spectrogram of the same length. For example, if the unvoiced image spectrogram is a spectrogram corresponding to the command issued by the user 12 to the smart speaker or the like, the NN 2 converts the unvoiced image spectrogram into a spectrogram of the same length.
 なお、NN2に入力されるスペクトログラムの長さには、固定値が設定されてもよい。しかしながら、当該固定値よりも短い長さのスペクトログラムがNN2に入力されようとした場合、制御部110は、不足している長さの無音声部を当該スペクトログラムに挿入した上で、NN2に入力してもよい。また、NN2では、損失関数として二乗平均誤差が用いられるため、NN2に対する入力が出力になるべく一致するようにNN2は学習する。 Note that a fixed value may be set for the length of the spectrogram input to NN2. However, when the spectrogram having a length shorter than the fixed value is about to be input to the NN2, the control unit 110 inserts the voiceless part having the insufficient length into the spectrogram and then inputs the spectrogram to the NN2. May be. Further, in NN2, since the root mean square error is used as the loss function, NN2 learns so that the input to NN2 matches the output as much as possible.
 NN2を用いる意図は、コマンドに対応する無発声時系列エコー画像から生成された無発声画像スペクトログラムが、実際にコマンドを発声した際の音声から生成した発声音声スペクトログラムにより近くなるように調整することである。NN1では、無発声時系列エコー画像の規定枚数のみを入力としているため、当該規定枚数と対応する時間幅よりも長い時間幅での文脈をとらえることはできない。NN2では、コマンドの文脈も含めて変換を行うことができる。 The purpose of using NN2 is to adjust the unvoiced image spectrogram generated from the unvoiced time series echo image corresponding to the command so as to be closer to the uttered voice spectrogram generated from the voice when the command is actually uttered. is there. In NN1, since only the specified number of unvoiced time-series echo images is input, it is not possible to capture the context in a time width longer than the time width corresponding to the specified number. In NN2, conversion can be performed including the context of the command.
 ここで、NN2の具体的な構造について説明する。図6は、本実施形態に係る第2のニューラルネットワークの構造を示す図である。図6には、長さが184、次元数が64である規定の長さの無発声画像スペクトログラム72を、同じ長さの高精度無発声画像スペクトログラム74に変換する例が示されている。初段の1D-Convolution Bank80は、1次元のCNNであるが、フィルターのサイズが1~8の範囲でそれぞれ異なる8個のNNにより構成されている。複数のフィルターサイズを用いることで、時間幅の異なった特徴が抽出される。当該特徴は、例えば、発音記号レベルの特徴、単語レベルの特徴等である。当該フィルターからの出力は、1D-Convolutionと1D-Deconvolutionを組み合わせたU-Networkと呼ばれるNNにより変換される。U-Networkは、Convolution/Deconvolutionによって変換された情報から大域的な情報を認識することができる。しかしながら、局所的な情報は失われがちなため、U-Networkは、当該局所的な情報を保証するための構造となっている。 Here, the specific structure of NN2 will be described. FIG. 6 is a diagram showing the structure of the second neural network according to the present embodiment. FIG. 6 shows an example in which an unvoiced image spectrogram 72 having a length of 184 and a dimension number of 64 and having a prescribed length is converted into a high-precision unvoiced image spectrogram 74 of the same length. The first-stage 1D-Convolution Bank 80 is a one-dimensional CNN, but is composed of eight different NNs with filter sizes ranging from 1 to 8. Features with different time widths are extracted by using multiple filter sizes. The features are, for example, phonetic symbol level features, word level features, and the like. The output from the filter is converted by an NN called U-Network, which is a combination of 1D-Convolution and 1D-Deconvolution. The U-Network can recognize global information from the information converted by the Convolution / Deconvolution. However, since local information tends to be lost, U-Network has a structure for guaranteeing the local information.
 例えば、図6に示すように、U-Networkは、長さが184で次元数が128の音響特徴量802を、長さが96で次元数が256の音響特徴量804とする。次いで、U-Networkは、当該音響特徴量804を、長さが46で次元数が512の音響特徴量806とする。さらに、U-Networkは、当該音響特徴量806を、長さが23で次元数が1024の音響特徴量808とする。これにより、空間的な大きさが小さくなる代わりに、空間的な深さが深くなることで、局所的な特徴が抽出される。 For example, as shown in FIG. 6, in the U-Network, an acoustic feature amount 802 having a length of 184 and a dimensionality of 128 is set as an acoustic feature amount 804 having a length of 96 and a dimensionality of 256. Next, the U-Network sets the acoustic feature amount 804 as the acoustic feature amount 806 having a length of 46 and a dimension number of 512. Further, the U-Network sets the acoustic feature amount 806 as the acoustic feature amount 808 having a length of 23 and a dimension number of 1024. As a result, the local feature is extracted by increasing the spatial depth instead of decreasing the spatial size.
 局所的な特量の抽出後、U-Networkは、局所的な特徴の抽出時とは逆の順序で、音響特徴量のサイズと次元数を元に戻していく。この時、U-Networkは、入力をそのまま出力にコピーした情報もNNに統合する。例えば、図6に示すように、U-Networkは、長さが23で次元数が1024の音響特徴量808を、長さが46で次元数が512の音響特徴量810とし、音響特徴量806をコピーした音響特徴量812と統合する。次いで、U-Networkは、音響特徴量812と統合された音響特徴量810を、長さが96で次元数が256の音響特徴量814とし、音響特徴量804をコピーした音響特徴量816と統合する。さらに、U-Networkは、音響特徴量816と統合された音響特徴量814を、長さが184で次元数が128の音響特徴量818とし、音響特徴量802をコピーした音響特徴量820と統合する。 After extracting local features, U-Network restores the size and number of dimensions of acoustic features in the reverse order of the extraction of local features. At this time, the U-Network also integrates the information in which the input is directly copied to the output into the NN. For example, as shown in FIG. 6, in the U-Network, an acoustic feature amount 808 having a length of 23 and a dimension number of 1024 is set as an acoustic feature amount 810 of a length of 46 and a dimension number of 512, and the acoustic feature amount 806 Is integrated with the copied acoustic feature amount 812. Next, the U-Network integrates the acoustic feature amount 810 integrated with the acoustic feature amount 812 into the acoustic feature amount 814 having a length of 96 and a dimension number of 256, and integrates the acoustic feature amount 804 with the copied acoustic feature amount 816. To do. Further, the U-Network integrates the acoustic feature amount 814 integrated with the acoustic feature amount 816 into an acoustic feature amount 818 having a length 184 and a dimensionality of 128, and integrates the acoustic feature amount 802 with a copied acoustic feature amount 820. To do.
 上述のU-Networkを用いる方法は、2次元画像の変換(例えば白黒画像からカラー画像への変換)を学習するNNには一般的に用いられている方法であり、本実施形態では、当該方法を1次元の音響特徴量系列に適用したものである。 The method using the U-Network described above is a method that is generally used for NN that learns conversion of a two-dimensional image (for example, conversion from a monochrome image to a color image). In the present embodiment, the method is used. Is applied to a one-dimensional acoustic feature quantity sequence.
 なお、NN2で用いられる第2の学習情報の数は、ユーザ12が学習情報用に作成した発声音声と発声時系列エコー画像の組み合わせの数である。例えば、ユーザ12が学習情報を作成するために300回の発話を行った場合、300個の入力と出力の組み合わせが作成される。しかしながら、300個という量は、NN2を学習させるために十分な量ではない場合が有り得る。そこで、第2の学習情報の量が十分でない場合、データ拡張(Data Augmentation)を行ってもよい。データ拡張は、出力を固定したまま、入力の音響特徴量を乱数により攪乱させることで、第2の学習情報の数を増やすことができる。 Note that the number of the second learning information used in the NN2 is the number of combinations of the uttered voice and the utterance time-series echo image created by the user 12 for the learning information. For example, when the user 12 speaks 300 times to create learning information, 300 combinations of input and output are created. However, the amount of 300 may not be enough to train NN2. Therefore, if the amount of the second learning information is not sufficient, data extension (Data Augmentation) may be performed. The data expansion can increase the number of the second learning information by perturbing the input acoustic feature quantity with a random number while fixing the output.
 なお、NN1及びNN2に関する機械学習は、特定の話者に依存することでより効果的に学習が行われる。そのため、当該機械学習は、特定の話者に依存させて行わせることが望ましい。なお、NN1は特定の話者のみに依存させ、NN2は複数の話者の情報を一括して学習させるなど、複合的な学習を行わせてもよい。 Note that the machine learning for NN1 and NN2 can be performed more effectively by relying on a specific speaker. Therefore, it is desirable that the machine learning be performed depending on a specific speaker. It should be noted that NN1 may be made to depend only on a specific speaker, and NN2 may be made to learn the information of a plurality of speakers all at once, so that complex learning may be performed.
 ・認識部114
 認識部114は、認識処理を行う機能を有する。例えば、認識部114は、記憶部120にアクセスしてNN1を用いた変換処理を行う。具体的に、認識部114は、通信部100から入力される超音波エコー装置20が取得した無発声時系列エコー画像をNN1へ入力する。また、認識部114は、記憶部120にアクセスしてNN2を用いた変換処理を行う。具体的に、認識部114は、NN1から出力される無発声画像スペクトログラムをNN2へ入力する。また、認識部114は、NN2から出力される高精度無発声画像スペクトログラムに基づく音声認識処理を行う。そして、認識部114は、音声認識処理の結果を処理制御部116へ出力する。
Recognition unit 114
The recognition unit 114 has a function of performing recognition processing. For example, the recognition unit 114 accesses the storage unit 120 and performs conversion processing using NN1. Specifically, the recognition unit 114 inputs the unvoiced time series echo image acquired by the ultrasonic echo device 20 input from the communication unit 100 to the NN1. The recognition unit 114 also accesses the storage unit 120 and performs conversion processing using NN2. Specifically, the recognition unit 114 inputs the unvoiced image spectrogram output from NN1 to NN2. The recognition unit 114 also performs a voice recognition process based on the high-accuracy unvoiced image spectrogram output from the NN2. Then, the recognition unit 114 outputs the result of the voice recognition process to the process control unit 116.
 なお、認識部114は、NN1のみを用いた音声認識処理を行ってもよい。例えば、認識部114は、記憶部120にアクセスしてNN1を用いた変換処理を行い、NN1から出力される無発声画像スペクトログラムに基づく音声認識処理を行ってもよい。このように、本実施形態では、NN1から出力される無発声画像スペクトログラムに基づく音声認識処理を行うことが可能である。しかしながら、NN2から出力される高精度無発声画像スペクトログラムの方が、NN1から出力される無発声画像スペクトログラムよりも高精度である。よって、認識部114は、NN1だけでなく、NN2も用いた音声認識処理を行うことで、より高精度に音声認識処理を行うことができる。 Note that the recognition unit 114 may perform voice recognition processing using only NN1. For example, the recognition unit 114 may access the storage unit 120, perform conversion processing using NN1, and perform voice recognition processing based on the unvoiced image spectrogram output from NN1. As described above, in the present embodiment, it is possible to perform the voice recognition process based on the unvoiced image spectrogram output from the NN1. However, the high-accuracy unvoiced image spectrogram output from NN2 is more accurate than the unvoiced image spectrogram output from NN1. Therefore, the recognition unit 114 can perform the voice recognition process with higher accuracy by performing the voice recognition process using not only the NN1 but also the NN2.
 ・処理制御部116
 処理制御部116は、制御部110における処理を制御する機能を有する。例えば、処理制御部116は、認識部114による音声認識処理の結果に基づき、実行する処理を決定する。具体的に、音声認識処理の結果が、ユーザ12により制御部110が実行する処理をユーザ12により指定されたことを示す場合、処理制御部116は、ユーザ12により指定された処理を実行する。また、音声認識処理の結果が、ユーザ12による質問であることを示す場合、処理制御部116は、当該質問に対して回答する処理を実行する。
Processing control unit 116
The processing control unit 116 has a function of controlling the processing in the control unit 110. For example, the process control unit 116 determines the process to be executed based on the result of the voice recognition process by the recognition unit 114. Specifically, when the result of the voice recognition process indicates that the process performed by the control unit 110 by the user 12 is specified by the user 12, the process control unit 116 executes the process specified by the user 12. Further, when the result of the voice recognition process indicates that the question is made by the user 12, the process control unit 116 executes a process of answering the question.
 なお、処理制御部116が実行する処理が、ユーザに対して音声を出力する処理である場合、処理制御部116は、ユーザが装着している音声入出力装置30に当該音声を送信し、音声入出力装置30にて音声を出力させることができる。これにより、本実施形態に係る無音声発話システム1000は、外部に音声を漏らすことなく、ユーザ12と音声によるコミュニケーションを行うことができる。 When the process executed by the process control unit 116 is a process of outputting a voice to the user, the process control unit 116 transmits the voice to the voice input / output device 30 worn by the user, and outputs the voice. The input / output device 30 can output sound. As a result, the voiceless speech system 1000 according to the present embodiment can perform voice communication with the user 12 without leaking voice to the outside.
 (1-3)記憶部120
 記憶部120は、携帯端末10における処理に関するデータを記憶する機能を有する。例えば、記憶部120は、制御部110における機械学習により生成されるアルゴリズムである第1のニューラルネットワーク122、及び第2のニューラルネットワーク124を記憶する。制御部110は、無発声時系列エコー画像を無発声画像スペクトログラムに変換する際に、記憶部120にアクセスして第1のニューラルネットワーク122を利用する。また、制御部110は、無発声画像スペクトログラムを高精度無発声画像スペクトログラムに変換する際に、記憶部120にアクセスして第2のニューラルネットワーク124を利用する。
(1-3) Storage unit 120
The storage unit 120 has a function of storing data regarding processing in the mobile terminal 10. For example, the storage unit 120 stores a first neural network 122 and a second neural network 124, which are algorithms generated by machine learning in the control unit 110. The control unit 110 accesses the storage unit 120 and uses the first neural network 122 when converting the unvoiced time-series echo image into the unvoiced image spectrogram. In addition, the control unit 110 accesses the storage unit 120 and uses the second neural network 124 when converting the unvoiced image spectrogram into a high-accuracy unvoiced image spectrogram.
 また、記憶部120は、制御部110は機械学習に用いる学習情報を記憶してもよい。なお、記憶部120が記憶するデータは、上述の例に限定されない。例えば、記憶部120は、各種アプリケーション等のプログラムを記憶してもよい。 The storage unit 120 may store learning information used by the control unit 110 for machine learning. The data stored in the storage unit 120 is not limited to the above example. For example, the storage unit 120 may store programs such as various applications.
 (2)超音波エコー装置20
 図4に示したように、超音波エコー装置20は、通信部200、制御部210、及びエコー取得部220を有する。
(2) Ultrasonic echo device 20
As shown in FIG. 4, the ultrasonic echo device 20 includes a communication unit 200, a control unit 210, and an echo acquisition unit 220.
 (2-1)通信部200
 通信部200は、外部装置と通信を行う機能を有する。例えば、通信部200は、外部装置との通信において、外部装置から受信する情報を制御部210へ出力する。具体的に、通信部200は、携帯端末10から受信するエコー画像の取得に関する情報を制御部210へ出力する。
(2-1) Communication unit 200
The communication unit 200 has a function of communicating with an external device. For example, the communication unit 200 outputs the information received from the external device to the control unit 210 in the communication with the external device. Specifically, the communication unit 200 outputs information regarding the acquisition of the echo image received from the mobile terminal 10 to the control unit 210.
 また、通信部200は、外部装置との通信において、制御部210から入力される情報を外部装置へ送信する。具体的に、通信部200は、制御部210から入力されるエコー画像を携帯端末10へ送信する。 The communication unit 200 also transmits information input from the control unit 210 to an external device in communication with the external device. Specifically, the communication unit 200 transmits the echo image input from the control unit 210 to the mobile terminal 10.
 (2-2)制御部210
 制御部210は、超音波エコー装置20の動作全般を制御する機能を有する。例えば、制御部210は、エコー取得部220によるエコー画像の取得処理を制御する。また、制御部210は、エコー取得部220により取得されたエコー画像を、通信部200が携帯端末10へ送信する処理を制御する。
(2-2) Control unit 210
The control unit 210 has a function of controlling the overall operation of the ultrasonic echo device 20. For example, the control unit 210 controls the echo image acquisition processing by the echo acquisition unit 220. Further, the control unit 210 controls the process of transmitting the echo image acquired by the echo acquisition unit 220 to the mobile terminal 10 by the communication unit 200.
 (2-3)エコー取得部220
 エコー取得部220は、エコー画像を取得する機能を有する。例えば、エコー取得部220は、超音波出力部22に備えられた超音波出力装置を用いてエコー画像を取得する。具体的に、エコー取得部220は、超音波出力装置にユーザ12の体内へ超音波を出力させ、ユーザ12の体内の器官に反射した超音波に基づき、エコー画像を取得する。エコー取得部220は、超音波出力装置にユーザ12の顎の下からユーザ12の口腔内に向けて超音波を出力させることで、ユーザ12の口腔内の状態を示すエコー画像を取得することができる。
(2-3) Echo acquisition unit 220
The echo acquisition unit 220 has a function of acquiring an echo image. For example, the echo acquisition unit 220 acquires an echo image using the ultrasonic output device provided in the ultrasonic output unit 22. Specifically, the echo acquisition unit 220 causes the ultrasonic wave output device to output ultrasonic waves into the body of the user 12 and acquires an echo image based on the ultrasonic waves reflected by the organs inside the body of the user 12. The echo acquisition unit 220 may acquire an echo image showing the state of the inside of the oral cavity of the user 12 by causing the ultrasonic wave output device to output ultrasonic waves from under the chin of the user 12 toward the inside of the oral cavity of the user 12. it can.
 (3)音声入出力装置30
 図4に示したように、音声入出力装置30は、通信部300、制御部310、音声入力部320、及び音声出力部330を有する。
(3) Voice input / output device 30
As shown in FIG. 4, the voice input / output device 30 includes a communication unit 300, a control unit 310, a voice input unit 320, and a voice output unit 330.
 (3-1)通信部300
 通信部300は、外部装置と通信を行う機能を有する。例えば、通信部300は、外部装置との通信において、外部装置から受信する情報を制御部310へ出力する。具体的に、通信部300は、携帯端末10から受信する音声データを制御部310へ出力する。
(3-1) Communication unit 300
The communication unit 300 has a function of communicating with an external device. For example, the communication unit 300 outputs the information received from the external device to the control unit 310 in the communication with the external device. Specifically, the communication unit 300 outputs the audio data received from the mobile terminal 10 to the control unit 310.
 また、通信部300は、外部装置との通信において、制御部310から入力される情報を外部装置へ送信する。具体的に、通信部300は、制御部310から入力される音声データを携帯端末10へ送信する。 The communication unit 300 also transmits information input from the control unit 310 to the external device in communication with the external device. Specifically, the communication unit 300 transmits the voice data input from the control unit 310 to the mobile terminal 10.
 (3-2)制御部310
 制御部310は、音声入出力装置30の動作全般を制御する機能を有する。例えば、制御部310は、音声入力部320による音声の取得処理を制御する。また、制御部310は、音声入力部320により取得された音声を、通信部300が携帯端末10へ送信する処理を制御する。また、制御部310は、音声出力部330による音声の出力処理を制御する。例えば、通信部300が携帯端末10から受信した音声データを、音声として音声出力部330に出力させる。
(3-2) Control unit 310
The control unit 310 has a function of controlling the overall operation of the voice input / output device 30. For example, the control unit 310 controls the voice acquisition process by the voice input unit 320. In addition, the control unit 310 controls the process in which the communication unit 300 transmits the voice acquired by the voice input unit 320 to the mobile terminal 10. The control unit 310 also controls the audio output processing by the audio output unit 330. For example, the communication unit 300 causes the audio output unit 330 to output the audio data received from the mobile terminal 10 as audio.
 (3-3)音声入力部320
 音声入力部320は、外部で生じた音声を取得する機能を有する。音声入力部320は、例えば、ユーザ12が発声した際の音声である発声音声を取得する。そして、音声入力部320は、取得した発声音声を制御部310へ出力する。なお、音声入力部320は、例えば、マイクロフォンにより実現され得る。
(3-3) Voice input unit 320
The voice input unit 320 has a function of acquiring a voice generated outside. The voice input unit 320 acquires, for example, a voiced voice that is a voice when the user 12 speaks. Then, the voice input unit 320 outputs the acquired vocalized voice to the control unit 310. The voice input unit 320 can be realized by, for example, a microphone.
 (3-4)音声出力部330
 音声出力部330は、外部装置から受信した音声を出力する機能を有する。音声出力部330は、例えば、携帯端末10における音声認識処理の結果に基づき生成された音声データを、制御部310から入力され、入力された音声データに対応する音声を出力する。なお、音声出力部330は、例えば、スピーカにより実現され得る。
(3-4) Audio output unit 330
The voice output unit 330 has a function of outputting a voice received from an external device. The voice output unit 330 receives, for example, voice data generated based on the result of the voice recognition process in the mobile terminal 10 from the control unit 310, and outputs a voice corresponding to the input voice data. The audio output unit 330 can be realized by, for example, a speaker.
 <1.4.無音声発話システムの処理>
 以上、本実施形態に係る無音声発話システム1000の機能について説明した。続いて、無音声発話システム1000の処理について説明する。
<1.4. Processing of voiceless speech system>
The functions of the voiceless speech system 1000 according to the present embodiment have been described above. Next, the processing of the voiceless speech system 1000 will be described.
 (1)第1のニューラルネットワークを取得する機械学習の流れ
 図7は、本実施形態に係る第1のニューラルネットワークを取得する機械学習の流れを示すフローチャートである。まず、携帯端末10は、超音波エコー装置20から発声時系列エコー画像を学習情報として取得する(S100)。また、携帯端末10は、音声入出力装置30から、発声音声を学習情報として取得する(S102)。次いで、携帯端末10は、取得した学習情報を用いて機械学習を行う(S104)。そして、携帯端末10は、当該機械学習により生成されるアルゴリズムをNN1とする(S106)。
(1) Machine Learning Flow for Acquiring First Neural Network FIG. 7 is a flowchart showing the flow of machine learning for acquiring the first neural network according to the present embodiment. First, the mobile terminal 10 acquires the vocalization time series echo image from the ultrasonic echo device 20 as learning information (S100). Further, the mobile terminal 10 acquires the vocalized voice as the learning information from the voice input / output device 30 (S102). Next, the mobile terminal 10 performs machine learning using the acquired learning information (S104). Then, the mobile terminal 10 sets the algorithm generated by the machine learning to NN1 (S106).
 (2)第2のニューラルネットワークを取得する機械学習の流れ
 図8は、本実施形態に係る第2のニューラルネットワークを取得する機械学習の流れを示すフローチャートである。まず、携帯端末10は、発声時系列エコー画像をNN1へ入力する(S200)。次いで、携帯端末10は、NN1から出力される発声画像スペクトログラムを学習情報として取得する(S202)。また、携帯端末10は、発声音声から発声音声スペクトログラムを学習情報として取得する(S204)。次いで、携帯端末10は、取得した学習情報を用いて機械学習を行う(S206)。そして、携帯端末10は、当該機械学習により生成されるアルゴリズムをNN2とする(S208)。
(2) Machine Learning Flow for Acquiring Second Neural Network FIG. 8 is a flowchart showing the flow of machine learning for acquiring the second neural network according to the present embodiment. First, the mobile terminal 10 inputs the utterance time series echo image to the NN1 (S200). Next, the mobile terminal 10 acquires the vocal image spectrogram output from the NN1 as learning information (S202). In addition, the mobile terminal 10 acquires a uttered voice spectrogram from the uttered voice as learning information (S204). Next, the mobile terminal 10 performs machine learning using the acquired learning information (S206). Then, the mobile terminal 10 sets the algorithm generated by the machine learning to NN2 (S208).
 (3)携帯端末10における処理
 図9は、本実施形態に係る携帯端末における処理の流れを示すフローチャートである。まず、携帯端末10は、無発声時系列エコー画像を取得する(S300)。次いで、携帯端末10は、取得した無発声時系列エコー画像をNN1へ入力し、無発声時系列エコー画像から複数の音声特徴量を生成する(S302)。次いで、携帯端末10は、生成した複数の音声特徴量を時系列順に合成し、無発声画像スペクトログラムを生成する(S304)。
(3) Processing in mobile terminal 10 FIG. 9 is a flowchart showing a flow of processing in the mobile terminal according to the present embodiment. First, the mobile terminal 10 acquires an unvoiced time series echo image (S300). Next, the mobile terminal 10 inputs the acquired unvoiced time-series echo image to the NN1 and generates a plurality of audio feature amounts from the unvoiced time-series echo image (S302). Next, the mobile terminal 10 synthesizes the plurality of generated voice feature amounts in time series to generate an unvoiced image spectrogram (S304).
 無発声時系列エコー画像から無発声画像スペクトログラムを生成後、携帯端末10は、生成した無発生画像スペクトログラムをNN2へ入力し、無発声画像スペクトログラムを高精度無発声画像スペクトログラムに変換する(S306)。変換後、携帯端末10は、認識部114にて高精度無発声画像スペクトログラムが示す内容を認識する(S308)。そして、携帯端末10は、認識部114にて認識した内容に基づく処理を実行する(S310)。 After generating the unvoiced image spectrogram from the unvoiced time-series echo image, the mobile terminal 10 inputs the generated unvoiced image spectrogram into the NN2 and converts the unvoiced image spectrogram into a high-precision unvoiced image spectrogram (S306). After conversion, the mobile terminal 10 recognizes the content indicated by the high-accuracy unvoiced image spectrogram in the recognition unit 114 (S308). Then, the mobile terminal 10 executes a process based on the content recognized by the recognition unit 114 (S310).
<<2.変形例>>
 以上、本開示の実施形態について説明した。続いて、本開示の実施形態の変形例を説明する。なお、以下に説明する変形例は、単独で本開示の実施形態に適用されてもよいし、組み合わせで本開示の実施形態に適用されてもよい。また、変形例は、本開示の実施形態で説明した構成に代えて適用されてもよいし、本開示の実施形態で説明した構成に対して追加的に適用されてもよい。
<< 2. Modification >>
The embodiments of the present disclosure have been described above. Subsequently, a modified example of the embodiment of the present disclosure will be described. Note that the modified examples described below may be applied to the embodiment of the present disclosure alone, or may be applied to the embodiment of the present disclosure in combination. In addition, the modified example may be applied instead of the configuration described in the embodiment of the present disclosure, or may be additionally applied to the configuration described in the embodiment of the present disclosure.
 上述の実施形態では、NN2により変換された高精度無発声画像スペクトログラムは、携帯端末10の認識部114へ出力される例について説明したが、当該高精度無発声画像スペクトログラムは、音声波形に変換された上で、スピーカ等の音声出力装置から音声として出力されてもよい。これにより、ユーザ12は、音声出力装置を介してスマートスピーカ等の音声入力機能付きの情報機器を制御することができる。 In the above-described embodiment, an example in which the high-precision unvoiced image spectrogram converted by NN2 is output to the recognition unit 114 of the mobile terminal 10 has been described. However, the high-precision unvoiced image spectrogram is converted into a voice waveform. Then, it may be output as voice from a voice output device such as a speaker. Thus, the user 12 can control the information device with the voice input function such as the smart speaker via the voice output device.
 また、高精度無発声画像スペクトログラムは、携帯端末10の認識部114ではなく、外部の音声認識装置へ出力されてもよい。例えば、当該高精度無発声画像スペクトログラムは、通信を介して、スマートスピーカの音声認識部へ入力されてもよい。これにより、ユーザ12は、携帯端末10に音波を空中へ放射させることなく、スマートスピーカ等の音声入力機能付きの情報機器を制御することができる。 The high-accuracy unvoiced image spectrogram may be output to an external voice recognition device instead of the recognition unit 114 of the mobile terminal 10. For example, the high-accuracy unvoiced image spectrogram may be input to the voice recognition unit of the smart speaker via communication. Accordingly, the user 12 can control the information device with the voice input function such as the smart speaker without causing the mobile terminal 10 to emit the sound wave into the air.
<<3.応用例>>
 以上、本開示の実施形態の変形例について説明した。続いて、本開示の実施形態に係る無音声発話システム1000の応用例を説明する。
<< 3. Application example >>
The modification of the embodiment of the present disclosure has been described above. Next, an application example of the voiceless speech system 1000 according to the embodiment of the present disclosure will be described.
 <3.1.第1の応用例>
 まず、本実施形態に係る第1の応用例について説明する。本実施形態に係る無音声発話システム1000は、例えば、話者が無発声で口や舌を動かすためのトレーニングに応用され得る。例えば、無音声発話システム1000は、超音波エコー装置20が取得した無発声時系列エコー画像から認識した内容を、話者へ視覚的にフィードバックする。これにより、話者は、当該フィードバックに基づき、口や舌の動かし方を改善することができる。具体的に、無音声発話システム1000が無発声時系列エコー画像を表示装置等に表示することで、話者は、表示された画像を確認して口や舌の動かし方を学習することができる。さらに、無音声発話システム1000が無発声時系列エコー画像から認識した内容を音声によりフィードバックすることで、話者は、どのように口や舌を動かすと、無音声発話システム1000にどのように認識されるかを学習することができる。なお、無音声発話システム1000が認識した内容は、テキストによりフィードバックされてもよい。
<3.1. First application example>
First, a first application example according to this embodiment will be described. The voiceless speech system 1000 according to the present embodiment can be applied to, for example, training for a speaker to move his mouth or tongue without speaking. For example, the voiceless speech system 1000 visually feeds back the content recognized from the voiceless time series echo image acquired by the ultrasonic echo device 20 to the speaker. Thereby, the speaker can improve the way of moving the mouth and tongue based on the feedback. Specifically, the voiceless utterance system 1000 displays the voiceless time-series echo images on the display device or the like, so that the speaker can confirm the displayed image and learn how to move the mouth or tongue. . Further, by feeding back the content recognized by the voiceless utterance system 1000 from the voiceless time-series echo images by voice, the speaker recognizes how the voiceless utterance system 1000 moves by moving his mouth or tongue. You can learn what will be done. The content recognized by the voiceless speech system 1000 may be fed back as text.
 <3.2.第2の応用例>
 続いて、本実施形態に係る第2の応用例について説明する。本実施形態に係る無音声発話システム1000は、声帯が欠損している人や聴覚障害者の発声支援機器として応用され得る。近年、声帯の機能を失った人のために、咽頭にボタン制御可能な振動子を押し当てて声帯の代替とする技術に関する技術が提案されている。当該技術により、声帯の機能を失った人は、声帯を振動させることなく音声を発声することができる。しかしながら、当該技術では、振動子が大きな音を発するため、口腔内を経由した発話の音響を阻害してしまうことが起こり得る。また、話者が当該大きな音の音量を調整することは難しく、当該大きな音は、話者にとっては不快な音となり得る。一方、本実施形態に係る無音声発話システム1000では、超音波エコーにより取得される情報が音響情報に変換され、当該音響情報が音声波形として発声されるため、発話の音響を阻害する音や、不快な音は生じない。また、話者は、無音声発話システム1000から生じる音声の音量を調整することもできる。よって、声帯の機能を失った人であっても、本実施形態に係る無音声発話システム1000をより快適に利用することができる。
<3.2. Second application example>
Next, a second application example according to this embodiment will be described. The voiceless speech system 1000 according to the present embodiment can be applied as a vocalization assisting device for a person with a defective vocal cord or a deaf person. In recent years, there has been proposed a technique related to a technique in which a button-controllable oscillator is pressed against the pharynx to substitute the vocal cords for a person who has lost the function of the vocal cords. With this technique, a person who has lost the function of the vocal cord can utter a voice without vibrating the vocal cord. However, in this technique, since the vibrator emits a loud sound, it may happen that the sound of the utterance that has passed through the oral cavity is disturbed. Further, it is difficult for the speaker to adjust the volume of the loud sound, and the loud sound may be an unpleasant sound for the speaker. On the other hand, in the voiceless speech system 1000 according to the present embodiment, the information acquired by the ultrasonic echo is converted into acoustic information, and the acoustic information is uttered as a voice waveform. No offensive sound. In addition, the speaker can also adjust the volume of the voice generated from the voiceless speech system 1000. Therefore, even a person who has lost the function of the vocal cords can more comfortably use the voiceless speech system 1000 according to the present embodiment.
 また、声帯が欠損している人は、音声を発することはできないが、口や舌を動かして口腔内の状態を変化させることができる。そのため、声帯が欠損している人の口腔内の状態を無音声発話システム1000が認識し、認識した内容をスピーカから音声として出力することで、声帯が欠損している人であっても他者と音声によりコミュニケーションを取ることができる。なお、本実施形態に係る無音声発話システム1000は、声帯が欠損している人に関わらず、高齢者など、声帯を振動させるだけの充分な肺活量をもたない人に対しても効果を発揮する。例えば、充分な声量で発声することができない高齢者の場合、会話が困難になり得るが、当該高齢者は、無音声発話システム1000によって発声能力を持つことができ、会話を容易に行うことが可能となる。 Also, a person with a defective vocal cord cannot make a voice, but can move the mouth or tongue to change the condition inside the oral cavity. Therefore, the voiceless system 1000 recognizes the state of the oral cavity of a person with a defective vocal cord, and outputs the recognized content as a voice from the speaker, so that even if the person with a defective vocal cord is used, And can communicate by voice. The voiceless speech system 1000 according to the present embodiment is effective even for a person who does not have sufficient vital capacity to vibrate the vocal cords, such as an elderly person, regardless of the person who has a vocal cord defect. To do. For example, in the case of an elderly person who cannot speak with a sufficient amount of voice, it may be difficult for the elderly person to speak, but the elderly person can have speaking ability by the voiceless speech system 1000 and can easily carry out the conversation. It will be possible.
 また、聴覚障害者は、音声を発することはできるが、自身が発した音声が他者に正確に伝わっているかを確認することは難しい。そこで、第1の応用例でも述べた本実施形態の無音声発話システム1000のフィードバックを利用することで、聴覚障害者は、自身がどのように音声を発しているかを確認することができる。また、無音声発話システム1000では、口腔内の状態を確認することもできるため、聴覚障害者は、口や舌の動かし方を確認しながらしゃべり方を練習することができる。 Although a hearing-impaired person can make a voice, it is difficult to confirm whether the voice made by himself or herself is accurately transmitted to others. Therefore, by utilizing the feedback of the voiceless speech system 1000 of the present embodiment described in the first application example, the deaf person can confirm how he / she is making a voice. Further, with the voiceless speech system 1000, the state of the oral cavity can be confirmed, so that the deaf person can practice speaking while confirming how to move the mouth and tongue.
 <3.3.第3の応用例>
 続いて、本実施形態に係る第3の応用例について説明する。本実施形態に係る無音声発話システム1000は、補聴器の機能の拡充に応用され得る。無音声発話システム1000が補聴器に搭載されることで、補聴器の利用者の利便性を向上することができる。
<3.3. Third application example>
Then, the 3rd application example concerning this embodiment is explained. The voiceless speech system 1000 according to the present embodiment can be applied to expand the functions of a hearing aid. By installing the voiceless speech system 1000 in a hearing aid, the convenience of the user of the hearing aid can be improved.
<<4.ハードウェア構成例>>
 最後に、図10を参照しながら、本実施形態に係る情報処理装置のハードウェア構成例について説明する。図10は、本実施形態に係る情報処理装置のハードウェア構成例を示すブロック図である。なお、図10に示す情報処理装置900は、例えば、図1及び図4にそれぞれ示した携帯端末10、超音波エコー装置20、及び音声入出力装置30を実現し得る。本実施形態に係る携帯端末10、超音波エコー装置20、及び音声入出力装置30による情報処理は、ソフトウェアと、以下に説明するハードウェアとの協働により実現される。
<< 4. Hardware configuration example >>
Finally, with reference to FIG. 10, a hardware configuration example of the information processing apparatus according to the present embodiment will be described. FIG. 10 is a block diagram showing a hardware configuration example of the information processing apparatus according to the present embodiment. The information processing apparatus 900 shown in FIG. 10 can realize, for example, the mobile terminal 10, the ultrasonic echo device 20, and the voice input / output device 30 shown in FIGS. 1 and 4, respectively. Information processing by the mobile terminal 10, the ultrasonic echo device 20, and the voice input / output device 30 according to the present embodiment is realized by cooperation of software and hardware described below.
 図10に示すように、情報処理装置900は、CPU(Central Processing Unit)901、ROM(Read Only Memory)902、及びRAM(Random Access Memory)903を備える。また、情報処理装置900は、ホストバス904a、ブリッジ904、外部バス904b、インタフェース905、入力装置906、出力装置907、ストレージ装置908、ドライブ909、接続ポート910、及び通信装置911を備える。なお、ここで示すハードウェア構成は一例であり、構成要素の一部が省略されてもよい。また、ハードウェア構成は、ここで示される構成要素以外の構成要素をさらに含んでもよい。 As shown in FIG. 10, the information processing apparatus 900 includes a CPU (Central Processing Unit) 901, a ROM (Read Only Memory) 902, and a RAM (Random Access Memory) 903. The information processing apparatus 900 also includes a host bus 904a, a bridge 904, an external bus 904b, an interface 905, an input device 906, an output device 907, a storage device 908, a drive 909, a connection port 910, and a communication device 911. The hardware configuration shown here is an example, and some of the components may be omitted. The hardware configuration may further include components other than the components shown here.
 CPU901は、例えば、演算処理装置又は制御装置として機能し、ROM902、RAM903、又はストレージ装置908に記録された各種プログラムに基づいて各構成要素の動作全般又はその一部を制御する。ROM902は、CPU901に読み込まれるプログラムや演算に用いるデータ等を格納する手段である。RAM903には、例えば、CPU901に読み込まれるプログラムや、そのプログラムを実行する際に適宜変化する各種パラメータ等が一時的又は永続的に格納される。これらはCPUバスなどから構成されるホストバス904aにより相互に接続されている。CPU901、ROM902およびRAM903は、例えば、ソフトウェアとの協働により、図4を参照して説明した制御部110、制御部210、及び制御部310の機能を実現し得る。 The CPU 901 functions as, for example, an arithmetic processing device or a control device, and controls the overall operation of each component or a part thereof based on various programs recorded in the ROM 902, the RAM 903, or the storage device 908. The ROM 902 is a means for storing a program read by the CPU 901, data used for calculation, and the like. The RAM 903 temporarily or permanently stores, for example, a program read by the CPU 901 and various parameters that change appropriately when the program is executed. These are connected to each other by a host bus 904a including a CPU bus and the like. The CPU 901, the ROM 902, and the RAM 903 can realize the functions of the control unit 110, the control unit 210, and the control unit 310 described with reference to FIG. 4, for example, in cooperation with software.
 CPU901、ROM902、及びRAM903は、例えば、高速なデータ伝送が可能なホストバス904aを介して相互に接続される。一方、ホストバス904aは、例えば、ブリッジ904を介して比較的データ伝送速度が低速な外部バス904bに接続される。また、外部バス904bは、インタフェース905を介して種々の構成要素と接続される。 The CPU 901, the ROM 902, and the RAM 903 are mutually connected, for example, via a host bus 904a capable of high-speed data transmission. On the other hand, the host bus 904a is connected to the external bus 904b having a relatively low data transmission rate via the bridge 904, for example. The external bus 904b is also connected to various components via the interface 905.
 入力装置906は、例えば、マウス、キーボード、タッチパネル、ボタン、マイクロフォン、スイッチ及びレバー等、ユーザによって情報が入力される装置によって実現される。また、入力装置906は、例えば、赤外線やその他の電波を利用したリモートコントロール装置であってもよいし、情報処理装置900の操作に対応した携帯電話やPDA等の外部接続機器であってもよい。さらに、入力装置906は、例えば、上記の入力手段を用いてユーザにより入力された情報に基づいて入力信号を生成し、CPU901に出力する入力制御回路などを含んでいてもよい。情報処理装置900のユーザは、この入力装置906を操作することにより、情報処理装置900に対して各種のデータを入力したり処理動作を指示したりすることができる。 The input device 906 is realized by a device such as a mouse, a keyboard, a touch panel, a button, a microphone, a switch, and a lever, to which information is input by the user. Further, the input device 906 may be, for example, a remote control device that uses infrared rays or other radio waves, or may be an externally connected device such as a mobile phone or PDA that corresponds to the operation of the information processing device 900. . Further, the input device 906 may include, for example, an input control circuit that generates an input signal based on the information input by the user using the above-described input means and outputs the input signal to the CPU 901. By operating the input device 906, the user of the information processing device 900 can input various data to the information processing device 900 and instruct a processing operation.
 他にも、入力装置906は、ユーザに関する情報を検知する装置により形成され得る。例えば、入力装置906は、画像センサ(例えば、カメラ)、深度センサ(例えば、ステレオカメラ)、加速度センサ、ジャイロセンサ、地磁気センサ、光センサ、音センサ、測距センサ(例えば、ToF(Time of Flight)センサ)、力センサ等の各種のセンサを含み得る。また、入力装置906は、情報処理装置900の姿勢、移動速度等、情報処理装置900自身の状態に関する情報や、情報処理装置900の周辺の明るさや騒音等、情報処理装置900の周辺環境に関する情報を取得してもよい。また、入力装置906は、GNSS(Global Navigation Satellite System)衛星からのGNSS信号(例えば、GPS(Global Positioning System)衛星からのGPS信号)を受信して装置の緯度、経度及び高度を含む位置情報を測定するGNSSモジュールを含んでもよい。また、位置情報に関しては、入力装置906は、Wi-Fi(登録商標)、携帯電話・PHS・スマートフォン等との送受信、または近距離通信等により位置を検知するものであってもよい。入力装置906は、例えば、図4を参照して説明したエコー取得部220及び音声入力部320の機能を実現し得る。 Besides, the input device 906 may be formed by a device that detects information about the user. For example, the input device 906 includes an image sensor (for example, a camera), a depth sensor (for example, a stereo camera), an acceleration sensor, a gyro sensor, a geomagnetic sensor, an optical sensor, a sound sensor, and a distance measuring sensor (for example, ToF (Time of Flight). ) Sensors), force sensors and the like. Further, the input device 906 includes information about the state of the information processing device 900, such as the posture and moving speed of the information processing device 900, information about the surrounding environment of the information processing device 900, such as brightness and noise around the information processing device 900. May be obtained. Further, the input device 906 receives a GNSS signal from a GNSS (Global Navigation Satellite System) satellite (for example, a GPS signal from a GPS (Global Positioning System) satellite) and receives position information including latitude, longitude, and altitude of the device. It may include a GNSS module to measure. Regarding the position information, the input device 906 may detect the position by transmission / reception with Wi-Fi (registered trademark), a mobile phone / PHS / smartphone, or the like, short-range communication, or the like. The input device 906 can realize the functions of the echo acquisition unit 220 and the voice input unit 320 described with reference to FIG. 4, for example.
 出力装置907は、取得した情報をユーザに対して視覚的又は聴覚的に通知することが可能な装置で形成される。このような装置として、CRTディスプレイ装置、液晶ディスプレイ装置、プラズマディスプレイ装置、ELディスプレイ装置、レーザープロジェクタ、LEDプロジェクタ及びランプ等の表示装置や、スピーカ及びヘッドホン等の音声出力装置や、プリンタ装置等がある。出力装置907は、例えば、情報処理装置900が行った各種処理により得られた結果を出力する。具体的には、表示装置は、情報処理装置900が行った各種処理により得られた結果を、テキスト、イメージ、表、グラフ等、様々な形式で視覚的に表示する。他方、音声出力装置は、再生された音声データや音響データ等からなるオーディオ信号をアナログ信号に変換して聴覚的に出力する。出力装置907は、例えば、図4を参照して説明した音声出力部330の機能を実現し得る。 The output device 907 is formed of a device capable of visually or audibly notifying the user of the acquired information. Such devices include CRT display devices, liquid crystal display devices, plasma display devices, EL display devices, display devices such as laser projectors, LED projectors and lamps, audio output devices such as speakers and headphones, and printer devices. . The output device 907 outputs results obtained by various processes performed by the information processing device 900, for example. Specifically, the display device visually displays the results obtained by the various processes performed by the information processing device 900 in various formats such as text, images, tables, and graphs. On the other hand, the audio output device converts an audio signal composed of reproduced audio data, acoustic data, and the like into an analog signal and outputs it audibly. The output device 907 can realize the function of the audio output unit 330 described with reference to FIG. 4, for example.
 ストレージ装置908は、情報処理装置900の記憶部の一例として形成されたデータ格納用の装置である。ストレージ装置908は、例えば、HDD等の磁気記憶部デバイス、半導体記憶デバイス、光記憶デバイス又は光磁気記憶デバイス等により実現される。ストレージ装置908は、記憶媒体、記憶媒体にデータを記録する記録装置、記憶媒体からデータを読み出す読出し装置および記憶媒体に記録されたデータを削除する削除装置などを含んでもよい。このストレージ装置908は、CPU901が実行するプログラムや各種データ及び外部から取得した各種のデータ等を格納する。ストレージ装置908は、例えば、図4を参照して説明した記憶部120の機能を実現し得る。 The storage device 908 is a device for data storage formed as an example of a storage unit of the information processing device 900. The storage device 908 is realized by, for example, a magnetic storage device such as an HDD, a semiconductor storage device, an optical storage device, a magneto-optical storage device, or the like. The storage device 908 may include a storage medium, a recording device that records data in the storage medium, a reading device that reads data from the storage medium, a deletion device that deletes data recorded in the storage medium, and the like. The storage device 908 stores programs executed by the CPU 901, various data, various data acquired from the outside, and the like. The storage device 908 can realize the function of the storage unit 120 described with reference to FIG. 4, for example.
 ドライブ909は、記憶媒体用リーダライタであり、情報処理装置900に内蔵、あるいは外付けされる。ドライブ909は、装着されている磁気ディスク、光ディスク、光磁気ディスク、または半導体メモリ等のリムーバブル記憶媒体に記録されている情報を読み出して、RAM903に出力する。また、ドライブ909は、リムーバブル記憶媒体に情報を書き込むこともできる。 The drive 909 is a reader / writer for a storage medium, and is built in or externally attached to the information processing device 900. The drive 909 reads out information recorded on a removable storage medium such as a mounted magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, and outputs it to the RAM 903. The drive 909 can also write information in a removable storage medium.
 接続ポート910は、例えば、USB(Universal Serial Bus)ポート、IEEE1394ポート、SCSI(Small Computer System Interface)、RS-232Cポート、又は光オーディオ端子等のような外部接続機器を接続するためのポートである。 The connection port 910 is, for example, a USB (Universal Serial Bus) port, an IEEE 1394 port, a SCSI (Small Computer System Interface), an RS-232C port, or a port for connecting an external device such as an optical audio terminal. .
 通信装置911は、例えば、ネットワーク920に接続するための通信デバイス等で形成された通信インタフェースである。通信装置911は、例えば、有線若しくは無線LAN(Local Area Network)、LTE(Long Term Evolution)、Bluetooth(登録商標)又はWUSB(Wireless USB)用の通信カード等である。また、通信装置911は、光通信用のルータ、ADSL(Asymmetric Digital Subscriber Line)用のルータ又は各種通信用のモデム等であってもよい。この通信装置911は、例えば、インターネットや他の通信機器との間で、例えばTCP/IP等の所定のプロトコルに則して信号等を送受信することができる。通信装置911は、例えば、図4を参照して説明した通信部100、通信部200、及び通信部300の機能を実現し得る。 The communication device 911 is, for example, a communication interface formed of a communication device or the like for connecting to the network 920. The communication device 911 is, for example, a communication card for wired or wireless LAN (Local Area Network), LTE (Long Term Evolution), Bluetooth (registered trademark) or WUSB (Wireless USB). The communication device 911 may be a router for optical communication, a router for ADSL (Asymmetric Digital Subscriber Line), a modem for various kinds of communication, or the like. The communication device 911 can send and receive signals and the like to and from the Internet and other communication devices, for example, according to a predetermined protocol such as TCP / IP. The communication device 911 can realize, for example, the functions of the communication unit 100, the communication unit 200, and the communication unit 300 described with reference to FIG.
 なお、ネットワーク920は、ネットワーク920に接続されている装置から送信される情報の有線、または無線の伝送路である。例えば、ネットワーク920は、インターネット、電話回線網、衛星通信網などの公衆回線網や、Ethernet(登録商標)を含む各種のLAN(Local Area Network)、WAN(Wide Area Network)などを含んでもよい。また、ネットワーク920は、IP-VPN(Internet Protocol-Virtual Private Network)などの専用回線網を含んでもよい。 The network 920 is a wired or wireless transmission path for information transmitted from a device connected to the network 920. For example, the network 920 may include a public line network such as the Internet, a telephone line network, and a satellite communication network, various LANs (Local Area Network) including Ethernet (registered trademark), WAN (Wide Area Network), and the like. Further, the network 920 may include a dedicated line network such as an IP-VPN (Internet Protocol-Virtual Private Network).
 以上、本実施形態に係る情報処理装置900の機能を実現可能なハードウェア構成の一例を示した。上記の各構成要素は、汎用的な部材を用いて実現されていてもよいし、各構成要素の機能に特化したハードウェアにより実現されていてもよい。従って、本実施形態を実施する時々の技術レベルに応じて、適宜、利用するハードウェア構成を変更することが可能である。 Above, an example of the hardware configuration capable of realizing the functions of the information processing apparatus 900 according to the present embodiment has been shown. Each component described above may be realized by using a general-purpose member, or may be realized by hardware specialized for the function of each component. Therefore, it is possible to appropriately change the hardware configuration to be used according to the technical level at the time of implementing the present embodiment.
<<5.まとめ>>
 以上説明したように、本実施形態に係る携帯端末10は、超音波エコーにより取得される口腔内の状態を示す複数枚の時系列画像を、機械学習により取得したアルゴリズムに基づき口腔内の状態に対応する情報に変換する。これにより、携帯端末10は、ユーザが声を発さずに、意図して口または舌の少なくとも一方を動かした際の口腔内の状態を示す画像を音響情報に変換することができる。
<< 5. Summary >>
As described above, the mobile terminal 10 according to the present embodiment sets a plurality of time-series images showing the intraoral state acquired by ultrasonic echo to the intraoral state based on the algorithm acquired by machine learning. Convert to the corresponding information. Thereby, the mobile terminal 10 can convert an image showing the state in the oral cavity when the user intentionally moves at least one of the mouth and the tongue into acoustic information without uttering a voice.
 よって、ユーザが発声せずに意図した音響情報を得ることが可能な、新規かつ改良された情報処理装置及び情報処理方法を提供することが可能である。 Therefore, it is possible to provide a new and improved information processing apparatus and information processing method capable of obtaining the intended acoustic information without the user speaking.
 以上、添付図面を参照しながら本開示の好適な実施形態について詳細に説明したが、本開示の技術的範囲はかかる例に限定されない。本開示の技術分野における通常の知識を有する者であれば、請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本開示の技術的範囲に属するものと了解される。 The preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, but the technical scope of the present disclosure is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field of the present disclosure can come up with various changes or modifications within the scope of the technical idea described in the claims. Of course, it is understood that the invention also belongs to the technical scope of the present disclosure.
 例えば、本明細書において説明した各装置は、単独の装置として実現されてもよく、一部または全部が別々の装置として実現されても良い。例えば、図1に示した携帯端末10、超音波エコー装置20及び音声入出力装置30は、それぞれ単独の装置として実現されてもよい。また、例えば、携帯端末10が、超音波エコー装置20及び音声入出力装置30とネットワーク等で接続されたサーバ装置として実現されてもよい。また、携帯端末10が有する制御部110の機能をネットワーク等で接続されたサーバ装置が有する構成であってもよい。 For example, each device described in this specification may be realized as a single device, or part or all may be realized as separate devices. For example, the mobile terminal 10, the ultrasonic echo device 20, and the voice input / output device 30 illustrated in FIG. 1 may be realized as independent devices. Further, for example, the mobile terminal 10 may be realized as a server device connected to the ultrasonic echo device 20 and the voice input / output device 30 via a network or the like. Further, the function of the control unit 110 included in the mobile terminal 10 may be included in a server device connected via a network or the like.
 また、本明細書において説明した各装置による一連の処理は、ソフトウェア、ハードウェア、及びソフトウェアとハードウェアとの組合せのいずれを用いて実現されてもよい。ソフトウェアを構成するプログラムは、例えば、各装置の内部又は外部に設けられる記録媒体(非一時的な媒体:non-transitory media)に予め格納される。そして、各プログラムは、例えば、コンピュータによる実行時にRAMに読み込まれ、CPUなどのプロセッサにより実行される。 Also, the series of processes performed by each device described in this specification may be realized using any of software, hardware, and a combination of software and hardware. The program forming the software is stored in advance in a recording medium (non-transitory medium: non-transmission media) provided inside or outside each device, for example. Then, each program is read into the RAM when it is executed by a computer, and executed by a processor such as a CPU.
 また、本明細書においてフローチャートを用いて説明した処理は、必ずしも図示された順序で実行されなくてもよい。いくつかの処理ステップは、並列的に実行されてもよい。また、追加的な処理ステップが採用されてもよく、一部の処理ステップが省略されてもよい。 Also, the processes described using the flowcharts in this specification do not necessarily have to be executed in the illustrated order. Some processing steps may be performed in parallel. In addition, additional processing steps may be adopted, and some processing steps may be omitted.
 また、本明細書に記載された効果は、あくまで説明的または例示的なものであって限定的ではない。つまり、本開示に係る技術は、上記の効果とともに、または上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏しうる。 Also, the effects described in the present specification are merely explanatory or exemplifying ones, and are not limiting. That is, the technique according to the present disclosure may have other effects that are apparent to those skilled in the art from the description of the present specification, in addition to or instead of the above effects.
 なお、以下のような構成も本開示の技術的範囲に属する。
(1)
 超音波エコーにより取得される口腔内の状態を示す複数枚の時系列画像を、機械学習により取得したアルゴリズムに基づき前記口腔内の状態に対応する情報に変換する制御部、を備える、情報処理装置。
(2)
 前記アルゴリズムは、第1のニューラルネットワークを有し、
 前記制御部は、前記第1のニューラルネットワークを介して、入力された無発声時の複数枚の時系列画像を第1の音響情報に変換する、
前記(1)に記載の情報処理装置。
(3)
 前記第1のニューラルネットワークは、入力された前記無発声時の複数枚の時系列画像から単位時間当たりの音響特徴量を複数生成し、生成した複数の前記音響特徴量を時系列順に合成することで前記第1の音響情報を生成する、前記(2)に記載の情報処理装置。
(4)
 前記第1のニューラルネットワークは、前記単位時間に取得された前記無発声時の複数枚の時系列画像の内、前記単位時間の中央の時刻における時系列画像を選択し、選択した前記時系列画像から前記単位時間当たりの音響特徴量を生成する、前記(3)に記載の情報処理装置。
(5)
 前記第1のニューラルネットワークは、発声時の音声と、前記発声時の複数枚の時系列画像とを含む第1の学習情報を用いた前記機械学習により得られる、前記(2)~(4)のいずれか一項に記載の情報処理装置。
(6)
 前記アルゴリズムは、第2のニューラルネットワークをさらに有し、
 前記制御部は、前記第2のニューラルネットワークを介して、前記第1の音響情報を発声時の音声と対応する第2の音響情報に変換する、
前記(2)~(5)のいずれか一項に記載の情報処理装置。
(7)
 前記第2のニューラルネットワークは、前記発声時の複数枚の時系列画像を前記第1のニューラルネットワークに入力することで生成される第3の音響情報と、発声時の音声に対応する第4の音響情報とを含む第2の学習情報を用いた前記機械学習により得られる、前記(6)に記載の情報処理装置。
(8)
 前記音響情報は、スペクトログラムである、前記(2)~(7)のいずれか一項に記載の情報処理装置。
(9)
 前記複数枚の時系列画像は、ユーザが発声せずに口又は舌の少なくとも一方を動かした際の前記口腔内の状態の変化を示す、前記(1)~(8)のいずれか一項に記載の情報処理装置。
(10)
 前記機械学習は、ディープラーニングにより行われる、前記(1)~(9)のいずれか一項に記載の情報処理装置。
(11)
 前記機械学習は、コンボリューショナルニューラルネットワークを用いて行われる、前記(1)~(10)のいずれか一項に記載の情報処理装置。
(12)
 超音波エコーにより取得される口腔内の状態を示す複数枚の時系列画像を、機械学習により取得したアルゴリズムに基づき前記口腔内の状態に対応する情報に変換すること、を含む、プロセッサにより実行される情報処理方法。
Note that the following configurations also belong to the technical scope of the present disclosure.
(1)
An information processing apparatus, comprising: a plurality of time-series images showing the intraoral state acquired by ultrasonic echo, and a control unit that converts the information corresponding to the intraoral state based on an algorithm acquired by machine learning. .
(2)
The algorithm has a first neural network,
The control unit converts a plurality of input time-series images when no voice is input into first acoustic information via the first neural network,
The information processing device according to (1).
(3)
The first neural network generates a plurality of acoustic feature amounts per unit time from a plurality of input time-series images when no voice is input, and synthesizes the generated plurality of acoustic feature amounts in time series order. The information processing apparatus according to (2), wherein the first acoustic information is generated by.
(4)
The first neural network selects a time-series image at a central time of the unit time from the plurality of time-series images of the unvoiced time acquired in the unit time, and selects the selected time-series image. The information processing apparatus according to (3), wherein the acoustic feature amount per unit time is generated from the.
(5)
The first neural network is obtained by the machine learning using the first learning information including a voice at the time of utterance and a plurality of time-series images at the time of utterance, (2) to (4) above. The information processing apparatus according to any one of 1.
(6)
The algorithm further comprises a second neural network,
The control unit converts the first acoustic information into second acoustic information corresponding to a voice when uttered, via the second neural network.
The information processing apparatus according to any one of (2) to (5) above.
(7)
The second neural network includes third acoustic information generated by inputting the plurality of time-series images at the time of utterance to the first neural network, and a fourth acoustic information corresponding to voice at the time of utterance. The information processing device according to (6), which is obtained by the machine learning using second learning information including acoustic information.
(8)
The information processing device according to any one of (2) to (7), wherein the acoustic information is a spectrogram.
(9)
The plurality of time-series images show changes in the state of the oral cavity when the user moves at least one of the mouth and tongue without uttering, and the time-series images are described in any one of (1) to (8) above. The information processing device described.
(10)
The information processing device according to any one of (1) to (9), wherein the machine learning is performed by deep learning.
(11)
The information processing device according to any one of (1) to (10), wherein the machine learning is performed using a convolutional neural network.
(12)
A plurality of time-series images showing the intraoral state acquired by ultrasonic echo, including conversion into information corresponding to the intraoral state based on an algorithm acquired by machine learning, including, executed by the processor. Information processing method.
 10 携帯端末
 20 超音波エコー装置
 30 音声入出力装置
 100 通信部
 110 制御部
 112 機械学習部
 114 認識部
 116 処理制御部
 120 記憶部
 122 第1のニューラルネットワーク
 124 第2のニューラルネットワーク
 200 通信部
 210 制御部
 220 エコー取得部
 300 通信部
 310 制御部
 320 音声入力部
 330 音声出力部
 1000 無音声発話システム
10 Mobile Terminal 20 Ultrasonic Echo Device 30 Voice Input / Output Device 100 Communication Unit 110 Control Unit 112 Machine Learning Unit 114 Recognition Unit 116 Processing Control Unit 120 Storage Unit 122 First Neural Network 124 Second Neural Network 200 Communication Unit 210 Control Part 220 Echo acquisition part 300 Communication part 310 Control part 320 Voice input part 330 Voice output part 1000 Voiceless speech system

Claims (12)

  1.  超音波エコーにより取得される口腔内の状態を示す複数枚の時系列画像を、機械学習により取得したアルゴリズムに基づき前記口腔内の状態に対応する情報に変換する制御部、を備える、情報処理装置。 An information processing apparatus, comprising: a plurality of time-series images showing the intraoral state acquired by ultrasonic echo, and a control unit that converts the information corresponding to the intraoral state based on an algorithm acquired by machine learning. .
  2.  前記アルゴリズムは、第1のニューラルネットワークを有し、
     前記制御部は、前記第1のニューラルネットワークを介して、入力された無発声時の複数枚の時系列画像を第1の音響情報に変換する、
    請求項1に記載の情報処理装置。
    The algorithm has a first neural network,
    The control unit converts a plurality of input time-series images when no voice is input into first acoustic information via the first neural network,
    The information processing apparatus according to claim 1.
  3.  前記第1のニューラルネットワークは、入力された前記無発声時の複数枚の時系列画像から単位時間当たりの音響特徴量を複数生成し、生成した複数の前記音響特徴量を時系列順に合成することで前記第1の音響情報を生成する、請求項2に記載の情報処理装置。 The first neural network generates a plurality of acoustic feature quantities per unit time from a plurality of input time-series images when no voice is input, and synthesizes the generated plurality of acoustic feature quantities in chronological order. The information processing apparatus according to claim 2, wherein the first acoustic information is generated by.
  4.  前記第1のニューラルネットワークは、前記単位時間に取得された前記無発声時の複数枚の時系列画像の内、前記単位時間の中央の時刻における時系列画像を選択し、選択した前記時系列画像から前記単位時間当たりの音響特徴量を生成する、請求項3に記載の情報処理装置。 The first neural network selects a time-series image at a central time of the unit time from the plurality of time-series images of the unvoiced time acquired in the unit time, and selects the selected time-series image. The information processing apparatus according to claim 3, wherein the acoustic feature quantity per unit time is generated from the.
  5.  前記第1のニューラルネットワークは、発声時の音声と、前記発声時の複数枚の時系列画像とを含む第1の学習情報を用いた前記機械学習により得られる、請求項2に記載の情報処理装置。 The information processing according to claim 2, wherein the first neural network is obtained by the machine learning using the first learning information including a voice at the time of utterance and a plurality of time-series images at the time of the utterance. apparatus.
  6.  前記アルゴリズムは、第2のニューラルネットワークをさらに有し、
     前記制御部は、前記第2のニューラルネットワークを介して、前記第1の音響情報を発声時の音声と対応する第2の音響情報に変換する、
    請求項2に記載の情報処理装置。
    The algorithm further comprises a second neural network,
    The control unit converts the first acoustic information into second acoustic information corresponding to a voice when uttered, via the second neural network.
    The information processing apparatus according to claim 2.
  7.  前記第2のニューラルネットワークは、前記発声時の複数枚の時系列画像を前記第1のニューラルネットワークに入力することで生成される第3の音響情報と、発声時の音声に対応する第4の音響情報とを含む第2の学習情報を用いた前記機械学習により得られる、請求項6に記載の情報処理装置。 The second neural network includes third acoustic information generated by inputting the plurality of time-series images at the time of utterance to the first neural network, and a fourth acoustic information corresponding to voice at the time of utterance. The information processing device according to claim 6, which is obtained by the machine learning using the second learning information including acoustic information.
  8.  前記音響情報は、スペクトログラムである、請求項2に記載の情報処理装置。 The information processing apparatus according to claim 2, wherein the acoustic information is a spectrogram.
  9.  前記複数枚の時系列画像は、ユーザが発声せずに口又は舌の少なくとも一方を動かした際の前記口腔内の状態の変化を示す、請求項1に記載の情報処理装置。 The information processing apparatus according to claim 1, wherein the plurality of time-series images show changes in the state of the oral cavity when the user moves at least one of the mouth and tongue without uttering.
  10.  前記機械学習は、ディープラーニングにより行われる、請求項1に記載の情報処理装置。 The information processing apparatus according to claim 1, wherein the machine learning is performed by deep learning.
  11.  前記機械学習は、コンボリューショナルニューラルネットワークを用いて行われる、請求項1に記載の情報処理装置。 The information processing apparatus according to claim 1, wherein the machine learning is performed using a convolutional neural network.
  12.  超音波エコーにより取得される口腔内の状態を示す複数枚の時系列画像を、機械学習により取得したアルゴリズムに基づき前記口腔内の状態に対応する情報に変換すること、を含む、プロセッサにより実行される情報処理方法。 A plurality of time-series images showing the intraoral state acquired by ultrasonic echo, including conversion into information corresponding to the intraoral state based on an algorithm acquired by machine learning, including, executed by the processor. Information processing method.
PCT/JP2019/029985 2018-10-18 2019-07-31 Information processing device and information processing method WO2020079918A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201980065946.7A CN112840397A (en) 2018-10-18 2019-07-31 Information processing apparatus and information processing method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2018-196739 2018-10-18
JP2018196739 2018-10-18

Publications (1)

Publication Number Publication Date
WO2020079918A1 true WO2020079918A1 (en) 2020-04-23

Family

ID=70283869

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/029985 WO2020079918A1 (en) 2018-10-18 2019-07-31 Information processing device and information processing method

Country Status (2)

Country Link
CN (1) CN112840397A (en)
WO (1) WO2020079918A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022064590A1 (en) * 2020-09-24 2022-03-31 Siシナジーテクノロジー株式会社 Trained autoencoder, trained autoencoder generation method, non-stationary vibration detection method, non-stationary vibration detection device, and computer program
JP7490199B2 (en) 2020-09-24 2024-05-27 Siシナジーテクノロジー株式会社 Trained autoencoder, trained autoencoder generation method, non-stationary vibration detection method, non-stationary vibration detection device, and computer program

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114429766A (en) * 2022-01-29 2022-05-03 北京百度网讯科技有限公司 Method, device and equipment for adjusting playing volume and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61226023A (en) * 1985-03-29 1986-10-07 リオン株式会社 Langauge function diagnostic apparatus by ultrasonic wave
JPH11219421A (en) * 1998-01-30 1999-08-10 Toshiba Corp Image recognizing device and method therefor
JP2007111335A (en) * 2005-10-21 2007-05-10 Yamaha Corp Oral cavity sensor and phoneme discrimination device
WO2013031677A1 (en) * 2011-08-26 2013-03-07 国立大学法人豊橋技術科学大学 Pronunciation movement visualization device and pronunciation learning device
US20160284347A1 (en) * 2015-03-27 2016-09-29 Google Inc. Processing audio waveforms
JP2018502319A (en) * 2015-07-07 2018-01-25 三菱電機株式会社 Method for distinguishing one or more components of a signal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61226023A (en) * 1985-03-29 1986-10-07 リオン株式会社 Langauge function diagnostic apparatus by ultrasonic wave
JPH11219421A (en) * 1998-01-30 1999-08-10 Toshiba Corp Image recognizing device and method therefor
JP2007111335A (en) * 2005-10-21 2007-05-10 Yamaha Corp Oral cavity sensor and phoneme discrimination device
WO2013031677A1 (en) * 2011-08-26 2013-03-07 国立大学法人豊橋技術科学大学 Pronunciation movement visualization device and pronunciation learning device
US20160284347A1 (en) * 2015-03-27 2016-09-29 Google Inc. Processing audio waveforms
JP2018502319A (en) * 2015-07-07 2018-01-25 三菱電機株式会社 Method for distinguishing one or more components of a signal

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022064590A1 (en) * 2020-09-24 2022-03-31 Siシナジーテクノロジー株式会社 Trained autoencoder, trained autoencoder generation method, non-stationary vibration detection method, non-stationary vibration detection device, and computer program
JP7490199B2 (en) 2020-09-24 2024-05-27 Siシナジーテクノロジー株式会社 Trained autoencoder, trained autoencoder generation method, non-stationary vibration detection method, non-stationary vibration detection device, and computer program

Also Published As

Publication number Publication date
CN112840397A (en) 2021-05-25

Similar Documents

Publication Publication Date Title
JP5750380B2 (en) Speech translation apparatus, speech translation method, and speech translation program
US20100131268A1 (en) Voice-estimation interface and communication system
US20160314781A1 (en) Computer-implemented method, computer system and computer program product for automatic transformation of myoelectric signals into audible speech
JP2008083376A (en) Voice translation device, voice translation method, voice translation program and terminal device
US9131876B2 (en) Portable sound source playing apparatus for testing hearing ability and method of testing hearing ability using the apparatus
US11842736B2 (en) Subvocalized speech recognition and command execution by machine learning
US11727949B2 (en) Methods and apparatus for reducing stuttering
EP3982358A2 (en) Whisper conversion for private conversations
JP2024504316A (en) Synthetic speech generation
WO2020079918A1 (en) Information processing device and information processing method
JP2009178783A (en) Communication robot and its control method
JP2016105142A (en) Conversation evaluation device and program
WO2021149441A1 (en) Information processing device and information processing method
WO2021153101A1 (en) Information processing device, information processing method, and information processing program
JP5347505B2 (en) Speech estimation system, speech estimation method, and speech estimation program
JP7218143B2 (en) Playback system and program
JP7339151B2 (en) Speech synthesizer, speech synthesis program and speech synthesis method
WO2020208926A1 (en) Signal processing device, signal processing method, and program
JP2015187738A (en) Speech translation device, speech translation method, and speech translation program
JP6696878B2 (en) Audio processing device, wearable terminal, mobile terminal, and audio processing method
US20240087597A1 (en) Source speech modification based on an input speech characteristic
JP2019087798A (en) Voice input device
WO2021028758A1 (en) Acoustic device and method for operating same
JP2000206986A (en) Language information detector
JP7070402B2 (en) Information processing equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19873987

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19873987

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP