WO2019244298A1 - 属性識別装置、属性識別方法、およびプログラム記録媒体 - Google Patents
属性識別装置、属性識別方法、およびプログラム記録媒体 Download PDFInfo
- Publication number
- WO2019244298A1 WO2019244298A1 PCT/JP2018/023594 JP2018023594W WO2019244298A1 WO 2019244298 A1 WO2019244298 A1 WO 2019244298A1 JP 2018023594 W JP2018023594 W JP 2018023594W WO 2019244298 A1 WO2019244298 A1 WO 2019244298A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- attribute
- attribute information
- value
- information
- speaker
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/0002—Remote monitoring of patients using telemetry, e.g. transmission of vital signals via a communication network
- A61B5/0015—Remote monitoring of patients using telemetry, e.g. transmission of vital signals via a communication network characterised by features of the telemetry system
- A61B5/0022—Monitoring a patient using a global network, e.g. telephone networks, internet
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/117—Identification of persons
- A61B5/1171—Identification of persons based on the shapes or appearances of their bodies or parts thereof
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/72—Signal processing specially adapted for physiological signals or for diagnostic purposes
- A61B5/7235—Details of waveform analysis
- A61B5/7253—Details of waveform analysis characterised by using transforms
- A61B5/7257—Details of waveform analysis characterised by using transforms using Fourier transforms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2203/00—Indexing scheme relating to G06F3/00 - G06F3/048
- G06F2203/01—Indexing scheme relating to G06F3/01
- G06F2203/011—Emotion or mood input determined on the basis of sensed human body parameters such as pulse, heart rate or beat, temperature of skin, facial expressions, iris, voice pitch, brain activity patterns
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- the present invention relates to an attribute identification device, an attribute identification method, and a program recording medium.
- this type of voice processing device may estimate the attribute information as a discrete value or may estimate the attribute information as a continuous value.
- Patent Document 1 describes a technique for estimating age as an attribute of a person from a face image signal.
- the age estimation technique described in Patent Literature 1 first estimates age as a discrete value from a face image signal, and estimates age as a continuous value. Then, the age estimation technique described in Patent Literature 1 calculates a final estimated value by integrating the estimation results of the discrete value and the continuous value.
- Patent Document 1 has a problem that the identification accuracy of the attribute of a person is not sufficient.
- the technique described in Patent Document 1 estimates a discrete value of a first estimated value and a continuous value of a second estimated value when estimating age as an attribute of a person from a face image signal. To calculate a final estimated value.
- the technique described in Patent Literature 1 independently obtains a first estimated value and a second estimated value. For this reason, the first estimated value and the second estimated value may be significantly different. In such a case, even after integration, the two estimated values are promising and it is difficult to narrow down to one estimated value. Therefore, there is a possibility that the age identification accuracy may be impaired.
- the present invention has been made in view of the above problems, and an object of the present invention is to provide an attribute identification device, an attribute identification method, and a program recording medium that further enhance the accuracy of attribute identification of a person.
- An attribute identification device is based on a biological signal, a first attribute identification unit that identifies first attribute information that is a value range of a specific attribute from the biological signal, and the biological signal and Second attribute identifying means for identifying second attribute information, which is specific attribute information, from the first attribute information.
- the attribute identification method based on a biological signal, identifies first attribute information that is a range of a specific attribute value from the biological signal, from the biological signal and the first attribute information Identify second attribute information that is specific attribute information.
- a program recording medium includes a process for identifying, based on a biological signal, first attribute information that is a value range of a specific attribute from the biological signal, and processing the biological signal and the first attribute.
- an attribute identification device it is possible to provide an attribute identification device, an attribute identification method, and a program recording medium that further improve the accuracy of attribute identification of a person.
- FIG. 2 is a block diagram illustrating a hardware configuration of a computer device that implements the device according to each embodiment of the present invention.
- FIG. 2 is a block diagram illustrating a functional configuration of the audio processing device according to the first embodiment of the present invention.
- FIG. 4 is a diagram illustrating an example of first attribute information output by a first attribute identification unit of the audio processing device according to the first embodiment of the present invention.
- FIG. 5 is a diagram illustrating another example of the second attribute identification unit of the audio processing device according to the first embodiment of the present invention.
- FIG. 6 is a diagram illustrating another example of the first attribute information output by the first attribute identification unit of the audio processing device according to the first embodiment of the present invention.
- 4 is a flowchart illustrating an operation of the audio processing device according to the first embodiment of the present invention. It is a block diagram showing functional composition of an attribute discernment device concerning an embodiment of a minimum composition.
- FIG. 1 is a block diagram showing a hardware configuration of a computer device 10 for realizing a voice processing device and a voice processing method in each embodiment of the present invention.
- each component of the audio processing device described below indicates a block of a functional unit.
- Each component of the audio processing device can be realized by, for example, an arbitrary combination of a computer device 10 and software as shown in FIG.
- the computer device 10 includes a processor 11, a RAM (Random Access Memory) 12, a ROM (Read Only Memory) 13, a storage device 14, an input / output interface 15, and a bus 16.
- a processor 11 a processor 11, a RAM (Random Access Memory) 12, a ROM (Read Only Memory) 13, a storage device 14, an input / output interface 15, and a bus 16.
- RAM Random Access Memory
- ROM Read Only Memory
- the storage device 14 stores the program 18.
- the processor 11 uses the RAM 12 to execute the program 18 according to the audio processing device.
- the program 18 may be stored in the ROM 13. Further, the program 18 may be recorded on the recording medium 20 and read by the drive device 17, or may be transmitted from an external device via a network.
- the input / output interface 15 exchanges data with peripheral devices (keyboard, mouse, display device, etc.) 19.
- the input / output interface 15 can function as a unit for acquiring or outputting data.
- the bus 16 connects each component.
- each unit of the audio processing device can be realized as hardware (dedicated circuit).
- the audio processing device can be realized by a combination of a plurality of devices.
- a program that causes the configuration of each embodiment to operate so as to realize the functions of the present embodiment and other embodiments (more specifically, a program that causes a computer to execute the processing illustrated in FIG. 6 and the like) is recorded on a recording medium.
- a processing method of reading a program recorded on the recording medium as a code and executing the program on a computer is also included in the scope of each embodiment. That is, a computer-readable recording medium is also included in the scope of each embodiment.
- not only a recording medium on which the above-described program is recorded, but also the program itself is included in each embodiment.
- a floppy (registered trademark) disk for example, a hard disk, an optical disk, a magneto-optical disk, a CD (Compact @ Disc) -ROM, a magnetic tape, a nonvolatile memory card, and a ROM can be used.
- a program that executes processing by a single program recorded on the recording medium but also a program that operates on an OS (Operating @ System) to execute processing in cooperation with other software and functions of an extension board. Is also included in the category of each embodiment.
- FIG. 2 is a block diagram illustrating a functional configuration of the audio processing device 100 according to the first embodiment.
- the speech processing device 100 includes a speech section detection unit 110, a speaker feature calculation unit 120, a first attribute identification unit 130, and a second attribute identification unit 140.
- the voice section detection unit 110 receives a voice signal from the outside.
- the voice signal is a signal representing a voice based on the utterance of the speaker.
- the acquired signal is not limited to a voice signal, and may be a biological signal emitted from the body due to a biological phenomenon such as a heartbeat, brain wave, pulse, respiration, or sweating.
- the voice section detection unit 110 detects and segments voice sections included in the received voice signal. At this time, the voice section detection unit 110 may partition the voice signal into fixed lengths, or may partition the voice signal into different lengths. For example, the voice section detection unit 110 may determine a section in which the volume of the voice signal continues for a predetermined time and is smaller than a predetermined value as silence, and determine before and after the section as different voice sections to perform segmentation. . Then, speech section detection section 110 outputs a segmented speech signal, which is a segmentation result (processing result of speech section detection section 110), to speaker characteristic calculation section 120.
- a segmentation result processing result of speech section detection section 110
- the reception of the audio signal refers to, for example, reception of an audio signal from an external device or another processing device, or delivery of a processing result of the audio signal processing from another program.
- the output is, for example, transmission to an external device or another processing device, or delivery of the processing result of the voice section detection unit 110 to another program.
- the speaker characteristic calculation unit 120 receives the segmented speech signal from the speech section detection unit 110.
- the speaker characteristic calculation unit 120 calculates a speaker characteristic expressing a characteristic of personality included in the segmented audio signal based on the received segmented audio signal.
- speaker characteristic calculating section 120 outputs the calculated speaker characteristic (the processing result of speaker characteristic calculating section 120). That is, the speaker characteristic calculating unit 120 serves as a speaker characteristic calculating unit that calculates a speaker characteristic indicating a speaker's personality based on a voice signal representing a voice, which is a biological signal.
- a speaker characteristic calculated for a certain audio signal is referred to as a speaker characteristic of the audio signal.
- the speaker feature calculation unit 120 calculates a feature vector based on i-vector representing the individuality of the voice quality of the speaker, based on the segmented speech signal received from the speech section detection unit 110.
- the speaker characteristic calculation unit 120 may use, for example, a method described in Non-Patent Document 1 as a method of calculating a feature vector based on i-vectors representing individuality of a speaker's voice quality.
- the speaker feature calculated by the speaker feature calculation unit 120 may be a vector that can be calculated by performing a predetermined operation on the segmented speech signal, and may be any feature that indicates the individuality of the speaker. -Vector is one example.
- the speaker feature calculation unit 120 calculates a feature vector representing a frequency analysis result of the audio signal based on the segmented audio signal received from the audio section detection unit 110.
- the speaker feature calculating unit 120 obtains, for example, a frequency filter bank feature obtained by a fast Fourier transform process (FFT) and a filter bank process as a feature representing a frequency analysis result, and a discrete cosine transform process in addition thereto.
- FFT fast Fourier transform process
- MFCC mel frequency cepstrum coefficient
- the first attribute identification unit 130 receives the speaker characteristics output by the speaker characteristic calculation unit 120.
- the first attribute identification unit 130 estimates (identifies) specific attribute information using the speaker characteristics and outputs the information as first attribute information.
- the specific attribute information may be, for example, information indicating the age group of the speaker.
- the first attribute identification unit 130 serves as first attribute identification means for identifying first attribute information that is a value range of a specific attribute from the biological signal based on the biological signal. Note that the identification includes estimation of attribute values, classification based on a range of attribute values, and the like.
- the first attribute identifying unit 130 may use, for example, a neural network (Neural @ Network) as an identifier.
- the first attribute identification unit 130 may use a probability model such as a Gaussian mixture distribution or an identification model such as a linear discriminant analysis or a support vector machine as a classifier.
- the discriminator of the first attribute discriminating unit 130 learns learning data in which a speaker feature relating to a speech signal is associated with a class (described later in detail) including a speaker attribute value.
- a classifier is generated in which the input is a speaker characteristic and the output is a class (first attribute information).
- the first attribute identifying unit 130 calculates attribute information to be output based on the input speaker characteristics and the weight coefficient of the neural network.
- FIG. 3 is a diagram illustrating an example of the first attribute information output by the first attribute identification unit 130.
- the first attribute identifying unit 130 determines a class based on a range of possible values of the attribute to be estimated, scores each class, and outputs a vector having the score as a value as first attribute information.
- the score is a value indicating the correlation between the result calculated by the classifier and the attribute information to be estimated. That is, the score is a value indicating the likelihood of the estimation result calculated by the classifier.
- the first attribute identification unit 130 determines a class based on a range of possible values of the attribute to be estimated.
- the value of the attribute to be estimated is a natural number from “10” to “60”.
- the first attribute identification unit 130 classifies the class including “10” to “20” into C1, the class including “21” to “40” into C2, and “41”.
- a class including “60” is defined as C3.
- the first attribute identification unit 130 scores each class calculated by the classifier, and outputs a vector having each score as a value as an estimated value of the classes C1 to C3. As illustrated in FIG.
- the first attribute identifying unit 130 calculates the scores of the classes C1, C2, and C3 as 0.1, 0.7, and 0.2, respectively.
- the first attribute identifying unit 130 may output a vector having a value of one class number as the estimated value.
- the second attribute identifying unit 140 receives the speaker characteristics output by the speaker characteristic calculating unit 120 and the first attribute information output by the first attribute identifying unit 130.
- the second attribute identification unit 140 estimates (identifies) specific attribute information (second attribute information) using the received speaker characteristics and the first attribute information.
- the specific attribute information may be, for example, information indicating the age of the speaker.
- the second attribute identification unit 140 serves as second attribute identification means for identifying second attribute information, which is specific attribute information, from the biological signal and the first attribute information.
- the second attribute identification unit 140 may use, for example, a neural network as an identifier.
- the discriminator of the second attribute discriminating unit 140 learns the learning data in which the speaker characteristics relating to the audio signal, the speaker attribute value, and the class including the attribute value are associated. Through the learning, an identifier is used as the speaker attribute and the first attribute information that is the output of the first attribute identification unit 130 that has input the speaker characteristic, and the output is the attribute information (attribute value) that is the estimation result. Is generated.
- the second attribute identifying unit 140 calculates attribute information to be output based on the input including the speaker characteristics and the first attribute information and the weight coefficient of the neural network.
- the second attribute identification unit 140 calculates the estimation result as a continuous value.
- the second attribute identification unit 140 can improve the accuracy of attribute identification by using the first attribute information output by the first attribute identification unit 130 as an input.
- the reason is that the second attribute identification unit 140 estimates attribute information using the result estimated by the first attribute identification unit 130 as prior information, so that the second attribute identification unit 140 estimates only from speaker features without prior information. Is also likely to output a value close to the true value.
- the discriminator learns to minimize the residual at the learning stage. Almost biased toward the center of the whole. That is, when the true value is lower than the average value, it is easily estimated to be higher, and when the true value is higher than the average value, it is easily estimated to be lower.
- the above-described bias can be reduced by using the range of the attribute value estimated by the first attribute identification unit 130 as the advance information.
- the second attribute identification unit 140 may calculate the estimation result as a discrete value.
- the second attribute identification unit 140 calculates a class whose value range is narrower (limited) than the class defined by the first attribute identification unit 130 as a discrete value estimation result.
- the discriminator of the second attribute discriminating unit 140 previously learns learning data in which a speaker feature related to an input speech signal is associated with a class including a speaker attribute value.
- the second attribute identifying unit 140 uses a class defined in a range narrower than the attribute value range defined by the first attribute identifying unit 130 for the learning data.
- the second attribute identifying unit 140 determines the class in increments of, for example, “5” so that the range is narrower than “10”.
- the second attribute identification unit 140 uses the class determined in this way for the learning data. Through the learning, a discriminator whose input is the speaker characteristic and the first attribute information which is the output of the first attribute discriminating unit 130 to which the speaker characteristic is input, and whose output is the attribute information (class) which is the estimation result, Generated.
- the second attribute identifying unit 140 calculates the estimation result as a discrete value.
- the second attribute identification section 140 may have a multi-stage configuration.
- FIG. 4 is a diagram illustrating an example of the second attribute identification unit 140 having a multi-stage configuration. As shown in FIG. 4, the second attribute identification unit 140 may include a processing unit 141 that performs a discriminant analysis and a processing unit 142 that performs a regression analysis.
- the second attribute identification unit 140 calculates a discrete value as a temporary estimated value (temporary attribute information) in the processing unit 141, and uses the temporary estimated value as a continuous value by the processing unit 142 as a continuous value. calculate.
- the processing unit 141 learns learning data in which a speaker feature related to a voice signal is associated with a class including a speaker attribute value. Through learning, a classifier is generated in which the input is a speaker feature and the output of the first attribute identifying unit 130 that has input the speaker feature, and the output is a class (temporary estimated value). At this time, the processing unit 141 uses the speaker characteristics and the first attribute information to calculate the tentative estimated value indicated by, for example, a class of “5” as described above.
- the processing unit 142 learns the learning data in which the speaker characteristics relating to the audio signal, the attribute value of the speaker, and the class including the attribute value are associated.
- the input is a speaker characteristic
- the output is the second attribute information (attribute value) which is the estimation result. Is generated.
- the processing unit 142 calculates an estimated value of a continuous value using the speaker characteristics, the first attribute information, and the temporary estimated value output from the processing unit 141.
- the processing unit 142 may calculate a class that is narrower than the attribute value range determined by the processing unit 141 as a discrete value using the tentative estimated value calculated by the processing unit 141 and output the class.
- the second attribute identifying unit 140 estimates the class by defining the attribute value range to be narrower (fine) than the attribute value range determined by the first attribute identifying unit 130. Alternatively, the second attribute identification unit 140 estimates the attribute value as a continuous value. Therefore, it can be said that the second attribute identification unit 140 has a function of outputting a true value. Further, the speech processing device 100 includes a plurality of attribute identifying units, but the second attribute identifying unit 140 calculates a final estimated value, so that a single estimated value can be calculated. As described above, in the voice processing device 100, the second attribute identification unit 140 calculates the attribute information using the first attribute information output from the first attribute identification unit 130 in addition to the speaker characteristics. , It is possible to output a highly accurate attribute estimation result.
- the first attribute identification unit 130 outputs one piece of attribute information.
- the first attribute identification unit 130 may output a plurality of pieces of attribute information.
- FIG. 5 is a diagram illustrating another example of the first attribute information output by the first attribute identification unit 130.
- the first attribute identification unit 130 determines classes including ranges of values different from each other based on the range of possible values of the attribute to be estimated, calculates an estimated value for each class, and outputs May be.
- the first attribute identifying unit 130 classifies classes including “10” to “30” into D1, “31” to “30” as ranges of values different from the above C1 to C3.
- a class including "50” is defined as D2, and a class including "51” to "60” is defined as D3.
- the classifier used by the first attribute classifier 130 is a neural network
- the first attribute classifier has two output layers corresponding to classes C1 to C3 and classes D1 to D3, respectively. 130 may be configured.
- the first attribute identifying unit 130 can improve the accuracy of attribute identification by determining a plurality of classes so that the range of possible values of the attribute to be estimated is different. For example, when identifying which of the above classes C1 to C3 the attribute value is included in, attention is paid to the class C2 including “21” to “40”. In this case, the identification accuracy of “21” and “40” near the boundary is lower than “30” near the center of the value range included in C2. That is, there is a possibility that "21” is identified as a class C1 and C2 and "40” is identified as an incorrect class among the classes C2 and C3.
- classes D1 to D3 are separately defined as a range of values in which values close to the boundary such as “21” and “40” are close to the center. That is, the first attribute identification unit 130 divides the attribute value into two or more types so that the boundary value of the attribute value range is different, and identifies the attribute value range in each division. As a result, values near the boundaries in the classes C1 to C3 can be identified in the same manner as values near the center, so that identification accuracy can be improved.
- the first attribute identification unit 130 roughly estimates the attribute value as the first attribute information
- the second attribute identification unit 140 The value of the attribute is estimated in detail using one piece of attribute information.
- the attribute value can be accurately estimated for the audio signal. That is, the speech processing device 100 according to the present embodiment can improve the accuracy of attribute identification of a person.
- the voice processing device 100 receives one or more voice signals from the outside and provides the voice signal to the voice section detection unit 110.
- the voice section detection unit 110 partitions the received voice signal, and outputs the partitioned voice signal to the speaker characteristic calculation unit 120 (Step S101).
- the speaker characteristic calculating unit 120 calculates a speaker characteristic for each of the received one or more segmented audio signals (step S102).
- the first attribute identification unit 130 identifies and outputs first attribute information based on the received one or more speaker characteristics (step S103).
- the second attribute identifying unit 140 identifies and outputs the second attribute information based on the received one or more speaker characteristics and the first attribute information (Step S104).
- the sound processing device 100 ends a series of processes.
- the accuracy of attribute identification of a person can be improved. This is because the voice processing apparatus 100 uses the first attribute information roughly estimated by the first attribute identification unit 130 and the second attribute identification unit 140 estimates and outputs the attribute information in more detail. is there.
- the estimated value is calculated with a certain accuracy regardless of the possible value of the attribute by the calculation method of calculating the attribute identification of the person while refining the attribute stepwise. You can ask.
- the audio processing device 100 is an example of an attribute identification device that identifies specific attribute information from an audio signal.
- the voice processing device 100 can be used as an age identification device.
- the attribute information may be information indicating the gender of the speaker, information indicating the age group of the speaker, or information indicating the physique of the speaker.
- the voice processing device 100 can be used as an emotion identification device when the specific attribute information is information indicating the emotion of the speaker when speaking. Further, the voice processing device 100 includes, for example, a voice search device including a mechanism for specifying a voice signal corresponding to a specific emotion based on emotion information estimated using emotion characteristics for a plurality of stored voice signals. Alternatively, it can be used as a part of an audio display device.
- the emotion information includes, for example, information indicating an emotional expression, information indicating the character of the speaker, and the like.
- FIG. 7 is a block diagram showing a functional configuration of the attribute identifying apparatus 100 according to the embodiment having the minimum configuration of the present invention.
- the attribute identification device 100 includes a first attribute identification unit 130 and a second attribute identification unit 140.
- the first attribute identifying unit 130 identifies, based on the biological signal, first attribute information that is a value range of a specific attribute from the biological signal.
- the second attribute identification section 140 identifies second attribute information, which is specific attribute information, from the biological signal and the first attribute information.
- the second attribute identifying unit 140 uses the first attribute information output by the first attribute identifying unit 130 as an input, and thus the attribute of the person is The effect that the accuracy of identification can be further improved is obtained.
- the voice processing device and the like have the effect of increasing the accuracy of attribute identification of a person, and are useful as a voice processing device and the like and an attribute identification device.
- REFERENCE SIGNS LIST 100 voice processing device 110 voice section detection unit 120 speaker characteristic calculation unit 130 first attribute identification unit 140 second attribute identification unit
Abstract
Description
<第1の実施形態>
本発明の第1の実施形態および他の実施形態にかかる音声処理装置または属性識別装置を構成するハードウェアについて説明する。図1は、本発明の各実施形態における音声処理装置および音声処理方法を実現するコンピュータ装置10のハードウェア構成を示すブロック図である。なお、本発明の各実施形態において、以下に示す音声処理装置の各構成要素は、機能単位のブロックを示している。音声処理装置の各構成要素は、例えば図1に示すようなコンピュータ装置10とソフトウェアとの任意の組み合わせにより実現することができる。
すなわち、話者特徴算出部120は、生体信号である、音声を表す音声信号に基づき、話者の個人性を表す話者特徴を算出する話者特徴算出手段を担う。以降、ある音声信号に対して算出された話者特徴を、該音声信号の話者特徴と呼ぶ。
<第1の実施形態の動作>
次に、第1の実施形態における音声処理装置100の動作について、図6のフローチャートを用いて説明する。図6は、音声処理装置100の動作の一例を示すフローチャートである。
<第1の実施形態の効果>
以上、説明したように、本実施形態にかかる音声処理装置100によれば、人物の属性識別の精度を高めることができる。なぜならば、音声処理装置100は、第一の属性識別部130が粗く推定した第一の属性情報を用いて、第二の属性識別部140がより詳細に属性情報を推定して出力するからである。
本発明の最小構成の実施形態について説明する。
110 音声区間検出部
120 話者特徴算出部
130 第一の属性識別部
140 第二の属性識別部
Claims (8)
- 生体信号に基づき、前記生体信号から特定の属性の値の範囲である第一の属性情報を識別する第一の属性識別手段と、
前記生体信号および前記第一の属性情報から特定の属性情報である第二の属性情報を識別する第二の属性識別手段と
を備える属性識別装置。 - 前記第二の属性識別手段は、前記第二の属性情報として、
特定の属性の値、あるいは、前記第一の属性識別手段よりも限定した属性の値の範囲、の少なくともいずれかひとつを識別する
請求項1に記載の属性識別装置。 - 前記第一の属性識別手段は、前記第一の属性情報として、
属性の値の範囲の境界値が異なるように2通り以上に分割し、それぞれの分割において属性の値の範囲を識別する、
請求項1または2に記載の属性識別装置。 - 前記第二の属性識別手段は、
前記生体信号および前記第一の属性情報から属性の値の範囲を仮属性情報として識別し、
前記生体信号および前記仮属性情報から前記第二の属性情報を識別する、
請求項1ないし請求項3のいずれか1項に記載の属性識別装置。 - 前記生体信号である、音声を表す音声信号に基づき、話者の個人性を表す話者特徴を算出する話者特徴算出手段をさらに備え、
前記第一の属性識別手段は、前記話者特徴から、前記第一の属性情報を識別し、
前記第二の属性識別手段は、前記話者特徴および前記第一の属性情報から前記第二の属性情報を識別する
請求項1ないし請求項4のいずれか1項に記載の属性識別装置。 - 前記特定の属性情報は、
前記生体信号から識別される人物の年齢、性別、体格、感情および性格の少なくともいずれかひとつを表す情報である
請求項1ないし請求項5のいずれか1項に記載の属性識別装置。 - 生体信号に基づき、前記生体信号から特定の属性の値の範囲である第一の属性情報を識別し、
前記生体信号および前記第一の属性情報から特定の属性情報である第二の属性情報を識別する
属性識別方法。 - 生体信号に基づき、前記生体信号から特定の属性の値の範囲である第一の属性情報を識別する処理と、
前記生体信号および前記第一の属性情報から特定の属性情報である第二の属性情報を識別する処理と
を、コンピュータに実行させるプログラムを記録するプログラム記録媒体。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP18923160.8A EP3813061A4 (en) | 2018-06-21 | 2018-06-21 | ATTRIBUTE IDENTIFICATION DEVICE, ATTRIBUTE IDENTIFICATION METHOD AND PROGRAM STORAGE MEDIUM |
PCT/JP2018/023594 WO2019244298A1 (ja) | 2018-06-21 | 2018-06-21 | 属性識別装置、属性識別方法、およびプログラム記録媒体 |
US17/253,763 US20210264939A1 (en) | 2018-06-21 | 2018-06-21 | Attribute identifying device, attribute identifying method, and program storage medium |
JP2020525165A JP7160095B2 (ja) | 2018-06-21 | 2018-06-21 | 属性識別装置、属性識別方法、およびプログラム |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2018/023594 WO2019244298A1 (ja) | 2018-06-21 | 2018-06-21 | 属性識別装置、属性識別方法、およびプログラム記録媒体 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019244298A1 true WO2019244298A1 (ja) | 2019-12-26 |
Family
ID=68982796
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2018/023594 WO2019244298A1 (ja) | 2018-06-21 | 2018-06-21 | 属性識別装置、属性識別方法、およびプログラム記録媒体 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20210264939A1 (ja) |
EP (1) | EP3813061A4 (ja) |
JP (1) | JP7160095B2 (ja) |
WO (1) | WO2019244298A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2021056335A (ja) * | 2019-09-30 | 2021-04-08 | 株式会社なごみテクノロジー | 評価システム |
WO2023013081A1 (ja) * | 2021-08-06 | 2023-02-09 | 日本電信電話株式会社 | 学習装置、推定装置、学習方法及び学習プログラム |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11699447B2 (en) * | 2020-06-22 | 2023-07-11 | Rovi Guides, Inc. | Systems and methods for determining traits based on voice analysis |
CN114863939B (zh) * | 2022-07-07 | 2022-09-13 | 四川大学 | 一种基于声音的大熊猫属性识别方法及系统 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0955665A (ja) * | 1995-08-14 | 1997-02-25 | Toshiba Corp | 音声符号化装置 |
JP4273359B2 (ja) | 2007-09-28 | 2009-06-03 | Necソフト株式会社 | 年齢推定システム及び年齢推定方法 |
JP2010510534A (ja) * | 2006-11-16 | 2010-04-02 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 音声アクティビティ検出システム及び方法 |
US8160877B1 (en) * | 2009-08-06 | 2012-04-17 | Narus, Inc. | Hierarchical real-time speaker recognition for biometric VoIP verification and targeting |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2016177394A (ja) * | 2015-03-19 | 2016-10-06 | カシオ計算機株式会社 | 情報処理装置、年齢推定方法、及びプログラム |
SG11201805830TA (en) * | 2016-01-12 | 2018-08-30 | Hitachi Int Electric Inc | Congestion-state-monitoring system |
JP6980994B2 (ja) * | 2016-10-07 | 2021-12-15 | 日本電気株式会社 | 情報処理装置、制御方法、及びプログラム |
TWI611374B (zh) * | 2017-05-04 | 2018-01-11 | Chunghwa Telecom Co Ltd | 垂直式影像人流計數之性別與年齡辨識方法 |
-
2018
- 2018-06-21 US US17/253,763 patent/US20210264939A1/en active Pending
- 2018-06-21 JP JP2020525165A patent/JP7160095B2/ja active Active
- 2018-06-21 WO PCT/JP2018/023594 patent/WO2019244298A1/ja active Application Filing
- 2018-06-21 EP EP18923160.8A patent/EP3813061A4/en not_active Withdrawn
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0955665A (ja) * | 1995-08-14 | 1997-02-25 | Toshiba Corp | 音声符号化装置 |
JP2010510534A (ja) * | 2006-11-16 | 2010-04-02 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 音声アクティビティ検出システム及び方法 |
JP4273359B2 (ja) | 2007-09-28 | 2009-06-03 | Necソフト株式会社 | 年齢推定システム及び年齢推定方法 |
US8160877B1 (en) * | 2009-08-06 | 2012-04-17 | Narus, Inc. | Hierarchical real-time speaker recognition for biometric VoIP verification and targeting |
Non-Patent Citations (5)
Title |
---|
CHEN, CHIH-CHANG ET AL.: "Gendar-to-Age Hierarchical Recognition for Speech", PROC. 2011 IEEE 54TH INTERNATIONAL MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, 7 August 2011 (2011-08-07), pages 1 - 4, XP031941418, DOI: 10.1109/MWSCAS.2011.6026475 * |
NAJIM DEHAKPATRICK KENNYREDA DEHAKPIERRE DUMOUCHELPIERRE OUELLET: "Front-End Factor Analysis for Speaker Verification", IEEE TRANSACTION ON AUDIO, SPEECH AND LANGUAGE PROCESSING, vol. 19, no. 4, 2011, pages 788 - 798, XP055566628, DOI: 10.1109/TASL.2010.2064307 |
PONTES, JHONY K. ET AL.: "A Flexible Hierarchical Approach for Facial Age Estimation based on Multiple Features", PATTERN RECOGNITION, vol. 54, June 2016 (2016-06-01), pages 34 - 51, XP029439146, DOI: 10.1016/j.patcog.2015.12.003 * |
POORJAM, AMIR HOSSEIN ET AL.: "Multitask Speaker Profiling for Estimating Age, Height, Weight and Smoking Habits from Spontaneous Telephone Speech Signals", PROC. 2014 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND KNOWLEDGE ENGINEERING, 29 October 2014 (2014-10-29), pages 7 - 12, XP032711179, DOI: 10.1109/ICCKE.2014.6993339 * |
See also references of EP3813061A4 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2021056335A (ja) * | 2019-09-30 | 2021-04-08 | 株式会社なごみテクノロジー | 評価システム |
WO2023013081A1 (ja) * | 2021-08-06 | 2023-02-09 | 日本電信電話株式会社 | 学習装置、推定装置、学習方法及び学習プログラム |
Also Published As
Publication number | Publication date |
---|---|
JP7160095B2 (ja) | 2022-10-25 |
EP3813061A1 (en) | 2021-04-28 |
JPWO2019244298A1 (ja) | 2021-07-08 |
US20210264939A1 (en) | 2021-08-26 |
EP3813061A4 (en) | 2021-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7152514B2 (ja) | 声紋識別方法、モデルトレーニング方法、サーバ、及びコンピュータプログラム | |
WO2019244298A1 (ja) | 属性識別装置、属性識別方法、およびプログラム記録媒体 | |
JP6303971B2 (ja) | 話者交替検出装置、話者交替検出方法及び話者交替検出用コンピュータプログラム | |
WO2019102884A1 (ja) | ラベル生成装置、モデル学習装置、感情認識装置、それらの方法、プログラム、および記録媒体 | |
US11875799B2 (en) | Method and device for fusing voiceprint features, voice recognition method and system, and storage medium | |
EP3469582A1 (en) | Neural network-based voiceprint information extraction method and apparatus | |
US20160071520A1 (en) | Speaker indexing device and speaker indexing method | |
CN112435684B (zh) | 语音分离方法、装置、计算机设备和存储介质 | |
JP7342915B2 (ja) | 音声処理装置、音声処理方法、およびプログラム | |
US11837236B2 (en) | Speaker recognition based on signal segments weighted by quality | |
JP6553015B2 (ja) | 話者属性推定システム、学習装置、推定装置、話者属性推定方法、およびプログラム | |
Besbes et al. | Multi-class SVM for stressed speech recognition | |
JP6676009B2 (ja) | 話者判定装置、話者判定情報生成方法、プログラム | |
JP2017097188A (ja) | 話者らしさ評価装置、話者識別装置、話者照合装置、話者らしさ評価方法、プログラム | |
Karthikeyan et al. | Hybrid machine learning classification scheme for speaker identification | |
JP6373621B2 (ja) | 話し方評価装置、話し方評価方法、プログラム | |
JP6724290B2 (ja) | 音響処理装置、音響処理方法、及び、プログラム | |
Jeon et al. | Nonnegative matrix factorization based adaptive noise sensing over wireless sensor networks | |
JP7107377B2 (ja) | 音声処理装置、音声処理方法、およびプログラム | |
JP6636374B2 (ja) | 登録発話分割装置、話者らしさ評価装置、話者識別装置、登録発話分割方法、話者らしさ評価方法、プログラム | |
Sekkate et al. | A multiresolution-based fusion strategy for improving speech emotion recognition efficiency | |
Płonkowski | Using bands of frequencies for vowel recognition for Polish language | |
CN114678037B (zh) | 一种重叠语音的检测方法、装置、电子设备及存储介质 | |
WO2022249450A1 (ja) | 学習方法、検出方法、それらの装置、およびプログラム | |
CN111862946B (zh) | 一种订单处理方法、装置、电子设备及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18923160 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2020525165 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2018923160 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2018923160 Country of ref document: EP Effective date: 20210121 |