WO2012020717A1 - Dispositif de détermination d'intervalle de parole, procédé de détermination d'intervalle de parole, et programme de détermination d'intervalle de parole - Google Patents

Dispositif de détermination d'intervalle de parole, procédé de détermination d'intervalle de parole, et programme de détermination d'intervalle de parole Download PDF

Info

Publication number
WO2012020717A1
WO2012020717A1 PCT/JP2011/068003 JP2011068003W WO2012020717A1 WO 2012020717 A1 WO2012020717 A1 WO 2012020717A1 JP 2011068003 W JP2011068003 W JP 2011068003W WO 2012020717 A1 WO2012020717 A1 WO 2012020717A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
standard
section
speech segment
determination
Prior art date
Application number
PCT/JP2011/068003
Other languages
English (en)
Japanese (ja)
Inventor
隆行 荒川
田中 大介
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to US13/814,141 priority Critical patent/US9293131B2/en
Priority to JP2012528661A priority patent/JP5725028B2/ja
Publication of WO2012020717A1 publication Critical patent/WO2012020717A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Definitions

  • the present invention relates to a speech segment determination device, a speech segment determination method, and a speech segment determination program.
  • the voice segment determination technique is used for the purpose of improving voice transmission efficiency by removing or compressing a non-speech segment that is not spoken by a speaker in mobile communication or the like.
  • the speech segment determination technique is used for the purpose of estimating noise between non-speech segments with a noise canceller, an echo canceller, or the like.
  • the speech segment determination technique is widely used for the purpose of improving performance and reducing the processing amount in a speech recognition system.
  • a general speech segment determination system calculates a feature amount per unit time for a time series of input sound, and compares the feature amount with a threshold value, so that a speech segment of a time series of input sound is calculated. And non-speech segment. The following feature amounts are used for speech segment determination.
  • Patent Document 1 discloses that the fluctuation of the spectrum power is smoothed and that the fluctuation is smoothed is used as the feature amount.
  • Non-Patent Document 1 discloses that an SNR value shown in Section 4.3.3 or an average of SNRs shown in Section 4.3.5 is used as a feature amount.
  • B. 3 Number of zero crossings shown in section 3.1.4, likelihood ratio using voice GMM (Gaussian Mixture Model) and silent GMM shown in Non-Patent Document 3, or shown in Patent Document 2
  • Various feature amounts such as a combination of a plurality of feature amounts are used.
  • Patent Document 2 the user is encouraged to perform a reference utterance, forced alignment is performed on the performed utterance, voice segments and non-speech segments are determined, and the determined speech segments and non-speech are determined.
  • a method of updating the weights for a plurality of feature amounts so as to minimize errors with respect to a section is disclosed.
  • an object of the present invention is to update a parameter used for speech segment determination without imposing a burden on the user, and to provide a speech segment determination device, a speech segment determination method, and speech that are robust against noise. It is to provide a section judgment program.
  • one aspect of the present invention is a speech segment determination device, which compares a time-series feature value of an input sound with a threshold value, thereby comparing the time-series speech segment of the input sound.
  • a first speech segment determination means for determining (first speech segment) and a non-speech segment (first non-speech segment), and a standard speech acquired from a standard speech storage unit in time series of the first non-speech segment.
  • the time-series feature value of the first non-speech section after superimposition is compared with the threshold value to compare the time-series speech section and non-speech of the first non-speech section after superimposing the standard speech.
  • a threshold value for updating the threshold value so that a mismatch rate between the determination result of the second voice segment determination unit for determining the segment, the determination result of the second voice segment determination unit, and the correct answer calculated from the standard voice is reduced. Updating means.
  • the time-series feature value of the input sound is compared with the threshold value, so that the time-series speech section (first speech section) and non-speech section (first non-speech) of the input sound are compared.
  • a first speech segment determination step for determining a speech segment), and a time-series feature of the first non-speech segment after the standard speech acquired from the standard speech storage means is superimposed on the time sequence of the first non-speech segment.
  • a second speech segment determination step for determining a time-series speech segment and a non-speech segment of the first non-speech segment after superimposing the standard speech by comparing the value of the amount and the threshold;
  • a speech segment determination program that causes a computer to execute a threshold update step of updating the threshold value so that a mismatch rate between a determination result in the second speech segment determination step and a correct answer calculated from the standard speech is small.
  • the time-series feature value of the input sound is compared with the threshold value, so that the time-series speech section (first speech section) and non-speech section (first non-speech) of the input sound are compared.
  • the time series feature value of the first non-speech segment after the standard speech acquired from the standard speech storage means is superimposed on the time series of the first non-speech segment, and the threshold value.
  • a speech segment determination device a speech segment determination method, and a speech segment determination program that are robust against noise by updating parameters used for speech segment determination without imposing a burden on the user.
  • Each unit constituting the speech segment determination device 1 of each embodiment includes a control unit, a memory, a program loaded in the memory, a storage unit such as a hard disk for storing the program, a network connection interface, and the like. Realized by any combination of software. And unless there is particular notice, the realization method and apparatus are not limited.
  • the control unit is composed of a CPU (Central Processing Unit) and the like, and operates the operating system to control the entire voice section determination device 1.
  • a program or data is stored in a memory from a recording medium mounted on a drive device or the like.
  • the recording medium is, for example, an optical disk, a flexible disk, a magnetic optical disk, an external hard disk, a semiconductor memory, etc., and records a computer program so that it can be read by a computer.
  • the computer program may be downloaded from an external computer (not shown) connected to the communication network.
  • the block diagram used in the description of each embodiment shows a functional unit block, not a hardware unit configuration. These functional blocks are realized by any combination of hardware and software. Further, in these drawings, the components of each embodiment may be described as being realized by one physically coupled device, but the means for realizing the same is not limited thereto.
  • FIG. 1 is a diagram showing the configuration of the first exemplary embodiment of the present invention.
  • a speech segment determination device 1 according to the first exemplary embodiment of the present invention includes an input sound acquisition unit 101, a threshold storage unit 102, a first speech segment determination unit 103, and a standard speech storage unit 104.
  • the input sound acquisition unit 101 is realized by a dedicated device realized by a logic circuit, a CPU of an information processing device that executes a program, or the like.
  • the input sound acquisition unit 101 is connected to or integrally formed with a device such as a microphone, for example, and is configured to acquire a time series of voice input sounds.
  • the threshold storage unit 102 is realized by a storage device such as an optical disk device or a magnetic disk device.
  • the threshold storage unit 102 is configured to store a threshold related to speech segment determination.
  • the threshold storage unit 102 stores a threshold used when the first speech segment determination unit 103 determines whether the time series of the input sound is a speech segment or a non-speech segment.
  • the first speech section determination unit 103 is realized by a dedicated device realized by a logic circuit, a CPU of an information processing device that executes a program, or the like.
  • the first speech segment determination unit 103 uses the threshold value stored in the threshold value storage unit 102 for the input sound time sequence acquired by the input sound acquisition unit 101 to determine whether the time sequence of the input sound is a speech segment or a non-speech segment. Is configured to determine.
  • a section determined by the first speech section determination unit 103 as a speech section is a first speech section
  • a section determined as a non-speech section is a first non-speech section.
  • the standard audio storage unit 104 is realized by a storage device such as an optical disk device or a magnetic disk device.
  • the standard voice storage unit 104 stores voice data (standard voice) contents and time (length) information whose utterance contents and time (length) are known in advance.
  • the standard audio superimposing unit 105 is realized by a dedicated device realized by a logic circuit, a CPU of an information processing device that executes a program, or the like.
  • the standard voice superimposing unit 105 is configured to superimpose the standard voice stored in the standard voice storage unit 104 on the time series of the input sound determined as the non-speech section by the first voice segment determination unit 103. Detailed operation of the standard audio superimposing unit 105 will be described later.
  • the second speech section determination unit 106 is realized by a dedicated device realized by a logic circuit, a CPU of an information processing device that executes a program, or the like.
  • the second voice segment determination unit 106 is a time series of the input sound on which the standard voice is superimposed by the standard voice superimposing unit 105 (for the time series of the input sound determined as the non-speech segment by the first voice segment determination unit 103) It is configured to determine whether the time series is a speech section or a non-speech section using the threshold value stored in the threshold value storage unit 102 again with respect to the time series of the input sound after the standard sound is superimposed.
  • the determination result comparison unit 107 is realized by a dedicated device realized by a logic circuit, a CPU of an information processing device that executes a program, or the like.
  • the determination result comparison unit 107 determines the correct speech segment length and the non-speech segment length (correct answer) determined from the determination result determined by the second speech segment determination unit 106 and the length information of the standard speech stored in the threshold storage unit 102. ) And the comparison result is output to the threshold update unit 108.
  • the threshold update unit 108 is realized by a dedicated device realized by a logic circuit, a CPU of an information processing device that executes a program, or the like.
  • the threshold update unit 108 is configured to update the threshold stored in the threshold storage unit 102 based on the comparison result output from the determination result comparison unit 107.
  • the input sound acquisition unit 101 acquires a time series of input sounds (step S1 in FIG. 2).
  • the input sound acquisition unit 101 may acquire analog data acquired by a microphone or the like as digital data having a sampling frequency of 8000 Hz 16-bit Linear-PCM.
  • the first speech segment determination unit 103 determines a first speech segment / non-speech segment from the time series of the input sound (step S2 in FIG. 2).
  • a portion of the speech section of "Hello,”"it is the forest"
  • the before and after and the section in between is a non-speech section.
  • the first speech segment determination unit 103 calculates a feature value indicating the likelihood of speech for each short unit time such as 10 milliseconds from the time series of the input signal, and compares the threshold value stored in the threshold storage unit 102 with the magnitude. And may be used for determination of a voice section.
  • the first speech segment determination unit 103 may use, for example, amplitude power as a feature quantity indicating the likelihood of speech.
  • the amplitude power Pt is calculated by, for example, (Equation 1) below. In (Expression 1), N is the number of sample points per unit time. xt is the value of input sound data (waveform data) at time t.
  • the first speech segment determination unit 103 determines that the speech state is present, and determines that the amplitude state is less than the threshold and indicates a non-speech state.
  • the first speech section determination unit 103 uses amplitude power as a feature value indicating speech likeness, but the number of zero crossings, likelihood ratio between speech model and non-speech model, pitch frequency, SN ratio, etc. A feature amount may be used.
  • the first speech segment determination unit 103 determines a segment in which the same state continues among speech and non-speech states determined for each unit time as a speech segment and a non-speech segment.
  • the beginning of the speech segment is the time when the non-speech state that has been continued changes to the speech state. This time is also the end of the non-voice state interval. Further, the end of the voice section is a time point when the voice state that has continued is interrupted and changes to a non-voice state. This time is also the beginning of the non-voice segment.
  • the voice segment and the non-speech segment are thus determined when the continuous state is interrupted. Note that the first voice segment determination unit 103 must not continuously determine the voice state for a certain period of time after changing from the non-speech state to the voice state in order to prevent a short voice segment or a non-voice segment from occurring.
  • the standard sound superimposing unit 105 superimposes the standard sound on the time series of the input sound in the section determined as the non-speech section by the first sound section determining unit 103 (step S3 in FIG. 2). For example, the standard audio superimposing unit 105 may calculate the sum for each sample point as in (Equation 2) below.
  • the standard sound superimposing unit 105 may select the standard sound to be superimposed according to the length of the non-speech section from the plurality of lengths of standard sound prepared by the standard sound storage unit 104. Further, the standard voice superimposing unit 105 may superimpose the standard voice a plurality of times when the length of the non-speech section is longer than a predetermined value.
  • the standard speech superimposing unit 105 may not superimpose speech on the non-speech segment.
  • the standard speech superimposing unit 105 superimposes the standard speech on the first non-speech segment and the third non-speech segment, but the length of the second non-speech segment is shorter than a predetermined value. Therefore, standard audio is not superimposed.
  • the second speech segment determination unit 106 determines a second speech segment / non-speech segment from the time series of the input sound on which the standard speech is superimposed (step S4 in FIG. 2). The method for determining the voice segment / non-speech segment is the same as step S2 in FIG.
  • the threshold used by the second speech segment determination unit 106 is the same value as the first speech segment determination unit 103.
  • the determination result comparison unit 107 compares the determination result of the second speech section / non-speech section with the determination result of the correct answer (step S5 in FIG. 2).
  • the determination result comparison unit 107 performs comparison using, for example, a false rejection rate (FRR) and a false acceptance rate (FAR).
  • FRR false rejection rate
  • FAR false acceptance rate
  • the determination result comparison unit 107 may calculate an error rejection rate (FRR) and an error acceptance rate (FAR) for each non-speech interval determined in step S2 of FIG. Further, the determination result comparison unit 107 may compare the determination results using another mismatch rate that represents the degree of mismatch between the sections.
  • the threshold update unit 108 updates the threshold used for speech segment determination based on the comparison result (step S6 in FIG. 2).
  • the threshold update unit 108 updates the threshold ⁇ using, for example, (Equation 5) below.
  • is a step size.
  • is a parameter that controls whether to place importance on the false rejection rate (FRR) or the false acceptance rate (FAR). These two parameters may be set to predetermined values in advance, or may be generated according to conditions and environments.
  • FRR false rejection rate
  • FAR false acceptance rate
  • the threshold update unit 108 updates the threshold so that the mismatch rate between the determination result of the second speech segment / non-speech segment and the determination result of the correct answer becomes small.
  • the threshold update unit 108 may update the value of the threshold ⁇ using only one of the false rejection rate (FRR) and the false acceptance rate (FAR), or update using another mismatch rate.
  • the threshold value updating method is not particularly limited. The processing from step S1 to step S6 in FIG. 2 may be performed for each utterance of the user, may be performed every time a voice section or a non-speech section is determined, or performed at certain time intervals. It may be broken.
  • the speech segment determination apparatus 1 superimposes standard speech on a time series of input sounds in a segment determined by the first speech segment determination unit 103 as a non-speech segment using a threshold value. Further, the second speech segment determination unit 106 again uses the threshold value to divide the time series on which the standard speech is superimposed into a speech segment and a non-speech segment. The speech segment determination device 1 can determine whether or not the threshold value is appropriate by comparing the result of the second speech segment determination with correct information that can be understood from the standard speech. Thus, the threshold value used for voice segment determination can be updated to an appropriate value.
  • FIG. 4 is a diagram showing the configuration of the second exemplary embodiment of the present invention.
  • a speech segment determination device 1 according to the second exemplary embodiment of the present invention includes a gain / frequency characteristic acquisition unit 201, and a gain / frequency characteristic correction unit. 202.
  • the gain / frequency characteristic acquisition unit 201 is realized by a dedicated device realized by a logic circuit, a CPU of an information processing device that executes a program, or the like.
  • the gain / frequency characteristic acquisition unit 201 is configured to acquire at least one of gain information and frequency characteristic information from the time series of the input sound determined as the voice section by the first voice section determination unit 103.
  • the gain / frequency characteristic acquisition unit 201 has the following methods for acquiring the gain. For example, the gain / frequency characteristic acquisition unit 201 may calculate the amplitude power for each unit time using the above-described (Equation 1), and obtain the average value for the entire speech section.
  • the gain / frequency characteristic acquisition unit 201 may obtain the maximum value of the amplitude power in the voice section.
  • the frequency / characteristic acquisition method in the gain / frequency characteristic acquisition unit 201 includes the following methods.
  • the gain / frequency characteristic acquisition unit 201 may perform Fourier transform for each unit time to obtain a spectrum power for each frequency band, and obtain an average value for each frequency band for the entire speech section.
  • the gain / frequency characteristic acquisition unit 201 may obtain the maximum value of the spectrum power in the voice section for each frequency band.
  • the gain / frequency characteristic correction unit 202 is realized by a dedicated device realized by a logic circuit, a CPU of an information processing device that executes a program, or the like.
  • the gain / frequency characteristic acquisition unit 201 is configured to correct the gain and frequency characteristic of the standard sound using at least one of gain information and frequency characteristic information acquired by the gain / frequency characteristic acquisition unit 201.
  • the gain / frequency characteristic correction unit 202 sets the gain obtained in advance for the standard sound so that the gain obtained in advance for the standard sound is equal to the gain obtained for the input sound. You may correct
  • the frequency characteristic correction method in the gain / frequency characteristic correction unit 202 includes the following methods.
  • the gain / frequency characteristic correction unit 202 has previously obtained the standard sound so that the frequency characteristic obtained for the standard sound is equal to the frequency characteristic obtained for the input sound.
  • the frequency characteristic may be corrected by multiplying a predetermined factor for each frequency band.
  • another process is performed according to the determination result of the first speech section determination unit 103.
  • the gain / frequency characteristic acquisition unit 201 acquires information on gain / frequency characteristics for the time series of the input sound determined by the first speech section determination unit 103 as the speech section (FIG. 5).
  • Step S3 the gain / frequency correction unit 202 corrects the standard sound using the gain / frequency characteristic information acquired by the gain / frequency characteristic acquisition unit 201 (step S4 in FIG. 5).
  • the standard voice superimposing unit 105 superimposes the corrected standard voice on the time series of the input sound determined by the first voice segment determining unit 103 as a non-speech segment (step S5 in FIG. 5).
  • the speech segment determination device 1 performs the same processing as that after step S4 in FIG. 2 of the first embodiment.
  • the speech segment determination apparatus 1 acquires gain / frequency characteristic information using the time series of the input sound determined as the speech segment by the first speech segment determination unit 103, and corrects the standard speech.
  • the standard voice can be brought closer to the speaker nature and acoustic environment closer to the user's utterance. Therefore, the speech segment determination apparatus 1 in the present embodiment can update the threshold with higher accuracy.
  • FIG. 6 is a diagram showing the configuration of the third exemplary embodiment of the present invention.
  • the speech segment determination apparatus 1 includes a feature amount extraction unit 301, a standard speech storage unit 302, a standard configuration in addition to the configuration of the first exemplary embodiment.
  • a voice selection unit 303 Since other configurations are the same as those of the first embodiment, description thereof is omitted.
  • the feature extraction unit 301 is a feature for discriminating speaker characteristics (individuality information that each individual has) and acoustic environment from the time series of the input sound determined as the speech section by the first speech section determination unit 103. Configured to determine the quantity.
  • the feature amount is a feature amount obtained from a time series of input sounds such as a spectrum and a cepstrum.
  • the feature amount extraction unit 301 may calculate the feature amount for each unit time with respect to the time series of the input sound, and may calculate the average value over the entire speech section.
  • the standard sound storage unit 302 is realized by a storage device such as an optical disk device or a magnetic disk device, and stores a plurality of standard sounds having different feature amounts recorded in different speaker characteristics and acoustic environments.
  • the feature amount is a feature amount obtained from a time series of input sounds such as a spectrum and a cepstrum as described above.
  • the standard voice storage unit 302 may store standard voice data and its feature values in association with each other in advance.
  • the standard voice selection unit 303 is realized by a dedicated device realized by a logic circuit, a CPU of an information processing apparatus that executes a program, or the like.
  • the first voice segment determination unit 103 determines a voice segment.
  • the standard voice close to the time series of the input sound is selected from the standard voice storage unit 302.
  • the standard voice selection unit 303 selects, from the standard voice storage unit 302, the standard voice having the feature value closest to the feature value extracted from the time series of the input sound by the feature value extraction unit 301. Also good.
  • the standard voice selection unit 303 obtains the similarity between the time series of the input sound and the standard voice by a predetermined correlation function, and selects the standard voice whose similarity value with the input sound is equal to or greater than the predetermined value. Also good.
  • the standard voice selection unit 303 may select a standard voice from the standard voice storage unit 302 based on the similarity calculated using another known method. The selection method is not particularly limited. Next, the operation of the present embodiment will be described with reference to the flowchart of FIG. In the third embodiment of the present invention, after performing up to step S2 of FIG. 2 of the first embodiment, another process is performed according to the determination result of the first speech section determination unit 103.
  • the feature amount extraction unit 301 determines speaker characteristics and acoustic environment from the input sound time series. A feature amount for this is obtained (step S3 in FIG. 7). Then, the standard voice selection unit 303 selects a standard voice that is close to the time series of the input sound determined as the voice section by the first voice section determination unit 103 among the plurality of standard voices stored in the standard voice storage unit 302 ( Step S4 in FIG. Since the following steps are the same as the operations in the first embodiment, description thereof will be omitted.
  • the speech segment determination apparatus 1 obtains a feature amount for determining speaker characteristics and acoustic environment from a time series of input sounds determined as a speech segment by the first speech segment determination unit 103, and includes a plurality of features. A sound close to the time series of the input sound determined as the voice section is selected from the standard voices. As a result, the speech segment determination device 1 according to the present embodiment can bring the standard speech closer to the speaker nature or acoustic environment closer to the user's utterance, and thus can update the threshold with higher accuracy.
  • FIG. 8 is a diagram showing the configuration of the fourth exemplary embodiment of the present invention. Referring to FIG.
  • the speech segment determination device 1 includes a speech recognition unit 401 and a recognition result comparison unit 402 in addition to the configuration of the first exemplary embodiment. . Since other configurations are the same as those of the first embodiment, description thereof is omitted.
  • the voice recognition unit 401 performs voice recognition on the time series of the input sound determined as the voice section by the first voice section determination unit 103, and the word sequence corresponding to each voice section with respect to the time series of the input sound. Is configured to ask for.
  • the recognition result comparison unit 402 compares the degree of coincidence (or degree of disagreement) between the voice recognition result obtained by the voice recognition unit 401 and the section determined as the voice section by the first voice section determination unit 103.
  • the first speech segment determination unit 103 determines a time-series speech segment and a non-speech segment of the input sound (steps S1 and S2 in FIG. 9).
  • the subsequent processing differs depending on whether the input sound is a voice segment or a non-speech segment.
  • the speech section determination device 1 performs the same processing as step S3 and subsequent steps in FIG. 2 described in the first embodiment.
  • the speech section determination device 1 performs the following processing.
  • the speech recognition unit 401 performs speech recognition on the time series of the input speech determined to be a speech segment, and obtains a corresponding word string for the speech segment (steps S7 and S8 in FIG. 9).
  • the speech recognition unit 401 may add a margin before and after the time series of the input speech determined as the speech section. In the example of FIG. 3, for the first speech section corresponding is "Hello", for the next speech section corresponding is "is a forest".
  • the speech recognition unit 401 determines from which time point of the input speech the corresponding word string corresponds to which time point, and outputs the correspondence information of the word string that is the recognition result to the recognition result comparison unit 402. .
  • the recognition result comparison unit 402 compares the acquired correspondence information with the speech segment determined by the first speech segment determination unit 103 (step S9 in FIG. 9).
  • the recognition result comparison unit 402 uses, for example, the false rejection rate (FRR) defined by (Expression 3) or the false acceptance rate (FAR) defined by (Expression 4). Make a comparison.
  • the threshold update unit 108 updates the threshold based on the determination result of the second speech segment determination unit 106 and the determination result of the recognition result comparison unit 402 (step S6 in FIG. 9). At this time, the threshold update unit 108 adopts the one with the larger or smaller error rejection rate or the error acceptance rate among the determination result of the second speech segment determination unit 106 and the determination result of the recognition result comparison unit 402.
  • FIG. 10 is a diagram showing the configuration of the fifth exemplary embodiment of the present invention.
  • the speech segment determination device 1 according to the fifth exemplary embodiment of the present invention includes a first speech segment determination unit 103, a second speech segment determination unit 106, and a threshold update unit 108.
  • the first speech segment determination means for comparing the time interval speech segment and the non-speech segment of the input sound by comparing the value of the feature amount acquired from the time sequence of the input sound with a threshold value;
  • the standard voice is superimposed by comparing the threshold value with the feature value obtained from the time series after the standard voice is superimposed on the section determined as the non-speech section by the first voice section determination means.
  • a second speech segment determination unit that determines a time-series speech segment and a non-speech segment after being performed, and a threshold update unit that updates the threshold based on a determination result of the second speech segment determination unit.
  • a voice segment determination device is provided.
  • a parameter used for speech segment determination is updated without imposing a burden on the user, and a speech segment determination device, a speech segment determination method, and a speech segment determination program that are robust against noise are provided.
  • a speech segment determination device a parameter used for speech segment determination is updated without imposing a burden on the user
  • a speech segment determination device a speech segment determination method, and a speech segment determination program that are robust against noise are provided.
  • This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2010-179180 for which it applied on August 10, 2010, and takes in those the indications of all here.
  • a time-series speech segment (first speech segment) and a non-speech segment (first non-speech segment) of the input sound are determined.
  • 1 time interval feature value of the first non-speech segment after superimposing the standard speech acquired from the standard speech storage unit on the time sequence of the first non-speech segment, and the threshold value By comparing the time-series speech segment and the non-speech segment of the first non-speech segment after superimposing the standard speech, and the second speech segment determinator And a threshold value updating means for updating the threshold value so that the mismatch rate between the determination result and the correct answer calculated from the standard voice is small.
  • the speech segment determination apparatus uses at least one of the gain or frequency characteristic acquired from the time series of the input sound of the first voice section, the gain or frequency characteristic is equal to the gain or frequency characteristic of the standard voice.
  • the speech segment determination apparatus according to supplementary note 1, further comprising gain / frequency characteristic correction means for correcting a gain or frequency characteristic value of the standard speech superimposed on a non-speech segment.
  • Appendix 3 From the standard sound storage means for storing a plurality of standard sounds each having a different feature value, a standard sound having a feature value similar to the value of the time-series feature value of the input sound in the first sound section is obtained as the first sound.
  • the speech segment determination device according to any one of supplementary notes 1 and 2, further comprising standard speech selection means for selecting standard speech to be superimposed on a non-speech segment.
  • a speech recognition means for obtaining a word string section corresponding to a time series of input sounds of the first speech section; a mismatch rate between the first speech section and the word string section obtained by the speech recognition means;
  • a judgment result comparing means for judging, wherein the threshold value updating means comprises a mismatch rate determined by the judgment result comparing means, a judgment by the second speech section judging means, and a correct answer calculated from the standard voice.
  • the speech segment determination device according to any one of supplementary notes 1 to 3, wherein the threshold value is updated based on a discrepancy rate.
  • 1 speech segment determination step the time series feature value of the first non-speech segment after superimposing the standard speech acquired from the standard speech storage means on the time sequence of the first non-speech segment, and the threshold value
  • a speech interval determination program that causes a computer to execute a threshold update step for updating the threshold so that a mismatch rate between the determination result in and the correct answer calculated from the standard speech is small. (Appendix 6) Using at least one of the gain or frequency characteristic acquired from the time series of the input sound of the first voice section, the gain or frequency characteristic is equal to the gain or frequency characteristic of the standard voice.
  • the speech segment determination program according to appendix 5, further causing the computer to execute a step of correcting the gain or frequency characteristic value of the standard speech superimposed on the non-speech segment.
  • Appendix 7 From the standard sound storage means for storing a plurality of standard sounds each having a different feature value, a standard sound having a feature value similar to the value of the time-series feature value of the input sound in the first sound section is obtained as the first sound.
  • the speech segment determination program according to any one of supplementary notes 5 to 6, further causing the computer to execute a step of selecting the standard speech to be superimposed on the non-speech segment.
  • a speech recognition step for obtaining a word string segment corresponding to a time series of input sounds of the first speech segment; a determination result comparison step for determining a mismatch rate between the first speech segment and the word string segment; The threshold update step for updating the threshold based on the mismatch rate determined in the determination result comparison step and the mismatch rate between the determination in the second speech segment determination step and the correct answer calculated from the standard speech
  • the voice segment determination program according to any one of supplementary notes 5 to 7, which causes a computer to execute.
  • the gain or frequency characteristic is equal to the gain or frequency characteristic of the standard voice.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Telephone Function (AREA)

Abstract

La présente invention concerne un dispositif de détermination d'intervalle de parole, robuste au bruit, qui met à jour des paramètres utilisés dans la détermination d'intervalles de parole sans surcharger l'utilisateur, et l'invention décrit également un procédé de détermination d'intervalle de parole et un programme de détermination d'intervalle de parole. Le dispositif de détermination d'intervalle de parole comprend les éléments suivants : un premier moyen de détermination d'intervalle de parole qui détermine un intervalle de parole (premier intervalle de parole) et un intervalle sans parole (premier intervalle sans parole) dans une série chronologique d'un son d'entrée, en comparant une valeur de seuil et une valeur de caractéristique de la série chronologique du son d'entrée ; un second moyen de détermination d'intervalle de parole qui détermine un intervalle de parole et un intervalle sans parole dans une série chronologique du premier intervalle sans parole après qu'une parole de référence, acquise auprès d'un moyen de stockage de paroles de référence, a été superposée sur la série chronologique du premier intervalle sans parole, ladite détermination étant effectuée en comparant la valeur de seuil et une valeur de caractéristique de la série chronologique du premier intervalle sans parole après que la parole de référence a été superposée ; et un moyen de mise à jour de valeur de seuil qui met à jour la valeur de seuil de telle façon que le taux de divergence entre le résultat de détermination du second moyen de détermination d'intervalle de parole et la bonne réponse calculée à partir de la parole de référence soit diminué.
PCT/JP2011/068003 2010-08-10 2011-08-02 Dispositif de détermination d'intervalle de parole, procédé de détermination d'intervalle de parole, et programme de détermination d'intervalle de parole WO2012020717A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US13/814,141 US9293131B2 (en) 2010-08-10 2011-08-02 Voice activity segmentation device, voice activity segmentation method, and voice activity segmentation program
JP2012528661A JP5725028B2 (ja) 2010-08-10 2011-08-02 音声区間判定装置、音声区間判定方法および音声区間判定プログラム

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2010179180 2010-08-10
JP2010-179180 2010-08-10

Publications (1)

Publication Number Publication Date
WO2012020717A1 true WO2012020717A1 (fr) 2012-02-16

Family

ID=45567684

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2011/068003 WO2012020717A1 (fr) 2010-08-10 2011-08-02 Dispositif de détermination d'intervalle de parole, procédé de détermination d'intervalle de parole, et programme de détermination d'intervalle de parole

Country Status (3)

Country Link
US (1) US9293131B2 (fr)
JP (1) JP5725028B2 (fr)
WO (1) WO2012020717A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016145940A (ja) * 2015-02-09 2016-08-12 沖電気工業株式会社 目的音区間検出装置及びプログラム、雑音推定装置及びプログラム、並びに、snr推定装置及びプログラム

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5910379B2 (ja) * 2012-07-12 2016-04-27 ソニー株式会社 情報処理装置、情報処理方法、表示制御装置および表示制御方法
JP6235280B2 (ja) * 2013-09-19 2017-11-22 株式会社東芝 音声同時処理装置、方法およびプログラム
US9747368B1 (en) * 2013-12-05 2017-08-29 Google Inc. Batch reconciliation of music collections
US9916846B2 (en) * 2015-02-10 2018-03-13 Nice Ltd. Method and system for speech detection
TWI682385B (zh) * 2018-03-16 2020-01-11 緯創資通股份有限公司 語音服務控制裝置及其方法
CN113393865B (zh) * 2020-03-13 2022-06-03 阿里巴巴集团控股有限公司 功耗控制、模式配置与vad方法、设备及存储介质
CN114363280B (zh) * 2022-03-18 2022-06-17 深圳市欧乐智能实业有限公司 基于多段语音汇总式传输的手机聊天辅助系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10210075A (ja) * 1997-01-20 1998-08-07 Logic Corp 有音検知装置および方法
WO2010070840A1 (fr) * 2008-12-17 2010-06-24 日本電気株式会社 Dispositif et programme de détection sonore et procédé de réglage de paramètre
WO2011070972A1 (fr) * 2009-12-10 2011-06-16 日本電気株式会社 Système, procédé et programme de reconnaissance vocale

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA1116300A (fr) * 1977-12-28 1982-01-12 Hiroaki Sakoe Systeme d'identification/comparaison de l'expression orale
US6453291B1 (en) * 1999-02-04 2002-09-17 Motorola, Inc. Apparatus and method for voice activity detection in a communication system
US6249757B1 (en) * 1999-02-16 2001-06-19 3Com Corporation System for detecting voice activity
JP3878482B2 (ja) * 1999-11-24 2007-02-07 富士通株式会社 音声検出装置および音声検出方法
US20020103636A1 (en) * 2001-01-26 2002-08-01 Tucker Luke A. Frequency-domain post-filtering voice-activity detector
US7366667B2 (en) * 2001-12-21 2008-04-29 Telefonaktiebolaget Lm Ericsson (Publ) Method and device for pause limit values in speech recognition
JP4348970B2 (ja) * 2003-03-06 2009-10-21 ソニー株式会社 情報検出装置及び方法、並びにプログラム
JP4798601B2 (ja) 2004-12-28 2011-10-19 株式会社国際電気通信基礎技術研究所 音声区間検出装置および音声区間検出プログラム
EP1681670A1 (fr) * 2005-01-14 2006-07-19 Dialog Semiconductor GmbH Activation de voix
JP2007017620A (ja) 2005-07-06 2007-01-25 Kyoto Univ 発話区間検出装置、そのためのコンピュータプログラム及び記録媒体
US8175868B2 (en) * 2005-10-20 2012-05-08 Nec Corporation Voice judging system, voice judging method and program for voice judgment
KR100930584B1 (ko) * 2007-09-19 2009-12-09 한국전자통신연구원 인간 음성의 유성음 특징을 이용한 음성 판별 방법 및 장치
US9026443B2 (en) * 2010-03-26 2015-05-05 Nuance Communications, Inc. Context based voice activity detection sensitivity

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10210075A (ja) * 1997-01-20 1998-08-07 Logic Corp 有音検知装置および方法
WO2010070840A1 (fr) * 2008-12-17 2010-06-24 日本電気株式会社 Dispositif et programme de détection sonore et procédé de réglage de paramètre
WO2011070972A1 (fr) * 2009-12-10 2011-06-16 日本電気株式会社 Système, procédé et programme de reconnaissance vocale

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016145940A (ja) * 2015-02-09 2016-08-12 沖電気工業株式会社 目的音区間検出装置及びプログラム、雑音推定装置及びプログラム、並びに、snr推定装置及びプログラム

Also Published As

Publication number Publication date
US9293131B2 (en) 2016-03-22
US20130132078A1 (en) 2013-05-23
JP5725028B2 (ja) 2015-05-27
JPWO2012020717A1 (ja) 2013-10-28

Similar Documents

Publication Publication Date Title
JP5725028B2 (ja) 音声区間判定装置、音声区間判定方法および音声区間判定プログラム
JP5621783B2 (ja) 音声認識システム、音声認識方法および音声認識プログラム
US9536525B2 (en) Speaker indexing device and speaker indexing method
JP4246792B2 (ja) 声質変換装置および声質変換方法
US9099082B2 (en) Apparatus for correcting error in speech recognition
JP5949550B2 (ja) 音声認識装置、音声認識方法、及びプログラム
CN107871499B (zh) 语音识别方法、系统、计算机设备及计算机可读存储介质
US20120065978A1 (en) Voice processing device
JP5717097B2 (ja) 音声合成用の隠れマルコフモデル学習装置及び音声合成装置
US7505950B2 (en) Soft alignment based on a probability of time alignment
CN115428066A (zh) 合成语音处理
JP5234117B2 (ja) 音声検出装置、音声検出プログラムおよびパラメータ調整方法
JP4858663B2 (ja) 音声認識方法及び音声認識装置
Mporas et al. A hybrid architecture for automatic segmentation of speech waveforms
JP4464797B2 (ja) 音声認識方法、この方法を実施する装置、プログラムおよびその記録媒体
JP6811865B2 (ja) 音声認識装置および音声認識方法
JP2005321539A (ja) 音声認識方法、その装置およびプログラム、その記録媒体
JP4242320B2 (ja) 音声認識方法、その装置およびプログラム、その記録媒体
EP2107554A1 (fr) Tables de codage plurilingues pour la reconnaissance de la parole
JP5229738B2 (ja) 音声認識装置及び音声変換装置
JP2015040931A (ja) 信号処理装置、音声処理装置、信号処理方法および音声処理方法
JP2006235298A (ja) 音声認識ネットワーク生成方法、音声認識装置及びそのプログラム
JPWO2003042648A1 (ja) 音声符号化装置、音声復号化装置、音声符号化方法および音声復号化方法
Müller et al. Invariant integration features combined with speaker-adaptation methods.
JP2005004018A (ja) 音声認識装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11816382

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 13814141

Country of ref document: US

ENP Entry into the national phase

Ref document number: 2012528661

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11816382

Country of ref document: EP

Kind code of ref document: A1