WO2016152132A1 - Speech processing device, speech processing system, speech processing method, and recording medium - Google Patents

Speech processing device, speech processing system, speech processing method, and recording medium Download PDF

Info

Publication number
WO2016152132A1
WO2016152132A1 PCT/JP2016/001593 JP2016001593W WO2016152132A1 WO 2016152132 A1 WO2016152132 A1 WO 2016152132A1 JP 2016001593 W JP2016001593 W JP 2016001593W WO 2016152132 A1 WO2016152132 A1 WO 2016152132A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
segments
voice
unit
speech
Prior art date
Application number
PCT/JP2016/001593
Other languages
French (fr)
Japanese (ja)
Inventor
孝文 越仲
鈴木 隆之
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2017507495A priority Critical patent/JP6784255B2/en
Publication of WO2016152132A1 publication Critical patent/WO2016152132A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval

Definitions

  • the present invention relates to a voice processing device, a voice processing system, a voice processing method, and a recording medium that extract frequent patterns from voice data.
  • fingerprint identification which is a typical example, fingerprint images collected at the crime scene are sequentially compared with a large number of known fingerprint images to estimate who is involved in the crime.
  • a technique similar to a fingerprint test and dealing with voice is called a voiceprint test or a voice test.
  • Patent Document 1 describes a technique for extracting speech data of unknown words that are candidate keywords to be registered in a speech recognition dictionary from speech data.
  • the technique described in Patent Document 1 detects a section in which a state in which the power value of speech of speech data is greater than a threshold value th1 continues for a certain time or more as a speech section, and a state in which the power value is greater than the threshold value th2 from each speech section. Divide into sections that continue for more than a certain time.
  • the technique described in Patent Document 1 acquires a phoneme string from the divided speech data, performs clustering, calculates an evaluation value, detects an unknown word, and registers it in the dictionary.
  • Patent Document 2 describes a technique for determining a factor causing misrecognition and notifying a user.
  • the technique described in Patent Document 2 divides a mel cepstrum coefficient (Mel-Frequency Cepstrum Coefficients; hereinafter referred to as “MFCC”) vector sequence extracted by a feature extraction unit into segments for each phoneme using a set of standard models.
  • MFCC mel cepstrum coefficient
  • the technique described in Patent Document 2 investigates the cause of misrecognition, creates a character string of a message to be presented to the user according to the analysis result, and notifies the user by displaying the message on a display.
  • Patent Document 1 an unknown word that is a keyword candidate can be selected, but a phrase including a sentence (for example, a sentence such as “Please prepare a ransom”) cannot be selected.
  • a vector string for each segment that is erroneously recognized can be analyzed, but a desired phrase cannot be selected.
  • the techniques described in Patent Documents 1 and 2 have a problem that a desired phrase cannot be selected.
  • An object of the present invention is to provide an audio processing device, an audio processing system, an audio processing method, and a recording medium that can solve the above-described problems and can select a desired phrase.
  • a speech processing apparatus includes: a first generation unit configured to generate a plurality of segments from speech data so that adjacent segments at least partially overlap; Second generating means for classifying and generating clusters, selecting means for selecting clusters satisfying a predetermined condition based on the size of the clusters, and extracting means for extracting segments included in the selected clusters With.
  • the speech processing method generates a plurality of segments in which adjacent segments at least partially overlap from speech data, classifies the plurality of segments based on phoneme similarity, and generates a cluster. Based on the size of the cluster, a cluster satisfying a predetermined condition is selected, and segments included in the selected cluster are extracted.
  • a recording medium generates a cluster from audio data by generating a plurality of segments in which adjacent segments at least partially overlap, and classifying the plurality of segments based on phoneme similarity Storing a program for causing a computer to execute a process for selecting a cluster that satisfies a predetermined condition based on the size of the cluster, and a process for extracting a segment included in the selected cluster. Possible recording media.
  • the present invention has an effect that a desired phrase can be selected in a voice processing device, a voice processing system, a voice processing method, and a program.
  • a ransom request from a kidnapper or a telephone call of a terrorist's crime notice is recorded, and the recorded voice is compared with a known voice to identify the main voice of the telephone.
  • voice changes each time depending on what is spoken. Therefore, in the voice appraisal method, a part (section) of the voice in which the same content is spoken is cut out and compared. In the voice appraisal method, for example, in the ransom request of a kidnapper, it is expected that the phrase “prepare gold” will often appear, so such a phrase is discovered and cut out, and also “prepare gold” Compare with spoken voice.
  • FIG. 1 is a block diagram illustrating a configuration example of a voice processing device 10 according to the first embodiment of the present invention.
  • the speech processing apparatus 10 includes a generation unit 11, a clustering unit 12, a selection unit 13, and an extraction unit 14.
  • the generation unit 11 is also referred to as a first generation unit.
  • the clustering unit 12 is also referred to as a second generation unit.
  • the generation unit 11 generates a plurality of segments in which at least some of the adjacent segments overlap from the audio data stored in the external storage device. For example, the generation unit 11 subdivides the audio data stored in the external storage device into short time units, and generates a plurality of segments using the subdivided audio data. Moreover, the time length of the several segment which the production
  • the clustering unit 12 classifies a plurality of segments based on a predetermined similarity index to generate a cluster.
  • the selection unit 13 selects at least one cluster from the generated clusters based on the size of each cluster.
  • the extraction unit 14 extracts segments included in the selected cluster.
  • the size of the cluster is, for example, a result obtained by multiplying the total time length of the segment, the number of appearances of the segment content (also referred to as a phrase), and the segment length.
  • FIG. 2 is a schematic block diagram showing a configuration example of the computer 1000 in each embodiment and specific example of the present invention.
  • the computer 1000 includes a CPU (Central Processing Unit) 1001, a main storage device 1002, an auxiliary storage device 1003, an interface 1004, an input device 1005, and a display device 1006.
  • CPU Central Processing Unit
  • the voice processing apparatus 10 and the like of each embodiment and a specific example are mounted on a computer 1000. Operations of the voice processing device 10 and the like are stored in the auxiliary storage device 1003 in the form of a program.
  • the CPU 1001 reads out the program from the auxiliary storage device 1003, develops it in the main storage device 1002, and executes the above processing according to the program. For example, the CPU 1001 reads out the above program from the auxiliary storage device 1003 and develops it in the main storage device 1002, thereby realizing the functions of the generation unit 11, clustering unit 12, selection unit 13, and extraction unit 14.
  • the auxiliary storage device 1003 is an example of a tangible medium that is not temporary.
  • Other examples of non-temporary tangible media include magnetic disk, magneto-optical disk, CD (Compact Disc) -ROM (Read Only Memory), DVD (Digital Versatile Disc) -ROM, semiconductor, which are connected via an interface 1004 Memory etc. are mentioned.
  • this program is distributed to the computer 1000 via a communication line, the computer 1000 that has received the distribution may develop the program in the main storage device 1002 and execute the above processing.
  • the interface 1004 is connected to the CPU 1001 and is connected to a network or an external storage medium. External data may be taken into the CPU 1001 via the interface 1004.
  • the input device 1005 is, for example, a keyboard, a mouse, a touch panel, or a microphone.
  • the display device 1006 displays a screen corresponding to drawing data processed by a CPU 1001 or a GPU (Graphics Processing Unit) (not shown) such as an LCD (Liquid Crystal Display) or a CRT (Cathode Ray Tube) display. It is a device to do. Note that the hardware configuration illustrated in FIG. 2 is merely an example, and each unit illustrated in FIG. 2 may be configured with independent logic circuits.
  • the program may be for realizing a part of the above-described processing.
  • the program may be a differential program that realizes the above-described processing in combination with another program already stored in the auxiliary storage device 1003.
  • FIG. 3 is a flowchart showing an operation example of the speech processing apparatus 10 according to the first embodiment of the present invention.
  • the generation unit 11 generates a plurality of segments from the audio data stored in the external storage device (step S101). At this time, the generation unit 11 generates a plurality of segments so that adjacent segments have at least temporal overlap.
  • the time length of the segment may be a constant value in the range of 1 second to several seconds, for example, depending on the assumed time length of the phrase.
  • the generation unit 11 may generate a segment with various time lengths by dividing the sound data a plurality of times with different time lengths. Further, the generation unit 11 divides the audio data at a predetermined change point or a silent section using the method described in Non-Patent Document 1, and uses the plurality of divided audio data to generate variable length segments. It may be generated.
  • the clustering unit 12 classifies a plurality of segments based on a predetermined similarity index to generate a cluster (step S102). That is, the clustering unit 12 clusters a plurality of segments.
  • the clustering unit 12 calculates the similarity between the segments from the plurality of segments generated by the generation unit 11, and generates a cluster in which the segments with high similarity are collected.
  • the method described in Non-Patent Document 1 can be used.
  • the similarity index is an index for measuring the similarity of phonemes constituting a segment.
  • the similarity index is an index that uses a statistic of an acoustic feature quantity, such as a batch distance distance calculated from the average and variance of the acoustic feature quantity series, a divergence of a Cullback library, and a log likelihood ratio. These indices do not consider the order of the acoustic feature quantity sequence within the segment.
  • a method using the similarity index for example, an index considering an order, that is, a time order may be used.
  • the method using the similarity index is, for example, a DP matching method that calculates the degree of similarity by obtaining an optimal correspondence of each acoustic feature amount between segments by dynamic programming (hereinafter referred to as “DP”).
  • DP dynamic programming
  • the acoustic feature amount is, for example, MFCC.
  • MFCC is widely used for voice recognition and the like.
  • the selection unit 13 selects a cluster that satisfies a predetermined condition from the clusters generated by the clustering unit 12 based on the size of each cluster (step S103). S104). In this selection, the selection unit 13 compares cluster sizes from the viewpoint of finding frequently occurring phrases, and selects at least one cluster in descending order of size. Examples of the predetermined condition include a larger number of segments, a longer total time length of the segments, and the like. That is, the selection unit 13 selects, for example, a cluster including more segments or a cluster having a longer total segment length as a cluster that satisfies a predetermined condition.
  • the case where the clustering converges is, for example, a situation where Step S101 and Step S102 are executed a predetermined number of times, a situation where the increase or decrease of the predetermined evaluation value related to clustering is a predetermined value or less, and the like.
  • the case where clustering converges may be a situation in which the movement of a segment between clusters is lost in association with a situation in which the increase or decrease in a predetermined evaluation value related to clustering is equal to or less than a certain value.
  • the extraction unit 14 extracts segments from one or more segments included in the cluster selected by the selection unit 13 (step S105). Thereby, the extraction part 14 can extract the segment of the part applicable to a desired phrase from audio
  • step S101 if the clustering in the clustering unit 12 has not converged (No in step S103), the process returns to step S101.
  • step S101 and step S102 depend on each other and may be repeated a predetermined number of times or until convergence.
  • FIG. 4 is a diagram illustrating an example of a method in which the speech processing apparatus 10 extracts a phrase using the HMM. That is, the speech processing apparatus 10 learns an HMM as shown in FIG. 4 based on the maximum likelihood estimation method using speech data stored in the external storage device.
  • a one-way HMM (Left ⁇ ) expressing the first phrase (phrase 1 in FIG. 4), the second phrase (phrase 2 in FIG. 4),..., The Nth phrase (phrase N in FIG. 4). to-right HMM) is automatically formed, and at the same time, segments belonging to each are also acquired.
  • the generation unit 11 generates a plurality of segments from the speech data so that adjacent segments at least partially overlap, and the clustering unit 12 Based on the similarity, a plurality of segments are classified to generate a cluster.
  • the selection unit 13 selects at least one cluster from the clusters based on the size of each cluster.
  • the extraction unit 14 extracts the segments included in the selected cluster, the segment corresponding to the desired phrase can be extracted from the speech data. It becomes possible. The reason is that the generation unit 11 generates a plurality of segments so that adjacent segments overlap at least partially from the audio data, so a phrase longer than the word is generated as one segment from a word shorter than the word. Because it can.
  • the speech processing apparatus 10 in the present embodiment, frequent phrases necessary for speech appraisal can be found and selected at low cost even if not an expert appraiser.
  • the reason is that the speech processing apparatus 10 generates segments such that adjacent segments at least partially overlap from given speech data, clusters the segments, and selects a cluster including many similar segments. Because. This is because the speech processing apparatus 10 extracts segments included in the cluster selected in this way.
  • the extracted segment is a segment generated by the generation unit 11 and is partial audio data including a desired phrase in the audio data. This is because the speech processing apparatus 10 can automatically find frequently occurring phrases in the speech data.
  • FIG. 5 is a block diagram illustrating a configuration example of the sound processing device 20 according to the second embodiment of the present invention.
  • the speech processing apparatus 20 according to the second embodiment of the present invention includes a normalization learning unit 15, a speech data normalization unit 16, a speech data processing unit 17, and first to Nth speech data storages. Part (101-1 to 101-N (N is a positive integer)), unspecified acoustic model storage unit 102, and first to Nth parameter storage units (103-1 to 103-N (N is a positive integer) )).
  • the normalization learning unit 15 is also referred to as a third generation unit.
  • the first to Nth audio data storage units (101-1 to 101-N) are referred to as the audio data storage unit 101 when they are not distinguished or collectively referred to.
  • the first to Nth parameter storage units (103-1 to 103-N) are referred to as the parameter storage unit 103 when they are not distinguished or collectively referred to.
  • the audio data storage means 101 stores audio data having different properties. That is, the first audio data storage unit 101-1, the second audio data storage unit 101-2,..., And the Nth audio data storage unit 101-N each store audio data having different properties. Also, the audio data having different properties stored in the first audio data storage unit 101-1, the second audio data storage unit 101-2,..., And the Nth audio data storage unit 101-N are: Each of them is voice data classified based on acoustic characteristics.
  • the unspecified acoustic model storage unit 102 stores the unspecified acoustic model learned by the normalization learning unit 15.
  • the unspecified acoustic model is a model obtained by normalizing a difference between audio data having different properties stored in the audio data storage unit 101.
  • the parameter storage unit 103 stores parameters for normalizing the difference between the audio data. That is, the first parameter storage unit 103-1, the second parameter storage unit 103-2,..., And the Nth parameter storage unit 103-N have parameters for normalizing the difference of audio data. Remember each one.
  • the normalization learning unit 15 performs normalization learning using audio data having different properties stored in the audio data storage unit 101.
  • normalization learning is an acoustic model learning method described in Non-Patent Document 2, for example.
  • each phoneme i is defined by an average vector ⁇ i of acoustic feature values.
  • the average vector can be changed depending on the property of speech data. That is, in the present embodiment, the average vector (unspecified acoustic model) ⁇ i is expressed by affine transformation as shown in the following formula (1).
  • s 1, 2,..., N.
  • a s and b s are parameters for normalizing the difference in the properties of the audio data.
  • Equation (1) provides the unspecified acoustic model ⁇ i that is not affected by the difference in the properties of the speech data, and the parameters A S and b S for normalizing the difference in the properties of the speech data. Then, the normalization learning unit 15 stores the unspecified acoustic model ⁇ i in the unspecified acoustic model storage unit 102. In addition, the normalization learning unit 15 stores the parameters A S and b S in the parameter storage unit 103. Specifically, the normalization learning unit 15 stores the parameters A 1 and b 1 in the first parameter storage unit 103-1, and the parameters A N and b N in the Nth parameter storage unit 103-N. Store.
  • Non-Patent Document 2 describes a method of normalizing the difference between speakers assuming that the nature of speech data varies depending on the speaker.
  • the difference in the properties of speech data is not limited to the speaker, and background noise, Various assumptions such as a microphone and a communication line are possible.
  • the normalization learning unit 15 generates a normalization parameter for normalizing the difference between the audio data having different properties stored in the audio data storage unit 101 and stores the normalization parameter in the parameter storage unit 103.
  • the normalization learning unit 15 normalizes the difference between the audio data having different properties stored in the audio data storage unit 101 to learn the unspecified acoustic model, and stores the learned unspecified acoustic model in the unspecified acoustic model 102.
  • the normalization learning unit 15 generates the normalization parameter by estimating a normalization parameter for normalizing the difference between the audio data having different properties stored in the audio data storage unit 101.
  • the normalization learning unit 15 stores the unspecified acoustic model in the unspecified acoustic model 102 at each iteration.
  • the audio data normalization unit 16 refers to the parameters stored in the parameter storage unit 103, normalizes the audio data stored in each of the audio data storage units 101, and sends it to the audio data processing unit 17. Specifically, the sth parameter is used for the time series x 1 , x 2 ,..., X t ,... (T is a positive integer) of the acoustic feature amount of the sth audio data.
  • the expression (2) which is a conversion corresponding to the inverse conversion of the expression (1), is applied.
  • the parameters that define normalization may be different depending on the phoneme class (friction, plosive, etc.), or may be different depending on the preceding and following phonemes in consideration of context dependency. .
  • the audio data normalization unit 16 may normalize not only the average vector of acoustic feature values but also the variance. Further, the present invention is not limited to these, and various devices known for normalization learning may be applied.
  • the voice data processing unit 17 has the same configuration and effects as the voice processing apparatus 10 in the first embodiment. That is, the voice data processing unit 17 performs the processing of the generation unit 11, the clustering unit 12, the selection unit 13, and the extraction unit 14 illustrated in FIG. 1 in the same manner as in the first embodiment, and in the normalized voice data A segment containing a phrase that appears frequently is output.
  • FIG. 6 is a flowchart showing an operation example of the speech processing apparatus 20 according to the second embodiment of the present invention.
  • the operation of the audio data processing unit 17 in this embodiment that is, steps S204 to S208 is the same as the operation of the audio processing device 10 in the first embodiment, that is, steps S101 to S105. Since it is the same, description is abbreviate
  • the normalization learning unit 15 reads out each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameter of each voice data in the parameter storage unit 103 (step S201).
  • the normalization learning unit 15 stores the unspecified acoustic model generated after performing normalization to eliminate the difference in the properties of the speech data in the unspecified acoustic model storage unit 102 (step S202).
  • the audio data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103, and normalizes the audio data stored in the audio data storage unit 101, respectively (step S203).
  • the voice data processing unit 17 performs the same processing as steps S101 to S105 of the voice processing apparatus 10 in the first embodiment shown in FIG. 3, and outputs segments including phrases that frequently appear in the voice data (steps). S204 to step S208).
  • the normalization learning unit 15 reads each speech data from the speech data storage unit 101, performs normalization learning, and sets the normalization parameter of each speech data. Store in the parameter storage unit 103.
  • the normalization learning unit 15 performs normalization to eliminate the difference in acoustic properties of the respective voice data, and stores the unspecified acoustic model generated in the unspecified acoustic model storage unit 102.
  • the voice data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103 and normalizes the voice data stored in the voice data storage unit 101, respectively.
  • the voice data processing unit 17 outputs a segment including a phrase that frequently appears in the normalized voice data. Therefore, the speech processing apparatus 20 in the present embodiment can normalize speech data that has not been normalized and select a desired phrase.
  • the normalization learning unit 15 determines the difference in acoustic properties between the first speech data, the second speech data,. Learning to normalize the difference.
  • the voice data processing unit 17 extracts segments including phrases that frequently appear in the voice data. Therefore, the voice processing device 20 can more accurately extract phrases that frequently appear in the voice data.
  • the clustering unit 12 in the speech data processing unit 17 is affected by the difference in the properties of the speech data and generates an inappropriate cluster (for example, a speaker cluster). This is because such a situation can be reduced.
  • FIG. 7 is a block diagram illustrating a configuration example of the sound processing device 30 according to the third embodiment of the present invention.
  • the speech processing device 30 according to the third embodiment of the present invention includes an unclassified speech data storage unit 104 and a speech data classification unit 18 in addition to the configuration of the speech processing device 20 according to the second embodiment. And comprising.
  • the audio data classification unit 18 is also described as a fourth generation unit.
  • the uncategorized voice data storage unit 104 stores voice data.
  • the voice data classification unit 18 classifies the voice data stored in the voice data storage unit 104 based on the acoustic properties, and stores the voice data in the voice data storage unit 101.
  • the voice data classifying unit 18 classifies the voice data stored in the unclassified voice data storage unit 104 into N clusters based on differences in acoustic properties, for example, differences in speakers, and the voice data storage unit Each of them is stored in 101. That is, the audio data classification unit 18 generates N clusters by classifying the audio data stored in the unclassified audio data storage unit 104 based on the acoustic properties. Then, the voice data classifying unit 18 stores the first cluster in the first voice data storage unit, the second cluster in the second voice data storage unit,..., The second cluster in the Nth voice data storage unit. Store N clusters.
  • the audio data stored in the unclassified audio data storage unit 104 may be a mixture of audio data having various acoustic properties.
  • N may be a predetermined constant, or may be automatically determined by the audio data classification unit 18 according to the processing target. These can be implemented by applying a known clustering method.
  • the audio data storage unit 101 stores the audio data classified by the audio data classification unit 18.
  • FIG. 8 is a flowchart showing an operation example of the speech processing apparatus 30 according to the third embodiment of the present invention.
  • the operation of the audio data processing unit 17 in this embodiment that is, steps S306 to S310 is the same as the operation of the audio processing device 10 in the first embodiment, that is, steps S101 to S105. Since it is the same, description is abbreviate
  • the voice data classification unit 18 classifies the voice data stored in the voice data storage unit 104 based on the acoustic properties, and stores the voice data in the voice data storage unit 101 (step S301).
  • the normalization learning unit 15 reads each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameter of each voice data in the parameter storage unit 103 (step S302).
  • the normalization learning unit 15 stores the unspecified acoustic model generated after performing normalization to eliminate the difference in the properties of the voice data in the unspecified acoustic model storage unit 102 (step S303).
  • the speech data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103, and each of the speech data The voice data stored in the storage unit 101 is normalized (step S305).
  • step S304 when the results of the speech data classification unit 18 and the normalization learning unit 15 have not converged (No in step S304), the process returns to the flow of step S301. Thereby, the voice data classification unit 18 and the normalization learning unit 15 can be repeatedly executed alternately until the result converges.
  • the results output by the speech data classification unit 18 and the normalization learning unit 15 may depend on each other. Therefore, it is good also as a repetitive operation
  • Such an operation can be carried out efficiently based on an optimization criterion such as likelihood maximization following the method described in Non-Patent Document 3.
  • the voice data processing unit 17 performs the same processing as steps S101 to S105 of the voice processing apparatus 10 in the first embodiment shown in FIG. 6, and outputs segments including phrases that frequently appear in the voice data (steps). S306 to S310).
  • the audio data classification unit 18 classifies the audio data stored in the audio data storage unit 104 based on the acoustic properties, and the audio data storage unit 101.
  • the normalization learning unit 15 reads out each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameters of each voice data in the parameter storage unit 103.
  • the normalization learning unit 15 performs normalization to eliminate the difference in acoustic properties of the respective voice data, and stores the unspecified acoustic model generated in the unspecified acoustic model storage unit 102.
  • the sound data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103 and normalizes the sound data stored in the sound data storage unit 101, respectively.
  • the voice data processing unit 17 outputs a segment including a phrase that frequently appears in the normalized voice data. Therefore, the speech processing apparatus 30 in the present embodiment can classify and normalize speech data that has not been classified and normalized, and select a desired phrase.
  • the speech data classification unit 18 classifies speech data into N clusters based on the difference in acoustic properties, and uses the result to normalize the learning unit 15. Is configured to perform normalized learning. Therefore, the voice processing device 30 in the present embodiment can reduce the preparation cost of the voice data as compared with the voice processing device 20 in the second embodiment. The reason is that the speech processing apparatus 30 in the present embodiment does not need to divide speech data in advance according to the difference in acoustic properties (for example, for each speaker), and collects a set of various speech data at once. It is because it can be given and processed.
  • FIG. 9 is a block diagram showing a configuration example of the speech processing system 40 in the fourth embodiment of the present invention.
  • the voice processing system 40 in the fourth embodiment includes a voice processing device 41, a voice input device 42, an instruction input device 43, and an output device 44.
  • the voice processing device 41 processes the input voice by the processing of the voice processing device 10 in the first embodiment of the present invention, the processing of the voice processing device 20 in the second embodiment, or the third embodiment.
  • the processing of the voice processing device 30 (hereinafter referred to as “phrase extraction processing described in the first to third embodiments of the present invention”) is executed.
  • the voice input device 42 inputs voice.
  • the audio input device 42 is an arbitrary device that functions as an interface for inputting arbitrary audio data to the audio processing device 41, that is, a microphone that receives audio signals as data, a memory that records audio data, and the like.
  • the voice input device 42 is, for example, the input device 1005 shown in FIG.
  • the output device 44 outputs the result of the processing performed by the voice processing device 41.
  • the output device 44 is an output device such as a monitor or a speaker that outputs the processing result of the sound processing device 42 by visual or auditory means in accordance with an instruction input from the instruction input device 43 by the operator.
  • the output device 44 is a monitor, for example, a list of clusters is displayed in order of size, the contents of a specific cluster are displayed by waveform diagrams, spectrograms, etc., so that a plurality of segments can be compared. Display them side by side.
  • the output device 44 is a speaker, the output method of the output device 44 is to reproduce sound.
  • the output device 44 is realized by a display device 1006, for example.
  • the instruction input device 43 receives the instruction information from the operator and controls information displayed on the display device.
  • the instruction input device 43 is a user interface that receives operator instruction information such as processing for information output from the output device 44 and execution of processing by the voice processing device 41.
  • An arbitrary input device such as a mouse, a keyboard, or a touch panel is used. Is available.
  • the instruction input device 43 receives the instruction information from the operator and controls the voice processing device 41 to execute the process.
  • the voice input device 42 inputs arbitrary voice data to the voice processing device 41.
  • the speech processing device 41 executes the phrase extraction processing described in the first to third embodiments of the present invention based on the input speech data, selects clusters including frequently occurring phrases, and further selects them. The segments included in the created cluster are extracted.
  • the output device 44 outputs the processing result of the sound processing device 41 by visual or auditory means according to an instruction input from the instruction input device 43 by the operator. That is, the output device 44 outputs the processing result in a form that the operator desires to view.
  • the instruction input device 43 controls the voice processing device 41 to execute processing in accordance with the instruction information input from the operator.
  • the voice input device 42 inputs arbitrary voice data to the voice processing device 41.
  • the voice processing device 42 executes the phrase extraction described in the first to third embodiments of the present invention, selects a cluster including phrases (segments) that frequently appear, and Extract the segments contained in the selected cluster.
  • the output device 44 outputs the processing result of the sound processing device 41 by visual or auditory means according to the instruction input from the instruction input device 43 by the operator. Therefore, the speech processing system 40 in the present embodiment can output clusters and segments including frequently occurring phrases included in the speech data.
  • the voice processing system 40 allows an operator to easily perform analysis work such as identification of a person from voice. This is because the voice processing system 40 in the present embodiment is configured such that the processing result is output to the output device 44 in a form that the operator wants to browse.
  • the speech processing system 40 according to the present embodiment can frequently analyze phrases that frequently appear, so that it is possible to analyze a tendency of a talk or a topic that a specific person often speaks.
  • FIG. 10 is a diagram illustrating an example of audio data stored in the external storage device.
  • the external storage device is realized by, for example, the voice input device 42 in the fourth embodiment.
  • the external storage device stores voice data and a voice data ID that is an identifier of the voice data.
  • the voice data ID is “1”
  • the external storage device stores voice data “... Kept a child. Please prepare a ransom.
  • the external storage device is not limited to the contents of the audio data shown in FIG.
  • FIG. 11 is a diagram illustrating an example of a method in which the generation unit 11 generates a segment from audio data.
  • the generation unit 11 starts from the voice data shown in FIG. 10, that is, the voice data ID “1” “... Has left the child. Prepare the ransom. , Generate multiple segments.
  • segment 1 is “deposited” and segment 2 is “deposited.
  • the generation unit 11 subdivides the audio data arbitrarily (a predetermined time or the like), and generates a plurality of segments using these.
  • generation part 11 produces
  • FIG. 12 is a diagram illustrating an example of a method in which the clustering unit 12 generates a cluster in which a plurality of segments are collected.
  • a cluster is a cluster ID that is an identifier of a segment content (phrase), a segment content, and an appearance of a segment content (phrase) that appears in all audio data. Including the number of times.
  • the cluster ID and the segment number shown in FIG. 11 are assumed to be the same.
  • the cluster indicates, for example, that the phrase “deposited” with the cluster ID “1” appears 20 times in all audio data. That is, the clustering unit 12 calculates the similarity between the segments from the plurality of segments generated by the generation unit 11 as illustrated in FIG. 11, and generates a cluster having a high similarity, that is, a group of the same segments. .
  • the selection unit 13 compares clusters using the number of segments included in the cluster and the total time length, and selects a cluster that satisfies a predetermined condition. For example, the selection unit 13 compares the number of segments included in each cluster among the plurality of clusters generated by the clustering unit 12, that is, the number of appearances of the phrase. As shown in FIG. 12, the selection unit 13 selects “Ransom” with an appearance count of 35 and “Prepare a ransom” with an appearance count of 30. Next, the selection unit 13 performs comparison based on the size of each cluster. For example, the selection unit 13 uses the result of multiplying the number of appearances and the segment length, that is, the time length as the size of each cluster, and selects the cluster having the largest size of each cluster.
  • the selection unit 13 compares, for example, a cluster with a cluster ID of 7 and a cluster with a cluster ID of 8.
  • the selection unit 13 compares the result of the multiplication of the appearance count 35 times with the time length of “Ransom” and the result of the multiplication of the appearance count 30 times with the time length of “Prepare ransom”.
  • the selection part 13 may compare and select only the time length of a segment, when comparing the clusters with the same appearance frequency. Note that the selection unit 13 is not limited to the above method, and the size may be defined and compared based on various indexes such as the number of appearances, the time length, and the number of phonemes.
  • the extraction unit 14 extracts a segment from the selected cluster.
  • audio data that is a segment whose content is “Prepare ransom” is extracted. It can be seen from the voice data of this segment that the phrase “Prepare ransom” is frequently included in the voice data.
  • Speech processing apparatus DESCRIPTION OF SYMBOLS 10 Speech processing apparatus 11 Generation part 12 Clustering part 13 Selection part 14 Extraction part 15 Normalization learning part 16 Speech data normalization part 17 Speech data processing part 18 Speech data classification part 20 Speech processing apparatus 30 Speech processing apparatus 40 Speech processing system 41 Audio processing device 42 Audio input device 43 Instruction input device 44 Output device 101 Audio data storage unit 102 Unspecified acoustic model storage unit 103 Parameter storage unit 1000 Computer 1001 CPU 1002 Main storage device 1003 Auxiliary storage device 1004 Interface 1005 Input device 1006 Display device

Abstract

Provided is a speech processing device that allows frequently appearing phrases required for speech evaluation to be accurately selected from speech data. The speech processing device comprises: a generation unit for generating a plurality of segments from the speech data with adjacent segments at least partially overlapping each other; a clustering unit for generating a cluster by sorting the plurality of segments on the basis of phonological similarity; a selection unit for selecting a cluster that meets a prescribed condition on the basis of the size of the cluster; and an extraction unit for extracting a segment included in the selected cluster.

Description

音声処理装置、音声処理システム、音声処理方法、および記録媒体Audio processing apparatus, audio processing system, audio processing method, and recording medium
 本発明は音声データから頻出パターンを抽出する音声処理装置、音声処理システム、音声処理方法、および記録媒体に関する。 The present invention relates to a voice processing device, a voice processing system, a voice processing method, and a recording medium that extract frequent patterns from voice data.
 近年、警察の犯罪捜査では法科学に基づく科学的手法が広く用いられている。その代表例である指紋鑑定では、犯罪の現場で採取された指紋画像を大量の既知の指紋画像と順次比較して、犯罪に関与した人物が誰なのかを推定する。指紋鑑定に類する手法で、音声を扱うものを声紋鑑定あるいは音声鑑定と呼ぶ。 In recent years, scientific methods based on forensic science have been widely used in police criminal investigations. In fingerprint identification, which is a typical example, fingerprint images collected at the crime scene are sequentially compared with a large number of known fingerprint images to estimate who is involved in the crime. A technique similar to a fingerprint test and dealing with voice is called a voiceprint test or a voice test.
 特許文献1には、音声データから音声認識辞書に登録するキーワードの候補となる未知語の音声データを抽出する技術が記載されている。特許文献1に記載の技術は、音声データの音声のパワー値が閾値th1より大きい状態が一定時間以上連続する区間を発話区間として検出し、各発話区間から、パワー値が閾値th2より大きい状態が一定時間以上連続する区間ごとに分割する。そして、特許文献1に記載の技術は、この分割した音声データから音素列を取得し、クラスタリングを行い、評価値を算出して未知語を検出し、辞書に登録する。 Patent Document 1 describes a technique for extracting speech data of unknown words that are candidate keywords to be registered in a speech recognition dictionary from speech data. The technique described in Patent Document 1 detects a section in which a state in which the power value of speech of speech data is greater than a threshold value th1 continues for a certain time or more as a speech section, and a state in which the power value is greater than the threshold value th2 from each speech section. Divide into sections that continue for more than a certain time. The technique described in Patent Document 1 acquires a phoneme string from the divided speech data, performs clustering, calculates an evaluation value, detects an unknown word, and registers it in the dictionary.
 特許文献2には、誤認識となる要因を判定して利用者に通知する技術が記載されている。特許文献2に記載の技術は、特徴抽出部によって抽出されたメルケプストラム係数(Mel-Frequency Cepstrum Coefficients;以降「MFCC」と記載)ベクトル列を標準モデルの集合を用いて音素毎のセグメントに分割する。そして、特許文献2に記載の技術は、誤認識となった要因を調べ、分析結果に従い、利用者へ提示するメッセージの文字列を作成し、メッセージをディスプレイに表示することで利用者に通知する。 Patent Document 2 describes a technique for determining a factor causing misrecognition and notifying a user. The technique described in Patent Document 2 divides a mel cepstrum coefficient (Mel-Frequency Cepstrum Coefficients; hereinafter referred to as “MFCC”) vector sequence extracted by a feature extraction unit into segments for each phoneme using a set of standard models. . The technique described in Patent Document 2 investigates the cause of misrecognition, creates a character string of a message to be presented to the user according to the analysis result, and notifies the user by displaying the message on a display. .
国際公開第2009/136440号International Publication No. 2009/136440 特開2004-325635号公報JP 2004-325635 A
 しかしながら、特許文献1に記載の技術では、キーワードの候補となる未知語を選定できるが、センテンスを含むフレーズ(例えば、「身代金を用意しろ。」といった文章)は選定できない。特許文献2に記載の技術では、誤認識となるセグメントごとのベクトル列を分析できるが、所望のフレーズを選定できない。すなわち、特許文献1および2に記載の技術では、所望のフレーズを選定できないという問題がある。 However, with the technique described in Patent Document 1, an unknown word that is a keyword candidate can be selected, but a phrase including a sentence (for example, a sentence such as “Please prepare a ransom”) cannot be selected. With the technique described in Patent Document 2, a vector string for each segment that is erroneously recognized can be analyzed, but a desired phrase cannot be selected. In other words, the techniques described in Patent Documents 1 and 2 have a problem that a desired phrase cannot be selected.
 本発明の目的は、上記の問題を解決し、所望のフレーズを選定できる音声処理装置、音声処理システム、音声処理方法、および記録媒体を提供することにある。 An object of the present invention is to provide an audio processing device, an audio processing system, an audio processing method, and a recording medium that can solve the above-described problems and can select a desired phrase.
 本発明の一態様に係る音声処理装置は、音声データから、隣接するセグメントが少なくとも一部重複するように複数のセグメントを生成する第1の生成手段と、前記複数のセグメントを音韻の類似性に基づき分類してクラスタを生成する第2の生成手段と、前記クラスタのサイズに基づいて、所定の条件を満たすクラスタを選択する選択手段と、前記選択されたクラスタに含まれるセグメントを抽出する抽出手段とを備える。 A speech processing apparatus according to an aspect of the present invention includes: a first generation unit configured to generate a plurality of segments from speech data so that adjacent segments at least partially overlap; Second generating means for classifying and generating clusters, selecting means for selecting clusters satisfying a predetermined condition based on the size of the clusters, and extracting means for extracting segments included in the selected clusters With.
 本発明の一態様に係る音声処理方法は、音声データから、隣接するセグメントが少なくとも一部重複する複数のセグメントを生成し、前記複数のセグメントを音韻の類似性に基づき分類してクラスタを生成し、前記クラスタのサイズに基づいて、所定の条件を満たすクラスタを選択し、前記選択されたクラスタに含まれるセグメントを抽出する。 The speech processing method according to an aspect of the present invention generates a plurality of segments in which adjacent segments at least partially overlap from speech data, classifies the plurality of segments based on phoneme similarity, and generates a cluster. Based on the size of the cluster, a cluster satisfying a predetermined condition is selected, and segments included in the selected cluster are extracted.
 本発明の一態様に係る記録媒体は、音声データから、隣接するセグメントが少なくとも一部重複する複数のセグメントを生成する処理と、前記複数のセグメントを音韻の類似性に基づき分類してクラスタを生成する処理と、前記クラスタのサイズに基づいて、所定の条件を満たすクラスタを選択する処理と、前記選択されたクラスタに含まれるセグメントを抽出する処理とをコンピュータに実行させるプログラムを記憶する、コンピュータ読み取り可能な記録媒体。 A recording medium according to an aspect of the present invention generates a cluster from audio data by generating a plurality of segments in which adjacent segments at least partially overlap, and classifying the plurality of segments based on phoneme similarity Storing a program for causing a computer to execute a process for selecting a cluster that satisfies a predetermined condition based on the size of the cluster, and a process for extracting a segment included in the selected cluster. Possible recording media.
 本発明は、音声処理装置、音声処理システム、音声処理方法、およびプログラムにおいて、所望のフレーズを選定できるという効果がある。 The present invention has an effect that a desired phrase can be selected in a voice processing device, a voice processing system, a voice processing method, and a program.
本発明の第1の実施形態に係る音声処理装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech processing unit which concerns on the 1st Embodiment of this invention. 本発明の各実施形態および具体例におけるコンピュータの構成例を示す概略ブロック図である。It is a schematic block diagram which shows the structural example of the computer in each embodiment and specific example of this invention. 本発明の第1の実施形態に係る音声処理装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the speech processing unit which concerns on the 1st Embodiment of this invention. 本発明の第1の実施形態に係る音声処理装置がHMMを用いてフレーズを抽出する方法の一例を示す図である。It is a figure which shows an example of the method for the speech processing apparatus which concerns on the 1st Embodiment of this invention to extract a phrase using HMM. 本発明の第2の実施形態に係る音声処理装置の構成例を示すプロック図である。It is a block diagram which shows the structural example of the speech processing unit which concerns on the 2nd Embodiment of this invention. 本発明の第2の実施形態に係る音声処理装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the speech processing unit which concerns on the 2nd Embodiment of this invention. 本発明の第3の実施形態に係る音声処理装置の構成例を示すプロック図である。It is a block diagram which shows the structural example of the speech processing unit which concerns on the 3rd Embodiment of this invention. 本発明の第3の実施形態に係る音声処理装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the speech processing unit which concerns on the 3rd Embodiment of this invention. 本発明の第4の実施形態に係る音声処理システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech processing system which concerns on the 4th Embodiment of this invention. 本発明の具体例における外部記憶装置が記憶する音声データの一例を示す図である。It is a figure which shows an example of the audio | voice data which the external storage device in the specific example of this invention memorize | stores. 本発明の具体例における生成部が音声データを分割する方法の一例を示す図である。It is a figure which shows an example of the method in which the production | generation part in the specific example of this invention divides | segments audio | voice data. 本発明の具体例におけるクラスタリング部が、複数のセグメントをまとめたクラスタを生成する方法の一例を示す図である。It is a figure which shows an example of the method in which the clustering part in the specific example of this invention produces | generates the cluster which put together the some segment.
 まず、本発明の実施形態の理解を容易にするために、本発明の背景を説明する。 First, the background of the present invention will be described in order to facilitate understanding of the embodiment of the present invention.
 本発明に関連する音声鑑定方法では、例えば誘拐犯からの身代金要求やテロリストの犯行予告の電話を録音し、録音した音声を既知の音声と比較して、電話の声の主の特定を試みる。 In the voice appraisal method related to the present invention, for example, a ransom request from a kidnapper or a telephone call of a terrorist's crime notice is recorded, and the recorded voice is compared with a known voice to identify the main voice of the telephone.
 生涯不変な指紋と違い、音声は話される内容によって都度変化する。したがって音声鑑定方法では、同じ内容が話された音声の一部分(区間)を切り出し、比較する。音声鑑定方法では、例えば、誘拐犯の身代金要求では、「金を用意しろ」というフレーズがしばしば出現することが期待されるため、このようなフレーズを発見して切り出し、同じく「金を用意しろ」と話された音声と比較する。 Unlike life-long fingerprints, voice changes each time depending on what is spoken. Therefore, in the voice appraisal method, a part (section) of the voice in which the same content is spoken is cut out and compared. In the voice appraisal method, for example, in the ransom request of a kidnapper, it is expected that the phrase “prepare gold” will often appear, so such a phrase is discovered and cut out, and also “prepare gold” Compare with spoken voice.
 どのようなフレーズを用いるかは、ケースバイケースである。誘拐犯の場合は上述の「金を用意しろ」などが頻繁に出現するので適当と考えられる。振り込め詐欺犯の場合も金にまつわるフレーズが適当であろうが、誘拐犯の場合とはおそらく異なる。またテロリストの場合にはよりよい別のフレーズがあるであろうし、軍やその他政府機関の諜報活動でも異なるフレーズを用いた方がよいであろう。このような頻繁に出現するフレーズの選定は、これまで熟練した鑑定者の経験と勘に頼ってきた。しかしながら、そのような場合、熟練した鑑定者が時間をかけて注意深く音声を観察する必要があり、音声鑑定に必要な所望のフレーズを得ようとすると、大きな人的コストがかかる等の問題がある。 ¡What phrases are used on a case-by-case basis. In the case of kidnappers, the above-mentioned “Give me money” appears frequently, so it is considered appropriate. For money transfer scams, the money phrases are appropriate, but they are probably different from those of kidnappers. In the case of terrorists, there may be better alternative phrases, and it would be better to use different phrases in military and other government agencies. The selection of such frequently occurring phrases has so far depended on the experience and intuition of skilled appraisers. However, in such a case, it is necessary for an expert appraiser to observe the speech carefully over time, and there is a problem that it takes a large human cost to obtain a desired phrase necessary for the audio appraisal. .
 以下で説明される本発明の実施形態によれば、上述の問題等が解決され、所望のフレーズを選定することができる。 According to the embodiment of the present invention described below, the above-described problems are solved, and a desired phrase can be selected.
 以下、本発明の実施形態および具体例について図面を参照して説明する。尚、各実施形態および具体例について、同様な構成要素には同じ符号を付し、適宜説明を省略する。 Hereinafter, embodiments and specific examples of the present invention will be described with reference to the drawings. In addition, about each embodiment and a specific example, the same code | symbol is attached | subjected to the same component and description is abbreviate | omitted suitably.
 [第1の実施の形態]
 以下、本発明を実施するための第1の形態(以降、「第1の実施形態」と記載)について図面を参照して詳細に説明する。
[First Embodiment]
Hereinafter, a first mode for carrying out the present invention (hereinafter referred to as “first embodiment”) will be described in detail with reference to the drawings.
 [構成の説明]
 図1は、本発明の第1の実施形態における音声処理装置10の構成例を示すブロック図である。図1を参照すると、本発明の第1の実施形態における音声処理装置10は、生成部11、クラスタリング部12、選択部13、および抽出部14を備える。ここで、生成部11は、第1の生成部とも記載する。クラスタリング部12は、第2の生成部とも記載する。
[Description of configuration]
FIG. 1 is a block diagram illustrating a configuration example of a voice processing device 10 according to the first embodiment of the present invention. Referring to FIG. 1, the speech processing apparatus 10 according to the first embodiment of the present invention includes a generation unit 11, a clustering unit 12, a selection unit 13, and an extraction unit 14. Here, the generation unit 11 is also referred to as a first generation unit. The clustering unit 12 is also referred to as a second generation unit.
 生成部11は、外部記憶装置が記憶する音声データから、隣接するセグメントの少なくとも一部が重複する、複数のセグメントを生成する。生成部11は、例えば、外部記憶装置が記憶する音声データを、短い時間単位に細分化して、該細分化した音声データを用いて複数のセグメントを生成する。また、生成部11が生成する複数のセグメントの時間長は一定の時間長であってもよい。また、生成部11は、一つの音声データに対して異なる時間長で複数回の分割を行い、この分割した音声データを用いて、種々の時間長のセグメントを生成してもよい。 The generation unit 11 generates a plurality of segments in which at least some of the adjacent segments overlap from the audio data stored in the external storage device. For example, the generation unit 11 subdivides the audio data stored in the external storage device into short time units, and generates a plurality of segments using the subdivided audio data. Moreover, the time length of the several segment which the production | generation part 11 produces | generates may be fixed time length. Moreover, the production | generation part 11 may divide | segment several times with different time length with respect to one audio | voice data, and may produce | generate the segment of various time length using this divided | segmented audio | voice data.
 クラスタリング部12は、所定の類似度指標に基づき複数のセグメントを分類してクラスタを生成する。 The clustering unit 12 classifies a plurality of segments based on a predetermined similarity index to generate a cluster.
 選択部13は、生成されたクラスタの中から、各クラスタのサイズに基づき少なくとも1つのクラスタを選択する。抽出部14は、選択されたクラスタに含まれるセグメントを抽出する。ここで、クラスタのサイズとは、例えば、セグメントの総時間長、セグメントの内容(フレーズとも呼ぶ)の出現回数とセグメントの長さとの掛け算で得られる結果等である。 The selection unit 13 selects at least one cluster from the generated clusters based on the size of each cluster. The extraction unit 14 extracts segments included in the selected cluster. Here, the size of the cluster is, for example, a result obtained by multiplying the total time length of the segment, the number of appearances of the segment content (also referred to as a phrase), and the segment length.
 図2は、本発明の各実施形態および具体例におけるコンピュータ1000の構成例を示す概略ブロック図である。コンピュータ1000は、CPU(Central Processing Unit)1001と、主記憶装置1002と、補助記憶装置1003と、インターフェース1004と、入力デバイス1005と、ディスプレイ装置1006とを備える。 FIG. 2 is a schematic block diagram showing a configuration example of the computer 1000 in each embodiment and specific example of the present invention. The computer 1000 includes a CPU (Central Processing Unit) 1001, a main storage device 1002, an auxiliary storage device 1003, an interface 1004, an input device 1005, and a display device 1006.
 各実施形態および具体例の音声処理装置10等は、コンピュータ1000に実装される。音声処理装置10等の動作は、プログラムの形式で補助記憶装置1003に記憶されている。CPU1001は、プログラムを補助記憶装置1003から読み出して主記憶装置1002に展開し、そのプログラムに従って上記の処理を実行する。例えばCPU1001は、上記プログラムを補助記憶装置1003から読み出して主記憶装置1002に展開することで、生成部11、クラスタリング部12、選択部13、および抽出部14の各部の機能を実現する。 The voice processing apparatus 10 and the like of each embodiment and a specific example are mounted on a computer 1000. Operations of the voice processing device 10 and the like are stored in the auxiliary storage device 1003 in the form of a program. The CPU 1001 reads out the program from the auxiliary storage device 1003, develops it in the main storage device 1002, and executes the above processing according to the program. For example, the CPU 1001 reads out the above program from the auxiliary storage device 1003 and develops it in the main storage device 1002, thereby realizing the functions of the generation unit 11, clustering unit 12, selection unit 13, and extraction unit 14.
 補助記憶装置1003は、一時的でない有形の媒体の一例である。一時的でない有形の媒体の他の例として、インターフェース1004を介して接続される磁気ディスク、光磁気ディスク、CD(Compact Disc)-ROM(Read Only Memory)、DVD(Digital Versatile Disc)-ROM、半導体メモリ等が挙げられる。また、このプログラムが通信回線によってコンピュータ1000に配信される場合、配信を受けたコンピュータ1000がそのプログラムを主記憶装置1002に展開し、上記の処理を実行しても良い。 The auxiliary storage device 1003 is an example of a tangible medium that is not temporary. Other examples of non-temporary tangible media include magnetic disk, magneto-optical disk, CD (Compact Disc) -ROM (Read Only Memory), DVD (Digital Versatile Disc) -ROM, semiconductor, which are connected via an interface 1004 Memory etc. are mentioned. When this program is distributed to the computer 1000 via a communication line, the computer 1000 that has received the distribution may develop the program in the main storage device 1002 and execute the above processing.
 インターフェース1004は、CPU1001に接続され、ネットワークあるいは外部記憶媒体に接続される。外部データがインターフェース1004を介してCPU1001に取り込まれても良い。入力デバイス1005は、例えばキーボード、マウス、タッチパネル、又はマイクである。ディスプレイ装置1006は、例えばLCD(Liquid Crystal Display)やCRT(Cathode Ray Tube)ディスプレイのような、CPU1001やGPU(Graphics Processing Unit)(図示せず)等により処理された描画データに対応する画面を表示する装置である。なお、図2に示すハードウェア構成は、一例にすぎず、図2が示す各部それぞれが独立した論理回路で構成されていても良い。 The interface 1004 is connected to the CPU 1001 and is connected to a network or an external storage medium. External data may be taken into the CPU 1001 via the interface 1004. The input device 1005 is, for example, a keyboard, a mouse, a touch panel, or a microphone. The display device 1006 displays a screen corresponding to drawing data processed by a CPU 1001 or a GPU (Graphics Processing Unit) (not shown) such as an LCD (Liquid Crystal Display) or a CRT (Cathode Ray Tube) display. It is a device to do. Note that the hardware configuration illustrated in FIG. 2 is merely an example, and each unit illustrated in FIG. 2 may be configured with independent logic circuits.
 また、プログラムは、前述の処理の一部を実現するためのものであっても良い。さらに、プログラムは、補助記憶装置1003に既に記憶されている他のプログラムとの組み合わせで前述の処理を実現する差分プログラムであっても良い。 Further, the program may be for realizing a part of the above-described processing. Furthermore, the program may be a differential program that realizes the above-described processing in combination with another program already stored in the auxiliary storage device 1003.
 [動作の説明]
 図3を用いて、本実施形態の動作について説明する。図3は、本発明の第1の実施形態における音声処理装置10の動作例を示すフローチャートである。
[Description of operation]
The operation of this embodiment will be described with reference to FIG. FIG. 3 is a flowchart showing an operation example of the speech processing apparatus 10 according to the first embodiment of the present invention.
 生成部11は、外部記憶装置が記憶する音声データから複数のセグメントを生成する(ステップS101)。このとき、生成部11は、隣接するセグメントが、少なくとも時間的に重なりを持つように、複数のセグメントを生成する。セグメントの時間長は、想定するフレーズの時間長に応じて、例えば1秒から数秒の範囲の一定値としてもよい。 The generation unit 11 generates a plurality of segments from the audio data stored in the external storage device (step S101). At this time, the generation unit 11 generates a plurality of segments so that adjacent segments have at least temporal overlap. The time length of the segment may be a constant value in the range of 1 second to several seconds, for example, depending on the assumed time length of the phrase.
 また、生成部11は、一つの音声データに対して異なる時間長で複数回の分割を行い、種々の時間長のセグメントを生成してもよい。また、生成部11は、非特許文献1に記載された方法などを用いて所定の変化点や無音区間などにおいて音声データを分割し、分割した複数の音声データを用いて、可変長のセグメントを生成してもよい。 Also, the generation unit 11 may generate a segment with various time lengths by dividing the sound data a plurality of times with different time lengths. Further, the generation unit 11 divides the audio data at a predetermined change point or a silent section using the method described in Non-Patent Document 1, and uses the plurality of divided audio data to generate variable length segments. It may be generated.
 クラスタリング部12は、所定の類似度指標に基づき、複数のセグメントを分類してクラスタを生成する(ステップS102)。すなわち、クラスタリング部12は、複数のセグメントをクラスタリングする。クラスタリング部12は、生成部11が生成した複数のセグメントから、各セグメント間の類似度を計算し、類似度の高いセグメント同士をまとめたクラスタを生成する。クラスタリング部12による類似度指標やクラスタ生成の具体的な方法については、例えば非特許文献1に記載の方法を用いることができる。 The clustering unit 12 classifies a plurality of segments based on a predetermined similarity index to generate a cluster (step S102). That is, the clustering unit 12 clusters a plurality of segments. The clustering unit 12 calculates the similarity between the segments from the plurality of segments generated by the generation unit 11, and generates a cluster in which the segments with high similarity are collected. As a specific method of similarity index and cluster generation by the clustering unit 12, for example, the method described in Non-Patent Document 1 can be used.
 ここで、類似度指標とは、セグメントを構成する音韻の類似性を測る指標である。類似度指標は、例えば、音響特徴量系列の平均と分散から計算されるバタチャリャ距離、カルバック・ライブラのダイバージェンス、対数尤度比など、音響特徴量の統計量を用いる指標である。これらの指標は、セグメント内の音響特徴量系列の順序を考慮しない。 Here, the similarity index is an index for measuring the similarity of phonemes constituting a segment. The similarity index is an index that uses a statistic of an acoustic feature quantity, such as a batch distance distance calculated from the average and variance of the acoustic feature quantity series, a divergence of a Cullback library, and a log likelihood ratio. These indices do not consider the order of the acoustic feature quantity sequence within the segment.
 また、類似度指標を用いる方法は、例えば、順序、すなわち時刻順を考慮する指標を用いてもよい。類似度指標を用いる方法は、例えば、セグメント間で各音響特徴量の最適な対応関係を動的計画法(Dynamic Programming;以降「DP」と記載)で求めて類似度を計算するDPマッチング法がある。ここで、音響特徴量とは、例えば、MFCCである。MFCCは音声認識などで広く用いられている。 In addition, as a method using the similarity index, for example, an index considering an order, that is, a time order may be used. The method using the similarity index is, for example, a DP matching method that calculates the degree of similarity by obtaining an optimal correspondence of each acoustic feature amount between segments by dynamic programming (hereinafter referred to as “DP”). is there. Here, the acoustic feature amount is, for example, MFCC. MFCC is widely used for voice recognition and the like.
 選択部13は、クラスタリング部12におけるクラスタリングが収束した場合(ステップS103でYes)、クラスタリング部12が生成したクラスタの中から、各クラスタのサイズに基づき、所定の条件を満たすクラスタを選択する(ステップS104)。選択部13は、この選択において、頻出するフレーズを発見するという観点からクラスタのサイズを比較し、サイズの大きい順に、少なくとも1つのクラスタを選ぶ。所定の条件とは、より多くのセグメントを含む、セグメントの総時間長がより長い、等が挙げられる。つまり、選択部13は、例えば、より多くのセグメントを含むクラスタ、あるいはセグメントの総時間長がより長いクラスタを所定の条件を満たすクラスタとして選ぶ。 When the clustering in the clustering unit 12 converges (Yes in step S103), the selection unit 13 selects a cluster that satisfies a predetermined condition from the clusters generated by the clustering unit 12 based on the size of each cluster (step S103). S104). In this selection, the selection unit 13 compares cluster sizes from the viewpoint of finding frequently occurring phrases, and selects at least one cluster in descending order of size. Examples of the predetermined condition include a larger number of segments, a longer total time length of the segments, and the like. That is, the selection unit 13 selects, for example, a cluster including more segments or a cluster having a longer total segment length as a cluster that satisfies a predetermined condition.
 ここで、クラスタリングが収束する場合とは、例えば、ステップS101及びステップS102が所定回数実行された状況、クラスタリングに関する所定の評価値の増減が一定の値以下になった状況等である。なお、クラスタリングが収束する場合とは、クラスタリングに関する所定の評価値の増減が一定の値以下になった状況に付随して、クラスタ間でセグメントの移動がなくなった状況であってもよい。 Here, the case where the clustering converges is, for example, a situation where Step S101 and Step S102 are executed a predetermined number of times, a situation where the increase or decrease of the predetermined evaluation value related to clustering is a predetermined value or less, and the like. Note that the case where clustering converges may be a situation in which the movement of a segment between clusters is lost in association with a situation in which the increase or decrease in a predetermined evaluation value related to clustering is equal to or less than a certain value.
 抽出部14は、選択部13で選択されたクラスタに含まれる1または複数のセグメントから、セグメントを抽出する(ステップS105)。これにより、抽出部14は、音声データから、所望のフレーズに該当する部分のセグメントを抽出することができる。 The extraction unit 14 extracts segments from one or more segments included in the cluster selected by the selection unit 13 (step S105). Thereby, the extraction part 14 can extract the segment of the part applicable to a desired phrase from audio | voice data.
 ここで、クラスタリング部12におけるクラスタリングが収束していない場合(ステップS103でNo)、ステップS101の処理に戻る。これは、ステップS101およびステップS102が相互に依存するため、所定回数、あるいは収束するまで反復してもよいことを示す。 Here, if the clustering in the clustering unit 12 has not converged (No in step S103), the process returns to step S101. This indicates that step S101 and step S102 depend on each other and may be repeated a predetermined number of times or until convergence.
 なお、生成部11とクラスタリング部12とは、図4が示す構造を有する隠れマルコフモデル(Hidden Markov Model;以降、「HMM」と記載する)を用いて一括実行することも可能である。図4は、音声処理装置10がHMMを用いてフレーズを抽出する方法の一例を示す図である。すなわち、音声処理装置10は、外部記憶装置が記憶する音声データを用いて、図4が示すようなHMMを最尤推定法などに基づき学習する。これにより、第1のフレーズ(図4のフレーズ1)、第2のフレーズ(図4のフレーズ2)、…、第Nのフレーズ(図4のフレーズN)を表現する一方向型HMM(Left-to-right HMM)が自動的に形成され、同時に各々に属するセグメントも獲得される。 Note that the generation unit 11 and the clustering unit 12 can be collectively executed using a hidden Markov model (Hidden Markov Model; hereinafter referred to as “HMM”) having the structure shown in FIG. FIG. 4 is a diagram illustrating an example of a method in which the speech processing apparatus 10 extracts a phrase using the HMM. That is, the speech processing apparatus 10 learns an HMM as shown in FIG. 4 based on the maximum likelihood estimation method using speech data stored in the external storage device. Thus, a one-way HMM (Left−) expressing the first phrase (phrase 1 in FIG. 4), the second phrase (phrase 2 in FIG. 4),..., The Nth phrase (phrase N in FIG. 4). to-right HMM) is automatically formed, and at the same time, segments belonging to each are also acquired.
 抽出された音声データの該当部分を聴取することにより、頻出するフレーズを確認し、また音声鑑定に利用することができる。 Listening to the relevant part of the extracted voice data allows frequent phrases to be confirmed and used for voice appraisal.
 [効果の説明]
 以上のように、本実施形態に係る音声処理装置10よれば、生成部11が音声データから、隣接するセグメントが少なくとも一部重複するように、複数のセグメントを生成し、クラスタリング部12が音韻の類似性に基づき、複数のセグメントを分類してクラスタを生成する。そして、本実施形態に係る音声処理装置10によれば、選択部13がクラスタの中から、各クラスタのサイズに基づき少なくとも1つのクラスタを選択する。更に、本実施形態における音声処理装置10によれば、抽出部14が選択されたクラスタに含まれるセグメントを抽出するため、音声データの中から所望のフレーズに該当する部分のセグメントを抽出することが可能となる。その理由は、生成部11が音声データから隣接するセグメントが少なくとも一部が重複するように複数のセグメントを生成しているため、単語よりも短い語から単語よりも長いフレーズを1つのセグメントとして生成できるからである。
[Description of effects]
As described above, according to the speech processing apparatus 10 according to the present embodiment, the generation unit 11 generates a plurality of segments from the speech data so that adjacent segments at least partially overlap, and the clustering unit 12 Based on the similarity, a plurality of segments are classified to generate a cluster. According to the speech processing apparatus 10 according to the present embodiment, the selection unit 13 selects at least one cluster from the clusters based on the size of each cluster. Furthermore, according to the speech processing apparatus 10 in the present embodiment, since the extraction unit 14 extracts the segments included in the selected cluster, the segment corresponding to the desired phrase can be extracted from the speech data. It becomes possible. The reason is that the generation unit 11 generates a plurality of segments so that adjacent segments overlap at least partially from the audio data, so a phrase longer than the word is generated as one segment from a word shorter than the word. Because it can.
 また、本実施形態における音声処理装置10を用いることで、熟練した鑑定者でなくとも音声鑑定に必要な頻出するフレーズを低コストで発見および選定できる。その理由は、音声処理装置10が、与えられた音声データから、隣接するセグメントが少なくとも一部重複するようにセグメントを生成し、このセグメントをクラスタリングし、類似した多数のセグメントを含むクラスタを選択するからである。そして、音声処理装置10はこのように選択されたクラスタに含まれるセグメントを抽出するからである。抽出されたセグメントは、生成部11が生成したセグメントであり、音声データのうちの所望のフレーズを含む部分的な音声データである。これにより、音声処理装置10は、音声データ中で頻出するフレーズを自動的に発見できるからである。 Further, by using the speech processing apparatus 10 in the present embodiment, frequent phrases necessary for speech appraisal can be found and selected at low cost even if not an expert appraiser. The reason is that the speech processing apparatus 10 generates segments such that adjacent segments at least partially overlap from given speech data, clusters the segments, and selects a cluster including many similar segments. Because. This is because the speech processing apparatus 10 extracts segments included in the cluster selected in this way. The extracted segment is a segment generated by the generation unit 11 and is partial audio data including a desired phrase in the audio data. This is because the speech processing apparatus 10 can automatically find frequently occurring phrases in the speech data.
 さらに、本実施形態における音声処理装置10を用いることで、定量的に頻度の高いフレーズを発見できるため、音声鑑定に有用な頻出フレーズを高い信頼性で発見できる。 Furthermore, by using the speech processing apparatus 10 according to the present embodiment, frequently frequent phrases can be found quantitatively, so that frequent phrases useful for speech appraisal can be found with high reliability.
 [第2の実施の形態]
 以下、本発明の第2の実施形態について図面を参照して詳細に説明する。
[Second Embodiment]
Hereinafter, a second embodiment of the present invention will be described in detail with reference to the drawings.
 [構成の説明]
 図5は、本発明の第2の実施形態に係る音声処理装置20の構成例を示すブロック図である。図5を参照すると、本発明の第2の実施形態に係る音声処理装置20は、正規化学習部15、音声データ正規化部16、音声データ処理部17、第1~第Nの音声データ記憶部(101-1~101-N(Nは正の整数))、不特定音響モデル記憶部102、及び第1~第Nのパラメタ記憶部(103-1~103-N(Nは正の整数))を備える。
ここで、正規化学習部15は、第3の生成部とも記載する。なお、本実施の形態では、第1~第Nの音声データ記憶部(101-1~101-N)の夫々を区別しない場合、または、総称する場合には、音声データ記憶部101と呼ぶ。また、第1~第Nのパラメタ記憶部(103-1~103-N)の夫々を区別しない場合、または、総称する場合には、パラメタ記憶部103と呼ぶ。
[Description of configuration]
FIG. 5 is a block diagram illustrating a configuration example of the sound processing device 20 according to the second embodiment of the present invention. Referring to FIG. 5, the speech processing apparatus 20 according to the second embodiment of the present invention includes a normalization learning unit 15, a speech data normalization unit 16, a speech data processing unit 17, and first to Nth speech data storages. Part (101-1 to 101-N (N is a positive integer)), unspecified acoustic model storage unit 102, and first to Nth parameter storage units (103-1 to 103-N (N is a positive integer) )).
Here, the normalization learning unit 15 is also referred to as a third generation unit. In the present embodiment, the first to Nth audio data storage units (101-1 to 101-N) are referred to as the audio data storage unit 101 when they are not distinguished or collectively referred to. The first to Nth parameter storage units (103-1 to 103-N) are referred to as the parameter storage unit 103 when they are not distinguished or collectively referred to.
 音声データ記憶手段101は、性質の異なる音声データを各々記憶する。すなわち、第1の音声データ記憶部101-1、第2の音声データ記憶部101-2、・・・、及び第Nの音声データ記憶部101-Nは各々性質の異なる音声データを記憶する。また、第1の音声データ記憶部101-1、第2の音声データ記憶部101-2、・・・、及び第Nの音声データ記憶部101-Nが記憶する各々性質の異なる音声データは、それぞれ音響的な性質に基づいて分類された音声データである。 The audio data storage means 101 stores audio data having different properties. That is, the first audio data storage unit 101-1, the second audio data storage unit 101-2,..., And the Nth audio data storage unit 101-N each store audio data having different properties. Also, the audio data having different properties stored in the first audio data storage unit 101-1, the second audio data storage unit 101-2,..., And the Nth audio data storage unit 101-N are: Each of them is voice data classified based on acoustic characteristics.
 不特定音響モデル記憶部102は、正規化学習部15が学習した不特定音響モデルを記憶する。不特定音響モデルとは、音声データ記憶部101が記憶する性質の異なる音声データの差異を正規化することで得られるモデルである。 The unspecified acoustic model storage unit 102 stores the unspecified acoustic model learned by the normalization learning unit 15. The unspecified acoustic model is a model obtained by normalizing a difference between audio data having different properties stored in the audio data storage unit 101.
 パラメタ記憶部103は、音声データの差異を正規化するためのパラメタを各々記憶する。すなわち、第1のパラメタ記憶部103-1、第2のパラメタ記憶部103-2、・・・、及び第Nのパラメタ記憶部103-Nは、音声データの差異を正規化するためのパラメタを各々記憶する。 The parameter storage unit 103 stores parameters for normalizing the difference between the audio data. That is, the first parameter storage unit 103-1, the second parameter storage unit 103-2,..., And the Nth parameter storage unit 103-N have parameters for normalizing the difference of audio data. Remember each one.
 正規化学習部15は、音声データ記憶部101が記憶する性質の異なる音声データを用いて、正規化学習を行う。 The normalization learning unit 15 performs normalization learning using audio data having different properties stored in the audio data storage unit 101.
 ここで正規化学習とは、例えば非特許文献2に記載された音響モデルの学習法である。
音響モデルは、音響特徴量の平均ベクトルμiによって各音韻iを規定するが、正規化学習では平均ベクトルが音声データの性質によって変わり得るとする。即ち、本実施の形態では、平均ベクトル(不特定音響モデル)μiを、以下の式(1)のようなアフィン変換(affine transformation)で表現する。
Here, normalization learning is an acoustic model learning method described in Non-Patent Document 2, for example.
In the acoustic model, each phoneme i is defined by an average vector μ i of acoustic feature values. In normalization learning, it is assumed that the average vector can be changed depending on the property of speech data. That is, in the present embodiment, the average vector (unspecified acoustic model) μ i is expressed by affine transformation as shown in the following formula (1).
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
ここで、s=1、2、・・・、Nである。また、Aおよびbは、夫々、音声データの性質の違いを正規化するためのパラメタである。 Here, s = 1, 2,..., N. A s and b s are parameters for normalizing the difference in the properties of the audio data.
 式(1)により、音声データの性質の違いに影響されない不特定音響モデルμiと、音声データの性質の違いを正規化するためのパラメタASおよびbSが得られる。そして、正規化学習部15は、不特定音響モデルμiを不特定音響モデル記憶部102に格納する。また、正規化学習部15は、パラメタASおよびbSを、パラメタ記憶部103に記憶する。具体的には、正規化学習部15は、パラメタAおよびbを、第1のパラメタ記憶部103-1に格納し、パラメタAおよびbを第Nのパラメタ記憶部103-Nに格納する。非特許文献2には、話者によって音声データの性質が異なるとし、話者の違いを正規化する方法が記載されているが、音声データの性質の違いは話者に限らず、背景雑音、マイクや通信回線など、種々の想定が可能である。 Equation (1) provides the unspecified acoustic model μ i that is not affected by the difference in the properties of the speech data, and the parameters A S and b S for normalizing the difference in the properties of the speech data. Then, the normalization learning unit 15 stores the unspecified acoustic model μ i in the unspecified acoustic model storage unit 102. In addition, the normalization learning unit 15 stores the parameters A S and b S in the parameter storage unit 103. Specifically, the normalization learning unit 15 stores the parameters A 1 and b 1 in the first parameter storage unit 103-1, and the parameters A N and b N in the Nth parameter storage unit 103-N. Store. Non-Patent Document 2 describes a method of normalizing the difference between speakers assuming that the nature of speech data varies depending on the speaker. However, the difference in the properties of speech data is not limited to the speaker, and background noise, Various assumptions such as a microphone and a communication line are possible.
 すなわち、正規化学習部15は、音声データ記憶部101が記憶する性質の異なる音声データの差異を正規化するための正規化パラメタを生成し、パラメタ記憶部103に記憶する。また、正規化学習部15は、音声データ記憶部101が記憶する性質の異なる音声データの差異を正規化して不特定音響モデルを学習し、学習した不特定音響モデルを不特定音響モデル102に記憶する。ここで、正規化学習部15は、音声データ記憶部101が記憶する性質の異なる音声データの差異を正規化するための正規化パラメタを推定することで、該正規化パラメタを生成する。また、正規化学習部15は、例えば、反復計算を行う場合では、反復のたびに不特定音響モデルを不特定音響モデル102に記憶する。 That is, the normalization learning unit 15 generates a normalization parameter for normalizing the difference between the audio data having different properties stored in the audio data storage unit 101 and stores the normalization parameter in the parameter storage unit 103. In addition, the normalization learning unit 15 normalizes the difference between the audio data having different properties stored in the audio data storage unit 101 to learn the unspecified acoustic model, and stores the learned unspecified acoustic model in the unspecified acoustic model 102. To do. Here, the normalization learning unit 15 generates the normalization parameter by estimating a normalization parameter for normalizing the difference between the audio data having different properties stored in the audio data storage unit 101. In addition, for example, when performing iterative calculation, the normalization learning unit 15 stores the unspecified acoustic model in the unspecified acoustic model 102 at each iteration.
 音声データ正規化部16は、パラメタ記憶部103に記憶されたパラメタを参照し、各々音声データ記憶部101に記憶された音声データを正規化し、音声データ処理部17に送る。具体的には、第sの音声データの音響特徴量の時系列x、x、・・・、x、・・・(tは正の整数)に対して、第sのパラメタを用い、式(1)の逆変換に相当する変換である、式(2)を施す。 The audio data normalization unit 16 refers to the parameters stored in the parameter storage unit 103, normalizes the audio data stored in each of the audio data storage units 101, and sends it to the audio data processing unit 17. Specifically, the sth parameter is used for the time series x 1 , x 2 ,..., X t ,... (T is a positive integer) of the acoustic feature amount of the sth audio data. The expression (2), which is a conversion corresponding to the inverse conversion of the expression (1), is applied.
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 正規化を規定するパラメタは、音韻のクラス(摩擦音、破裂音など)に応じて異なるものを用いてもよいし、文脈依存性を考慮して前後の音韻に応じて異なるものを用いてもよい。また、音声データ正規化部16は、音響特徴量の平均ベクトルだけでなく分散も正規化するようにしてもよい。またこれらに限らず、正規化学習に関して知られている各種の工夫を適用してよい。 The parameters that define normalization may be different depending on the phoneme class (friction, plosive, etc.), or may be different depending on the preceding and following phonemes in consideration of context dependency. . Further, the audio data normalization unit 16 may normalize not only the average vector of acoustic feature values but also the variance. Further, the present invention is not limited to these, and various devices known for normalization learning may be applied.
 音声データ処理部17は、第1の実施形態における音声処理装置10と同様の構成および効果を有する。すなわち、音声データ処理部17は、図1が示す生成部11、クラスタリング部12、選択部13、および抽出部14の処理を第1の実施形態と同様に実行し、正規化された音声データ中に頻出するフレーズを含むセグメントを出力する。 The voice data processing unit 17 has the same configuration and effects as the voice processing apparatus 10 in the first embodiment. That is, the voice data processing unit 17 performs the processing of the generation unit 11, the clustering unit 12, the selection unit 13, and the extraction unit 14 illustrated in FIG. 1 in the same manner as in the first embodiment, and in the normalized voice data A segment containing a phrase that appears frequently is output.
 [動作の説明]
 図6を用いて、本実施形態の動作について説明する。図6は、本発明の第2の実施形態における音声処理装置20の動作例を示すフローチャートである。ここで、図6が示すように、本実施形態における音声データ処理部17の動作、すなわちステップS204からステップS208は、第1の実施形態における音声処理装置10の動作、すなわちステップS101乃至ステップS105と同様であるため、説明を省略する。
[Description of operation]
The operation of this embodiment will be described with reference to FIG. FIG. 6 is a flowchart showing an operation example of the speech processing apparatus 20 according to the second embodiment of the present invention. Here, as shown in FIG. 6, the operation of the audio data processing unit 17 in this embodiment, that is, steps S204 to S208 is the same as the operation of the audio processing device 10 in the first embodiment, that is, steps S101 to S105. Since it is the same, description is abbreviate | omitted.
 正規化学習部15は、音声データ記憶部101から各々音声データを読み出し、正規化学習を行って、各々の音声データの正規化パラメタをパラメタ記憶部103に記憶する(ステップS201)。 The normalization learning unit 15 reads out each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameter of each voice data in the parameter storage unit 103 (step S201).
 正規化学習部15は、正規化を行って音声データの性質の差異を解消した上で生成した不特定音響モデルを不特定音響モデル記憶部102に記憶する(ステップS202)。 The normalization learning unit 15 stores the unspecified acoustic model generated after performing normalization to eliminate the difference in the properties of the speech data in the unspecified acoustic model storage unit 102 (step S202).
 音声データ正規化部16は、パラメタ記憶部103に記憶された正規化パラメタを参照し、それぞれ音声データ記憶部101に記憶された音声データを正規化する(ステップS203)。 The audio data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103, and normalizes the audio data stored in the audio data storage unit 101, respectively (step S203).
 音声データ処理部17は、図3が示す第1の実施形態における音声処理装置10のステップS101乃至ステップS105と同様の処理を実行し、音声データ中に頻出するフレーズを含むセグメントを出力する(ステップS204乃至ステップS208)。 The voice data processing unit 17 performs the same processing as steps S101 to S105 of the voice processing apparatus 10 in the first embodiment shown in FIG. 3, and outputs segments including phrases that frequently appear in the voice data (steps). S204 to step S208).
 [効果の説明]
 以上のように、本実施形態における音声処理装置20よれば、正規化学習部15が音声データ記憶部101から各々音声データを読み出し、正規化学習を行って、各々の音声データの正規化パラメタをパラメタ記憶部103に記憶する。正規化学習部15が正規化を行って各々の音声データの音響的な性質の差異を解消した上で生成した不特定音響モデルを不特定音響モデル記憶部102に記憶する。また、音声データ正規化部16がパラメタ記憶部103に記憶された正規化パラメタを参照し、それぞれ音声データ記憶部101に記憶された音声データを正規化する。音声データ処理部17が正規化された音声データ中に頻出するフレーズを含むセグメントを出力する。そのため、本実施形態における音声処理装置20は、正規化されていない音声データを正規化し、所望のフレーズを選定することが可能である。
[Description of effects]
As described above, according to the speech processing apparatus 20 in the present embodiment, the normalization learning unit 15 reads each speech data from the speech data storage unit 101, performs normalization learning, and sets the normalization parameter of each speech data. Store in the parameter storage unit 103. The normalization learning unit 15 performs normalization to eliminate the difference in acoustic properties of the respective voice data, and stores the unspecified acoustic model generated in the unspecified acoustic model storage unit 102. Also, the voice data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103 and normalizes the voice data stored in the voice data storage unit 101, respectively. The voice data processing unit 17 outputs a segment including a phrase that frequently appears in the normalized voice data. Therefore, the speech processing apparatus 20 in the present embodiment can normalize speech data that has not been normalized and select a desired phrase.
 また、本実施形態における音声処理装置20によれば、正規化学習部15が第1の音声データ、第2の音声データ、…、第Nの音声データの音響的な性質の違い(例えば話者の違い)を正規化する学習を行う。音声データ正規化部16が音響的な性質の違いを解消した後に、音声データ処理部17が音声データ中に頻出するフレーズを含むセグメントを抽出する。そのため、音声処理装置20は、音声データ中に頻出するフレーズをより正確に抽出できる。理由としては、本実施形態における音声処理装置20は、音声データ処理部17の中のクラスタリング部12が音声データの性質の違いに影響されて不適切なクラスタ(例えば話者のクラスタ)を生成するような事態を低減することができるからである。 Further, according to the speech processing device 20 in the present embodiment, the normalization learning unit 15 determines the difference in acoustic properties between the first speech data, the second speech data,. Learning to normalize the difference. After the voice data normalization unit 16 eliminates the difference in acoustic properties, the voice data processing unit 17 extracts segments including phrases that frequently appear in the voice data. Therefore, the voice processing device 20 can more accurately extract phrases that frequently appear in the voice data. The reason is that in the speech processing apparatus 20 according to the present embodiment, the clustering unit 12 in the speech data processing unit 17 is affected by the difference in the properties of the speech data and generates an inappropriate cluster (for example, a speaker cluster). This is because such a situation can be reduced.
 [第3の実施の形態]
 以下、本発明の第3の実施形態について図面を参照して詳細に説明する。
[Third Embodiment]
Hereinafter, a third embodiment of the present invention will be described in detail with reference to the drawings.
 [構成の説明]
 図7は、本発明の第3の実施形態における音声処理装置30の構成例を示すブロック図である。図7を参照すると、本発明の第3の実施形態における音声処理装置30は、第2の実施形態における音声処理装置20の構成に加え、未分類音声データ記憶部104と、音声データ分類部18と、を備える。ここで、第2の実施形態における音声処理装置20の構成は既に説明しているため、説明を省略する。また、音声データ分類部18は、第4の生成部とも記載する。
[Description of configuration]
FIG. 7 is a block diagram illustrating a configuration example of the sound processing device 30 according to the third embodiment of the present invention. Referring to FIG. 7, the speech processing device 30 according to the third embodiment of the present invention includes an unclassified speech data storage unit 104 and a speech data classification unit 18 in addition to the configuration of the speech processing device 20 according to the second embodiment. And comprising. Here, since the configuration of the audio processing device 20 in the second embodiment has already been described, description thereof will be omitted. The audio data classification unit 18 is also described as a fourth generation unit.
 未分類音声データ記憶部104は、音声データを記憶する。 The uncategorized voice data storage unit 104 stores voice data.
 音声データ分類部18は、音声データ記憶部104が記憶する音声データを音響的な性質に基づいて分類し、音声データ記憶部101に記憶する。音声データ分類部18は、例えば、未分類音声データ記憶部104に記憶された音声データを音響的な性質の違い、例えば話者の違いに基づいてN個のクラスタに分類し、音声データ記憶部101に各々記憶する。すなわち、音声データ分類部18は、未分類音声データ記憶部104に記憶された音声データを音響的な性質に基づいて分類することで、N個のクラスタを生成する。そして、音声データ分類部18は、第1の音声データ記憶部に第1のクラスタを、第2の音声データ記憶部に第2のクラスタを、・・・、第Nの音声データ記憶部に第Nのクラスタを記憶する。 The voice data classification unit 18 classifies the voice data stored in the voice data storage unit 104 based on the acoustic properties, and stores the voice data in the voice data storage unit 101. For example, the voice data classifying unit 18 classifies the voice data stored in the unclassified voice data storage unit 104 into N clusters based on differences in acoustic properties, for example, differences in speakers, and the voice data storage unit Each of them is stored in 101. That is, the audio data classification unit 18 generates N clusters by classifying the audio data stored in the unclassified audio data storage unit 104 based on the acoustic properties. Then, the voice data classifying unit 18 stores the first cluster in the first voice data storage unit, the second cluster in the second voice data storage unit,..., The second cluster in the Nth voice data storage unit. Store N clusters.
 ここで、未分類音声データ記憶部104に記憶された音声データは、種々の音響的な性質を有する音声データが混在したものであってよい。またNはあらかじめ定められた定数としてもよいし、処理対象に応じて音声データ分類部18が自動的に決定するようにしてもよい。これらは公知のクラスタリングの方法を適用することにより実施可能である。 Here, the audio data stored in the unclassified audio data storage unit 104 may be a mixture of audio data having various acoustic properties. N may be a predetermined constant, or may be automatically determined by the audio data classification unit 18 according to the processing target. These can be implemented by applying a known clustering method.
 音声データ記憶部101は、音声データ分類部18によって分類された音声データを各々記憶する。 The audio data storage unit 101 stores the audio data classified by the audio data classification unit 18.
 [動作の説明]
 図8を用いて、本実施形態の動作について説明する。図8は、本発明の第3の実施形態における音声処理装置30の動作例を示すフローチャートである。ここで、図8が示すように、本実施形態における音声データ処理部17の動作、すなわちステップS306からステップS310は、第1の実施形態における音声処理装置10の動作、すなわちステップS101乃至ステップS105と同様であるため、説明を省略する。
[Description of operation]
The operation of this embodiment will be described with reference to FIG. FIG. 8 is a flowchart showing an operation example of the speech processing apparatus 30 according to the third embodiment of the present invention. Here, as shown in FIG. 8, the operation of the audio data processing unit 17 in this embodiment, that is, steps S306 to S310 is the same as the operation of the audio processing device 10 in the first embodiment, that is, steps S101 to S105. Since it is the same, description is abbreviate | omitted.
 音声データ分類部18は、音声データ記憶部104が記憶する音声データを音響的な性質に基づいて分類し、音声データ記憶部101に記憶する(ステップS301)。 The voice data classification unit 18 classifies the voice data stored in the voice data storage unit 104 based on the acoustic properties, and stores the voice data in the voice data storage unit 101 (step S301).
 正規化学習部15は、音声データ記憶部101から各々音声データを読み出し、正規化学習を行って、各々の音声データの正規化パラメタをパラメタ記憶部103に記憶する(ステップS302)。 The normalization learning unit 15 reads each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameter of each voice data in the parameter storage unit 103 (step S302).
 正規化学習部15は、正規化を行って音声データの性質の差異を解消した上で生成した不特定音響モデルを不特定音響モデル記憶部102に記憶する(ステップS303)。 The normalization learning unit 15 stores the unspecified acoustic model generated after performing normalization to eliminate the difference in the properties of the voice data in the unspecified acoustic model storage unit 102 (step S303).
 音声データ分類部18および正規化学習部15の結果が収束した場合(ステップS304でYes)、音声データ正規化部16は、パラメタ記憶部103に記憶された正規化パラメタを参照し、それぞれ音声データ記憶部101に記憶された音声データを正規化する(ステップS305)。 When the results of the speech data classification unit 18 and the normalization learning unit 15 converge (Yes in step S304), the speech data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103, and each of the speech data The voice data stored in the storage unit 101 is normalized (step S305).
 ここで、音声データ分類部18および正規化学習部15の結果が収束していない場合(ステップS304でNo)、ステップS301のフローへ戻る。これにより、音声データ分類部18と正規化学習部15は、結果が収束するまで交互に反復実行できる。 Here, when the results of the speech data classification unit 18 and the normalization learning unit 15 have not converged (No in step S304), the process returns to the flow of step S301. Thereby, the voice data classification unit 18 and the normalization learning unit 15 can be repeatedly executed alternately until the result converges.
 なお、音声データ分類部18と正規化学習部15が各々出力する結果は、相互に依存することもある。そのため、音声データ分類部18と正規化学習部15との実行回数が所定の閾値になるまでもしくは収束するまで、交互に実行する反復的な動作としてもよい。このような動作は、非特許文献3に記載される方法にならい、尤度最大化などの最適化基準に基づき効率的に実施することが可能である。 Note that the results output by the speech data classification unit 18 and the normalization learning unit 15 may depend on each other. Therefore, it is good also as a repetitive operation | movement performed alternately until the frequency | count of execution with the audio | voice data classification | category part 18 and the normalization learning part 15 becomes a predetermined threshold value or it converges. Such an operation can be carried out efficiently based on an optimization criterion such as likelihood maximization following the method described in Non-Patent Document 3.
 音声データ処理部17は、図6が示す第1の実施形態における音声処理装置10のステップS101乃至ステップS105と同様の処理を実行し、音声データ中に頻出するフレーズを含むセグメントを出力する(ステップS306乃至ステップS310)。 The voice data processing unit 17 performs the same processing as steps S101 to S105 of the voice processing apparatus 10 in the first embodiment shown in FIG. 6, and outputs segments including phrases that frequently appear in the voice data (steps). S306 to S310).
 [効果の説明]
 以上のように、本実施形態における音声処理装置30によれば、音声データ分類部18が、音声データ記憶部104が記憶する音声データを音響的な性質に基づいて分類し、音声データ記憶部101に記憶する。そして、正規化学習部15が、音声データ記憶部101から各々音声データを読み出し、正規化学習を行って、各々の音声データの正規化パラメタをパラメタ記憶部103に記憶する。正規化学習部15が正規化を行って各々の音声データの音響的な性質の差異を解消した上で生成した不特定音響モデルを不特定音響モデル記憶部102に記憶する。音声データ正規化部16がパラメタ記憶部103に記憶された正規化パラメタを参照し、それぞれ音声データ記憶部101に記憶された音声データを正規化する。音声データ処理部17が正規化された音声データ中に頻出するフレーズを含むセグメントを出力する。そのため、本実施形態における音声処理装置30は、分類および正規化されていない音声データを分類および正規化し、所望のフレーズを選定することが可能である。
[Description of effects]
As described above, according to the audio processing device 30 in the present embodiment, the audio data classification unit 18 classifies the audio data stored in the audio data storage unit 104 based on the acoustic properties, and the audio data storage unit 101. To remember. Then, the normalization learning unit 15 reads out each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameters of each voice data in the parameter storage unit 103. The normalization learning unit 15 performs normalization to eliminate the difference in acoustic properties of the respective voice data, and stores the unspecified acoustic model generated in the unspecified acoustic model storage unit 102. The sound data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103 and normalizes the sound data stored in the sound data storage unit 101, respectively. The voice data processing unit 17 outputs a segment including a phrase that frequently appears in the normalized voice data. Therefore, the speech processing apparatus 30 in the present embodiment can classify and normalize speech data that has not been classified and normalized, and select a desired phrase.
 また、本実施形態における音声処理装置30によれば、音声データ分類部18が音声データを音響的な性質の違いに基づいてN個のクラスタに分類し、その結果を用いて正規化学習部15が正規化学習を行うように構成されている。そのため、本実施形態における音声処理装置30は、第2の実施形態における音声処理装置20と比べて、音声データの準備コストを低減できる。その理由としては、本実施形態における音声処理装置30は、音声データをあらかじめ音響的な性質の違いに応じて(例えば話者ごとに)分けておく必要がなく、雑多な音声データの集合を一括で与えて処理することができるからである。 Further, according to the speech processing apparatus 30 in the present embodiment, the speech data classification unit 18 classifies speech data into N clusters based on the difference in acoustic properties, and uses the result to normalize the learning unit 15. Is configured to perform normalized learning. Therefore, the voice processing device 30 in the present embodiment can reduce the preparation cost of the voice data as compared with the voice processing device 20 in the second embodiment. The reason is that the speech processing apparatus 30 in the present embodiment does not need to divide speech data in advance according to the difference in acoustic properties (for example, for each speaker), and collects a set of various speech data at once. It is because it can be given and processed.
 [第4の実施の形態]
 [構成の説明]
 以下、本発明の第4の実施形態について図面を参照して詳細に説明する。
[Fourth Embodiment]
[Description of configuration]
Hereinafter, a fourth embodiment of the present invention will be described in detail with reference to the drawings.
 図9は、本発明の第4の実施形態における音声処理システム40の構成例を示すブロック図である。図9を参照すると、第4の実施形態における音声処理システム40は、音声処理装置41と、音声入力装置42と、指示入力装置43と、出力装置44とを備える。 FIG. 9 is a block diagram showing a configuration example of the speech processing system 40 in the fourth embodiment of the present invention. Referring to FIG. 9, the voice processing system 40 in the fourth embodiment includes a voice processing device 41, a voice input device 42, an instruction input device 43, and an output device 44.
 音声処理装置41は、入力された音声に対して本発明の第1の実施形態における音声処理装置10の処理、第2の実施の形態における音声処理装置20の処理、または、第3の実施形態における音声処理装置30の処理(以降、「本発明の第1乃至第3の実施形態に記載のフレーズ抽出処理」と記載)を実行する。 The voice processing device 41 processes the input voice by the processing of the voice processing device 10 in the first embodiment of the present invention, the processing of the voice processing device 20 in the second embodiment, or the third embodiment. The processing of the voice processing device 30 (hereinafter referred to as “phrase extraction processing described in the first to third embodiments of the present invention”) is executed.
 音声入力装置42は、音声を入力する。音声入力装置42は、任意の音声データを音声処理装置41に入力するインターフェースとして働く任意のデバイス、すなわち音声信号をデータとして収受するマイクや音声データを記録するメモリなどである。音声入力装置42は、例えば、図2が示す入力デバイス1005である。 The voice input device 42 inputs voice. The audio input device 42 is an arbitrary device that functions as an interface for inputting arbitrary audio data to the audio processing device 41, that is, a microphone that receives audio signals as data, a memory that records audio data, and the like. The voice input device 42 is, for example, the input device 1005 shown in FIG.
 出力装置44は、音声処理装置41が処理を実行した結果を出力する。出力装置44は、音声処理装置42の処理結果を、操作者が指示入力装置43から入力した指示に応じて視覚的あるいは聴覚的手段で出力する、モニターやスピーカーなどの出力デバイスである。出力装置44の出力方法は、出力装置44がモニターの場合、例えば、クラスタの一覧をサイズ順に表示する、特定のクラスタの内容を波形図、スペクトログラムなどにより表示する、複数のセグメントを比較できるように並べて表示する、などである。また、出力装置44がスピーカーの場合、出力装置44の出力方法は、音声を再生する、などである。出力装置44は、例えば、ディスプレイ装置1006で実現される。 The output device 44 outputs the result of the processing performed by the voice processing device 41. The output device 44 is an output device such as a monitor or a speaker that outputs the processing result of the sound processing device 42 by visual or auditory means in accordance with an instruction input from the instruction input device 43 by the operator. When the output device 44 is a monitor, for example, a list of clusters is displayed in order of size, the contents of a specific cluster are displayed by waveform diagrams, spectrograms, etc., so that a plurality of segments can be compared. Display them side by side. When the output device 44 is a speaker, the output method of the output device 44 is to reproduce sound. The output device 44 is realized by a display device 1006, for example.
 指示入力装置43は、操作者からの指示情報を受けて表示装置に表示する情報を制御する。指示入力装置43は、出力装置44が出力する情報に対する処理や音声処理装置41の処理の実行など、操作者の指示情報を受け取るユーザインタフェースであり、マウスやキーボード、タッチパネルなどの任意の入力デバイスが利用可能である。 The instruction input device 43 receives the instruction information from the operator and controls information displayed on the display device. The instruction input device 43 is a user interface that receives operator instruction information such as processing for information output from the output device 44 and execution of processing by the voice processing device 41. An arbitrary input device such as a mouse, a keyboard, or a touch panel is used. Is available.
 [動作の説明]
 以下、本発明の第4の実施形態における音声処理システム40の動作例について説明する。
[Description of operation]
Hereinafter, an operation example of the voice processing system 40 according to the fourth embodiment of the present invention will be described.
 指示入力装置43は、操作者からの指示情報を受け取り、音声処理装置41に処理を実行するよう制御する。音声入力装置42は、任意の音声データを音声処理装置41に入力する。音声処理装置41は、入力された音声データに基づき、本発明の第1乃至第3の実施形態に記載のフレーズ抽出処理を実行し、頻繁に出現するフレーズを含んだクラスタを選択し、さらに選択されたクラスタに含まれるセグメントを抽出する。出力装置44は、音声処理装置41の処理結果を、操作者が指示入力装置43から入力した指示に応じて視覚的あるいは聴覚的手段で出力する。つまり、出力装置44は、操作者が閲覧したいと希望した形態で、処理結果を出力する。 The instruction input device 43 receives the instruction information from the operator and controls the voice processing device 41 to execute the process. The voice input device 42 inputs arbitrary voice data to the voice processing device 41. The speech processing device 41 executes the phrase extraction processing described in the first to third embodiments of the present invention based on the input speech data, selects clusters including frequently occurring phrases, and further selects them. The segments included in the created cluster are extracted. The output device 44 outputs the processing result of the sound processing device 41 by visual or auditory means according to an instruction input from the instruction input device 43 by the operator. That is, the output device 44 outputs the processing result in a form that the operator desires to view.
 [効果の説明]
 以上のように、本実施形態における音声処理システム40によれば、指示入力装置43が操作者から入力される指示情報に応じて、音声処理装置41に処理を実行するよう制御する。音声入力装置42が任意の音声データを音声処理装置41に入力する。音声処理装置42が入力された音声データに基づき、本発明の第1乃至第3の実施形態に記載のフレーズ抽出を実行し、頻繁に出現するフレーズ(セグメント)を含んだクラスタを選択し、さらに選択されたクラスタに含まれるセグメントを抽出する。出力装置44が音声処理装置41の処理結果を、操作者が指示入力装置43から入力した指示に応じて視覚的あるいは聴覚的手段で出力する。そのため、本実施形態における音声処理システム40は、音声データに含まれる頻繁に出現するフレーズを含むクラスタやセグメントを出力することが可能である。
[Description of effects]
As described above, according to the voice processing system 40 in the present embodiment, the instruction input device 43 controls the voice processing device 41 to execute processing in accordance with the instruction information input from the operator. The voice input device 42 inputs arbitrary voice data to the voice processing device 41. Based on the input voice data, the voice processing device 42 executes the phrase extraction described in the first to third embodiments of the present invention, selects a cluster including phrases (segments) that frequently appear, and Extract the segments contained in the selected cluster. The output device 44 outputs the processing result of the sound processing device 41 by visual or auditory means according to the instruction input from the instruction input device 43 by the operator. Therefore, the speech processing system 40 in the present embodiment can output clusters and segments including frequently occurring phrases included in the speech data.
 また、本実施形態における音声処理システム40は、操作者が音声からの人物の特定などの分析作業が容易に行える。その理由としては、本実施形態における音声処理システム40は操作者が閲覧したい形態で、処理結果が出力装置44に出力されるように構成されているためである。また、本実施形態における音声処理システム40は、頻繁に出現するフレーズが視覚的、聴覚的に出力されることから、特定の人物がよく話す口癖や話題の傾向などを分析することができる。 In addition, the voice processing system 40 according to the present embodiment allows an operator to easily perform analysis work such as identification of a person from voice. This is because the voice processing system 40 in the present embodiment is configured such that the processing result is output to the output device 44 in a form that the operator wants to browse. In addition, the speech processing system 40 according to the present embodiment can frequently analyze phrases that frequently appear, so that it is possible to analyze a tendency of a talk or a topic that a specific person often speaks.
 (具体例)
 以下、本発明の第1の実施形態の具体例を説明する。図10乃至図12を用いて、音声処理装置10が音声データからフレーズを抽出する一例を説明する。
(Concrete example)
Hereinafter, a specific example of the first embodiment of the present invention will be described. An example in which the speech processing apparatus 10 extracts a phrase from speech data will be described with reference to FIGS.
 上記外部記憶装置が記憶する音声データからフレーズを抽出する一例の詳細について、図10乃至図12を用いて、説明する。図10は、外部記憶装置が記憶する音声データの一例を示す図である。ここで、外部記憶装置は、例えば、第4の実施形態における音声入力装置42によって実現される。 Details of an example of extracting a phrase from audio data stored in the external storage device will be described with reference to FIGS. FIG. 10 is a diagram illustrating an example of audio data stored in the external storage device. Here, the external storage device is realized by, for example, the voice input device 42 in the fourth embodiment.
 図10が示すように、外部記憶装置は、音声データとその音声データの識別子である音声データIDを記憶する。音声データIDが「1」の場合、外部記憶装置は、「・・・子どもを預かった。身代金を用意しろ。待ち合わせ場所は・・・」という音声データを記憶する。ここで、外部記憶装置は、図10が示す音声データの内容に限らない。 As shown in FIG. 10, the external storage device stores voice data and a voice data ID that is an identifier of the voice data. When the voice data ID is “1”, the external storage device stores voice data “... Kept a child. Please prepare a ransom. Here, the external storage device is not limited to the contents of the audio data shown in FIG.
 図11は、生成部11が音声データからセグメントを生成する方法の一例を示す図である。図11が示すように、生成部11は、図10が示す音声データ、すなわち音声データID「1」である「・・・子どもを預かった。身代金を用意しろ。待ち合わせ場所は・・・」から、複数のセグメントを生成する。図11が示すように、セグメント1は「預かった」、セグメント2は「預かった。身代」である。図11が示すように、生成部11は、音声データを任意(所定の時間等)で細分化し、これらを用いて複数のセグメントを生成する。ここで、生成部11は、音声データから、セグメント同士が重複するようにセグメントを生成する。すなわち、図11が示すように、セグメント1は「預かった」、セグメント2は「預かった。身代」というように、セグメント1及び2では、「預かった」が重複している。これにより、音声処理装置10は、音声データ内から求められるフレーズを抽出できる。 FIG. 11 is a diagram illustrating an example of a method in which the generation unit 11 generates a segment from audio data. As shown in FIG. 11, the generation unit 11 starts from the voice data shown in FIG. 10, that is, the voice data ID “1” “... Has left the child. Prepare the ransom. , Generate multiple segments. As shown in FIG. 11, segment 1 is “deposited” and segment 2 is “deposited. As shown in FIG. 11, the generation unit 11 subdivides the audio data arbitrarily (a predetermined time or the like), and generates a plurality of segments using these. Here, the production | generation part 11 produces | generates a segment so that segments may overlap from audio | voice data. That is, as shown in FIG. 11, segment 1 and “2” are overlapped, and segment 1 is “deposited” and segment 2 is “deposited”. Thereby, the speech processing apparatus 10 can extract a phrase obtained from the speech data.
 図12は、クラスタリング部12が、複数のセグメントをまとめたクラスタを生成する方法の一例を示す図である。図12が示すように、クラスタは、例えば、セグメントの内容(フレーズ)の識別子であるクラスタIDと、セグメントの内容と、全ての音声データ内で出現した、セグメントの内容(フレーズ)が出現した出現回数とを含む。なお、図12に示す通り、本具体例では、クラスタIDと、図11で示したセグメントの番号とは同じであるとして説明を行う。クラスタは、例えば、クラスタIDが「1」のフレーズ「預かった」が全ての音声データ内で20回出現したことを示す。すなわち、クラスタリング部12は、図11が示すように生成部11が生成した複数のセグメントから、各セグメント間の類似度を計算し、類似度の高い、すなわち同じセグメント同士をまとめたクラスタを生成する。 FIG. 12 is a diagram illustrating an example of a method in which the clustering unit 12 generates a cluster in which a plurality of segments are collected. As shown in FIG. 12, for example, a cluster is a cluster ID that is an identifier of a segment content (phrase), a segment content, and an appearance of a segment content (phrase) that appears in all audio data. Including the number of times. In this specific example, as shown in FIG. 12, the cluster ID and the segment number shown in FIG. 11 are assumed to be the same. The cluster indicates, for example, that the phrase “deposited” with the cluster ID “1” appears 20 times in all audio data. That is, the clustering unit 12 calculates the similarity between the segments from the plurality of segments generated by the generation unit 11 as illustrated in FIG. 11, and generates a cluster having a high similarity, that is, a group of the same segments. .
 選択部13は、クラスタに含まれるセグメントの個数および総時間長を用いてクラスタを比較し、所定の条件を満たすクラスタを選択する。選択部13は、例えば、クラスタリング部12が生成した複数のクラスタの中で、各クラスタに含まれるセグメントの数、つまり、フレーズの出現回数に基づき比較する。図12が示すように、選択部13は出現回数が35回の「身代金を」と出現回数30回の「身代金を用意しろ。」を選択する。次に、選択部13は、各クラスタのサイズに基づき比較する。選択部13は、例えば、出現回数とセグメントの長さ、すなわち時間長との掛け算の結果を各クラスタのサイズとし、各クラスタのサイズが一番大きいクラスタを選択する。 The selection unit 13 compares clusters using the number of segments included in the cluster and the total time length, and selects a cluster that satisfies a predetermined condition. For example, the selection unit 13 compares the number of segments included in each cluster among the plurality of clusters generated by the clustering unit 12, that is, the number of appearances of the phrase. As shown in FIG. 12, the selection unit 13 selects “Ransom” with an appearance count of 35 and “Prepare a ransom” with an appearance count of 30. Next, the selection unit 13 performs comparison based on the size of each cluster. For example, the selection unit 13 uses the result of multiplying the number of appearances and the segment length, that is, the time length as the size of each cluster, and selects the cluster having the largest size of each cluster.
 図12が示すように、選択部13は、例えば、クラスタIDが7のクラスタと、クラスタIDが8のクラスタとを比較する。選択部13は、出現回数35回と「身代金を」の時間長との掛け算の結果と、出現回数30回と「身代金を用意しろ。」の時間長との掛け算の結果とを比較し、セグメントの内容が「身代金を用意しろ。」であるクラスタを選択する。すなわち、「身代金を用意しろ。」というフレーズが所望のフレーズである。また、選択部13は、出現回数が同じクラスタ同士の比較の場合は、セグメントの時間長のみを比較して選定してもよい。なお、選択部13は、上記の方法に限定されず、出現回数や時間長その他音素の数等様々な指標に基づいてサイズを定義し、比較して良い。 As shown in FIG. 12, the selection unit 13 compares, for example, a cluster with a cluster ID of 7 and a cluster with a cluster ID of 8. The selection unit 13 compares the result of the multiplication of the appearance count 35 times with the time length of “Ransom” and the result of the multiplication of the appearance count 30 times with the time length of “Prepare ransom”. Select a cluster whose content is “Prepare ransom”. That is, the phrase “Prepare ransom” is the desired phrase. Moreover, the selection part 13 may compare and select only the time length of a segment, when comparing the clusters with the same appearance frequency. Note that the selection unit 13 is not limited to the above method, and the size may be defined and compared based on various indexes such as the number of appearances, the time length, and the number of phonemes.
 そして、抽出部14は、選択されたクラスタからセグメントを抽出する。これにより、内容が「身代金を用意しろ」であるセグメントである音声データが、抽出される。このセグメントの音声データによって「身代金を用意しろ」というフレーズが頻繁に音声データ中に含まれていることがわかる。 Then, the extraction unit 14 extracts a segment from the selected cluster. As a result, audio data that is a segment whose content is “Prepare ransom” is extracted. It can be seen from the voice data of this segment that the phrase “Prepare ransom” is frequently included in the voice data.
 以上のように、本具体例における音声処理装置10では、例えば、「・・・子どもを預かった。身代金を用意しろ。待ち合わせ場所は・・・」という音声データから頻出フレーズである「身代金を用意しろ」を抽出することが可能である。 As described above, in the speech processing apparatus 10 according to this specific example, for example, “... prepare ransom”, which is a frequent phrase from the speech data “... have kept the child. Prepare the ransom. It is possible to extract “shiro”.
 以上、実施形態および具体例を用いて本願発明を説明したが、本発明は必ずしも上記実施形態および具体例に限定されるものではない。本発明の構成や詳細には、本発明のスコープ内で当業者が理解しうる(その技術的思想の範囲内において)様々な変更をし、実施することができる。 As mentioned above, although this invention was demonstrated using embodiment and a specific example, this invention is not necessarily limited to the said embodiment and specific example. Various changes and modifications that can be understood by those skilled in the art within the scope of the present invention (within the scope of the technical idea) can be made to the configuration and details of the present invention.
 この出願は、2015年3月25日に出願された日本出願特願2015-061854を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2015-061854 filed on Mar. 25, 2015, the entire disclosure of which is incorporated herein.
 10  音声処理装置
 11  生成部
 12  クラスタリング部
 13  選択部
 14  抽出部
 15  正規化学習部
 16  音声データ正規化部
 17  音声データ処理部
 18  音声データ分類部
 20  音声処理装置
 30  音声処理装置
 40  音声処理システム
 41  音声処理装置
 42  音声入力装置
 43  指示入力装置
 44  出力装置
 101  音声データ記憶部
 102  不特定音響モデル記憶部
 103  パラメタ記憶部
 1000  コンピュータ
 1001  CPU
 1002  主記憶装置
 1003  補助記憶装置
 1004  インターフェース
 1005  入力デバイス
 1006  ディスプレイ装置
DESCRIPTION OF SYMBOLS 10 Speech processing apparatus 11 Generation part 12 Clustering part 13 Selection part 14 Extraction part 15 Normalization learning part 16 Speech data normalization part 17 Speech data processing part 18 Speech data classification part 20 Speech processing apparatus 30 Speech processing apparatus 40 Speech processing system 41 Audio processing device 42 Audio input device 43 Instruction input device 44 Output device 101 Audio data storage unit 102 Unspecified acoustic model storage unit 103 Parameter storage unit 1000 Computer 1001 CPU
1002 Main storage device 1003 Auxiliary storage device 1004 Interface 1005 Input device 1006 Display device

Claims (10)

  1.  音声データから、隣接するセグメントが少なくとも一部重複する複数のセグメントを生成する第1の生成手段と、
     前記複数のセグメントを音韻の類似性に基づき分類してクラスタを生成する第2の生成手段と、
     前記クラスタのサイズに基づいて、所定の条件を満たすクラスタを選択する選択手段と、
     前記選択されたクラスタに含まれるセグメントを抽出する抽出手段と
     を備える音声処理装置。
    First generation means for generating a plurality of segments at least partially overlapping adjacent segments from audio data;
    Second generating means for generating a cluster by classifying the plurality of segments based on phoneme similarity;
    Selection means for selecting a cluster that satisfies a predetermined condition based on the size of the cluster;
    An audio processing apparatus comprising: extraction means for extracting a segment included in the selected cluster.
  2.  複数の音声データに基づき、当該複数の音声データの音響的な性質の違いを正規化するための複数の正規化パラメタを生成する第3の生成手段と、
     前記複数の正規化パラメタを用いて、前記音声データを正規化する正規化手段とをさらに備え、
     前記第1の生成手段は、前記正規化された音声データから前記複数のセグメントを生成する請求項1に記載の音声処理装置。
    Third generation means for generating a plurality of normalization parameters for normalizing differences in acoustic properties of the plurality of sound data based on the plurality of sound data;
    Normalization means for normalizing the audio data using the plurality of normalization parameters;
    The audio processing apparatus according to claim 1, wherein the first generation unit generates the plurality of segments from the normalized audio data.
  3.  前記選択手段は、クラスタに含まれるセグメントの個数または総時間長を用いて前記クラスタを比較し、選択する請求項1または2に記載の音声処理装置。 The speech processing apparatus according to claim 1 or 2, wherein the selection means compares and selects the clusters using the number of segments included in the clusters or the total time length.
  4.  前記第2の生成手段は、前記セグメントを構成する音響特徴量の比較によりセグメント間の類似度を計算する請求項1乃至3のいずれか1項に記載の音声処理装置。 The speech processing apparatus according to any one of claims 1 to 3, wherein the second generation unit calculates a similarity between segments by comparing acoustic feature amounts constituting the segments.
  5.  前記第2の生成手段は、前記セグメント間のDP(Dynamic Programming)マッチングにより類似度を生成する請求項1または2に記載の音声処理装置。 The speech processing apparatus according to claim 1 or 2, wherein the second generation unit generates a similarity by DP (Dynamic Programming) matching between the segments.
  6.  音声データを音響的な性質の違いに基づいて分類してクラスタを生成する第4の生成手段をさらに備え、
     前記第3の生成手段は、前記クラスタに対して正規化パラメタを生成する請求項2記載の音声処理装置。
    A fourth generation means for generating a cluster by classifying the audio data based on a difference in acoustic properties;
    The speech processing apparatus according to claim 2, wherein the third generation unit generates a normalization parameter for the cluster.
  7.  前記第4の生成手段および前記学習手段は、相互の結果に基づき、前記結果が収束するまで又は実行回数が所定の閾値に達するまで交互に反復実行する請求項6記載の音声処理装置。 7. The speech processing apparatus according to claim 6, wherein the fourth generation means and the learning means are repeatedly executed alternately until the result converges or the number of executions reaches a predetermined threshold based on a mutual result.
  8.  音声データから、隣接するセグメントが少なくとも一部重複する複数のセグメントを生成し、
     前記複数のセグメントを音韻の類似性に基づき分類してクラスタを生成し、
     前記クラスタのサイズに基づいて、所定の条件を満たすクラスタを選択し、
     前記選択されたクラスタに含まれるセグメントを抽出する音声処理方法。
    From the audio data, generate multiple segments that at least partially overlap adjacent segments,
    Classifying the plurality of segments based on phonological similarity to generate a cluster;
    Based on the size of the cluster, select a cluster that satisfies a predetermined condition,
    A speech processing method for extracting a segment included in the selected cluster.
  9.  音声データから、隣接するセグメントが少なくとも一部重複する複数のセグメントを生成する処理と、
     前記複数のセグメントを音韻の類似性に基づき分類してクラスタを生成する処理と、
     前記クラスタのサイズに基づいて、所定の条件を満たすクラスタを1つ以上選択する処理と、
     前記選択されたクラスタに含まれるセグメントを抽出する処理と
     をコンピュータに実行させるプログラムを記憶する、コンピュータ読み取り可能な記録媒体。
    Generating a plurality of segments in which adjacent segments at least partially overlap from audio data;
    A process of generating a cluster by classifying the plurality of segments based on phoneme similarity; and
    A process of selecting one or more clusters that satisfy a predetermined condition based on the size of the clusters;
    The computer-readable recording medium which memorize | stores the program which makes a computer perform the process which extracts the segment contained in the said selected cluster.
  10.  操作者の指示情報を受け取る指示入力装置と、
     音声データを音声処理装置に入力する音声入力装置と、
     前記指示情報に基づき、前記入力された音声データに対して処理を実行する請求項1から7の何れか1項に記載の音声処理装置と、
     前記音声処理装置の処理結果を出力する出力装置と、を備え、
     前記出力装置は、前記指示情報に応じた前記処理結果の出力する音声処理システム。
    An instruction input device for receiving operator instruction information;
    A voice input device for inputting voice data to the voice processing device;
    The voice processing apparatus according to any one of claims 1 to 7, wherein a process is performed on the input voice data based on the instruction information;
    An output device for outputting a processing result of the voice processing device,
    The output device is a voice processing system that outputs the processing result according to the instruction information.
PCT/JP2016/001593 2015-03-25 2016-03-18 Speech processing device, speech processing system, speech processing method, and recording medium WO2016152132A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2017507495A JP6784255B2 (en) 2015-03-25 2016-03-18 Speech processor, audio processor, audio processing method, and program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2015061854 2015-03-25
JP2015-061854 2015-03-25

Publications (1)

Publication Number Publication Date
WO2016152132A1 true WO2016152132A1 (en) 2016-09-29

Family

ID=56978310

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2016/001593 WO2016152132A1 (en) 2015-03-25 2016-03-18 Speech processing device, speech processing system, speech processing method, and recording medium

Country Status (2)

Country Link
JP (1) JP6784255B2 (en)
WO (1) WO2016152132A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111613249A (en) * 2020-05-22 2020-09-01 云知声智能科技股份有限公司 Voice analysis method and equipment
CN113178196A (en) * 2021-04-20 2021-07-27 平安国际融资租赁有限公司 Audio data extraction method and device, computer equipment and storage medium
CN113380273A (en) * 2020-08-10 2021-09-10 腾擎科研创设股份有限公司 System for detecting abnormal sound and judging formation reason

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007140136A (en) * 2005-11-18 2007-06-07 Mitsubishi Electric Corp Music analysis device and music search device
JP2008515012A (en) * 2004-09-28 2008-05-08 フラウンホッファー−ゲゼルシャフト ツァ フェルダールング デァ アンゲヴァンテン フォアシュンク エー.ファオ Apparatus and method for grouping time segments of music
JP2008533580A (en) * 2005-03-10 2008-08-21 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Summary of audio and / or visual data
JP2010032792A (en) * 2008-07-29 2010-02-12 Nippon Telegr & Teleph Corp <Ntt> Speech segment speaker classification device and method therefore, speech recognition device using the same and method therefore, program and recording medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008515012A (en) * 2004-09-28 2008-05-08 フラウンホッファー−ゲゼルシャフト ツァ フェルダールング デァ アンゲヴァンテン フォアシュンク エー.ファオ Apparatus and method for grouping time segments of music
JP2008533580A (en) * 2005-03-10 2008-08-21 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Summary of audio and / or visual data
JP2007140136A (en) * 2005-11-18 2007-06-07 Mitsubishi Electric Corp Music analysis device and music search device
JP2010032792A (en) * 2008-07-29 2010-02-12 Nippon Telegr & Teleph Corp <Ntt> Speech segment speaker classification device and method therefore, speech recognition device using the same and method therefore, program and recording medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111613249A (en) * 2020-05-22 2020-09-01 云知声智能科技股份有限公司 Voice analysis method and equipment
CN113380273A (en) * 2020-08-10 2021-09-10 腾擎科研创设股份有限公司 System for detecting abnormal sound and judging formation reason
CN113178196A (en) * 2021-04-20 2021-07-27 平安国际融资租赁有限公司 Audio data extraction method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
JP6784255B2 (en) 2020-11-11
JPWO2016152132A1 (en) 2018-01-18

Similar Documents

Publication Publication Date Title
Shahin et al. Emotion recognition using hybrid Gaussian mixture model and deep neural network
Venkataramanan et al. Emotion recognition from speech
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
US9489965B2 (en) Method and apparatus for acoustic signal characterization
Xue et al. Online end-to-end neural diarization with speaker-tracing buffer
Deshmukh et al. Speech based emotion recognition using machine learning
CN111524527A (en) Speaker separation method, device, electronic equipment and storage medium
Gupta et al. Speech emotion recognition using svm with thresholding fusion
KR102406512B1 (en) Method and apparatus for voice recognition
US10699224B2 (en) Conversation member optimization apparatus, conversation member optimization method, and program
WO2016152132A1 (en) Speech processing device, speech processing system, speech processing method, and recording medium
JP5704071B2 (en) Audio data analysis apparatus, audio data analysis method, and audio data analysis program
Al Hindawi et al. Speaker identification for disguised voices based on modified SVM classifier
Raghib et al. Emotion analysis and speech signal processing
JP5091202B2 (en) Identification method that can identify any language without using samples
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
JP2015175859A (en) Pattern recognition device, pattern recognition method, and pattern recognition program
KR101023211B1 (en) Microphone array based speech recognition system and target speech extraction method of the system
JP2011191542A (en) Voice classification device, voice classification method, and program for voice classification
US20220277732A1 (en) Method and apparatus for training speech recognition model, electronic device and storage medium
Kim et al. Ada-vad: Unpaired adversarial domain adaptation for noise-robust voice activity detection
Cipli et al. Multi-class acoustic event classification of hydrophone data
Fennir et al. Acoustic scene classification for speaker diarization
US20220335928A1 (en) Estimation device, estimation method, and estimation program
JP2017134321A (en) Signal processing method, signal processing device, and signal processing program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16768039

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2017507495

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16768039

Country of ref document: EP

Kind code of ref document: A1