WO2016152132A1

WO2016152132A1 - Speech processing device, speech processing system, speech processing method, and recording medium

Info

Publication number: WO2016152132A1
Application number: PCT/JP2016/001593
Authority: WO
Inventors: 孝文越仲; 鈴木　隆之
Original assignee: 日本電気株式会社
Priority date: 2015-03-25
Filing date: 2016-03-18
Publication date: 2016-09-29
Also published as: JP6784255B2; JPWO2016152132A1

Abstract

Provided is a speech processing device that allows frequently appearing phrases required for speech evaluation to be accurately selected from speech data. The speech processing device comprises: a generation unit for generating a plurality of segments from the speech data with adjacent segments at least partially overlapping each other; a clustering unit for generating a cluster by sorting the plurality of segments on the basis of phonological similarity; a selection unit for selecting a cluster that meets a prescribed condition on the basis of the size of the cluster; and an extraction unit for extracting a segment included in the selected cluster.

Description

Audio processing apparatus, audio processing system, audio processing method, and recording medium

The present invention relates to a voice processing device, a voice processing system, a voice processing method, and a recording medium that extract frequent patterns from voice data.

In recent years, scientific methods based on forensic science have been widely used in police criminal investigations. In fingerprint identification, which is a typical example, fingerprint images collected at the crime scene are sequentially compared with a large number of known fingerprint images to estimate who is involved in the crime. A technique similar to a fingerprint test and dealing with voice is called a voiceprint test or a voice test.

Patent Document 1 describes a technique for extracting speech data of unknown words that are candidate keywords to be registered in a speech recognition dictionary from speech data. The technique described in Patent Document 1 detects a section in which a state in which the power value of speech of speech data is greater than a threshold value th1 continues for a certain time or more as a speech section, and a state in which the power value is greater than the threshold value th2 from each speech section. Divide into sections that continue for more than a certain time. The technique described in Patent Document 1 acquires a phoneme string from the divided speech data, performs clustering, calculates an evaluation value, detects an unknown word, and registers it in the dictionary.

Patent Document 2 describes a technique for determining a factor causing misrecognition and notifying a user. The technique described in Patent Document 2 divides a mel cepstrum coefficient (Mel-Frequency Cepstrum Coefficients; hereinafter referred to as “MFCC”) vector sequence extracted by a feature extraction unit into segments for each phoneme using a set of standard models. . The technique described in Patent Document 2 investigates the cause of misrecognition, creates a character string of a message to be presented to the user according to the analysis result, and notifies the user by displaying the message on a display. .

International Publication No. 2009/136440 JP 2004-325635 A

However, with the technique described in Patent Document 1, an unknown word that is a keyword candidate can be selected, but a phrase including a sentence (for example, a sentence such as “Please prepare a ransom”) cannot be selected. With the technique described in Patent Document 2, a vector string for each segment that is erroneously recognized can be analyzed, but a desired phrase cannot be selected. In other words, the techniques described in

Patent Documents

1 and 2 have a problem that a desired phrase cannot be selected.

An object of the present invention is to provide an audio processing device, an audio processing system, an audio processing method, and a recording medium that can solve the above-described problems and can select a desired phrase.

A speech processing apparatus according to an aspect of the present invention includes: a first generation unit configured to generate a plurality of segments from speech data so that adjacent segments at least partially overlap; Second generating means for classifying and generating clusters, selecting means for selecting clusters satisfying a predetermined condition based on the size of the clusters, and extracting means for extracting segments included in the selected clusters With.

The speech processing method according to an aspect of the present invention generates a plurality of segments in which adjacent segments at least partially overlap from speech data, classifies the plurality of segments based on phoneme similarity, and generates a cluster. Based on the size of the cluster, a cluster satisfying a predetermined condition is selected, and segments included in the selected cluster are extracted.

A recording medium according to an aspect of the present invention generates a cluster from audio data by generating a plurality of segments in which adjacent segments at least partially overlap, and classifying the plurality of segments based on phoneme similarity Storing a program for causing a computer to execute a process for selecting a cluster that satisfies a predetermined condition based on the size of the cluster, and a process for extracting a segment included in the selected cluster. Possible recording media.

The present invention has an effect that a desired phrase can be selected in a voice processing device, a voice processing system, a voice processing method, and a program.

It is a block diagram which shows the structural example of the speech processing unit which concerns on the 1st Embodiment of this invention. It is a schematic block diagram which shows the structural example of the computer in each embodiment and specific example of this invention. It is a flowchart which shows the operation example of the speech processing unit which concerns on the 1st Embodiment of this invention. It is a figure which shows an example of the method for the speech processing apparatus which concerns on the 1st Embodiment of this invention to extract a phrase using HMM. It is a block diagram which shows the structural example of the speech processing unit which concerns on the 2nd Embodiment of this invention. It is a flowchart which shows the operation example of the speech processing unit which concerns on the 2nd Embodiment of this invention. It is a block diagram which shows the structural example of the speech processing unit which concerns on the 3rd Embodiment of this invention. It is a flowchart which shows the operation example of the speech processing unit which concerns on the 3rd Embodiment of this invention. It is a block diagram which shows the structural example of the speech processing system which concerns on the 4th Embodiment of this invention. It is a figure which shows an example of the audio | voice data which the external storage device in the specific example of this invention memorize | stores. It is a figure which shows an example of the method in which the production | generation part in the specific example of this invention divides | segments audio | voice data. It is a figure which shows an example of the method in which the clustering part in the specific example of this invention produces | generates the cluster which put together the some segment.

First, the background of the present invention will be described in order to facilitate understanding of the embodiment of the present invention.

In the voice appraisal method related to the present invention, for example, a ransom request from a kidnapper or a telephone call of a terrorist's crime notice is recorded, and the recorded voice is compared with a known voice to identify the main voice of the telephone.

Unlike life-long fingerprints, voice changes each time depending on what is spoken. Therefore, in the voice appraisal method, a part (section) of the voice in which the same content is spoken is cut out and compared. In the voice appraisal method, for example, in the ransom request of a kidnapper, it is expected that the phrase “prepare gold” will often appear, so such a phrase is discovered and cut out, and also “prepare gold” Compare with spoken voice.

¡What phrases are used on a case-by-case basis. In the case of kidnappers, the above-mentioned “Give me money” appears frequently, so it is considered appropriate. For money transfer scams, the money phrases are appropriate, but they are probably different from those of kidnappers. In the case of terrorists, there may be better alternative phrases, and it would be better to use different phrases in military and other government agencies. The selection of such frequently occurring phrases has so far depended on the experience and intuition of skilled appraisers. However, in such a case, it is necessary for an expert appraiser to observe the speech carefully over time, and there is a problem that it takes a large human cost to obtain a desired phrase necessary for the audio appraisal. .

According to the embodiment of the present invention described below, the above-described problems are solved, and a desired phrase can be selected.

Hereinafter, embodiments and specific examples of the present invention will be described with reference to the drawings. In addition, about each embodiment and a specific example, the same code | symbol is attached | subjected to the same component and description is abbreviate | omitted suitably.

[First Embodiment]
Hereinafter, a first mode for carrying out the present invention (hereinafter referred to as “first embodiment”) will be described in detail with reference to the drawings.

[Description of configuration]
FIG. 1 is a block diagram illustrating a configuration example of a voice processing device 10 according to the first embodiment of the present invention. Referring to FIG. 1, the speech processing apparatus 10 according to the first embodiment of the present invention includes a generation unit 11, a clustering unit 12, a selection unit 13, and an extraction unit 14. Here, the generation unit 11 is also referred to as a first generation unit. The clustering unit 12 is also referred to as a second generation unit.

The generation unit 11 generates a plurality of segments in which at least some of the adjacent segments overlap from the audio data stored in the external storage device. For example, the generation unit 11 subdivides the audio data stored in the external storage device into short time units, and generates a plurality of segments using the subdivided audio data. Moreover, the time length of the several segment which the production | generation part 11 produces | generates may be fixed time length. Moreover, the production | generation part 11 may divide | segment several times with different time length with respect to one audio | voice data, and may produce | generate the segment of various time length using this divided | segmented audio | voice data.

The clustering unit 12 classifies a plurality of segments based on a predetermined similarity index to generate a cluster.

The selection unit 13 selects at least one cluster from the generated clusters based on the size of each cluster. The extraction unit 14 extracts segments included in the selected cluster. Here, the size of the cluster is, for example, a result obtained by multiplying the total time length of the segment, the number of appearances of the segment content (also referred to as a phrase), and the segment length.

FIG. 2 is a schematic block diagram showing a configuration example of the computer 1000 in each embodiment and specific example of the present invention. The computer 1000 includes a CPU (Central Processing Unit) 1001, a main storage device 1002, an auxiliary storage device 1003, an interface 1004, an input device 1005, and a display device 1006.

The voice processing apparatus 10 and the like of each embodiment and a specific example are mounted on a computer 1000. Operations of the voice processing device 10 and the like are stored in the auxiliary storage device 1003 in the form of a program. The CPU 1001 reads out the program from the auxiliary storage device 1003, develops it in the main storage device 1002, and executes the above processing according to the program. For example, the CPU 1001 reads out the above program from the auxiliary storage device 1003 and develops it in the main storage device 1002, thereby realizing the functions of the generation unit 11, clustering unit 12, selection unit 13, and extraction unit 14.

The auxiliary storage device 1003 is an example of a tangible medium that is not temporary. Other examples of non-temporary tangible media include magnetic disk, magneto-optical disk, CD (Compact Disc) -ROM (Read Only Memory), DVD (Digital Versatile Disc) -ROM, semiconductor, which are connected via an interface 1004 Memory etc. are mentioned. When this program is distributed to the computer 1000 via a communication line, the computer 1000 that has received the distribution may develop the program in the main storage device 1002 and execute the above processing.

The interface 1004 is connected to the CPU 1001 and is connected to a network or an external storage medium. External data may be taken into the CPU 1001 via the interface 1004. The input device 1005 is, for example, a keyboard, a mouse, a touch panel, or a microphone. The display device 1006 displays a screen corresponding to drawing data processed by a CPU 1001 or a GPU (Graphics Processing Unit) (not shown) such as an LCD (Liquid Crystal Display) or a CRT (Cathode Ray Tube) display. It is a device to do. Note that the hardware configuration illustrated in FIG. 2 is merely an example, and each unit illustrated in FIG. 2 may be configured with independent logic circuits.

Further, the program may be for realizing a part of the above-described processing. Furthermore, the program may be a differential program that realizes the above-described processing in combination with another program already stored in the auxiliary storage device 1003.

[Description of operation]
The operation of this embodiment will be described with reference to FIG. FIG. 3 is a flowchart showing an operation example of the speech processing apparatus 10 according to the first embodiment of the present invention.

The generation unit 11 generates a plurality of segments from the audio data stored in the external storage device (step S101). At this time, the generation unit 11 generates a plurality of segments so that adjacent segments have at least temporal overlap. The time length of the segment may be a constant value in the range of 1 second to several seconds, for example, depending on the assumed time length of the phrase.

Also, the generation unit 11 may generate a segment with various time lengths by dividing the sound data a plurality of times with different time lengths. Further, the generation unit 11 divides the audio data at a predetermined change point or a silent section using the method described in Non-Patent Document 1, and uses the plurality of divided audio data to generate variable length segments. It may be generated.

The clustering unit 12 classifies a plurality of segments based on a predetermined similarity index to generate a cluster (step S102). That is, the clustering unit 12 clusters a plurality of segments. The clustering unit 12 calculates the similarity between the segments from the plurality of segments generated by the generation unit 11, and generates a cluster in which the segments with high similarity are collected. As a specific method of similarity index and cluster generation by the clustering unit 12, for example, the method described in Non-Patent Document 1 can be used.

Here, the similarity index is an index for measuring the similarity of phonemes constituting a segment. The similarity index is an index that uses a statistic of an acoustic feature quantity, such as a batch distance distance calculated from the average and variance of the acoustic feature quantity series, a divergence of a Cullback library, and a log likelihood ratio. These indices do not consider the order of the acoustic feature quantity sequence within the segment.

In addition, as a method using the similarity index, for example, an index considering an order, that is, a time order may be used. The method using the similarity index is, for example, a DP matching method that calculates the degree of similarity by obtaining an optimal correspondence of each acoustic feature amount between segments by dynamic programming (hereinafter referred to as “DP”). is there. Here, the acoustic feature amount is, for example, MFCC. MFCC is widely used for voice recognition and the like.

When the clustering in the clustering unit 12 converges (Yes in step S103), the selection unit 13 selects a cluster that satisfies a predetermined condition from the clusters generated by the clustering unit 12 based on the size of each cluster (step S103). S104). In this selection, the selection unit 13 compares cluster sizes from the viewpoint of finding frequently occurring phrases, and selects at least one cluster in descending order of size. Examples of the predetermined condition include a larger number of segments, a longer total time length of the segments, and the like. That is, the selection unit 13 selects, for example, a cluster including more segments or a cluster having a longer total segment length as a cluster that satisfies a predetermined condition.

Here, the case where the clustering converges is, for example, a situation where Step S101 and Step S102 are executed a predetermined number of times, a situation where the increase or decrease of the predetermined evaluation value related to clustering is a predetermined value or less, and the like. Note that the case where clustering converges may be a situation in which the movement of a segment between clusters is lost in association with a situation in which the increase or decrease in a predetermined evaluation value related to clustering is equal to or less than a certain value.

The extraction unit 14 extracts segments from one or more segments included in the cluster selected by the selection unit 13 (step S105). Thereby, the extraction part 14 can extract the segment of the part applicable to a desired phrase from audio | voice data.

Here, if the clustering in the clustering unit 12 has not converged (No in step S103), the process returns to step S101. This indicates that step S101 and step S102 depend on each other and may be repeated a predetermined number of times or until convergence.

Note that the generation unit 11 and the clustering unit 12 can be collectively executed using a hidden Markov model (Hidden Markov Model; hereinafter referred to as “HMM”) having the structure shown in FIG. FIG. 4 is a diagram illustrating an example of a method in which the speech processing apparatus 10 extracts a phrase using the HMM. That is, the speech processing apparatus 10 learns an HMM as shown in FIG. 4 based on the maximum likelihood estimation method using speech data stored in the external storage device. Thus, a one-way HMM (Left−) expressing the first phrase (phrase 1 in FIG. 4), the second phrase (phrase 2 in FIG. 4),..., The Nth phrase (phrase N in FIG. 4). to-right HMM) is automatically formed, and at the same time, segments belonging to each are also acquired.

Listening to the relevant part of the extracted voice data allows frequent phrases to be confirmed and used for voice appraisal.

[Description of effects]
As described above, according to the speech processing apparatus 10 according to the present embodiment, the generation unit 11 generates a plurality of segments from the speech data so that adjacent segments at least partially overlap, and the clustering unit 12 Based on the similarity, a plurality of segments are classified to generate a cluster. According to the speech processing apparatus 10 according to the present embodiment, the selection unit 13 selects at least one cluster from the clusters based on the size of each cluster. Furthermore, according to the speech processing apparatus 10 in the present embodiment, since the extraction unit 14 extracts the segments included in the selected cluster, the segment corresponding to the desired phrase can be extracted from the speech data. It becomes possible. The reason is that the generation unit 11 generates a plurality of segments so that adjacent segments overlap at least partially from the audio data, so a phrase longer than the word is generated as one segment from a word shorter than the word. Because it can.

Further, by using the speech processing apparatus 10 in the present embodiment, frequent phrases necessary for speech appraisal can be found and selected at low cost even if not an expert appraiser. The reason is that the speech processing apparatus 10 generates segments such that adjacent segments at least partially overlap from given speech data, clusters the segments, and selects a cluster including many similar segments. Because. This is because the speech processing apparatus 10 extracts segments included in the cluster selected in this way. The extracted segment is a segment generated by the generation unit 11 and is partial audio data including a desired phrase in the audio data. This is because the speech processing apparatus 10 can automatically find frequently occurring phrases in the speech data.

Furthermore, by using the speech processing apparatus 10 according to the present embodiment, frequently frequent phrases can be found quantitatively, so that frequent phrases useful for speech appraisal can be found with high reliability.

[Second Embodiment]
Hereinafter, a second embodiment of the present invention will be described in detail with reference to the drawings.

[Description of configuration]
FIG. 5 is a block diagram illustrating a configuration example of the sound processing device 20 according to the second embodiment of the present invention. Referring to FIG. 5, the speech processing apparatus 20 according to the second embodiment of the present invention includes a normalization learning unit 15, a speech data normalization unit 16, a speech data processing unit 17, and first to Nth speech data storages. Part (101-1 to 101-N (N is a positive integer)), unspecified acoustic model storage unit 102, and first to Nth parameter storage units (103-1 to 103-N (N is a positive integer) )).
Here, the normalization learning unit 15 is also referred to as a third generation unit. In the present embodiment, the first to Nth audio data storage units (101-1 to 101-N) are referred to as the audio data storage unit 101 when they are not distinguished or collectively referred to. The first to Nth parameter storage units (103-1 to 103-N) are referred to as the parameter storage unit 103 when they are not distinguished or collectively referred to.

The audio data storage means 101 stores audio data having different properties. That is, the first audio data storage unit 101-1, the second audio data storage unit 101-2,..., And the Nth audio data storage unit 101-N each store audio data having different properties. Also, the audio data having different properties stored in the first audio data storage unit 101-1, the second audio data storage unit 101-2,..., And the Nth audio data storage unit 101-N are: Each of them is voice data classified based on acoustic characteristics.

The unspecified acoustic model storage unit 102 stores the unspecified acoustic model learned by the normalization learning unit 15. The unspecified acoustic model is a model obtained by normalizing a difference between audio data having different properties stored in the audio data storage unit 101.

The parameter storage unit 103 stores parameters for normalizing the difference between the audio data. That is, the first parameter storage unit 103-1, the second parameter storage unit 103-2,..., And the Nth parameter storage unit 103-N have parameters for normalizing the difference of audio data. Remember each one.

The normalization learning unit 15 performs normalization learning using audio data having different properties stored in the audio data storage unit 101.

Here, normalization learning is an acoustic model learning method described in Non-Patent Document 2, for example.
In the acoustic model, each phoneme i is defined by an average vector μ _i of acoustic feature values. In normalization learning, it is assumed that the average vector can be changed depending on the property of speech data. That is, in the present embodiment, the average vector (unspecified acoustic model) μ _i is expressed by affine transformation as shown in the following formula (1).

Here, s = 1, 2,..., N. A _s and b _s are parameters for normalizing the difference in the properties of the audio data.

Equation (1) provides the unspecified acoustic model μ _i that is not affected by the difference in the properties of the speech data, and the parameters A _S and b _S for normalizing the difference in the properties of the speech data. Then, the normalization learning unit 15 stores the unspecified acoustic model μ _i in the unspecified acoustic model storage unit 102. In addition, the normalization learning unit 15 stores the parameters A _S and b _S in the parameter storage unit 103. Specifically, the normalization learning unit 15 stores the parameters A ₁ and b ₁ in the first parameter storage unit 103-1, and the parameters A _N and b _N in the Nth parameter storage unit 103-N. Store. Non-Patent Document 2 describes a method of normalizing the difference between speakers assuming that the nature of speech data varies depending on the speaker. However, the difference in the properties of speech data is not limited to the speaker, and background noise, Various assumptions such as a microphone and a communication line are possible.

That is, the normalization learning unit 15 generates a normalization parameter for normalizing the difference between the audio data having different properties stored in the audio data storage unit 101 and stores the normalization parameter in the parameter storage unit 103. In addition, the normalization learning unit 15 normalizes the difference between the audio data having different properties stored in the audio data storage unit 101 to learn the unspecified acoustic model, and stores the learned unspecified acoustic model in the unspecified acoustic model 102. To do. Here, the normalization learning unit 15 generates the normalization parameter by estimating a normalization parameter for normalizing the difference between the audio data having different properties stored in the audio data storage unit 101. In addition, for example, when performing iterative calculation, the normalization learning unit 15 stores the unspecified acoustic model in the unspecified acoustic model 102 at each iteration.

The audio data normalization unit 16 refers to the parameters stored in the parameter storage unit 103, normalizes the audio data stored in each of the audio data storage units 101, and sends it to the audio data processing unit 17. Specifically, the sth parameter is used for the time series x ₁ , x ₂ ,..., X _t ,... (T is a positive integer) of the acoustic feature amount of the sth audio data. The expression (2), which is a conversion corresponding to the inverse conversion of the expression (1), is applied.

The parameters that define normalization may be different depending on the phoneme class (friction, plosive, etc.), or may be different depending on the preceding and following phonemes in consideration of context dependency. . Further, the audio data normalization unit 16 may normalize not only the average vector of acoustic feature values but also the variance. Further, the present invention is not limited to these, and various devices known for normalization learning may be applied.

The voice data processing unit 17 has the same configuration and effects as the voice processing apparatus 10 in the first embodiment. That is, the voice data processing unit 17 performs the processing of the generation unit 11, the clustering unit 12, the selection unit 13, and the extraction unit 14 illustrated in FIG. 1 in the same manner as in the first embodiment, and in the normalized voice data A segment containing a phrase that appears frequently is output.

[Description of operation]
The operation of this embodiment will be described with reference to FIG. FIG. 6 is a flowchart showing an operation example of the speech processing apparatus 20 according to the second embodiment of the present invention. Here, as shown in FIG. 6, the operation of the audio data processing unit 17 in this embodiment, that is, steps S204 to S208 is the same as the operation of the audio processing device 10 in the first embodiment, that is, steps S101 to S105. Since it is the same, description is abbreviate | omitted.

The normalization learning unit 15 reads out each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameter of each voice data in the parameter storage unit 103 (step S201).

The normalization learning unit 15 stores the unspecified acoustic model generated after performing normalization to eliminate the difference in the properties of the speech data in the unspecified acoustic model storage unit 102 (step S202).

The audio data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103, and normalizes the audio data stored in the audio data storage unit 101, respectively (step S203).

The voice data processing unit 17 performs the same processing as steps S101 to S105 of the voice processing apparatus 10 in the first embodiment shown in FIG. 3, and outputs segments including phrases that frequently appear in the voice data (steps). S204 to step S208).

[Description of effects]
As described above, according to the speech processing apparatus 20 in the present embodiment, the normalization learning unit 15 reads each speech data from the speech data storage unit 101, performs normalization learning, and sets the normalization parameter of each speech data. Store in the parameter storage unit 103. The normalization learning unit 15 performs normalization to eliminate the difference in acoustic properties of the respective voice data, and stores the unspecified acoustic model generated in the unspecified acoustic model storage unit 102. Also, the voice data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103 and normalizes the voice data stored in the voice data storage unit 101, respectively. The voice data processing unit 17 outputs a segment including a phrase that frequently appears in the normalized voice data. Therefore, the speech processing apparatus 20 in the present embodiment can normalize speech data that has not been normalized and select a desired phrase.

Further, according to the speech processing device 20 in the present embodiment, the normalization learning unit 15 determines the difference in acoustic properties between the first speech data, the second speech data,. Learning to normalize the difference. After the voice data normalization unit 16 eliminates the difference in acoustic properties, the voice data processing unit 17 extracts segments including phrases that frequently appear in the voice data. Therefore, the voice processing device 20 can more accurately extract phrases that frequently appear in the voice data. The reason is that in the speech processing apparatus 20 according to the present embodiment, the clustering unit 12 in the speech data processing unit 17 is affected by the difference in the properties of the speech data and generates an inappropriate cluster (for example, a speaker cluster). This is because such a situation can be reduced.

[Third Embodiment]
Hereinafter, a third embodiment of the present invention will be described in detail with reference to the drawings.

[Description of configuration]
FIG. 7 is a block diagram illustrating a configuration example of the sound processing device 30 according to the third embodiment of the present invention. Referring to FIG. 7, the speech processing device 30 according to the third embodiment of the present invention includes an unclassified speech data storage unit 104 and a speech data classification unit 18 in addition to the configuration of the speech processing device 20 according to the second embodiment. And comprising. Here, since the configuration of the audio processing device 20 in the second embodiment has already been described, description thereof will be omitted. The audio data classification unit 18 is also described as a fourth generation unit.

The uncategorized voice data storage unit 104 stores voice data.

The voice data classification unit 18 classifies the voice data stored in the voice data storage unit 104 based on the acoustic properties, and stores the voice data in the voice data storage unit 101. For example, the voice data classifying unit 18 classifies the voice data stored in the unclassified voice data storage unit 104 into N clusters based on differences in acoustic properties, for example, differences in speakers, and the voice data storage unit Each of them is stored in 101. That is, the audio data classification unit 18 generates N clusters by classifying the audio data stored in the unclassified audio data storage unit 104 based on the acoustic properties. Then, the voice data classifying unit 18 stores the first cluster in the first voice data storage unit, the second cluster in the second voice data storage unit,..., The second cluster in the Nth voice data storage unit. Store N clusters.

Here, the audio data stored in the unclassified audio data storage unit 104 may be a mixture of audio data having various acoustic properties. N may be a predetermined constant, or may be automatically determined by the audio data classification unit 18 according to the processing target. These can be implemented by applying a known clustering method.

The audio data storage unit 101 stores the audio data classified by the audio data classification unit 18.

[Description of operation]
The operation of this embodiment will be described with reference to FIG. FIG. 8 is a flowchart showing an operation example of the speech processing apparatus 30 according to the third embodiment of the present invention. Here, as shown in FIG. 8, the operation of the audio data processing unit 17 in this embodiment, that is, steps S306 to S310 is the same as the operation of the audio processing device 10 in the first embodiment, that is, steps S101 to S105. Since it is the same, description is abbreviate | omitted.

The voice data classification unit 18 classifies the voice data stored in the voice data storage unit 104 based on the acoustic properties, and stores the voice data in the voice data storage unit 101 (step S301).

The normalization learning unit 15 reads each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameter of each voice data in the parameter storage unit 103 (step S302).

The normalization learning unit 15 stores the unspecified acoustic model generated after performing normalization to eliminate the difference in the properties of the voice data in the unspecified acoustic model storage unit 102 (step S303).

When the results of the speech data classification unit 18 and the normalization learning unit 15 converge (Yes in step S304), the speech data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103, and each of the speech data The voice data stored in the storage unit 101 is normalized (step S305).

Here, when the results of the speech data classification unit 18 and the normalization learning unit 15 have not converged (No in step S304), the process returns to the flow of step S301. Thereby, the voice data classification unit 18 and the normalization learning unit 15 can be repeatedly executed alternately until the result converges.

Note that the results output by the speech data classification unit 18 and the normalization learning unit 15 may depend on each other. Therefore, it is good also as a repetitive operation | movement performed alternately until the frequency | count of execution with the audio | voice data classification | category part 18 and the normalization learning part 15 becomes a predetermined threshold value or it converges. Such an operation can be carried out efficiently based on an optimization criterion such as likelihood maximization following the method described in Non-Patent Document 3.

The voice data processing unit 17 performs the same processing as steps S101 to S105 of the voice processing apparatus 10 in the first embodiment shown in FIG. 6, and outputs segments including phrases that frequently appear in the voice data (steps). S306 to S310).

[Description of effects]
As described above, according to the audio processing device 30 in the present embodiment, the audio data classification unit 18 classifies the audio data stored in the audio data storage unit 104 based on the acoustic properties, and the audio data storage unit 101. To remember. Then, the normalization learning unit 15 reads out each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameters of each voice data in the parameter storage unit 103. The normalization learning unit 15 performs normalization to eliminate the difference in acoustic properties of the respective voice data, and stores the unspecified acoustic model generated in the unspecified acoustic model storage unit 102. The sound data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103 and normalizes the sound data stored in the sound data storage unit 101, respectively. The voice data processing unit 17 outputs a segment including a phrase that frequently appears in the normalized voice data. Therefore, the speech processing apparatus 30 in the present embodiment can classify and normalize speech data that has not been classified and normalized, and select a desired phrase.

Further, according to the speech processing apparatus 30 in the present embodiment, the speech data classification unit 18 classifies speech data into N clusters based on the difference in acoustic properties, and uses the result to normalize the learning unit 15. Is configured to perform normalized learning. Therefore, the voice processing device 30 in the present embodiment can reduce the preparation cost of the voice data as compared with the voice processing device 20 in the second embodiment. The reason is that the speech processing apparatus 30 in the present embodiment does not need to divide speech data in advance according to the difference in acoustic properties (for example, for each speaker), and collects a set of various speech data at once. It is because it can be given and processed.

[Fourth Embodiment]
[Description of configuration]
Hereinafter, a fourth embodiment of the present invention will be described in detail with reference to the drawings.

FIG. 9 is a block diagram showing a configuration example of the speech processing system 40 in the fourth embodiment of the present invention. Referring to FIG. 9, the voice processing system 40 in the fourth embodiment includes a voice processing device 41, a voice input device 42, an instruction input device 43, and an output device 44.

The voice processing device 41 processes the input voice by the processing of the voice processing device 10 in the first embodiment of the present invention, the processing of the voice processing device 20 in the second embodiment, or the third embodiment. The processing of the voice processing device 30 (hereinafter referred to as “phrase extraction processing described in the first to third embodiments of the present invention”) is executed.

The voice input device 42 inputs voice. The audio input device 42 is an arbitrary device that functions as an interface for inputting arbitrary audio data to the audio processing device 41, that is, a microphone that receives audio signals as data, a memory that records audio data, and the like. The voice input device 42 is, for example, the input device 1005 shown in FIG.

The output device 44 outputs the result of the processing performed by the voice processing device 41. The output device 44 is an output device such as a monitor or a speaker that outputs the processing result of the sound processing device 42 by visual or auditory means in accordance with an instruction input from the instruction input device 43 by the operator. When the output device 44 is a monitor, for example, a list of clusters is displayed in order of size, the contents of a specific cluster are displayed by waveform diagrams, spectrograms, etc., so that a plurality of segments can be compared. Display them side by side. When the output device 44 is a speaker, the output method of the output device 44 is to reproduce sound. The output device 44 is realized by a display device 1006, for example.

The instruction input device 43 receives the instruction information from the operator and controls information displayed on the display device. The instruction input device 43 is a user interface that receives operator instruction information such as processing for information output from the output device 44 and execution of processing by the voice processing device 41. An arbitrary input device such as a mouse, a keyboard, or a touch panel is used. Is available.

[Description of operation]
Hereinafter, an operation example of the voice processing system 40 according to the fourth embodiment of the present invention will be described.

The instruction input device 43 receives the instruction information from the operator and controls the voice processing device 41 to execute the process. The voice input device 42 inputs arbitrary voice data to the voice processing device 41. The speech processing device 41 executes the phrase extraction processing described in the first to third embodiments of the present invention based on the input speech data, selects clusters including frequently occurring phrases, and further selects them. The segments included in the created cluster are extracted. The output device 44 outputs the processing result of the sound processing device 41 by visual or auditory means according to an instruction input from the instruction input device 43 by the operator. That is, the output device 44 outputs the processing result in a form that the operator desires to view.

[Description of effects]
As described above, according to the voice processing system 40 in the present embodiment, the instruction input device 43 controls the voice processing device 41 to execute processing in accordance with the instruction information input from the operator. The voice input device 42 inputs arbitrary voice data to the voice processing device 41. Based on the input voice data, the voice processing device 42 executes the phrase extraction described in the first to third embodiments of the present invention, selects a cluster including phrases (segments) that frequently appear, and Extract the segments contained in the selected cluster. The output device 44 outputs the processing result of the sound processing device 41 by visual or auditory means according to the instruction input from the instruction input device 43 by the operator. Therefore, the speech processing system 40 in the present embodiment can output clusters and segments including frequently occurring phrases included in the speech data.

In addition, the voice processing system 40 according to the present embodiment allows an operator to easily perform analysis work such as identification of a person from voice. This is because the voice processing system 40 in the present embodiment is configured such that the processing result is output to the output device 44 in a form that the operator wants to browse. In addition, the speech processing system 40 according to the present embodiment can frequently analyze phrases that frequently appear, so that it is possible to analyze a tendency of a talk or a topic that a specific person often speaks.

(Concrete example)
Hereinafter, a specific example of the first embodiment of the present invention will be described. An example in which the speech processing apparatus 10 extracts a phrase from speech data will be described with reference to FIGS.

Details of an example of extracting a phrase from audio data stored in the external storage device will be described with reference to FIGS. FIG. 10 is a diagram illustrating an example of audio data stored in the external storage device. Here, the external storage device is realized by, for example, the voice input device 42 in the fourth embodiment.

As shown in FIG. 10, the external storage device stores voice data and a voice data ID that is an identifier of the voice data. When the voice data ID is “1”, the external storage device stores voice data “... Kept a child. Please prepare a ransom. Here, the external storage device is not limited to the contents of the audio data shown in FIG.

FIG. 11 is a diagram illustrating an example of a method in which the generation unit 11 generates a segment from audio data. As shown in FIG. 11, the generation unit 11 starts from the voice data shown in FIG. 10, that is, the voice data ID “1” “... Has left the child. Prepare the ransom. , Generate multiple segments. As shown in FIG. 11, segment 1 is “deposited” and segment 2 is “deposited. As shown in FIG. 11, the generation unit 11 subdivides the audio data arbitrarily (a predetermined time or the like), and generates a plurality of segments using these. Here, the production | generation part 11 produces | generates a segment so that segments may overlap from audio | voice data. That is, as shown in FIG. 11, segment 1 and “2” are overlapped, and segment 1 is “deposited” and segment 2 is “deposited”. Thereby, the speech processing apparatus 10 can extract a phrase obtained from the speech data.

FIG. 12 is a diagram illustrating an example of a method in which the clustering unit 12 generates a cluster in which a plurality of segments are collected. As shown in FIG. 12, for example, a cluster is a cluster ID that is an identifier of a segment content (phrase), a segment content, and an appearance of a segment content (phrase) that appears in all audio data. Including the number of times. In this specific example, as shown in FIG. 12, the cluster ID and the segment number shown in FIG. 11 are assumed to be the same. The cluster indicates, for example, that the phrase “deposited” with the cluster ID “1” appears 20 times in all audio data. That is, the clustering unit 12 calculates the similarity between the segments from the plurality of segments generated by the generation unit 11 as illustrated in FIG. 11, and generates a cluster having a high similarity, that is, a group of the same segments. .

The selection unit 13 compares clusters using the number of segments included in the cluster and the total time length, and selects a cluster that satisfies a predetermined condition. For example, the selection unit 13 compares the number of segments included in each cluster among the plurality of clusters generated by the clustering unit 12, that is, the number of appearances of the phrase. As shown in FIG. 12, the selection unit 13 selects “Ransom” with an appearance count of 35 and “Prepare a ransom” with an appearance count of 30. Next, the selection unit 13 performs comparison based on the size of each cluster. For example, the selection unit 13 uses the result of multiplying the number of appearances and the segment length, that is, the time length as the size of each cluster, and selects the cluster having the largest size of each cluster.

As shown in FIG. 12, the selection unit 13 compares, for example, a cluster with a cluster ID of 7 and a cluster with a cluster ID of 8. The selection unit 13 compares the result of the multiplication of the appearance count 35 times with the time length of “Ransom” and the result of the multiplication of the appearance count 30 times with the time length of “Prepare ransom”. Select a cluster whose content is “Prepare ransom”. That is, the phrase “Prepare ransom” is the desired phrase. Moreover, the selection part 13 may compare and select only the time length of a segment, when comparing the clusters with the same appearance frequency. Note that the selection unit 13 is not limited to the above method, and the size may be defined and compared based on various indexes such as the number of appearances, the time length, and the number of phonemes.

Then, the extraction unit 14 extracts a segment from the selected cluster. As a result, audio data that is a segment whose content is “Prepare ransom” is extracted. It can be seen from the voice data of this segment that the phrase “Prepare ransom” is frequently included in the voice data.

As described above, in the speech processing apparatus 10 according to this specific example, for example, “... prepare ransom”, which is a frequent phrase from the speech data “... have kept the child. Prepare the ransom. It is possible to extract “shiro”.

As mentioned above, although this invention was demonstrated using embodiment and a specific example, this invention is not necessarily limited to the said embodiment and specific example. Various changes and modifications that can be understood by those skilled in the art within the scope of the present invention (within the scope of the technical idea) can be made to the configuration and details of the present invention.

This application claims priority based on Japanese Patent Application No. 2015-061854 filed on Mar. 25, 2015, the entire disclosure of which is incorporated herein.

DESCRIPTION OF SYMBOLS 10 Speech processing apparatus 11 Generation part 12 Clustering part 13 Selection part 14 Extraction part 15 Normalization learning part 16 Speech data normalization part 17 Speech data processing part 18 Speech data classification part 20 Speech processing apparatus 30 Speech processing apparatus 40 Speech processing system 41 Audio processing device 42 Audio input device 43 Instruction input device 44 Output device 101 Audio data storage unit 102 Unspecified acoustic model storage unit 103 Parameter storage unit 1000 Computer 1001 CPU
1002 Main storage device 1003 Auxiliary storage device 1004 Interface 1005 Input device 1006 Display device

Claims

First generation means for generating a plurality of segments at least partially overlapping adjacent segments from audio data;
Second generating means for generating a cluster by classifying the plurality of segments based on phoneme similarity;
Selection means for selecting a cluster that satisfies a predetermined condition based on the size of the cluster;
An audio processing apparatus comprising: extraction means for extracting a segment included in the selected cluster.
Third generation means for generating a plurality of normalization parameters for normalizing differences in acoustic properties of the plurality of sound data based on the plurality of sound data;
Normalization means for normalizing the audio data using the plurality of normalization parameters;
The audio processing apparatus according to claim 1, wherein the first generation unit generates the plurality of segments from the normalized audio data.
The speech processing apparatus according to claim 1 or 2, wherein the selection means compares and selects the clusters using the number of segments included in the clusters or the total time length.
The speech processing apparatus according to any one of claims 1 to 3, wherein the second generation unit calculates a similarity between segments by comparing acoustic feature amounts constituting the segments.
The speech processing apparatus according to claim 1 or 2, wherein the second generation unit generates a similarity by DP (Dynamic Programming) matching between the segments.
A fourth generation means for generating a cluster by classifying the audio data based on a difference in acoustic properties;
The speech processing apparatus according to claim 2, wherein the third generation unit generates a normalization parameter for the cluster.
7. The speech processing apparatus according to claim 6, wherein the fourth generation means and the learning means are repeatedly executed alternately until the result converges or the number of executions reaches a predetermined threshold based on a mutual result.
From the audio data, generate multiple segments that at least partially overlap adjacent segments,
Classifying the plurality of segments based on phonological similarity to generate a cluster;
Based on the size of the cluster, select a cluster that satisfies a predetermined condition,
A speech processing method for extracting a segment included in the selected cluster.
Generating a plurality of segments in which adjacent segments at least partially overlap from audio data;
A process of generating a cluster by classifying the plurality of segments based on phoneme similarity; and
A process of selecting one or more clusters that satisfy a predetermined condition based on the size of the clusters;
The computer-readable recording medium which memorize | stores the program which makes a computer perform the process which extracts the segment contained in the said selected cluster.
An instruction input device for receiving operator instruction information;
A voice input device for inputting voice data to the voice processing device;
The voice processing apparatus according to any one of claims 1 to 7, wherein a process is performed on the input voice data based on the instruction information;
An output device for outputting a processing result of the voice processing device,
The output device is a voice processing system that outputs the processing result according to the instruction information.