WO2014203328A1

WO2014203328A1 - Voice data search system, voice data search method, and computer-readable storage medium

Info

Publication number: WO2014203328A1
Application number: PCT/JP2013/066690
Authority: WO
Inventors: 龍武田; 直之神田; 藤田　雄介; 康成大淵
Original assignee: 株式会社日立製作所
Priority date: 2013-06-18
Filing date: 2013-06-18
Publication date: 2014-12-24

Abstract

A technique for achieving an accurate voice data search is provided. The present invention receives a search keyword and calculates first score values, which are score values between the search keyword and pieces of voice section data included in indexed voice information data, thereby obtaining a plurality of search result candidates. The present invention then identifies pieces of voice section data adjacent to each of the plurality of search result candidates on the basis of dialog sequence data. Further, the present invention obtains information related to the search keyword and calculates second score values, which are score values between the related information and the adjacent pieces of voice section data. The present invention then corrects the first score values using the second score values, outputs the corrected score values, sorts the plurality of search result candidates by use of the corrected score values, and outputs the sorted plurality of search result candidates.

Description

Speech data retrieval system, speech data retrieval method, and computer-readable storage medium

The present invention relates to a voice data search system, a voice data search method, and a computer-readable storage medium, for example, a technique for searching for a specific keyword from voice data.

In a call center, voice call data for thousands of hours a day, specifically, operator and customer voices are often recorded in pairs. These are recorded for operator training and confirmation of received contents, and a voice database is used as necessary. In particular, customer voices contain information such as product names, product defects, and complaints that need to be heard efficiently and put together in reports. In many conventional voice databases, information on the time when voice is recorded is given to the voice data, and desired voice data is searched based on the time information. In the search based on the time information, it is necessary to know in advance the time when the desired voice is uttered, so that it is not suitable for use in searching for a voice with a specific utterance. When searching for a voice with a specific utterance, in the conventional method, it is necessary to listen to the voice data from the beginning to the end.

Therefore, a technology for detecting a position where a specific keyword is spoken in a speech database has been developed. In the subword search method, which is one of representative methods, first, speech data is converted into a subword string by a subword recognition process. Here, the subword is a name indicating a general unit system smaller than a word, such as a phoneme or a syllable. In the subword search method, the subword expression of the input keyword is compared with the subword recognition result of the speech data, and the distance between the subwords is calculated according to some criteria. By sorting the search results in descending order using the calculated distance as a score, the time when the keyword is spoken is detected on the voice data. As a document disclosing such a technique, there is, for example, Patent Document 1. Japanese Patent Laid-Open No. 2004-133867 discloses an input search when a search result obtained by searching an input search keyword from a speech database and an appearance location (time) of a search result by a co-occurrence keyword related to the input search keyword are close. It discloses that an evaluation value (score value) of a search result by a keyword is increased.

JP 2009-295101A

However, in Patent Document 1, a keyword and its co-occurrence keyword are searched only from the voice data of the speaker to be searched, and a score is given based on the search result. In general, subword recognition is difficult for customer's voice due to the influence of noise and the diversity of speaker characteristics, and keyword misdetection increases. For this reason, the technique such as Patent Document 1 has a problem that unnecessary search results rise to the top and search accuracy decreases.

The present invention has been made in view of such a situation, and provides a technique for realizing high-accuracy speech data retrieval.

In order to solve the above-described problems, in the present invention, (i) using a sound model and a language model generated from learning speech data, subword recognition processing is performed on the data in the speech section of the search target data, and speech Processing for generating index audio information data including section data, silent section information, voice file channel information indicating a channel in which the voice section data is spoken, and voice metadata information; and (ii) a voice file channel Processing for generating dialogue order data indicating the utterance order of voice segment data based on the information and voice metadata information; and (iii) accepting a search keyword, and voice segment data included in the search keyword and index voice information data; (Iv) a process of calculating a first score value that is a score value and obtaining a plurality of search result candidates; A process of identifying the surrounding speech interval data of the number of search result candidates based on the dialogue order data; and (v) acquiring related information related to the search keyword, and obtaining the related information and the surrounding speech interval data A process of calculating a second score value which is a score value, (vi) a process of correcting the first score value using the second score value and outputting a corrected score value, and (vii) a corrected score value And processing for sorting and outputting the plurality of search result candidates using the.

Further features related to the present invention will become apparent from the description of the present specification and the accompanying drawings. The embodiments of the present invention can be achieved and realized by elements and combinations of various elements and the following detailed description and appended claims.

According to the present invention, it is possible to improve the search accuracy by correcting the score using the status of the speech section of another speaker before and after the speech section of the keyword search result, for example, the related keyword and the silent section length.

It is a figure which shows the structure of the audio | voice data search device 1 by the 1st Embodiment of this invention. It is a flowchart for demonstrating the registration process of the audio | voice data which the indexing / audio | voice information extraction part 106 performs in this embodiment of this invention. It is a figure which shows the example of the audio | voice data divided | segmented. FIG. 5 is a diagram illustrating a configuration example of information stored in index / audio information data (storage unit) 107; It is a figure which shows an example of subword N-gram (when N = 3). It is a figure which shows an example of the conversation on the same time. It is a figure which shows the structural example of the dialog order data 109 constructed | assembled by the dialog order analysis part. It is a flowchart for demonstrating a keyword input process. It is a figure which shows the example of conversion to a subword. It is a flowchart for demonstrating a related information input process. It is a flowchart for demonstrating the process by the candidate position evaluation part 112 by embodiment of this invention. It is a figure which shows the structural example of the display format of the search result displayed by the search result display part. It is a figure which shows the structural example which shows the audio | voice data search device 2 by the 2nd Embodiment of this invention. It is a flowchart for demonstrating a related information data construction process. 6 is a diagram illustrating a configuration example of related information data stored in related information data (storage unit) 1404. FIG. 10 is a flowchart for explaining processing by a related information data selection unit 1305; It is a figure which shows the structural example of the audio | voice data search system by the 3rd Embodiment of this invention. 14 is a diagram illustrating an example of a format of audio data stored in a storage device 1719. FIG. It is a figure which shows schematic structure of a general content cloud system. It is a figure which shows schematic structure of the audio | voice data search system implement | achieved by incorporating the function of the audio | voice data search device 1 in the content cloud system in the 4th Embodiment of this invention.

The present invention uses the related keyword information and silent section length information included in the operator utterance when extracting the keyword from the voice of the customer, for example, the situation of the other speaker's voice section before and after the speech section of the keyword search result, The search accuracy is improved by correcting the search score value of the search result by the input search keyword. The present invention is made by a call center practitioner paying attention to an operator's response status, for example, silent section length and emotion information, when confirming whether customer voice data is a complaint, for example.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the accompanying drawings, functionally identical elements may be denoted by the same numbers. The attached drawings show specific embodiments and implementation examples based on the principle of the present invention, but these are for understanding the present invention and are not intended to limit the present invention. Not used.

This embodiment has been described in sufficient detail for those skilled in the art to practice the present invention, but other implementations and configurations are possible without departing from the scope and spirit of the technical idea of the present invention. It is necessary to understand that the configuration and structure can be changed and various elements can be replaced. Therefore, the following description should not be interpreted as being limited to this.

Furthermore, as will be described later, the embodiment of the present invention may be implemented by software running on a general-purpose computer, or may be implemented by dedicated hardware or a combination of software and hardware.

In the following description, each information of the present invention will be described in a “table” format. However, the information does not necessarily have to be expressed in a data structure by a table, such as a data structure such as a list, a DB, a queue, or the like. It may be expressed as Therefore, “table”, “list”, “DB”, “queue”, etc. may be simply referred to as “information” to indicate that they do not depend on the data structure.

In addition, when explaining the contents of each information, the expressions “identification information”, “identifier”, “name”, “name”, “ID” can be used, and these can be replaced with each other. It is.

In the following, each process in the embodiment of the present invention will be described using “program” as a subject (operation subject). However, a program is executed by a processor and a process determined by a memory and a communication port (communication control device). Since it is performed while being used, the description may be made with the processor as the subject. Further, the processing disclosed with the program as the subject may be processing performed by a computer such as a management server or an information processing apparatus. Part or all of the program may be realized by dedicated hardware, or may be modularized. Various programs may be installed in each computer by a program distribution server or a storage medium.

(1) First Embodiment The first embodiment relates to a stand-alone voice data retrieval apparatus.

<Configuration of voice data retrieval device>
FIG. 1 is a diagram showing a configuration of a speech data retrieval apparatus 1 according to the first embodiment of the present invention.

The speech data search apparatus 1 includes learning-labeled speech data (storage unit) 101, an acoustic / language model learning unit 102, an acoustic model 103, a language model (storage unit) 104, and search target data (storage unit). 105, indexing / speech information extraction unit 106, index / speech information data (storage unit) 107, dialogue order analysis unit 108, dialogue order data (storage unit) 109, keyword input unit 110, and related information input Unit 111, candidate position evaluation unit 112, search result integration unit 113, and search result display unit 114.

The learning-labeled speech data (storage unit) 101 is learning data prepared in advance, and stores a speech waveform of an unspecified number of speakers and text that transcribes the utterance content with a label. . If the voice data is accompanied by a written text, the voice data may be a voice track extracted from the TV, a reading voice corpus, or a normal conversation. Of course, an ID for identifying the speaker and a label such as the presence or absence of noise may be attached.

The acoustic model / language model learning unit 102 sets parameters of each statistical model using the learning-labeled speech data 101. The problem of recognizing speech data can result in, for example, a posterior probability maximization search problem. In this posterior probability maximization search problem, a solution is obtained based on an acoustic model and a language model learned from a large amount of learning data. Processing for estimating parameters of the acoustic model and the language model is performed using the learning-labeled speech data 101. For example, HMM (Hidden Markov Model) may be adopted as the acoustic model, and N-Gram may be adopted as the language model. Since a detailed method of speech recognition, a method of constructing an acoustic model and a language model, and estimating a parameter are well-known techniques, description thereof will be omitted. For example, it is described in “Kiyohiro Shikano, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, Mikio Yamamoto,“ Speech Recognition System ”, Ohmsha, 2001”.

The acoustic model (storage unit) 103 stores parameters of a statistical model that expresses a voice feature (for example, a feature of the sound of “A”). The language model (storage unit) 104 includes language features (features of connection between words: for example, the word “ha” is connected after the word “dinner”, or the word “dinner” is “eating”. , Etc.), which stores the parameters of a statistical model that represents.

Search target data (storage unit) 105 stores voice data to be searched, voice extracted from TV, conference voice, recorded voice on a telephone line (for example, utterance record), and the like. The audio data may be recorded in a plurality of files by type, a plurality of channels may be recorded, or metadata information such as a speaker identification ID may be given.

The indexing / speech information extraction unit 106 detects an utterance section from the search target data 105, performs subword recognition using the acoustic model 103 and the language model 104, and obtains a subword recognition result, an N-gram index based on the subword, and other information. The included index audio information data is generated and stored in the index / audio information data (storage unit) 107.

The dialog order analysis unit 108 reads the utterance section information, the audio file channel information, and the metadata information detected by the indexing / speech information extraction unit 106 from the index / speech information data (storage unit) 107, and uses these information. The dialogue order data is generated and stored in the dialogue order data (storage unit) 109. More specifically, with reference to the metadata, a process of identifying which person's utterance data appears after the utterance of a specific person and associating the index voice data with the information of the order of dialogue is performed. For example, if the stored data is call voice recording data, two-way calls are recorded in different channels of the same audio file, or conversations of multiple speakers are recorded in separate files, but are linked by metadata. There may be. Here, first, a set of files in which conversations on the same time are recorded is obtained based on channel information and metadata information. This is the preprocessing part in the speech data retrieval apparatus.

The keyword input unit 110 receives a search keyword input by the user, converts it to a subword string if necessary, and outputs the converted subword string to the candidate position evaluation unit 112.

The related information input unit 111 receives and analyzes data (related words and related information of a search keyword) input by a user, and sets various parameters such as related keywords used in the search, silent section information, and weights to the candidate position evaluation unit 112. Output to. For example, when the input search keyword is “Shinkansen”, the related words can include station names, departure times, route names, etc., and the related information includes silent section length information and utterance length. Information (for example, information that the utterance time of the customer is more than twice the utterance time of the operator).

The candidate position evaluation unit 112 includes a subword string of search keywords output from the keyword input unit 110, search related keywords and silent section information output from the related information input unit 111, and parameters thereof (hereinafter referred to as related information), and indexes / voices. Using the information data 107, in the search target data 105, the parts where the keyword is likely to be spoken are listed, and the distance (score) between the part and the subword string of the keyword is calculated.

The search result integration unit 113 sorts the search candidates output by the candidate position evaluation unit 112 based on the score, and outputs the search results to the search result display unit 114 as a search result. The search result display unit 114 forms the search candidate appearance file name, time, score, and the like, and transmits the search result output by the search result integration unit 113 to the output device. The steps up to here are the part of the search process in the voice data search apparatus 1. Note that the sorting algorithm can use a well-known quick sort, radix sort, or the like. The sorted search results include the file name, time, and score at which each search candidate is determined to have been uttered. This search result is sent to the search result display unit 114, but it is also possible to send only the search result to another application.

The search result display unit 114 transfers the search results from the top of the score in the display format of the display and displays them on the display.

In the present embodiment, the voice data search device 1 has been described as a single device, but may be configured by a system including a terminal (browser) and a computer (server). In this case, for example, the terminal (browser) executes processing of the keyword input unit 110 and the search result display unit 114, and the computer (server) executes processing of other processing units.

In this embodiment, the search target data 105, the learning-labeled voice data 101, the acoustic model 103, the language model 104, the index / voice information data 107, and the dialogue order data 109 are stored and generated in the same apparatus. Although described as a thing, you may comprise separately the computer which performs these pre-processing, and the computer which performs from the keyword input part 110 to the search result display part 114. FIG. For example, the search target data 105 is stored in an external storage, and the index / voice information data 107, the dialogue order data 109, the acoustic model 103, and the language model 104 are created in advance by another computer, and the search process is executed. It can be copied to a computer and used.

<Audio data registration process>
FIG. 2 is a flowchart for explaining audio data registration processing executed by the indexing / audio information extracting unit 106 in the present embodiment of the present invention.

First, the indexing / voice information extraction unit 106 selects all voices (data for each channel: where ch is, for example, telephone conversation data, uplink ch (customer utterance) and downlink ch (operator utterance)). (Step 201), the audio data of a plurality of files of the search target data 105 are divided into appropriate lengths (step 202). For example, when the time during which the audio power is equal to or less than the predetermined threshold θp continues for the predetermined threshold θt or more, the audio data may be divided at that position. FIG. 3 shows the audio data divided in this way. In FIG. 3, information indicating the original file and the start time (301) and end time information (302) of the divided sections are given to each audio section. In addition to the method of dividing the voice data, various methods such as a method using the number of zero crossings, a method using a GMM (Gaussian Mixture Model), a method using a voice recognition technique, etc. Is widely known. In the present embodiment, any of these methods may be used. Moreover, although only the speech section information is extracted here, voice information such as emotion and speech speed information may be extracted. Since the method for realizing these can be performed by combining known techniques, details are omitted.

Subsequently, the indexing / speech information extraction unit 106 performs subword conversion processing on all speech sections (step 203). Specifically, the audio data is converted into subword units. Next, the converted subword string (subword recognition result), time corresponding to the subword N-gram, speech section information, and other voice information (such as voice time length), metadata (speaker ID, operator ID, date, Customer telephone number information, channel information of each speaker, etc.) are stored in the index / voice information data 107 (step 204). Note that the audio data registration process may be performed only once during the initial operation. When this voice data registration process is completed, a keyword search becomes possible. Here, only the so-called 1-best recognition result is stored in the index table, but a plurality of speech recognition results may be output in the N-best format or the network format.

<Index and utterance related information>
FIG. 4 is a diagram illustrating a configuration example of information stored in the index / audio information data (storage unit) 107.

The index / speech information data has ID 401, file name-ch 402, N-gram index 403, subword recognition result 404, speech information 405, and other metadata 406 as configuration items.

ID 401 is the management number of the database and indicates the ID of the audio file.

File name-ch 402 is an audio file name and channel number. For example, xxx. wav 0ch indicates the file name of the operator's utterance and the channel number on which it was uttered. In addition, xxx. wav 1ch indicates the file name of the customer's utterance and the channel number on which the utterance is spoken.

The N-gram index 403 is a column for recording a pair of the S-ID (ID included in the subword recognition result) of the subword N-gram index of the audio file and its appearance position. From the information of the N-gram index 403 in FIG. 4, the sub-word N-gram w-En has the 0th place of the index of the sub-word sequence with the S-ID of 0 and the sub-word sequence with the S-ID of 5. It can be seen that the index appears at the eleventh place.

The subword recognition result 404 is information including S-ID that is a subword ID and subword string information.

In the voice information 405, an utterance section ID (S-ID) in the voice file, a subword recognition result of the section, and an utterance section and its length are recorded.

Other metadata 406 records various metadata attached to the audio file. In FIG. 4, an operator ID (OID), a customer ID (CID), an utterance date, and an utterance time are stored as other metadata.

<Example of subword N-gram>
FIG. 5 is a diagram illustrating an example of the subword N-gram (when N = 3). Each subword N-gram is composed of a triplet (501) of subwords, and an index is created by shifting the subword one by one from the beginning. The subword N-gram index information and the construction method are well known in the field of normal text search technology, and thus description thereof is omitted here.

<Example of conversation on the same time>
FIG. 6 is a diagram illustrating an example of a conversation on the same time. A number 601 is assigned to each utterance section of each file. Then, based on each utterance time information, utterances existing around a certain utterance section of a certain file are linked. This may be done by focusing on each utterance section and listing the utterance sections of another audio file or channel that falls within an appropriate time range. For example, xxx. In the vicinity (front and back) of the utterance section 0 of wavch0, xxx. It can be seen that

utterance sections

2 and 3 of wavch1 exist.

<Configuration example of conversation order data>
FIG. 7 is a diagram illustrating a configuration example of the dialogue order data 109 constructed by the dialogue order analysis unit 108.

The dialogue order data includes a registration ID 701, an original file name-ch 702, an utterance section ID 703, and a related dialogue ID 704 as configuration items. In FIG. 7, for example, xxx. The file of wav-ch0 is xxx. wav-ch1-2 and xxx. It is shown to exist between wav-ch1-3.

When the construction process of the voice data index / voice information data 107 and the dialogue order data 109 is completed, the system is ready to accept keywords by the user.

<Keyword input process>
FIG. 8 is a flowchart for explaining the keyword input process.

The keyword input unit 110 receives data input by the user. Data may be input by an input device such as a keyboard or a touch pad, or a keyword input by another computer may be received via a network. Alternatively, it may be input by voice and converted into a keyword character string using voice recognition.

Next, the keyword input unit 110 determines whether or not the input data is a subword string (step 801), and if it is not a subword string, converts the keyword into a subword (phoneme) string and outputs it (step 802).

<Subword conversion example>
FIG. 9 is a diagram illustrating an example of conversion into subwords. The input keyword is converted into a subword string 901 based on a rule and data prepared in advance. As described above, since a method for converting an input word into a sub-word string is also possible by a known technique, the details are omitted here.

<Related information input processing>
FIG. 10 is a flowchart for explaining the related information input process.

The related information input unit 111 first receives data input by the user, and determines whether or not the related word information for search is included (step 1001).

If the related word information for search is included, the related information input unit 111 further determines whether or not it is a sub-word string (step 1002).

If the related word information for search is not a subword string, the related information input unit 111 converts the corresponding word into a subword string (step 1003).

Then, the related information input unit 111 converts various parameters such as the use flag of the silent section information, the average of the silent section length, its variance, weight, and the like input from the user into a search format and outputs it (step 1004). .

<Candidate position evaluation process>
FIG. 11 is a flowchart for explaining processing by the candidate position evaluation unit 112 according to this embodiment of the present invention. In the candidate position evaluation process, the candidate position evaluation unit 112 performs score correction on the speech information around the search candidate obtained by the input keyword based on the dialogue order data, the voice information data, and the information from the related information input unit. It is. For this purpose, the candidate position evaluation unit 112 accesses the dialogue order data using the S-ID and the file name of the candidate section, and acquires the file name including the utterance section around the candidate section and its section ID. If this file name and section ID are used, information such as a subword recognition sequence and a voice length in the section ID can be acquired from the voice information data. Hereinafter, the candidate position evaluation process will be described in more detail with reference to FIG.

The candidate position evaluation unit 112 receives the keyword subword string from the keyword input unit 110, and uses the index / voice information data 107 to enumerate the keyword utterance location candidates (search result candidates) in the voice data (step 1101). . With regard to keyword utterance location candidates, for example, by dividing the keyword subword string while allowing overlap, the corresponding location of the N-gram index in the index table can be made a candidate. Since the N-gram index is a search method that is widely used in the field of document search, description thereof is omitted here.

Next, the candidate position evaluation unit 112 calculates using the distances between all the search candidates and the subword string of the search keyword (step 1102). The distance is calculated by, for example, an end point-free Viterbi algorithm or DP matching, and sections corresponding to the keyword sub-word strings having the minimum distance are obtained at the same time. Since the end point-free Viterbi algorithm is a known algorithm, description thereof is omitted here. When there are a plurality of recognition results in one section as in the N-best recognition result, the distance between phonemes is calculated for each recognition result, and the weighted sum is used as the score of the section. Thus, a score based on the distance is given to each search candidate.

The candidate position evaluation unit 112 determines whether or not a related word is input to the data from the related information input unit 111 (step 1103). If a related word has been input, the process proceeds to step 1104. If no related word is input, the process proceeds to step 1105.

In step 1104, the candidate position evaluation unit 112 calculates the distance between the peripheral subword recognition sequence and the related word subword sequence by the same method as in step 1102 (step 1104).

Then, the candidate position evaluation unit 112 determines whether or not the silent data is included in the input data from the related information input unit 111 (specifically, whether or not a flag that uses the silent data is input to the input data). Is determined (step 1105). If silence section information is included, the process proceeds to step 1106. If not included, the process proceeds to step 1107.

In step 1106, the candidate position evaluation unit 112 has, as parameters, the average silent section length and the variance in which the silent section lengths of the peripheral utterances of each search candidate obtained in step 1102 are input as a score based on the silent section information. Calculation is performed by inputting the Gaussian distribution type score function (step 1106). Since such a score function is also a well-known technique in the field of machine learning, details are omitted. In addition, the difference between the silent section length input by the user and the silent section length around each search candidate can be normalized to obtain a score. That is, in step 1106, the score is calculated by obtaining a relative relationship regarding the length of the silent section around the speech section in which each search candidate is included.

Finally, the candidate position evaluation unit 112 corrects the search candidate score calculated in step 1102 using the various scores calculated in step 1104 and / or step 1107 (step 1107). Specifically, the correction score of the candidate section is calculated according to the following formula 1.
[Formula 1]

As shown in Expression 1, the weight W _key obtained from the data from the related information input unit 111 and the keyword score S _key are multiplied by the weight W _i of the related word or silent section length and the score S thereof. By adding the product of _i , a corrected score is obtained. When the speech information data 107 includes the recognition reliability of the subword, the score can be corrected by further multiplying the keyword score by the reliability.

Here, as the related information, the candidate position evaluation process has been described using related words and silent section information as an example, but in addition to these, information on utterance length (for example, the utterance time of the customer is the utterance time of the operator) The correction score may be calculated using the information that is equal to or more than twice: the ratio of the target speech segment length and the speech segment length of other speakers before and after the speech segment.

<Example of search result display>
FIG. 12 is a diagram illustrating a configuration example of a display format of a search result displayed by the search result display unit 114.

The search result display screen includes a file name 1201 in which it is determined that the search keyword is uttered, a time 1202 in the file, a score 1203, and an audio file playback button 1204 as display configuration items.

When the sound file playback button 1204 is pressed on the search result display screen, the sound around the search position is played back, and the user can confirm the contents of the search result by listening to the actual sound.

Note that this search result can be formed according to the format of another application, not the display, and transmitted to another computer or the like.

As described above, according to the first embodiment, it is possible to correct the score using the status of the speech section of another speaker before and after the speech section of the keyword search result, for example, the related keyword and the silent section length. The search accuracy of voice data search can be improved.

(2) Second Embodiment The second embodiment of the present invention automatically generates related information input by the user in the first embodiment, so that the voice can be searched without imposing a burden on the user. The data retrieval apparatus 2 is related.

<Configuration of voice data retrieval device>
FIG. 13 is a diagram illustrating a configuration example showing the voice data search device 2 according to the second embodiment. In the voice data search device 2 of FIG. 13, the blocks given the same reference numerals shown in FIG. 1 already described have the same functions, and their descriptions are omitted.

In the speech data retrieval apparatus 2 according to the second embodiment, the speech data retrieval apparatus 1 according to the first embodiment is similar to the speech 1301 with learning data label, the text data 1302, the related information data construction unit 1303, and the related information data. 1304 and a related information selection unit 1305 are added.

The learning data labeled speech (storage unit) 1301 stores learning labeled speech data generated from actual dialogue data. In addition to the transcription text of the speech signal, A file list related to the other speaker's voice data, dialogue order data, and the like may be included.

The text data (storage unit) 1302 stores related word information and data prepared for confirming the co-occurrence degree. For example, a web text such as a word dictionary, a thesaurus dictionary, an emotion word dictionary, or WikiPedia. Data, general wording, product name list, model number list, etc. are included.

The related information data construction unit 1303 uses the learning data labeled speech 1301 and the text data 1302 to analyze the relationship between the co-occurrence words or words and the silence interval length, and stores the information as the related information data 1304.

<Related information data construction process>
FIG. 14 is a flowchart for explaining the related information data construction process. Here, the processing of steps 1401 to 1403 is executed for all words included in the text data.

The related information data construction unit 1303 assigns attribute values of each word, for example, information such as anger, emotion, product name, part of speech, using the emotion word dictionary and the product name list (step 1401).

Next, the related information data construction unit 1303 extracts the related words from the text data and the labeled speech data for each word in the word dictionary, and statistically analyzes the co-occurrence degree and weight parameters (step 1402). With respect to related words and their weights, the co-occurrence degree of each word can be automatically acquired by using an analysis technique such as singular value decomposition or LDA (Latent Dirichlet Allocation) using web information or a thesaurus dictionary. The co-occurrence degree is calculated by counting the number of times the co-occurrence keyword appears in the utterance section before and after the target utterance section, and calculating the percentage of the total occurrence. This is the information that is required.

Finally, the related information data construction unit 1303 transcribes each word in the word dictionary, which is attached to the learning data labeled speech 1401, enumerates all the appearance positions from the label, and other words around the utterance including the word All the utterance interval lengths of the speakers are counted, and statistics such as the average and variance are calculated (step 1403). Here, the silent section length may be a value given by manually listening to an audio file, or may be a value automatically detected using a speech section detection technique. Further, the appearance frequency of the silent section length or the prior probability itself corresponds to the weight parameter in the score. Furthermore, regarding the related word of a certain word, the appearance frequency in the surrounding utterance section can be counted and the appearance probability can be calculated, and this can be used as the second weight parameter of the related word. Since these can be estimated by a statistical technique such as a maximum likelihood estimation method which is a known technique, the details are omitted here.

When the amount of data is small, the words are clustered according to the similarity such as their attributes and subword distances, and the appearance frequency and average silence interval length of each class are calculated. For example, various statistics are calculated using a class “anger” instead of the word “do not play” and a class “product” instead of the word “model number 103487”. As another classification method, it is conceivable to construct a database for each subword N-gram of each word.

<Example of related information data>
FIG. 15 is a diagram illustrating a configuration example of related information data stored in the related information data (storage unit) 1404.

The related information data includes a registration ID 1501, a word 1502, a sub word 1503, an attribute 1504, related voice information 1505, and various parameters 1506 as configuration information.

In the related voice information 1505, related words are managed by word IDs. These values are all generated by the related information data construction unit 1303. For example, the word (phrase) “ID” is related to the word (phrase) “ID” of “0”. That is, for example, when the customer utters “Do not play”, the operator often utters “I am sorry”, so the latter is registered as the related voice information of the former.

<Related information selection process>
FIG. 16 is a flowchart for explaining processing by the related information data selection unit 1305.

First, when a search keyword is input from the keyword input unit 110, the related information data selection unit 1305 determines whether there is a word corresponding to the search keyword input to the related information data (storage unit) 1304 ( Step 1601). If there is a word corresponding to the related information data (storage unit) 1304, the process proceeds to step 1602. On the other hand, if it does not exist, the process proceeds to step 1603.

In step 1602, the related information data selection unit 1305 acquires information (related words, silent sections and their parameters) related to the input search keyword from the related information data (storage unit) 1304.

If there is no corresponding word in the related information data (storage unit) 1304, in step 1603, the related data selection unit 1305 selects a similar word group using the subword distance and attribute, and stores the information. In addition, the parameters of the input keyword are predicted and output. In a simple method, related speech information and parameters of a word having the nearest phoneme distance to the input keyword may be output. For example, if "I'm sorry" is registered as related information data, but the input word is "I'm sorry" and the word itself is not registered, the related voice information about "I'm sorry" Will be output. In addition, when a database based on subword N-grams is also constructed, an average value of parameters of speech-related information in each N-gram can be used after the input keyword is decomposed into subword N-grams. Furthermore, when keyword attribute values are also accepted from the keyword input unit, it is possible to use average values of various parameters of words having the same attribute. The related speech information and the parameters generated in this way are used by the candidate position evaluation unit 112. If there are a plurality of similar words, they may be displayed on the display unit and selected by the user.

As described above, according to the second embodiment, the user can search without feeling a burden by automatically generating the related information input by the user in the first embodiment.

(3) Third Embodiment A third embodiment relates to a system that can be introduced into a call center by adding a telephone line call recording device to the voice data search device 1.

<Configuration of voice data retrieval device>
FIG. 17 is a diagram illustrating a configuration example of a speech data search system according to the third embodiment. The voice data search system 3 according to the third embodiment corresponds to an example in which the voice data search device 1 according to the first embodiment is applied to a call center.

The voice data retrieval system 3 includes a private branch exchange (PBX) device 1703, a call recording device 1704, a storage device 1719 for storing call management data 1705 and search target data 1706, and data used for the search.

Storage devices

1720 and 1721, and a computer 1722 that includes a CPU 1723 and a main storage device 1724 and performs voice data search. Each device is connected to a telephone line, a network, and a computer via a bus.

The PBX device 1703 is connected to a customer telephone 1701 (hereinafter referred to as a customer telephone) through a public telephone line network. The PBX device 1703 is connected to the operator's telephone 1702.

The call recording device 1704 has a general-purpose computer configuration such as a CPU, a memory, and a control program. Also, the call recording device 1704 acquires a voice signal based only on the customer's utterance from the PBX device 1703 or the telephone 1702 used by the operator. Further, the call recording device 1704 acquires a voice signal from the telephone 1702 only by the operator's utterance. It is also possible to acquire a voice signal of only the operator's utterance by preparing a headset and a recording device separately. Thereafter, the call recording device 1704 performs A / D conversion on the audio signal only from the customer and the audio signal only from the operator, and converts it into digital data such as WAV format. The conversion to audio data may be performed by real time processing. These search target data 1706 are stored in the storage device 1719 together with the call management data 1705.

FIG. 18 is a diagram showing an example of a format of audio data stored in the storage device 1719. The audio file includes operator ID 1801, customer speaker ID 1802, call time 1803, call time length 1804, and 16-bit signed binary waveform data 1805 as information. These audio files are transferred to a storage device and stored as call management data 1705 search target data 1706. The call duration, customer speaker ID, and operator ID can be acquired from the PBX device 1703 or the like.

Returning to FIG. 17, the storage device 1720 stores at least a language model 1707, an acoustic model 1708, index / voice information data 1709, and dialogue order data 2210 as data used in the search.

Further, the storage device 1721 stores learning voice data 1711 (corresponding to the learning data labeled voice 101 in FIG. 1). Here, the language model 1707 and the acoustic model 1708 may be calculated by another computer using the learning speech data 1711.

The computer 1722 executes the central processing of the voice data search system 3 according to the third embodiment. The memory of the computer 1722 includes an indexing module 1715 including functions of the indexing / speech information extraction unit 106 and the dialog order analysis unit 108, a keyword input unit 110, a related information input unit 111, a candidate position evaluation unit 112, and a search result integration unit 113. And a search module 1716 including the function of the search result display unit 114 and an acoustic / language model learning module 1717 including the function of the acoustic model / language model learning unit 102 are stored. The function of each module is appropriately developed and realized in the memory 1724 by the control instruction of the CPU 1723. If the procedure described in the first embodiment is performed, the voice search system 3 operates appropriately.

With respect to the index / voice information data 1709 and the dialogue order data 1710, the search target data 1706 is accessed at regular intervals, only the difference data is indexed, and added to the index / voice information data 1709 (index table). May be.

As described above, the voice data search system 3 that can introduce the voice data search apparatus 1 according to the first embodiment into the call center can be constructed.

(4) Fourth Embodiment The fourth embodiment relates to a configuration when the voice data search device 1 is incorporated in a content cloud system. First, the outline of the content cloud system will be described, and then the incorporation into the content cloud system based on the module division of the voice data search device 1 will be described.

<Configuration of content cloud system>
FIG. 19 is a diagram showing a schematic configuration of a general content cloud system. The content cloud system operates on a general computer including one or more CPUs, memories, and storage devices, and the system itself is composed of various modules. Specifically, an ETL (Extract Transform Load) (module) ) 1903, content storage 1904, search engine (module) 1905, metadata server (module) 1906, and multimedia server (module) 1907. Each module may be executed by an independent computer. In this case, each storage and the module are connected by a network or the like, and are realized by distributed processing in which data communication is performed via them. In the content cloud system, the application program 1908 transmits a request to a search engine or the like via a network or the like. In response to this, the content cloud system transmits information corresponding to the request to the application 1908.

The content cloud system targets data in any format such as audio data 1901, medical data 1901, and / or mail data 1901 as input. The various data are, for example, call center call voice, mail data, document data, and the like, and may be structured or not. Data input to the content cloud system is temporarily stored in the content storage 1902.

The ETL 1903 in the content cloud system monitors the storage 1902. When the accumulation of various data 1901 in the storage 1902 is completed, the information extraction processing module corresponding to the data is operated to extract the extracted information (metadata). The content storage 1904 is archived and saved. The ETL 1903 includes, for example, a text index module, an image recognition module, and the like. Examples of metadata include time, an N-gram index, an image recognition result (object name), an image feature amount and its related word, This includes speech recognition results. As these information extraction modules, all programs for extracting some information (metadata) can be used, and publicly known techniques can be adopted. Therefore, description of various information extraction modules is omitted here. If necessary, the metadata may be compressed in data size by a data compression algorithm. Further, after extracting information by various modules, a process of registering data file name, data registration date, original data type, metadata text information, etc. in RDB (Relational Data Base) may be performed.

The content storage 1904 stores the information extracted by the ETL 1903 and the pre-processing data 1901 temporarily stored in the storage 1902.

When there is a request from the application program 1908, the search engine 1905 searches the text based on the index created by the ETL 1903, for example, if it is a text search, and transmits the search result to the application program 1908. Here, a publicly known technique can be applied to the search engine and its algorithm. The search engine may include a module that searches not only text but also data such as images and sounds.

The metadata server 1906 manages the metadata stored in the RDB. For example, in the ETL 1903, if the file name of data, the date of data registration, the type of original data, metadata text information, etc. are registered in the RDB, if there is a request from the application 1908, Information is sent to the application 1908.

In the multimedia server 1907, pieces of information between metadata extracted by the ETL 1903 are associated with each other, and the metadata is structured in a graph format and stored. As an example of the association, the original voice file, image data, related words, and the like are expressed in a network format with respect to the voice recognition result “apple” stored in the content storage 1904. When the multimedia server 1907 receives a request from the application 1908, the multimedia server 1907 transmits meta information corresponding to the request to the application 1908. For example, when there is a request for “apple”, related meta information such as an image of an apple, an average market price, and an artist's song name is provided based on the constructed graph structure.

<Voice data search system>
FIG. 20 is a diagram showing a schematic configuration of a voice data search system realized by incorporating the function of the voice data search device 1 into the content cloud system.

Various functions of the speech data retrieval apparatus 1 are modularized, and an indexing module (indexing / speech information extraction unit 106, dialogue order analysis unit 108) and a search module (keyword input unit 110, related information input unit 111, candidate position evaluation unit 112). The search result integration unit 113).

Also, the acoustic model 103 and the language model 104 are created in advance by another computer and copied to the content cloud system. At this time, the indexing module 2001 can be registered in the ETL 1903, and the search module 2002 can be registered in the multimedia server 1907.

When the audio data is input, the indexing module 2001 is called from the ETL 1903, performs an indexing process on the audio data, and outputs the index / audio information data to the content storage.

When the search module 2002 receives a keyword from the application program 1908 or the multimedia server control program (not shown), the search module 2002 uses the index / voice information data 2003 (corresponding to 107), and the file name and time when the keyword is spoken. Returns a list of scores. The processing of the indexing / voice information extraction module and search module 2002 is only a part of the processing of the voice data search apparatus 1 and will not be described here.

Also, the search module 2002 can be set in the search engine 1905. In this case, when a request is made from the allocation program 1908 to the search engine 1905, the search module 2002 transmits the file name, time, and score at which the keyword is spoken in the voice data to the search engine 1905.

As described above, the voice data search device 1 described in the first embodiment can be incorporated into a content cloud system.

(5) Summary (i) The functions or configurations according to the first to fourth embodiments can be appropriately combined with each other, and the embodiments are not independent.

Therefore, for example, it is needless to say that the voice data search device 2 according to the second embodiment can be introduced into the system according to the third embodiment or incorporated into the content cloud system according to the fourth embodiment. .

(Ii) In the embodiment of the present invention, dialogue order data indicating the utterance order of the voice segment data of the search target data is generated based on the voice file channel information and the voice metadata information included in the index voice information data. When the search keyword is actually input from the user, the score value (first score value) between the search keyword and the voice section data included in the index voice information data is calculated, and a plurality of search result candidates are acquired. The In addition, the voice segment data around each of the plurality of search result candidates is specified based on the dialogue order data. Furthermore, related information related to the search keyword is acquired (when the user inputs it or when it is acquired from the related information data storage unit (DB)), and the score between the related information and the speech section data around the search result publication A value (second score value) is calculated. The first score value is corrected using the second score value, and a plurality of search result candidates are sorted and output using the corrected score value. As described above, the score value between the search keyword and the search target data is corrected with the score value based on the related information, so that the search accuracy can be improved.

Here, as related information, not only related words (words having a high co-occurrence) related to the search keyword, but also information on silent section length, speech section length to be searched, and other speakers before and after the speech section. Information on the ratio of the length of the voice interval can also be used. By correcting the score value using such information, it is possible to improve the search accuracy even when the user does not know a related word having a high co-occurrence with the input search keyword. When the score value correction is performed using the silent section length information, by calculating the relative relationship of the silent section lengths around the speech section including each of the search candidates (specified by the dialogue order data) A second score value is calculated. By doing in this way, it becomes possible to implement | achieve score value correction | amendment other than the score value correction | amendment by a related word with high co-occurrence degree.

In addition to the case where the user inputs related information, for each of a plurality of words, the attribute of the word, the co-occurrence word, the co-occurrence degree of the co-occurrence word, and the silent section information of the co-occurrence word, A related information database to be stored may be provided. In this case, related information related to the search keyword is acquired from the related information database. As a result, it is possible to save the user from inputting related information, and to improve the search accuracy. More specifically, when a search keyword input by the user is registered in the related information database, a co-occurrence word corresponding to the search keyword is acquired. On the other hand, when the search keyword is not registered in the related information database, a word similar to the search keyword is selected using the phoneme distance information, and a co-occurrence word corresponding to the similar word is acquired. By doing in this way, even if the search keyword itself is not registered in the related information database, the search candidate score value can be corrected, so that the search accuracy can be improved.

(Iii) The present invention can also be realized by software program codes that implement the functions of the embodiments. In this case, a storage medium in which the program code is recorded is provided to the system or apparatus, and the computer (or CPU or MPU) of the system or apparatus reads the program code stored in the storage medium. In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the program code itself and the storage medium storing the program code constitute the present invention. As a storage medium for supplying such program code, for example, a flexible disk, CD-ROM, DVD-ROM, hard disk, optical disk, magneto-optical disk, CD-R, magnetic tape, nonvolatile memory card, ROM Etc. are used.

Also, based on the instruction of the program code, an OS (operating system) running on the computer performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing. May be. Further, after the program code read from the storage medium is written in the memory on the computer, the computer CPU or the like performs part or all of the actual processing based on the instruction of the program code. Thus, the functions of the above-described embodiments may be realized.

Further, by distributing the program code of the software that realizes the functions of the embodiment via a network, the program code is stored in a storage means such as a hard disk or a memory of a system or apparatus, or a storage medium such as a CD-RW or CD-R And the computer (or CPU or MPU) of the system or apparatus may read and execute the program code stored in the storage means or the storage medium when used.

Finally, it should be understood that the processes and techniques described herein are not inherently related to any particular equipment, and can be implemented by any suitable combination of components. In addition, various types of devices for general purpose can be used in accordance with the teachings described herein. It may prove useful to build a dedicated device to perform the method steps described herein. Various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiments. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined. Although the present invention has been described with reference to specific examples, these are in all respects illustrative rather than restrictive. Those skilled in the art will appreciate that there are numerous combinations of hardware, software, and firmware that are suitable for implementing the present invention. For example, the described software can be implemented in a wide range of programs or script languages such as assembler, C / C ++, perl, shell, PHP, Java (registered trademark).

Furthermore, in the above-described embodiment, control lines and information lines are those that are considered necessary for the explanation, and not all control lines and information lines on the product are necessarily shown. All the components may be connected to each other.

101 Speech data with learning 102 Acoustic model / language model learning unit 103 Acoustic model 104 Language model 105 Search target data 106 Indexing / speech information extraction unit 107 Index / speech information data 108 Dialogue order analysis unit 109 Dialogue order data 110 Keyword input Unit 111 related information input unit 112 candidate position evaluation unit 113 search result integration unit 114 search result display unit

Claims

A speech data retrieval system for retrieving speech data,
A storage device for storing search target data;
A memory for storing a program for realizing the voice data search process;
A processor that reads the program from the memory and executes the voice data search process according to the program,
The processor is
Using an acoustic model and a language model generated from learning speech data, subword recognition processing is performed on speech segment data of the search target data, speech segment data, silence segment information, and the speech segment data Processing for generating index audio information data including audio file channel information indicating a channel on which is uttered and audio metadata information;
Processing for generating dialogue order data indicating the utterance order of the voice segment data based on the voice file channel information and voice metadata information;
A process of receiving a search keyword, calculating a first score value that is a score value of the search keyword and voice segment data included in the index voice information data, and acquiring a plurality of search result candidates;
A process for identifying the voice section data around each of the plurality of search result candidates based on the dialogue order data;
A process of acquiring related information related to the search keyword and calculating a second score value that is a score value of the related information and the surrounding speech segment data;
A process of correcting the first score value using the second score value and outputting a corrected score value;
A process of sorting and outputting the plurality of search result candidates using the corrected score value;
A speech data retrieval system characterized by
In claim 1,
The related information includes at least a related word related to the search keyword, information on a silent section length, and information on a ratio of a voice section length to be searched and a voice section length of other speakers before and after the voice section. A speech data retrieval system including one.
In claim 2,
Furthermore, for each of a plurality of words, it has a related information database that stores the word attributes, co-occurrence words, the co-occurrence degree of the co-occurrence words, and information on the silent section of the co-occurrence words,
The speech data search system, wherein the processor acquires the related information related to the search keyword from the related information database.
In claim 3,
The processor acquires a co-occurrence word corresponding to the search keyword when the search keyword is registered in the related information database, and phoneme when the search keyword is not registered in the related information database. A speech data search system, wherein a word similar to the search keyword is selected using distance information, and a co-occurrence word corresponding to the similar word is acquired.
In claim 2,
When the processor obtains the second score value using the silent section length information, the processor obtains a relative relation of silent section lengths around a speech section including each of the plurality of search result candidates. The speech data search system characterized in that the second score value is calculated.
In claim 1,
And a private branch exchange and at least one operator telephone connected directly to the private branch exchange,
The private branch exchange is connected to a public telephone network;
Customer voice data acquired from a plurality of customer telephones connected via the public telephone line network and voice data acquired from the operator telephone are stored in the storage device as the search target data. A speech data retrieval system characterized by that.
An audio data search method for searching desired audio data from search target data stored in a storage device,
A processor that executes speech data search processing performs subword recognition processing on the speech section data of the search target data using an acoustic model and a language model generated from learning speech data, and speech section data; Generating index audio information data including silent section information, audio file channel information indicating a channel in which the audio section data is spoken, and audio metadata information;
The processor generates dialogue order data indicating an utterance order of the voice segment data based on the voice file channel information and voice metadata information;
The processor accepts a search keyword, calculates a first score value that is a score value between the search keyword and voice segment data included in the index voice information data, and obtains a plurality of search result candidates;
The processor specifies voice segment data around each of the plurality of search result candidates based on the interaction order data;
The processor acquires related information related to the search keyword, and calculates a second score value that is a score value of the related information and the surrounding speech segment data;
The processor correcting the first score value using the second score value and outputting a corrected score value;
The processor sorts and outputs the plurality of search result candidates using the corrected score value;
A speech data search method comprising:
In claim 7,
The related information includes at least a related word related to the search keyword, information on a silent section length, and information on a ratio of a voice section length to be searched and a voice section length of other speakers before and after the voice section. A speech data retrieval method comprising: one.
In claim 8,
Further, the processor stores, for each of a plurality of words, a word attribute, a co-occurrence word, a co-occurrence degree of the co-occurrence word, and information on a silent section of the co-occurrence word. And obtaining the related information related to the search keyword.
In claim 9,
In the step of acquiring the related information, the processor acquires a co-occurrence word corresponding to the search keyword when the search keyword is registered in the related information database, and the search keyword is stored in the related information database. A speech data search method, wherein a word similar to the search keyword is selected using phoneme distance information and a co-occurrence word corresponding to the similar word is acquired.
In claim 8,
In the step of calculating the second score value, when the processor obtains the second score value using the silent section length information, the processor surrounds a speech section including each of the plurality of search result candidates. The voice data search method, wherein the second score value is calculated by obtaining a relative relationship between the lengths of silent periods of each other.
A computer-readable storage medium for storing a program for causing a computer to execute the speech data search method according to claim 7.