WO2010023939A1 - テキストマイニング装置、テキストマイニング方法、及びコンピュータ読み取り可能な記録媒体 - Google Patents
テキストマイニング装置、テキストマイニング方法、及びコンピュータ読み取り可能な記録媒体 Download PDFInfo
- Publication number
- WO2010023939A1 WO2010023939A1 PCT/JP2009/004211 JP2009004211W WO2010023939A1 WO 2010023939 A1 WO2010023939 A1 WO 2010023939A1 JP 2009004211 W JP2009004211 W JP 2009004211W WO 2010023939 A1 WO2010023939 A1 WO 2010023939A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text data
- reliability
- text
- mining
- unique
- Prior art date
Links
- 238000005065 mining Methods 0.000 title claims abstract description 272
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000012545 processing Methods 0.000 claims abstract description 101
- 239000000284 extract Substances 0.000 claims abstract description 26
- 238000000605 extraction Methods 0.000 claims description 39
- 230000002411 adverse Effects 0.000 abstract 1
- 230000000694 effects Effects 0.000 abstract 1
- 238000004364 calculation method Methods 0.000 description 31
- 238000010586 diagram Methods 0.000 description 20
- 238000004458 analytical method Methods 0.000 description 13
- 230000006870 function Effects 0.000 description 10
- 230000003287 optical effect Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 239000004065 semiconductor Substances 0.000 description 3
- 238000013519 translation Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000005315 distribution function Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 244000187656 Eucalyptus cornuta Species 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000556 factor analysis Methods 0.000 description 1
- 238000009413 insulation Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 238000000611 regression analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Definitions
- the present invention relates to a text mining apparatus and a text mining method that target text data obtained by computer processing as a mining target.
- text mining has attracted attention as a technique for extracting useful information from a large amount of text data.
- text mining a collection of unformatted sentences is divided into words and phrases using a natural language analysis technique, and feature words are extracted. Then, the appearance frequency and correlation of feature words are analyzed, and useful information is provided to the analyst. According to the text mining, it is possible to analyze a huge amount of text data that has been impossible to analyze manually.
- An example of the field of application of such text mining is a free description questionnaire.
- the text mining is executed on text data obtained by typing or character recognition of a questionnaire response result (see, for example, Patent Document 1, Patent Document 2, and Non-Patent Document 1).
- An analyst can perform various analyzes and verification of hypotheses using the results of text mining.
- a corporate call center Another example of application fields of text mining is a corporate call center.
- the call center accumulates a large amount of voice recording of calls between the customer and the operator, and memos created by key input or the like when the operator responds.
- this information has become an important source of knowledge for companies to know the needs of consumers and the improvements of their products and services.
- text mining is performed on text data (speech recognition text data) obtained from voice recognition of a call or text data (call memo text data) obtained from a call memo created by an operator. Is executed against. Which text data is subjected to text mining is determined from the viewpoint of analysis required by the analyst.
- speech recognition text data covers all calls between operators and consumers. Therefore, if the purpose is to extract consumer demand for products and services, it is necessary to cover all of the utterances of the consumer, so text mining is performed on the speech recognition text data.
- call memo text data covers a narrow range, but includes matters that the operator has determined to be important in the call, and further includes matters that the operator has recognized and determined based on the content of the call. It is out. Therefore, when analysis that focuses on operator additional information is required, such as when extracting expert operator's know-how to be shared by other operators or judgment errors of new operators, etc. Mining is executed.
- the speech recognition text data includes a recognition error in most cases.
- feature words may not be extracted accurately due to the influence of recognition errors.
- it has been proposed to perform text mining using speech recognition text data for example, see Non-Patent Document 2) in which reliability is given to word candidates obtained by speech recognition.
- Patent Document 3 In the text mining described in Patent Literature 3, when the extracted feature words are counted, correction based on the reliability is performed, and the influence due to the recognition error is reduced.
- the voice recognition text data and the call memo text data described in the call center example are information obtained from the same event (call) via different channels. Both are information with different channels, but the information sources of both are the same. Therefore, if you perform text mining that uses the characteristics of both and using both in a complementary manner, than text mining for one text data or simply text mining for both. However, it is thought that complicated analysis will be possible.
- the voice recognition text data is separated into a part common to the call memo text data and a part specific to the call voice not described in the call memo text data.
- the call memo text data is also divided into a part common to the voice recognition text data and a part unique to the call memo not described in the voice recognition text data.
- text mining is executed for the part unique to the call voice of the voice recognition text data.
- This text mining performs an intensive analysis on information that appears in the call voice but is missing from the description of the call memo. And by this analysis, information that should have been recorded as a call memo but leaked is extracted. The extracted information can be used to improve the description guidelines for call memos.
- text mining is executed for the part unique to the call memo of the call memo text data.
- This text mining performs a focused analysis on information that appears in the call memo but does not appear in the voice recognition text data of the call voice. According to this analysis, it is possible to extract the judgment know-how of an experienced operator more reliably than when text mining is performed only on the above-mentioned call memo text data.
- the extracted judgment know-how can be used as educational material for new operators.
- cross channel text mining Such text mining (hereinafter referred to as “cross channel text mining”) performed on a plurality of text data obtained from the same event via different channels can be used in other examples.
- cross-channel text mining can be used to analyze the corporate image from the reported content, or to analyze the conversation content in a communication place such as a meeting.
- text mining is executed on speech recognition text data from an utterance such as an announcer and text data such as an utterance manuscript or a newspaper article.
- voice recognition text data obtained by voice recognition of the conversation of the participant, text data such as a document referenced by the participant on the spot, a memo or minutes made by the participant, Text mining is executed for.
- the mining target need not be speech recognition text data or text data created by key input.
- character recognition text data (see Non-Patent Document 3) obtained by character recognition of the above-described questionnaires, minutes, etc. are also subject to mining.
- text data generated by computer processing such as voice recognition or character recognition contains errors.
- the recognition error is included in the speech recognition text data.
- these errors affect the discrimination between the common part and the unique part between text data, and thus may greatly reduce the reliability of the mining result.
- Patent Document 3 discloses a technique for reducing a case where a recognition error in speech recognition affects text mining.
- this technique is applied to cross-channel text mining. It is not a technology that takes into account. Even if the technique disclosed in Patent Document 3 is applied to cross-channel text mining, it is difficult to improve the reliability of the mining results because the influence of recognition errors on the discrimination between common parts and unique parts between text data is not removed. It is.
- An object of the present invention is to solve the above-mentioned problems, and in text mining for a plurality of text data including text data generated by computer processing, a text mining device capable of suppressing the influence of a computer processing error on a mining result, A text mining method and a computer-readable recording medium are provided.
- a text mining device is a text mining device that executes text mining on a plurality of text data including text data generated by computer processing, A reliability is set for each of the plurality of text data, For each of the plurality of text data, a unique part extraction unit that extracts a unique part of each text data with respect to other text data; A unique reliability setting unit that sets a unique reliability indicating the reliability of each unique part in each unique part of each text data with respect to other text data using the reliability set for each of the plurality of text data When, A mining processing unit that performs text mining on each of the unique parts of the text data with respect to other text data using the inherent reliability is provided.
- a text mining method in the present invention is a text mining method for executing text mining on a plurality of text data including text data generated by computer processing, (A) setting reliability for each of the plurality of text data; (B) for each of the plurality of text data, extracting a unique portion of each text data with respect to other text data; (C) using the reliability set for each of the plurality of text data, setting a unique reliability indicating the reliability of each unique part for each unique part of each text data with respect to other text data; , (D) performing a text mining on each unique part of each text data with respect to other text data using the inherent reliability.
- a computer-readable recording medium storing a program, In the computer device, (A) setting reliability for each of the plurality of text data; (B) for each of the plurality of text data, extracting a unique portion of each text data with respect to other text data; (C) using the reliability set for each of the plurality of text data, setting a unique reliability indicating the reliability of each unique part for each unique part of each text data with respect to other text data; , (D) A program including an instruction for executing a step of executing text mining on each unique part of each text data with respect to other text data using the inherent reliability is recorded.
- FIG. 1 is a block diagram showing a schematic configuration of a text mining apparatus according to Embodiment 1 of the present invention.
- FIG. 2 is a diagram illustrating an example of data that is a target of text mining in Embodiment 1 of the present invention.
- FIG. 3 is a diagram illustrating an example of speech recognition text data in which reliability is set.
- FIG. 4 is a diagram illustrating an example of speech recognition text data in which the reliability is set and the language is English.
- FIG. 5 is a diagram for explaining a unique part extraction process executed by the text mining apparatus according to Embodiment 1 of the present invention.
- FIG. 6 is a diagram illustrating an example of a setting state of inherent reliability.
- FIG. 7 is a diagram illustrating an example of a result of the text mining process.
- FIG. 1 is a block diagram showing a schematic configuration of a text mining apparatus according to Embodiment 1 of the present invention.
- FIG. 2 is a diagram illustrating an example of data that is a target of text
- FIG. 8 is a flowchart showing the flow of processing performed by the text mining method according to Embodiment 1 of the present invention.
- FIG. 9 is a block diagram showing a schematic configuration of the text mining device according to Embodiment 2 of the present invention.
- FIG. 10 is a diagram illustrating an example of the unique part extracted by the text mining device according to Embodiment 2 of the present invention.
- FIG. 11 is a flowchart showing a flow of processing performed by the text mining method according to Embodiment 2 of the present invention.
- FIG. 12 is a block diagram showing a schematic configuration of the text mining device according to Embodiment 3 of the present invention.
- FIG. 13 is a flowchart showing a flow of processing performed by the text mining method according to Embodiment 3 of the present invention.
- Embodiment 1 The text mining device, text mining method, and program according to Embodiment 1 of the present invention will be described below with reference to FIGS. First, the configuration of the text mining apparatus according to Embodiment 2 of the present invention will be described with reference to FIGS.
- FIG. 1 is a block diagram showing a schematic configuration of a text mining apparatus according to Embodiment 1 of the present invention.
- FIG. 2 is a diagram illustrating an example of data that is a target of text mining in Embodiment 1 of the present invention.
- FIG. 3 is a diagram illustrating an example of speech recognition text data in which reliability is set.
- FIG. 4 is a diagram illustrating an example of speech recognition text data in which the reliability is set and the language is English.
- FIG. 5 is a diagram for explaining a unique part extraction process executed by the text mining apparatus according to Embodiment 1 of the present invention.
- FIG. 6 is a diagram illustrating an example of a setting state of inherent reliability.
- FIG. 7 is a diagram illustrating an example of a result of the text mining process.
- the text mining apparatus 1 shown in FIG. 1 executes text mining on a plurality of text data including text data generated by computer processing.
- the text mining device 1 includes a unique part extraction unit 6, a unique reliability setting unit 7, and a mining processing unit 8.
- reliability is set for each of the multiple text data.
- the “reliability” here indicates the degree of appropriateness of the words constituting the text data.
- the “reliability” is an indicator of whether each word constituting the text data is correct as a processing result of the computer processing.
- the unique part extraction unit 6 extracts a unique part of each text data with respect to other text data for each of the plurality of text data.
- the unique part of each text data with respect to other text data is a word or phrase constituting each text data, which is not included in the other text data at all, or is slightly included even if included. It means what is.
- the unique reliability setting unit 7 uses the reliability set for each of the plurality of text data to indicate the reliability of each unique part to each unique part of each text data with respect to the other text data.
- the mining processing unit 8 executes text mining on each unique part of each text data with respect to other text data using the inherent reliability.
- the text mining apparatus 1 sets the inherent reliability indicating the reliability of the specific part in the specific part of each text data.
- the inherent reliability is obtained from the reliability set for each text data, and serves as an index as to whether the inherent part is correct as a processing result of the computer processing. In the text mining process, the inherent reliability is referred to.
- the influence of errors in computer processing can be easily removed from the mining result.
- a highly reliable mining result in which the influence of errors in computer processing is suppressed can be obtained.
- computer processing refers to analysis processing executed by a computer according to a certain algorithm.
- text data obtained by computer processing refers to text data automatically generated by computer processing.
- Specific examples of computer processing include processing such as speech recognition processing, character recognition processing, and machine translation processing.
- the text mining apparatus 1 receives three types of data such as call voice data D1, call memo text data D2, and incidental information D3 shown in FIG.
- the call voice data D1 is voice data in which communication between the operator and the customer at the call center is recorded.
- “A” indicates an operator
- “B” indicates a customer.
- the call memo text data D2 is text data created as a memo by the operator during a call and is not text data obtained by computer processing.
- the incidental information D3 is data accompanying the call voice data D1 and the call memo text data D2, and only a part is shown in FIG.
- the incidental information D3 is mainly used in the calculation of the characteristic degree described later.
- the call voice data D1 is one unit (one record) from the start to the end of the call between the operator and the customer, and the call memo text data D2 and the accompanying information D3 are created for each record.
- FIG. 2 shows one record of call voice data D1, corresponding call memo text data D2, and incidental information D3.
- the call voice data D1 (l) in the record with the record number l, the call memo text data D2 (l) and the accompanying information D3 (l) corresponding to the call voice data D1 (l) are set as one set, and a plurality of sets are text mining devices. 1 is input.
- the text mining device 1 includes a data input unit 2, a speech recognition unit 3, a language in addition to the specific part extraction unit 6, the specific reliability setting unit 7, and the mining processing unit 8. And a processing unit 5.
- an input device 13 and an output device 14 are connected to the text mining device 1.
- Specific examples of the input device 13 include a keyboard and a mouse.
- Specific examples of the output device 14 include a display device such as a liquid crystal display, a printer, and the like.
- the input device 13 and the output device 14 may be attached to another computer device connected to the text mining device 1 via a network.
- input data such as call voice data D1 (l), corresponding call memo text data D2 (l), and corresponding incidental information D3 (l) corresponding to each record l are input to the data input unit 2.
- these data may be directly input to the data input unit 2 from an external computer device via a network, or may be provided in a state stored in a recording medium.
- an interface for connecting the outside and the text mining device 1 is used as the data input unit 2.
- a reading device is used as the data input unit 2.
- the data input unit 2 When these data are input, the data input unit 2 outputs the call voice data D1 (l) to the voice recognition unit 3, and outputs the call memo text data D2 (l) to the language processing unit 5. Further, the data input unit 2 outputs the incidental information D3 (l) to the mining processing unit 8.
- the voice recognition unit 3 performs voice recognition on the call voice data D1 (l) and generates voice recognition text data.
- the voice recognition unit 3 includes a reliability setting unit 4.
- the reliability setting unit 4 sets the reliability for each word constituting the speech recognition text data.
- the speech recognition text data in which the reliability is set is output to the unique part extraction unit 6.
- processing in the voice recognition unit 3 will be described based on FIGS. 3 and 4 using the conversation included in the call voice data D1 shown in FIG.
- the many conversations included in the call voice data D1 “Do you have a storage function” and “Is there white?” Are used as conversations.
- the voice recognition unit 3 performs voice recognition for each call voice data D1 (l) in each record l. Then, as shown in FIG. 3, the speech recognition unit 3 extracts candidate words w i for each time frame m.
- the numbers attached to the horizontal axis are frame numbers, and the frame numbers are continuous in one record l.
- the speech recognition unit 3 extracts a plurality of words.
- a plurality of words In the example of FIG. 3, in frame number 20, two candidates of “storage” and “thermal insulation” are extracted. Further, in frame number 33, two of “color” and “white” are extracted.
- the speech recognition unit 3 extracts candidate words w i for each time frame m. For example, when the conversation is “Does it have heat retaining function?” And “Do you have white color?” Corresponding to the English translation of the example of FIG. 3, the speech recognition unit 3, as shown in FIG. 4, The word w i is extracted.
- the speech recognition unit 3 does not have to extract all words as candidates.
- the speech recognition unit 3 does not extract words that do not make sense alone, such as particles and prepositions, regardless of the language type, and as part of speech of independent words such as nouns, verbs, and adverbs. Only as candidates.
- the reliability setting unit 4 sets the reliability R Call (w i , l, m) for each word w i .
- R Call (w i , l, m) is not particularly limited, and serves as an index as to whether or not the words constituting the speech recognition text data are correct as a recognition result. If it is good.
- Non-Patent Document 2 For example, as the reliability R Call (w i , l, m), “Confidence Measure” disclosed in Non-Patent Document 2 described above can be used. Specifically, it is assumed that an input voice or an acoustic feature amount observed for the voice is given as a premise. In this case, word w i of the reliability R Call (w i, l, m) is the word as a posterior probability of w i, the input speech or Forward-Backward algorithm based on the word graph is obtained as a recognition result for acoustic features Can be calculated using.
- reliability setting unit 4 for each word w i, reliability obtained in the above R Call (w i, l, m) using a reliability R Call (w i, l) for each record l also calculate. Specifically, the reliability setting unit 4 calculates all the words w i using the following equation (Equation 1).
- voice recognition is performed in advance by a voice recognition device external to the text mining device 1, and voice recognition text data in which reliability is set for each word is input to the text mining device 1.
- the aspect created previously may be sufficient.
- the text mining device 1 does not need to include the speech recognition unit 3, and the speech recognition text data is input to the unique part extraction unit 6 via the data input unit 2.
- the text mining device 1 includes the speech recognition unit 3, the language model and the acoustic model used for speech recognition can be easily adjusted, and the recognition accuracy of speech recognition can be improved.
- the language processing unit 5 performs language processing such as morphological analysis, dependency analysis, synonym processing, non-term processing, etc. on the call memo text data.
- the language processing unit 5 generates a word string by dividing the call memo text data for each word w j so as to correspond to the word w i of the speech recognition text data.
- the word string is output to the unique part extraction unit 6.
- the unique part extraction unit 6 extracts a unique part of the speech recognition text data for the call memo text data and a unique part of the call memo text data for the speech recognition text data.
- these unique parts are referred to as “unique part of speech recognition text data” and “unique part of call memo text data”, respectively.
- the unique part extraction unit 6 first extracts words that do not match words constituting other text data from the word group constituting each text data. To do. Next, the unique part extraction unit 6 sets the extracted word as a unique part for other text data of each text data.
- the unique part extraction unit 6 extracts “white” as a unique part of the speech recognition text data.
- the speech recognition text data obtained from the call voice data D (1) does not include “color”, but the corresponding call memo text data D (1) includes “color”. Yes.
- the unique part extraction unit 6 extracts “color” as a unique part of the call memo text data D (1).
- the voice recognition text data obtained from the call voice data D (3) has two candidates of “color” and “white” for the same frame number. (See FIG. 3).
- Corresponding call memo text data D (3) includes only “white”.
- the unique part extraction unit 6 extracts “color” as a unique part of the speech recognition text data, but does not extract “white” as any unique part.
- the unique part of the speech recognition text data and the unique part of the call memo text data extracted in this way are input to the unique reliability setting unit 7.
- the word w i (hereinafter referred to as “unique partial element w i ”) extracted as the unique part of the speech recognition text data
- the word w j (hereinafter referred to as “unique part” extracted as the unique part of the call memo text data).
- Element w j ′′) is input to the inherent reliability setting unit 7.
- the unique reliability setting unit 7 first uses the word string output from the language processing unit 5 to determine the reliability R Memo for each word w j constituting the call memo text data.
- Set (w j , l) the reliability is “1.0” if the word is included in the call memo text data.
- the reliability of words not included in the call memo text data is “0.0”.
- intrinsic reliability setting unit 7 sets specific reliability C Call (w i, l) for the specific part element w i as the inherent reliability C Memo (w j, l) for the specific part element w j and To do.
- the inherent reliability setting unit 7 includes the reliability R Call (w i , l), the reliability R Memo (w j , l), the reliability R Call (w j , l), and the reliability R Memo (w i , l) is applied to the following equations (Equation 2) and (Equation 3).
- the inherent reliability C Call (w i , l) and the inherent reliability C Memo (w j , l) are calculated.
- the calculated intrinsic reliability C Call (w i , l) and intrinsic reliability C Memo (w j , l) are input to the mining processing unit 8 together with the eigensubelement w i and eigensubelement w j. .
- the inherent reliability setting unit 7 when the inherent reliability setting unit 7 sets the inherent reliability in the specific part of one text data, the reliability set in the other text data is set from 1. The value obtained by subtraction is multiplied by the reliability set in one text data. The inherent reliability obtained in this way is easy to set and reliably presents the reliability of the inherent part.
- the mining processing unit 8 performs so-called cross-channel text mining using the inherent reliability C call (w i , l) and the inherent reliability C Memo (w j , l). That is, the mining processing unit 8 performs a mining process on the unique subelement w i and a mining process on the unique subelement w j .
- the mining processing unit 8 extracts feature words as the mining process and calculates the feature degree.
- “Characteristic words” refer to words and phrases extracted by the mining process.
- the feature word is extracted from the words determined to be the unique partial element w i or the unique partial element w j .
- “Characteristic” indicates the degree to which the extracted characteristic word is characteristic in an arbitrary category (for example, a record set having a specific value in the incidental information D3). .
- the mining processing unit 8 includes a mining processing management unit 8, a feature word counting unit 10, a feature degree calculation unit 11, and a mining result output unit 12.
- the feature word counting unit 10 extracts feature words from the unique subelement w i and the unique subelement w j and counts how many times the extracted feature words appear in the corresponding text data or in all text data. To do. Thereby, the appearance frequency and the total appearance frequency are obtained (see FIG. 7).
- the feature word counting unit 10 extracts feature words by using the inherent reliability C call (w i , l) and the inherent reliability C Memo (w j , l). For example, a threshold is set for the unique reliability, and only unique subelements with the unique reliability equal to or higher than the threshold are extracted as feature words. In the example of FIG. 7, the threshold is set to 0.4, and the unique subelement “black” having the inherent reliability set to 0.3 is excluded from the feature word.
- the threshold value may be set as appropriate. However, it is preferable that an experiment is performed in advance and the threshold value is set based on the experimental result. Specifically, speech data with a unique part set in advance and text data with a unique part preset in advance are used as experimental data, and the inherent reliability C call (w i , l) and unique The reliability C Memo (w j , l) is calculated. Then, a threshold value is set so that a unique part preset in each data is extracted. In this case, the threshold value can be set for each inherent reliability. In order to increase the reliability of the set threshold value, it is preferable to prepare as much experimental data as possible.
- the feature word counting unit 10 can count feature words for a plurality of records.
- the number of records to be counted for feature words is not particularly limited. Note that, when cross-channel mining is not performed, the feature word counting unit 10 is not a unique subelement, but all words (except words that do not make sense) included in the speech recognition text data or the call memo text data. On the other hand, the appearance frequency is counted.
- the feature degree calculation unit 11 calculates the feature degree (see FIG. 7) using the appearance frequency and the total appearance frequency obtained by the feature word counting unit 10.
- the calculation method of the feature degree is not particularly limited, and can be performed using various statistical analysis techniques according to the purpose of mining.
- the feature word calculation unit 11 calculates statistical measures such as the frequency of each word, log likelihood ratio, ⁇ 2 value, Yates correction ⁇ 2 value, self-mutual information amount, SE, ESC, etc. in a specific category, It is calculated as a feature amount of the word, and the obtained value can be used as the feature degree.
- the record set etc. which have the specific value which an analyst designates in incidental information D3 are mentioned, for example.
- statistical analysis techniques such as multiple regression analysis, principal component analysis, factor analysis, discriminant analysis, and cluster analysis can be used for calculating the degree of feature.
- the mining processing management unit 8 receives the mining condition input by the user via the input device 13, and operates the feature word counting unit 10 and the feature degree calculating unit 11 according to the received condition. For example, when the user instructs to perform text mining only on the unique part of the speech recognition text data, the mining processing management unit 8 instructs the feature word counting unit 10 to use the unique part element w of the speech recognition text data. The feature words are extracted from i and the feature words are counted. Further, the mining processing management unit 8 causes the feature degree calculation unit 11 to calculate the feature degree.
- the mining result output unit 12 outputs a mining result as shown in FIG.
- the mining result includes a feature word, an appearance frequency, a total appearance frequency, an inherent reliability, and a feature degree.
- the mining results for both the voice recognition text data and the call memo text data are output.
- the mining result is displayed on the display screen when the display device is the output device 14.
- FIG. 8 is a flowchart showing the flow of processing performed by the text mining method according to Embodiment 1 of the present invention.
- the text mining method according to the first embodiment can be implemented by operating the text mining apparatus 1 shown in FIG. Therefore, hereinafter, the text mining method according to the first embodiment will be described together with the description of the operation of the text mining apparatus 1 shown in FIG. 1 with appropriate reference to FIGS.
- the language processing unit 5 executes language processing on the call memo text data (step A1).
- the call memo text data becomes a word string of the word w j , and is output to the unique part extraction unit 6 and the unique reliability setting unit 7 in a state of being a word string.
- Step A2 the speech recognition unit 3, performs speech recognition, to create a speech recognition text data by extracting a word w i as a candidate (step A2).
- reliability setting unit 4 the speech recognition text data, for each word w i constituting it, sets reliability reliability R Call (w i, l, m) a.
- the reliability setting unit 4 calculates the reliability R Call (w i , l) for each record l by applying the reliability R Call (w i , l, m) to the above equation (Equation 1). (Step A3).
- steps A2 and A3 are omitted. Steps A2 and A3 may be executed before step A1, or may be executed simultaneously with step A1.
- step A4 specific part extraction section 6, and comparing the word w j of call notes text data and word w i of the speech recognition text data, the specific part of speech recognition text data (specific part element w i), call notes A unique part (unique partial element w j ) of the text data is extracted (step A4).
- the unique part extraction unit 6 inputs the extracted unique part element w i and the unique part element w j to the unique reliability setting unit 7.
- the unique reliability setting unit 7 uses the word string output from the language processing unit 5 to perform the following for each word w j constituting the call memo text data D2 (l) in each record l.
- the reliability R Memo (w j , l) is set (step A5).
- intrinsic reliability setting unit 7 sets specific reliability C Call (w i, l) for the specific part element w i as the inherent reliability C Memo (w j, l) for the specific part element w j and (Step A6).
- the inherent reliability setting unit 7 applies the reliability R Call (w i , l) and the reliability R Memo (w j , l) to the above equations (Equation 2) and (Equation 3),
- the intrinsic reliability C Call (w i , l) and the intrinsic reliability C Memo (w j , l) are calculated.
- the specific reliability setting unit 7 inputs the calculated specific reliability C Call (w i , l) and the specific reliability C Memo (w j , l) to the feature word counting unit 10.
- the mining processing unit 8 executes a mining process (step A7). Specifically, first, the feature word counting unit 10 uses the inherent reliability C Call (w i , l) and the inherent reliability C Memo (w j , l) to perform the eigensubelement w i and the eigenpart. A feature word is extracted from the element w j . Further, the feature word counting unit 10 counts the appearance frequency and the total appearance frequency. And the feature word calculation part 11 calculates a feature degree about the extracted feature word. By executing step A7, data shown in FIG. 7 is obtained.
- the mining result output unit 14 outputs the result obtained in step A7 to the output device 14 (step A8).
- the text mining device 1 ends the process.
- the mining process for the unique part is performed using the unique reliability set for the unique part of each text data. For this reason, the influence that the recognition error generated during the speech recognition has on the mining result is extremely small.
- the program according to the first embodiment may be a program including instructions that cause a computer to execute steps A1 to A8 shown in FIG.
- the text mining apparatus 1 can be realized by installing the program according to the first embodiment on a computer and executing the program.
- the CPU Central Processing Unit
- the CPU Central Processing Unit of the computer functions as the speech recognition unit 3, the language processing unit 5, the specific part extraction unit 6, the specific reliability setting unit 7 and the mining processing unit 8, and the steps A1 to A Process A8 is performed.
- the program according to the first embodiment is supplied in a state of being stored in a computer-readable recording medium such as an optical disk, a magnetic disk, a magneto-optical disk, a semiconductor memory, a floppy disk, or the like. .
- FIG. 9 is a block diagram showing a schematic configuration of the text mining device according to Embodiment 2 of the present invention.
- FIG. 10 is a diagram illustrating an example of the unique part extracted by the text mining device according to Embodiment 2 of the present invention.
- the text mining device 20 according to the second embodiment is different from the text mining device 1 according to the first embodiment shown in FIG. .
- whether or not it is a unique part is determined more strictly than in the first embodiment.
- the difference from the first embodiment will be mainly described.
- the unique part extraction unit 6 calculates the score S call (w i , l) or the score S Memo (w j , l) for each word constituting each text data. ) And a unique part of each text data is extracted based on the calculated value.
- the score S call (w i , l) indicates the degree to which each word w i constituting the speech recognition text data corresponds to its unique part.
- the score S Memo (w j , l) indicates the degree to which each word w j constituting the voice memo text data corresponds to its unique part.
- the unique part extraction unit 6 includes a frequency calculation unit 15, a score calculation unit 16, and a unique part determination unit 17.
- the frequency calculation unit 15 uses the word string output from the language processing unit 5 to set the reliability R Memo (w j , l) for each word w j constituting the call memo text data.
- the frequency calculation unit 15 Set by The calculated reliability R Memo (w j , l) is also input to the inherent reliability setting unit 7 because it is necessary for calculating the inherent reliability.
- the frequency calculation unit 15 calculates each of the word w i and the word w j from the reliability R Call (w i , l) set by the reliability setting unit 4 and the reliability R Memo (w j , l). Appearance frequencies N Call (w i ) and N Memo (w j ) are obtained. In addition, the frequency calculation unit 15 uses both the reliability R Call (w i , l) and the reliability R Memo (w j , l) for both records (record (1) to record (L)). cause frequency N Call, Memo (w i, w j) also seek.
- the frequency calculation unit 15 obtains the appearance frequency N Call (w i ) of the word w i using the following formula (Formula 4), and uses the following formula (Formula 5) to calculate the word w determine the j of the frequency of occurrence N Memo (w j). Also, the frequency calculation unit 15 uses the equation (5) below, obtains the co-occurrence frequency N Call, Memo the (w i, w j). After that, the frequency calculation unit 15 outputs the appearance frequency N Call (w i ), the appearance frequency N Memo (w j ), and the co-occurrence frequency N Call, Memo (w i , w j ) to the score calculation unit 16.
- the score calculation unit 16 uses the appearance frequency N Call (w i ), the appearance frequency N Memo (w j ), and the co-occurrence frequency N Call, Memo (w i , w j ) to score S call (w i , l) and the score S Memo (w j , l) are calculated. Specifically, the score calculation unit 16 first calculates a mutual information amount I (w i ; w j ) using w i and w j as discrete random variables.
- P Call (w i ) and P Memo (w j ) be the peripheral probability distribution functions of the mutual information I (w i ; w j ).
- P Call (w i ) is calculated by the following equation (Equation 8).
- P Memo (w j ) is calculated by the following equation (Equation 9).
- P Call (w i ) is a marginal probability distribution function for a probability event that the word w i appears in the speech recognition text data Call in a certain record.
- P Memo (w j ) is a peripheral probability distribution function for the probability event that the word w j appears in the call memo text data Memo in one record.
- the score calculation unit 16 calculates the score S call (w i , l) and the score S Memo (w j , l) using the mutual information amount I (w i ; w j ).
- the score S call (w i , l) and the score S Memo (w j , l) a function that monotonously decreases with respect to the mutual information amount I (w i ; w j ) is used. .
- the score S call (w i , l) is calculated by the following equation (Equation 11)
- the score S Memo (w j , l) is calculated by the following equation (Equation 12).
- ⁇ is an arbitrary constant larger than 0 (zero).
- the calculated score S call (w i , l) and score S Memo (w j , l) are output to the unique part determination unit 17.
- the score calculated in this way varies depending on the reliability value set in the voice recognition text data and voice memo text data. That is, the score varies depending on a recognition error in speech recognition.
- the method for calculating the score S call (w i , l) and the score S Memo (w j , l) is not limited to the above calculation method.
- the score S call (w i , l) and the score S Memo (w j , l) may be any as long as they can be used to determine whether they are unique parts.
- the unique part determination unit 17 compares the score S call (w i , l) and the score S Memo (w j , l) with a preset threshold value, and determines whether the part is a unique part.
- the fixed part determination unit 9 determines that the word is a unique part when the score is equal to or greater than a threshold value. For example, as shown in FIG. 10, a score is calculated for the word w i constituting the speech recognition text data and the word w j constituting the call memo text data, and the thresholds are score S call (w i , l) and score S It is assumed that 0.500 is set for all of Memo (w j , l).
- the unique part determination unit 17 extracts “advertisement” and “white” as unique parts of the speech recognition text data. Further, the unique part determination unit 17 extracts “future”, “color variation”, “increase”, “new”, “addition”, and “examination” as unique parts of the call memo text data.
- the size of the threshold used for determining the unique portion is not particularly limited, and may be appropriately selected based on the result of the text mining process.
- the threshold value is set based on an experimental result obtained in advance.
- the setting of the threshold value in this case can also be performed in the same manner as in the case where the threshold value is set for the inherent reliability in the first embodiment. That is, using the speech data in which the unique part is preset and text data in which the unique part is preset in advance as the experimental data, the score S call (w i , l) and the score S Memo (w j , L). Then, a threshold value is set so that each unique part preset in each data is extracted. In this case, the threshold value can be set for each score. In order to increase the reliability of the set threshold value, it is preferable to prepare as much experimental data as possible.
- the unique part determination unit 17 determines the word w i (unique partial element w i ) determined as the unique part of the speech recognition text data and the word w j (unique partial element w j ) determined as the unique part of the call memo text data. ) Is input to the inherent reliability setting unit 7.
- the unique reliability setting unit 7 functions in the same manner as in the first embodiment except for the process of setting the reliability R Memo (w j , l), and the unique reliability C Call (w i for each unique part. , L) and the inherent reliability C Memo (w j , l).
- the mining processing unit 8 also functions in the same manner as in the first embodiment and executes mining.
- FIG. 11 is a flowchart showing a flow of processing performed by the text mining method according to Embodiment 2 of the present invention.
- the text mining method according to the second embodiment can be implemented by operating the text mining device 20 shown in FIG. For this reason, hereinafter, the text mining method according to the second embodiment will be described together with the description of the operation of the text mining apparatus 20 shown in FIG. 9 with appropriate reference to FIGS. 9 and 10.
- step A11 language processing by the language processing unit 5 (step A11), speech recognition by the speech recognition unit 3 (step A12), and reliability R Call (w i , l by the reliability setting unit 4). ) Is calculated (step A13).
- Steps A11 to A13 are the same as steps A1 to A3 shown in FIG. 8 in the first embodiment.
- the frequency calculation unit 15 uses the word string output from the language processing unit 5 to determine the reliability R Memo for each word w j constituting the call memo text data.
- ( Wj , l) is set (step A14).
- Step A14 is performed by the same processing as Step A5 shown in FIG. 8 in the first embodiment.
- step A15 a word w i of the reliability R Call (w i, l), the reliability of a word w j R Memo (w j, l) from the respective frequency N Call (w i ) and N Memo (w j ) and the co-occurrence frequencies N Call, Memo (w i , w j ) of all records (record (1) to record (L)) are obtained (step A15).
- step A15 the above equations (Equation 4) to (Equation 6) are used.
- the score calculation unit 16 performs the appearance frequencies N Call (w i ) and N Memo (w j ), and the co-occurrence frequencies N Call, Memo.
- Scores S call (w i , l) and S Memo (w j , l) are calculated using (w i , w j ) (step A16).
- the scores S call (w i , l) and S Memo (w j , l) are calculated for each of the records (1) to (L).
- the score calculation unit 8 calculates the mutual information amount I (w i , w j ) using the above formulas (Equation 7) to (Equation 10), and then calculates this. It applies to said Formula (Formula 11) and (Formula 12). As a result of Step A16, data shown in FIG. 10 is obtained.
- the unique part determination unit 9 sets in advance the corresponding score S call (w i , l) or score S Memo (w j , l) for each word of the records (1) to (L). It is determined whether or not the threshold value is equal to or greater than the threshold value, and a word that is equal to or greater than the threshold value is determined to be a unique part (step A17). Information specifying the word determined to be the unique part in step A17 is sent to the unique reliability setting unit 7.
- Step A18 is the same as step A6 shown in FIG. 8 in the first embodiment.
- step A19 the mining processing unit 8 performs a mining process
- step A20 the mining result output unit 12 outputs a mining result
- the program according to the second embodiment may be a program including instructions that cause a computer to execute steps A11 to A20 shown in FIG.
- the text mining device 20 can be realized by installing the program according to the second embodiment on a computer and executing the program.
- the CPU (Central Processing Unit) of the computer functions as the speech recognition unit 3, the language processing unit 5, the specific part extraction unit 6, and the mining processing unit 8, and performs the processing of Step A11 to Step A20.
- the program according to the second embodiment is supplied in a state stored in a computer-readable recording medium, for example, an optical disk, a magnetic disk, a magneto-optical disk, a semiconductor memory, a floppy disk, or the like, or via a network. .
- a computer-readable recording medium for example, an optical disk, a magnetic disk, a magneto-optical disk, a semiconductor memory, a floppy disk, or the like, or via a network.
- FIG. 12 is a block diagram showing a schematic configuration of the text mining device according to Embodiment 3 of the present invention.
- the text mining device 21 shown in FIG. 12 uses voice recognition text data and text data (character recognition text data) obtained by performing character recognition as mining targets. For this reason, the text mining device 21 receives image data D4 of a document output from an optical reading device such as a scanner.
- a character recognition unit 22 is provided.
- the text mining device 21 is applied to a call center as in the first embodiment.
- the document that is the basis of the image data D4 (l) corresponding to each record l is a memo created by handwriting by an operator, a FAX sent from a customer, or the like.
- the character recognition unit 22 performs character recognition for each image data D4 (l) corresponding to each record l, and generates character recognition text data corresponding to each record l. Moreover, the character recognition part 22 extracts the word which comprises character recognition text data, and sets reliability for every word.
- the reliability in this case may be an index indicating whether or not the words constituting the character recognition text data are correct as the recognition result of the input image.
- the character recognition feature amount observed from the input image data D4, (l) or the input image data D4 (l) is given.
- the posterior probability of the word can be used. Specifically, “Estimated ⁇ ⁇ posterior probability” disclosed in Non-Patent Document 3 described above can be used as the posterior probability in this case.
- the text mining device 21 is configured in the same manner as the text mining device 1 shown in FIG. 1 except for the points described above. Therefore, the data input unit 2, the speech recognition unit 3, the specific part extraction unit 6, the specific reliability setting unit 7, and the mining processing unit 10 function in the same manner as in the example of the first embodiment. In the third embodiment, the extraction of the unique part and the setting of the unique reliability are performed on the speech recognition text data and the character recognition text data, and then cross-channel mining is executed.
- character recognition is performed in advance by a character recognition device external to the text mining device 21, and character recognition text data in which reliability is set for each word is input to the text mining device 21.
- the aspect created previously may be sufficient.
- the text mining device 21 does not need to include the character recognition unit 22, and the character recognition text data is input to the unique part extraction unit 6 via the data input unit 2.
- FIG. 13 is a flowchart showing a flow of processing performed by the text mining method according to Embodiment 3 of the present invention.
- the text mining method according to the third embodiment can be implemented by operating the text mining device 21 shown in FIG.
- the text mining method according to the third embodiment will be described together with the description of the operation of the text mining device 21 shown in FIG. 12 with appropriate reference to FIG.
- call voice data D1 (l), image data D4 (l), and incidental information D3 (l) for each of records (1) to (L) are input to the data input unit 2 of the text mining device 20.
- One set is input for each set.
- the character recognition unit 22 performs character recognition on each of the image data D4 (l) corresponding to each record l (step A21).
- the image data D4 from (l) character recognition text data is generated, furthermore, the extraction of a word w j which constitutes the character recognition text data, the reliability of the setting for each word w j is performed.
- Step A22 the speech recognition text data is generated by the speech recognition unit 3 (step A22), and the reliability R Call (w i , l) is calculated by the reliability setting unit 4 (step A23).
- Step A22 and step A23 are the same steps as the steps A2 and A3 shown in FIG.
- step A21 is omitted.
- steps A22 and A23 are also omitted. Steps A22 and A23 may be executed before step A21 or may be executed simultaneously with step A21.
- the unique part extraction unit 6 extracts the unique part element w i and the unique part element w j (step A24).
- the intrinsic reliability setting unit 7 sets the intrinsic reliability for the intrinsic subelement w i and the intrinsic reliability for the intrinsic subelement w j (step A25).
- Step A24 and step A25 are the same steps as steps A4 and A6 shown in FIG. 8, respectively. However, in setting the inherent reliability in step A25, the reliability set in step A21 is used.
- step A26 the mining processing unit 8 performs a mining process
- step A27 the mining result output unit 12 outputs a mining result
- the inherent reliability is set for the unique part of the speech recognition text data and the unique part of the character recognition text data. According to the third embodiment, when one of the mining targets is character recognition text data, it is possible to suppress the influence of a recognition error that occurs during character recognition on the mining result.
- the program according to the third embodiment may be a program including instructions that cause a computer to execute steps A21 to A27 shown in FIG.
- the text mining device 21 can be realized by installing the program according to the first embodiment on a computer and executing the program.
- the CPU Central Processing ⁇ Unit
- the CPU functions as the voice recognition unit 3, the character recognition unit 22, the unique part extraction unit 6, and the mining processing unit 8, and performs the processing of Step A21 to Step A27.
- the program according to the third embodiment is supplied in a state stored in a computer-readable recording medium, for example, an optical disk, a magnetic disk, a magneto-optical disk, a semiconductor memory, a floppy disk, or the like, or via a network. .
- a computer-readable recording medium for example, an optical disk, a magnetic disk, a magneto-optical disk, a semiconductor memory, a floppy disk, or the like, or via a network.
- Embodiments 1 to 3 show examples in which the text mining device is applied to a call center, but the application example of the text mining device is not limited to this.
- the text mining device can be applied to the case of analyzing a corporate image from the contents reported on television or radio, or the case of analyzing the contents of conversation in a communication place such as a meeting.
- the extraction of the unique part is performed on two text data, but the present invention is not limited to this.
- the extraction of the unique part may be performed on three or more text data.
- Embodiments 1 and 2 described above describe an example in which a set of speech recognition text data that may contain an error and call memo text data that does not contain an error is targeted for text mining.
- Embodiment 3 described above describes an example in which a combination of speech recognition text data that may contain an error and character recognition text data that may also contain an error is targeted for text mining.
- the present invention is not limited to the case where the above combination is targeted, and text mining can be executed for combinations other than the above combinations.
- a unique reliability is set for each unique part of a plurality of text data, and text mining is performed using this. Therefore, for example, a group other than the above-described group, for example, a group of voice recognition text data for an operator's call voice and a voice recognition text data for a customer's call voice can be set as a text mining target.
- the text data to be subjected to text mining may be text data other than voice recognition text data, text data generated by key input (call memo text data), and character recognition text data. Even if it is such text data, the word which is the component can be extracted, and if the reliability can be set for each word, the present invention can be applied. Specific examples include text data obtained by machine translation.
- the text mining apparatus, text mining method, and computer-readable recording medium according to the present invention have the following characteristics.
- a text mining device that executes text mining on a plurality of text data including text data generated by computer processing, A reliability is set for each of the plurality of text data, For each of the plurality of text data, a unique part extraction unit that extracts a unique part of each text data with respect to other text data; A unique reliability setting unit that sets a unique reliability indicating the reliability of each unique part in each unique part of each text data with respect to other text data using the reliability set for each of the plurality of text data When, A text mining device, comprising: a mining processing unit that executes text mining on each unique part of each text data with respect to other text data using the inherent reliability.
- the reliability is set to a numerical value of 1 or less for each of the plurality of text data,
- the specific reliability setting unit sets the specific reliability in a specific part of one text data with respect to other text data, the specific reliability is obtained by subtracting the reliability set in the other text data from 1.
- the text mining device according to (1), wherein the inherent reliability is set by multiplying the value set by the reliability set in the one text data.
- the unique part extraction unit extracts, for each text data, a word that does not match a word constituting another text data from a word group constituting the text data, The text mining device according to (1), wherein each text data is a unique part with respect to other text data.
- the unique part extraction unit uses the reliability set for each of the plurality of text data, and for each text data, each word constituting the text data corresponds to the other text data of each text data Calculate the degree corresponding to the unique part,
- the text mining device according to (1), wherein a unique portion of each text data with respect to other text data is extracted based on the calculated degree.
- Text data generated by speech recognition is used as the text data generated by the computer processing, Further, the text data generated by the voice recognition is provided with a reliability setting unit that sets the reliability using the word graph or N best word string obtained at the time of the voice recognition.
- the text mining device according to (1).
- a text mining method for executing text mining on a plurality of text data including text data generated by computer processing, (A) setting reliability for each of the plurality of text data; (B) for each of the plurality of text data, extracting a unique portion of each text data with respect to other text data; (C) using the reliability set for each of the plurality of text data, setting a unique reliability indicating the reliability of each unique part for each unique part of each text data with respect to other text data; , (D) A text mining method comprising: performing text mining on each unique part of each text data with respect to other text data using the inherent reliability.
- the reliability is set to a numerical value of 1 or less for each of the plurality of text data.
- the reliability set in the other text data is subtracted from 1 when setting the inherent reliability in the unique part of the text data with respect to the other text data.
- the text mining method according to (6), wherein the inherent reliability is set by multiplying the value set by the reliability set in the one text data.
- a word that does not match a word constituting another text data is extracted from a word group constituting the text data, and the extracted word is The text mining method according to (6), wherein each text data is a unique part with respect to other text data.
- each word constituting the text data is another text data of the text data. Calculate the degree to which it corresponds to the specific part for The text mining method according to (6), wherein a unique portion of each text data with respect to other text data is extracted based on the calculated degree.
- Text data generated by speech recognition is used as the text data generated by the computer processing,
- the text according to (6) further comprising a step of setting a reliability using the word graph or N best word string obtained at the time of voice recognition for the text data generated by the voice recognition. Mining method.
- a computer-readable recording medium storing a program for executing text mining on a plurality of text data including text data generated by computer processing using a computer device, In the computer device, (A) setting reliability for each of the plurality of text data; (B) for each of the plurality of text data, extracting a unique portion of each text data with respect to other text data; (C) using the reliability set for each of the plurality of text data, setting a unique reliability indicating the reliability of each unique part for each unique part of each text data with respect to other text data; , (D) A computer readable recording of a program including instructions for executing a step of executing text mining on each unique portion of each text data with respect to other text data using the inherent reliability. recoding media.
- the reliability is set to a numerical value of 1 or less for each of the plurality of text data.
- the reliability set in the other text data is subtracted from 1 when setting the inherent reliability in the unique part of the text data with respect to the other text data.
- each word constituting the text data is another text data for each text data. Calculate the degree to which it corresponds to the specific part for The computer-readable recording medium according to (11), wherein a unique portion of each text data with respect to other text data is extracted based on the calculated degree.
- Text mining device (Embodiment 1) DESCRIPTION OF SYMBOLS 2 Data input part 3 Voice recognition part 4 Reliability setting part 5 Language processing part 6 Eigenpart extraction part 7 Intrinsic reliability setting part 8 Mining process part 9 Mining process management part 10 Feature word count part 11 Feature degree calculation part 12 Mining result Output unit 13 Input device 14 Output device 15 Frequency calculation unit 16 Score calculation unit 17 Eigen part determination unit 20 Text mining device (Embodiment 2) 21 Text mining device (Embodiment 3) 22 Character recognition part D1 (l) Call voice data D2 (l) Call memo text data D3 (l) Additional information D4 (l) Image data
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
前記複数のテキストデータそれぞれに信頼度が設定されており、
前記複数のテキストデータそれぞれについて、各テキストデータの他のテキストデータに対する固有部分を抽出する、固有部分抽出部と、
前記複数のテキストデータそれぞれに設定された信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに、各固有部分の信頼性を示す固有信頼度を設定する固有信頼度設定部と、
前記固有信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに対して、テキストマイニングを実行するマイニング処理部とを備えている、ことを特徴とする。
(a)前記複数のテキストデータそれぞれに信頼度を設定するステップと、
(b)前記複数のテキストデータそれぞれについて、各テキストデータの他のテキストデータに対する固有部分を抽出するステップと、
(c)前記複数のテキストデータそれぞれに設定された信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに、各固有部分の信頼性を示す固有信頼度を設定するステップと、
(d)前記固有信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに対して、テキストマイニングを実行するステップとを有する、ことを特徴とする。
前記コンピュータ装置に、
(a)前記複数のテキストデータそれぞれに信頼度を設定するステップと、
(b)前記複数のテキストデータそれぞれについて、各テキストデータの他のテキストデータに対する固有部分を抽出するステップと、
(c)前記複数のテキストデータそれぞれに設定された信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに、各固有部分の信頼性を示す固有信頼度を設定するステップと、
(d)前記固有信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに対して、テキストマイニングを実行するステップとを実行させる、命令を含むプログラムを記録していることを特徴とする。
以下、本発明の実施の形態1におけるテキストマイニング装置、テキストマイニング方法及びプログラムについて、図1~図8を参照しながら説明する。最初に、本発明の実施の形態2におけるテキストマイニング装置の構成について図1~図7を用いて説明する。
次に、本発明の実施の形態2におけるテキストマイニング装置、テキストマイニング方法及びプログラムについて、図9~図11を参照しながら説明する。最初に、本発明の実施の形態2におけるテキストマイニング装置の構成について、図9及び図10を用いて説明する。図9は、本発明の実施の形態2におけるテキストマイニング装置の概略構成を示すブロック図である。図10は、本発明の実施の形態2におけるテキストマイニング装置が抽出した固有部分の一例を示す図である。
次に、本発明の実施の形態3におけるテキストマイニング装置、テキストマイニング方法及びプログラムについて、図12及び図13を参照しながら説明する。最初に、本発明の実施の形態3におけるテキストマイニング装置の構成について、図12を用いて説明する。図12は、本発明の実施の形態3におけるテキストマイニング装置の概略構成を示すブロック図である。
前記複数のテキストデータそれぞれに信頼度が設定されており、
前記複数のテキストデータそれぞれについて、各テキストデータの他のテキストデータに対する固有部分を抽出する、固有部分抽出部と、
前記複数のテキストデータそれぞれに設定された信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに、各固有部分の信頼性を示す固有信頼度を設定する固有信頼度設定部と、
前記固有信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに対して、テキストマイニングを実行するマイニング処理部とを備えている、ことを特徴とするテキストマイニング装置。
前記固有信頼度設定部が、一のテキストデータの他のテキストデータに対する固有部分に前記固有信頼度を設定する際に、前記他のテキストデータに設定されている信頼度を1から減算して得られる値を、前記一のテキストデータに設定されている信頼度に乗算することによって、前記固有信頼度を設定する、上記(1)に記載のテキストマイニング装置。
算出された前記度合いに基づいて、前記各テキストデータの他のテキストデータに対する固有部分を抽出する、上記(1)に記載のテキストマイニング装置。
更に、前記音声認識によって生成されたテキストデータに、音声認識の際に得られた単語グラフ又はNベスト単語列を利用して、信頼度を設定する信頼度設定部が、備えられている、上記(1)に記載のテキストマイニング装置。
(a)前記複数のテキストデータそれぞれに信頼度を設定するステップと、
(b)前記複数のテキストデータそれぞれについて、各テキストデータの他のテキストデータに対する固有部分を抽出するステップと、
(c)前記複数のテキストデータそれぞれに設定された信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに、各固有部分の信頼性を示す固有信頼度を設定するステップと、
(d)前記固有信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに対して、テキストマイニングを実行するステップとを有する、ことを特徴とするテキストマイニング方法。
前記(c)のステップにおいて、一のテキストデータの他のテキストデータに対する固有部分に前記固有信頼度を設定する際に、前記他のテキストデータに設定されている信頼度を1から減算して得られる値を、前記一のテキストデータに設定されている信頼度に乗算することによって、前記固有信頼度を設定する、上記(6)に記載のテキストマイニング方法。
算出された前記度合いに基づいて、前記各テキストデータの他のテキストデータに対する固有部分を抽出する、上記(6)に記載のテキストマイニング方法。
更に、前記音声認識によって生成されたテキストデータに、音声認識の際に得られた単語グラフ又はNベスト単語列を利用して、信頼度を設定するステップを有する、上記(6)に記載のテキストマイニング方法。
前記コンピュータ装置に、
(a)前記複数のテキストデータそれぞれに信頼度を設定するステップと、
(b)前記複数のテキストデータそれぞれについて、各テキストデータの他のテキストデータに対する固有部分を抽出するステップと、
(c)前記複数のテキストデータそれぞれに設定された信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに、各固有部分の信頼性を示す固有信頼度を設定するステップと、
(d)前記固有信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに対して、テキストマイニングを実行するステップとを実行させる、命令を含むプログラムを記録したコンピュータ読み取り可能な記録媒体。
前記(c)のステップにおいて、一のテキストデータの他のテキストデータに対する固有部分に前記固有信頼度を設定する際に、前記他のテキストデータに設定されている信頼度を1から減算して得られる値を、前記一のテキストデータに設定されている信頼度に乗算することによって、前記固有信頼度を設定する、上記(11)に記載のコンピュータ読み取り可能な記録媒体。
算出された前記度合いに基づいて、前記各テキストデータの他のテキストデータに対する固有部分を抽出する、上記(11)に記載のコンピュータ読み取り可能な記録媒体。
前記プログラムが、前記音声認識によって生成されたテキストデータに、音声認識の際に得られた単語グラフ又はNベスト単語列を利用して、信頼度を設定するステップを、前記コンピュータ装置に実行させる命令を更に含む、上記(11)に記載のコンピュータ読み取り可能な記録媒体。
2 データ入力部
3 音声認識部
4 信頼度設定部
5 言語処理部
6 固有部分抽出部
7 固有信頼度設定部
8 マイニング処理部
9 マイニング処理管理部
10 特徴語計数部
11 特徴度算出部
12 マイニング結果出力部
13 入力装置
14 出力装置
15 頻度算出部
16 スコア算出部
17 固有部分判定部
20 テキストマイニング装置(実施の形態2)
21 テキストマイニング装置(実施の形態3)
22 文字認識部
D1(l) 通話音声データ
D2(l) 通話メモテキストデータ
D3(l) 付帯情報
D4(l) 画像データ
Claims (15)
- コンピュータ処理によって生成されたテキストデータを含む複数のテキストデータを対象としてテキストマイニングを実行するテキストマイニング装置であって、
前記複数のテキストデータそれぞれに信頼度が設定されており、
前記複数のテキストデータそれぞれについて、各テキストデータの他のテキストデータに対する固有部分を抽出する、固有部分抽出部と、
前記複数のテキストデータそれぞれに設定された信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに、各固有部分の信頼性を示す固有信頼度を設定する固有信頼度設定部と、
前記固有信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに対して、テキストマイニングを実行するマイニング処理部とを備えている、ことを特徴とするテキストマイニング装置。 - 前記複数のテキストデータそれぞれに、1以下の数値で前記信頼度が設定され、
前記固有信頼度設定部が、一のテキストデータの他のテキストデータに対する固有部分に前記固有信頼度を設定する際に、前記他のテキストデータに設定されている信頼度を1から減算して得られる値を、前記一のテキストデータに設定されている信頼度に乗算することによって、前記固有信頼度を設定する、請求項1に記載のテキストマイニング装置。 - 前記固有部分抽出部が、前記各テキストデータについて、それを構成する単語群の中から、他のテキストデータを構成している単語と一致しない単語を抽出し、抽出された単語を、前記各テキストデータの他のテキストデータに対する固有部分とする、請求項1または2に記載のテキストマイニング装置。
- 前記固有部分抽出部が、前記複数のテキストデータそれぞれに設定された信頼度を用いて、前記各テキストデータについて、それを構成する各単語が、前記各テキストデータの他のテキストデータに対する固有部分に該当する度合いを算出し、
算出された前記度合いに基づいて、前記各テキストデータの他のテキストデータに対する固有部分を抽出する、請求項1または2に記載のテキストマイニング装置。 - 前記コンピュータ処理によって生成されたテキストデータとして、音声認識によって生成されたテキストデータが用いられ、
更に、前記音声認識によって生成されたテキストデータに、音声認識の際に得られた単語グラフ又はNベスト単語列を利用して、信頼度を設定する信頼度設定部が、備えられている、請求項1~4のいずれかに記載のテキストマイニング装置。 - コンピュータ処理によって生成されたテキストデータを含む複数のテキストデータを対象としたテキストマイニングを実行するテキストマイニング方法であって、
(a)前記複数のテキストデータそれぞれに信頼度を設定するステップと、
(b)前記複数のテキストデータそれぞれについて、各テキストデータの他のテキストデータに対する固有部分を抽出するステップと、
(c)前記複数のテキストデータそれぞれに設定された信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに、各固有部分の信頼性を示す固有信頼度を設定するステップと、
(d)前記固有信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに対して、テキストマイニングを実行するステップとを有する、ことを特徴とするテキストマイニング方法。 - 前記(a)のステップにおいて、前記複数のテキストデータそれぞれに、1以下の数値で前記信頼度が設定され、
前記(c)のステップにおいて、一のテキストデータの他のテキストデータに対する固有部分に前記固有信頼度を設定する際に、前記他のテキストデータに設定されている信頼度を1から減算して得られる値を、前記一のテキストデータに設定されている信頼度に乗算することによって、前記固有信頼度を設定する、請求項6に記載のテキストマイニング方法。 - 前記(b)のステップにおいて、前記各テキストデータについて、それを構成する単語群の中から、他のテキストデータを構成している単語と一致しない単語を抽出し、抽出された単語を、前記各テキストデータの他のテキストデータに対する固有部分とする、請求項6または7に記載のテキストマイニング方法。
- 前記(b)のステップにおいて、前記複数のテキストデータそれぞれに設定された信頼度を用いて、前記各テキストデータについて、それを構成する各単語が、前記各テキストデータの他のテキストデータに対する固有部分に該当する度合いを算出し、
算出された前記度合いに基づいて、前記各テキストデータの他のテキストデータに対する固有部分を抽出する、請求項6または7に記載のテキストマイニング方法。 - 前記コンピュータ処理によって生成されたテキストデータとして、音声認識によって生成されたテキストデータが用いられ、
更に、前記音声認識によって生成されたテキストデータに、音声認識の際に得られた単語グラフ又はNベスト単語列を利用して、信頼度を設定するステップを有する、請求項6~9のいずれかに記載のテキストマイニング方法。 - コンピュータ装置を用いて、コンピュータ処理によって生成されたテキストデータを含む複数のテキストデータを対象としてテキストマイニングを実行するための、プログラムを記録したコンピュータ読み取り可能な記録媒体であって、
前記コンピュータ装置に、
(a)前記複数のテキストデータそれぞれに信頼度を設定するステップと、
(b)前記複数のテキストデータそれぞれについて、各テキストデータの他のテキストデータに対する固有部分を抽出するステップと、
(c)前記複数のテキストデータそれぞれに設定された信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに、各固有部分の信頼性を示す固有信頼度を設定するステップと、
(d)前記固有信頼度を用いて、前記各テキストデータの他のテキストデータに対する固有部分それぞれに対して、テキストマイニングを実行するステップとを実行させる、命令を含むプログラムを記録したコンピュータ読み取り可能な記録媒体。 - 前記(a)のステップにおいて、前記複数のテキストデータそれぞれに、1以下の数値で前記信頼度が設定され、
前記(c)のステップにおいて、一のテキストデータの他のテキストデータに対する固有部分に前記固有信頼度を設定する際に、前記他のテキストデータに設定されている信頼度を1から減算して得られる値を、前記一のテキストデータに設定されている信頼度に乗算することによって、前記固有信頼度を設定する、請求項11に記載のコンピュータ読み取り可能な記録媒体。 - 前記(b)のステップにおいて、前記各テキストデータについて、それを構成する単語群の中から、他のテキストデータを構成している単語と一致しない単語を抽出し、抽出された単語を、前記各テキストデータの他のテキストデータに対する固有部分とする、請求項11または12に記載のコンピュータ読み取り可能な記録媒体。
- 前記(b)のステップにおいて、前記複数のテキストデータそれぞれに設定された信頼度を用いて、前記各テキストデータについて、それを構成する各単語が、前記各テキストデータの他のテキストデータに対する固有部分に該当する度合いを算出し、
算出された前記度合いに基づいて、前記各テキストデータの他のテキストデータに対する固有部分を抽出する、請求項11または12に記載のコンピュータ読み取り可能な記録媒体。 - 前記コンピュータ処理によって生成されたテキストデータとして、音声認識によって生成されたテキストデータが用いられ、
前記プログラムが、前記音声認識によって生成されたテキストデータに、音声認識の際に得られた単語グラフ又はNベスト単語列を利用して、信頼度を設定するステップを、前記コンピュータ装置に実行させる命令を更に含む、請求項11~14のいずれかに記載のコンピュータ読み取り可能な記録媒体。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/060,587 US8380741B2 (en) | 2008-08-29 | 2009-08-28 | Text mining apparatus, text mining method, and computer-readable recording medium |
JP2010526564A JP5472641B2 (ja) | 2008-08-29 | 2009-08-28 | テキストマイニング装置、テキストマイニング方法、及びプログラム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008222455 | 2008-08-29 | ||
JP2008-222455 | 2008-08-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010023939A1 true WO2010023939A1 (ja) | 2010-03-04 |
Family
ID=41721120
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/004211 WO2010023939A1 (ja) | 2008-08-29 | 2009-08-28 | テキストマイニング装置、テキストマイニング方法、及びコンピュータ読み取り可能な記録媒体 |
Country Status (3)
Country | Link |
---|---|
US (1) | US8380741B2 (ja) |
JP (1) | JP5472641B2 (ja) |
WO (1) | WO2010023939A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2017090731A (ja) * | 2015-11-12 | 2017-05-25 | 日本電信電話株式会社 | 音声認識結果圧縮装置、音声認識結果圧縮方法、プログラム |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9400790B2 (en) * | 2009-12-09 | 2016-07-26 | At&T Intellectual Property I, L.P. | Methods and systems for customized content services with unified messaging systems |
US8538987B1 (en) * | 2011-06-07 | 2013-09-17 | Sprint Communications Company L.P. | Care agent call classification |
US10902481B1 (en) * | 2017-05-23 | 2021-01-26 | Walgreen Co. | Method and system for providing a seamless handoff from a voice channel to a call agent |
US20200137224A1 (en) * | 2018-10-31 | 2020-04-30 | International Business Machines Corporation | Comprehensive log derivation using a cognitive system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007026347A (ja) * | 2005-07-21 | 2007-02-01 | Nec Corp | テキストマイニング装置、テキストマイニング方法およびテキストマイニングプログラム |
WO2007066704A1 (ja) * | 2005-12-09 | 2007-06-14 | Nec Corporation | テキストマイニング装置、テキストマイニング方法、および、テキストマイニングプログラム |
WO2007138872A1 (ja) * | 2006-05-26 | 2007-12-06 | Nec Corporation | テキストマイニング装置、テキストマイニング方法、および、テキストマイニングプログラム |
JP2008039983A (ja) * | 2006-08-03 | 2008-02-21 | Nec Corp | テキストマイニング装置、テキストマイニング方法、およびテキストマイニング用プログラム |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6185527B1 (en) * | 1999-01-19 | 2001-02-06 | International Business Machines Corporation | System and method for automatic audio content analysis for word spotting, indexing, classification and retrieval |
JP2001101194A (ja) | 1999-09-27 | 2001-04-13 | Mitsubishi Electric Corp | テキストマイニング方法、テキストマイニング装置及びテキストマイニングプログラムが記録された記録媒体 |
US6973428B2 (en) * | 2001-05-24 | 2005-12-06 | International Business Machines Corporation | System and method for searching, analyzing and displaying text transcripts of speech after imperfect speech recognition |
US7010515B2 (en) * | 2001-07-12 | 2006-03-07 | Matsushita Electric Industrial Co., Ltd. | Text comparison apparatus |
JP3955522B2 (ja) | 2002-11-11 | 2007-08-08 | 株式会社ジャストシステム | データ分析装置及び方法、並びにプログラム |
JP2004178123A (ja) * | 2002-11-26 | 2004-06-24 | Hitachi Ltd | 情報処理装置、該情報処理装置を実現するためのプログラム |
US20050283357A1 (en) * | 2004-06-22 | 2005-12-22 | Microsoft Corporation | Text mining method |
US7461056B2 (en) * | 2005-02-09 | 2008-12-02 | Microsoft Corporation | Text mining apparatus and associated methods |
-
2009
- 2009-08-28 US US13/060,587 patent/US8380741B2/en active Active
- 2009-08-28 WO PCT/JP2009/004211 patent/WO2010023939A1/ja active Application Filing
- 2009-08-28 JP JP2010526564A patent/JP5472641B2/ja active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007026347A (ja) * | 2005-07-21 | 2007-02-01 | Nec Corp | テキストマイニング装置、テキストマイニング方法およびテキストマイニングプログラム |
WO2007066704A1 (ja) * | 2005-12-09 | 2007-06-14 | Nec Corporation | テキストマイニング装置、テキストマイニング方法、および、テキストマイニングプログラム |
WO2007138872A1 (ja) * | 2006-05-26 | 2007-12-06 | Nec Corporation | テキストマイニング装置、テキストマイニング方法、および、テキストマイニングプログラム |
JP2008039983A (ja) * | 2006-08-03 | 2008-02-21 | Nec Corp | テキストマイニング装置、テキストマイニング方法、およびテキストマイニング用プログラム |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2017090731A (ja) * | 2015-11-12 | 2017-05-25 | 日本電信電話株式会社 | 音声認識結果圧縮装置、音声認識結果圧縮方法、プログラム |
Also Published As
Publication number | Publication date |
---|---|
JPWO2010023939A1 (ja) | 2012-01-26 |
US8380741B2 (en) | 2013-02-19 |
US20110161367A1 (en) | 2011-06-30 |
JP5472641B2 (ja) | 2014-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5472640B2 (ja) | テキストマイニング装置、テキストマイニング方法、及びプログラム | |
US10515156B2 (en) | Human-to-human conversation analysis | |
US8750489B2 (en) | System and method for automatic call segmentation at call center | |
CN102906735B (zh) | 语音流增强的笔记记录 | |
US9014363B2 (en) | System and method for automatically generating adaptive interaction logs from customer interaction text | |
JP5440815B2 (ja) | 情報分析装置、情報分析方法、及びプログラム | |
US9904927B2 (en) | Funnel analysis | |
JP5496863B2 (ja) | 感情推定装置、その方法、プログラム及びその記録媒体 | |
JP4453687B2 (ja) | テキストマイニング装置、テキストマイニング方法、およびテキストマイニング用プログラム | |
JP5472641B2 (ja) | テキストマイニング装置、テキストマイニング方法、及びプログラム | |
Kopparapu | Non-linguistic analysis of call center conversations | |
WO2023124647A1 (zh) | 一种纪要确定方法及其相关设备 | |
WO2011071174A1 (ja) | テキストマイニング方法、テキストマイニング装置及びテキストマイニングプログラム | |
Camelin et al. | Detection and interpretation of opinion expressions in spoken surveys | |
JP2012168669A (ja) | インタビュー支援装置、方法及びプログラム | |
JP6743108B2 (ja) | パターン認識モデル及びパターン学習装置、その生成方法、それを用いたfaqの抽出方法及びパターン認識装置、並びにプログラム | |
Boulis et al. | The role of disfluencies in topic classification of human-human conversations | |
Ikbal et al. | Intent focused summarization of caller-agent conversations | |
Renard et al. | A Tool to Evaluate Error Correction Resources and Processes Suited for Documents Improvement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09809594 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010526564 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13060587 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09809594 Country of ref document: EP Kind code of ref document: A1 |