WO2011071174A1 - Procédé d'exploration de texte, dispositif d'exploration de texte et programme d'exploration de texte - Google Patents

Procédé d'exploration de texte, dispositif d'exploration de texte et programme d'exploration de texte Download PDF

Info

Publication number
WO2011071174A1
WO2011071174A1 PCT/JP2010/072310 JP2010072310W WO2011071174A1 WO 2011071174 A1 WO2011071174 A1 WO 2011071174A1 JP 2010072310 W JP2010072310 W JP 2010072310W WO 2011071174 A1 WO2011071174 A1 WO 2011071174A1
Authority
WO
WIPO (PCT)
Prior art keywords
topic
text
degree
feature
unit
Prior art date
Application number
PCT/JP2010/072310
Other languages
English (en)
Japanese (ja)
Inventor
晃裕 田村
開 石川
真一 安藤
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2011545274A priority Critical patent/JPWO2011071174A1/ja
Priority to US13/511,504 priority patent/US9135326B2/en
Publication of WO2011071174A1 publication Critical patent/WO2011071174A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data

Definitions

  • the present invention relates to a text mining method, a text mining apparatus, and a text mining program for a specific topic.
  • Text mining technology is a technology for analyzing the characteristics and trends of a text set.
  • a system to which text mining technology is applied (hereinafter referred to as a text mining system) calculates the feature level of each element such as a word or phrase in each text in the text set, and from the text set based on the feature level Identify characteristic elements.
  • a text set to be examined for features and trends is referred to as a “target text set” in the following description.
  • the text mining system uses, for example, the frequency at which each element appears in the text as the feature level of each element. In this case, elements that frequently appear in the focused text set are identified as characteristic elements in the focused text set.
  • the analyst When text mining is performed on the topic “inquiry content”, the analyst first applies the topic analysis system described in Non-Patent Document 2 to each input call text, and the topic “inquiry content”. The part corresponding to is identified. As shown in FIG. 17, the input call text is divided for each utterance, and an identifier (utterance index) for identifying the topic and each utterance is given to each utterance. After the topic is identified by the topic analysis system, the analyst classifies the divided utterances into a portion indicated by the speech indexes “6” to “15” whose topic is “inquiry content” and other portions. The analyst can analyze the content of the inquiry by performing text mining on the call text classified here. The text mining method applied after the topic is specified will be further described.
  • the data input unit 10 includes a topic to which each part of the text divided into these units (sometimes referred to as partial text) and a value indicating the degree to which the part is involved in the topic (hereinafter referred to as topic participation).
  • topic participation a value indicating the degree to which the part is involved in the topic.
  • the information given to each part may be received. That is, each part of each text of the input text set (that is, a unit such as a word, sentence, sentence, paragraph, utterance, etc.), a topic to which these parts belong, and a degree of involvement in the topic (topic participation degree) Information about a topic may be given.
  • each part is involved in a topic means that each part is associated with the topic.
  • the analysis setting input unit 20 is realized by an input device such as a keyboard, for example.
  • the analysis setting input unit 20 is a subset of the input text set (hereinafter referred to as “target text”) whose characteristics and tendency are to be investigated from the information on the topic to be analyzed (analysis target topic) and the input text set. It is described as a focused text set). Then, the analysis setting input unit 20 notifies the analysis management unit 41 of these pieces of information.
  • the analysis setting input unit 20 may optionally accept a setting indicating that the target of text mining is narrowed down in the input text set. In this case, the analysis setting input unit 20 transmits this setting to the analysis management unit 41 described later. In the subsequent processing, the computer 40 performs processing on the set target instead of the entire input text set.
  • the topic participation correction feature level calculation unit 44 adds other topic participation levels (degrees related to topics other than the analysis target topic). ) Is also used to correct the feature.
  • the feature degree is a value defined for each element, and is an index indicating the degree of appearance of the element in the text set of interest. In other words, the feature degree is an index representing how much each element appears in the target text set.
  • the text 1 has two portions where one verb “moving” appears, and the number of appearances of each portion is corrected to 0.8 and 0.6, respectively.
  • the text 2 has one portion where one verb “moving” appears, and the number of appearances is corrected to 0.3.
  • the text 5 has one portion where one verb “moving” appears, and the number of appearances is corrected to 0.9.
  • the total number of words appearing in the text in the text set of interest is 1000.
  • four verbs “move” appear in the focused text set, so that the feature degree of the verb “move” can be calculated as 4/1000.
  • the output unit 30 may output only the element determined to be characteristic, or may output the combination of the element determined to be characteristic and its characteristic degree. Further, the output unit 30 may output not only the elements determined to be characteristic but also sets of elements and features for all elements. The output unit 30 may output the feature degrees in the output order rearranged according to the feature degrees, or may output the feature degrees regardless of the feature degrees (that is, without rearranging). Note that, as described above, the analysis setting input unit 20 may accept, as options, settings for language processing executed by the language processing unit 42 and various settings related to the calculation method executed by the topic participation degree calculating unit 43.
  • the analysis setting input unit 20 may optionally accept various settings used at the time of correction executed by the appearance degree calculation unit 45 and various settings when the feature degree calculation unit 46 calculates the feature degree.
  • the analysis setting input unit 20 may transmit the input information to the analysis management unit 41.
  • the analysis management unit 41 appropriately transmits the transmitted information to each unit (more specifically, the language processing unit 42, the topic participation calculation unit 43, the appearance calculation unit 45, and the feature calculation unit 46). Each unit may use these pieces of information when performing processing.
  • the analysis management unit 41, the language processing unit 42, the topic participation degree calculation unit 43, and the topic participation degree correction feature degree calculation unit 44 are programmed.
  • a central processing unit of the computer 40 that operates according to the (text mining program).
  • the program is stored in the storage unit 50 such as a memory or HDD, and the central processing unit reads the program, and according to the program, the analysis management unit 41, the language processing unit 42, the topic participation degree calculation unit 43, and the topic participation degree.
  • the corrected feature level calculation unit 44 (more specifically, the appearance level calculation unit 45 and the feature level calculation unit 46) may be operated.
  • the analysis management unit 41, the language processing unit 42, the topic participation calculation unit 43, the topic participation correction feature calculation unit 44 (more specifically, the appearance calculation unit 45 and the feature calculation unit 46), Each may be realized by dedicated hardware. Next, the operation will be described. FIG.
  • the data input unit 10 receives a set of texts (ie, an input text set) to be subjected to text mining according to the embodiment of the present invention as an input (step A1).
  • the analysis setting input unit 20 receives various setting information necessary for performing text mining on the input text set in accordance with a user instruction (step A2).
  • the analysis setting input unit 20 may accept, as options, settings for language processing executed by the language processing unit 42 and various settings related to the calculation method executed by the topic participation degree calculating unit 43.
  • the analysis setting input unit 20 may optionally accept various settings used at the time of correction executed by the appearance degree calculation unit 45 and various settings when the feature degree calculation unit 46 calculates the feature degree.
  • the topic participation degree calculation unit 43 calculates the topic participation degree with respect to the analysis target topic for each part of each text to be text mined.
  • the topic participation degree calculation unit 43 may calculate a topic participation degree for a topic other than the analysis target topic (step A4).
  • the topic participation correction characteristic calculation unit 44 analyzes the analysis target topic and the target text set information specified by the user in step A2 (that is, the analysis target topic and the target text set received by the analysis setting input unit 20 from the user). Information) is received through the analysis management unit 41. Then, the topic participation correction feature calculation unit 44 calculates the feature of each element with respect to the target text set.
  • the analysis management unit 41 includes a language processing unit 42, a topic participation degree calculation unit 43, and a topic participation degree correction feature degree calculation unit 44 (more specifically, an appearance degree calculation unit 45 and a feature degree calculation unit 46).
  • each component unit executes each process according to the instructed processing procedure. In this way, by performing processing based on an instruction to repeat the processing, not only a single text mining trial with only one axis of analysis such as the text set of interest and the topic to be analyzed but also analysis It is possible to try text mining multiple times while changing the axis.
  • the instruction to repeat the process is not necessarily an instruction to perform all the steps exemplified in steps A1 to A7, but may be an instruction to change the process according to the analysis process.
  • the unit to which the topic information is assigned (that is, the unit obtained by dividing each text) is not limited to the utterance unit.
  • the topic information may not be an utterance unit, but may be a word unit, a sentence unit, a sentence unit, a paragraph unit, or the like.
  • the part indicated by the speech index “16” indicates that it is involved in the topic “treatment” with a degree of 0.83.
  • the portion indicated by the speech index “20” indicates that the topic “treatment” is involved in a degree of 0.42, and the topic “contact method” is involved in a degree of 0.35.
  • the topic information may not be information on all topics, but may be information on some topics as illustrated in FIG.
  • the example of FIG. 8 indicates that only information related to the topic “treatment” is given.
  • the input text may be text to which topic information is not given. Note that the processing so far corresponds to the processing up to step A1 illustrated in FIG. Subsequently, in order to perform an analysis desired by the user, the analysis setting input unit 20 receives, from the user, various kinds of information necessary for performing text mining on a certain analysis target topic with respect to the input text set.
  • the analysis setting input unit 20 presents each text of the input text set to the user, recognizes the text set designated as the text that the user wants to be the focused text set, and determines that the designation of the focused text set has been accepted. May be. Specifically, first, the analysis setting input unit 20 presents each text of the input text set to the user. When the user designates “text set corresponding to operator A” from the presented text, the analysis setting input unit 20 recognizes the text set designated by the user, and sets the target text set to “operator A supported”. Set to Text Set. When analysis is performed by the method exemplified in analysis (2), the analysis setting input unit 20 may receive designation of a set of text portions corresponding to a specific topic as the focused text set.
  • the analysis setting input unit 20 may optionally accept various settings used at the time of correction executed by the appearance degree calculation unit 45 and various settings when the feature degree calculation unit 46 calculates the feature degree.
  • the settings and information received by the analysis setting input unit 20 are transmitted to the analysis management unit 41.
  • each unit more specifically, the language processing unit 42.
  • the topic participation degree calculation unit 43, the appearance degree calculation unit 45, and the feature degree calculation unit 46 may receive various settings from the analysis management unit 41 and use them. Specific examples of the setting items will be described in the later-described processing of each unit in which the setting is used.
  • the analysis setting input unit 20 may optionally accept a setting for narrowing down text mining targets from the input text set.
  • each processing is performed not on the entire input text set but on the narrowed-down text set.
  • a process when the text mining target is not narrowed down will be described as an example, but the process when the text mining target is narrowed down is also the same. That is, when the target of text mining is narrowed down, the process for “input text set” in the following description is performed as “text set as a result of narrowing down the input text set in step A2 illustrated in FIG.
  • each element is a combination of a plurality of elements.
  • “n” in the word n-gram and “n” in n consecutive dependency are natural numbers, and may be values set manually, for example.
  • specific language processing morphological analysis, syntax analysis, dependency analysis, and the like are performed in accordance with the unit of the element to be generated. For example, when a word or a word n-gram is included as an element unit, the language processing unit 42 performs morphological analysis and generates an element.
  • the topic participation correction feature calculation unit 44 receives the analysis target topic and the information on the text set of interest specified by the user through the analysis management unit 41. Then, the topic participation degree correction feature degree calculation unit 44 corrects the feature degree with respect to the text set of interest for each element generated in step A3 according to the topic participation degree calculated in step A4 illustrated in FIG. In the case of performing the analysis by the method illustrated in the analysis (2), the topic participation correction feature degree calculation unit 44 uses the other topic participation degrees calculated in step A4 illustrated in FIG. The feature degree may be corrected.
  • the analysis is performed by the method illustrated in the analysis (1) (that is, the method using only the portion corresponding to the analysis target topic when calculating the characteristic)
  • the calculation unit 46 may use only the number of appearances appearing in the portion corresponding to the analysis target topic of each element among the numbers of appearances corrected by the appearance degree calculation unit 45 in step A5 for calculating the feature level.
  • the analysis is performed by the method exemplified in the analysis (2) (that is, a method that uses a part corresponding to a topic other than the analysis target topic in addition to a part corresponding to the analysis target topic). Also good.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

L'invention concerne un procédé, un dispositif et un programme d'exploration de texte susceptibles de réaliser une exploration de texte sur un sujet spécifique comme un objet avec une grande précision. Un moyen d'identification d'élément (81) calcule un degré de caractéristique, qui est un indice pour indiquer un degré auquel, dans une collection de textes intéressante, qui est une collection de textes qui doit être analysée, un élément du texte apparaît. Une unité de sortie (30) identifie des éléments distinctifs dans la collection de textes intéressante en fonction du degré de caractéristique calculé et sort les éléments identifiés. Le moyen d'identification d'élément (81) corrige le degré de caractéristique en fonction du degré de pertinence du sujet, qui est une valeur indiquant un degré de pertinence par rapport à un sujet d'analyse, qui est un sujet pour lequel chaque partie de texte du texte analysé a été partitionnée en des unités prédéterminées qui doivent être analysées.
PCT/JP2010/072310 2009-12-10 2010-12-07 Procédé d'exploration de texte, dispositif d'exploration de texte et programme d'exploration de texte WO2011071174A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2011545274A JPWO2011071174A1 (ja) 2009-12-10 2010-12-07 テキストマイニング方法、テキストマイニング装置及びテキストマイニングプログラム
US13/511,504 US9135326B2 (en) 2009-12-10 2010-12-07 Text mining method, text mining device and text mining program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2009-280558 2009-12-10
JP2009280558 2009-12-10

Publications (1)

Publication Number Publication Date
WO2011071174A1 true WO2011071174A1 (fr) 2011-06-16

Family

ID=44145716

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2010/072310 WO2011071174A1 (fr) 2009-12-10 2010-12-07 Procédé d'exploration de texte, dispositif d'exploration de texte et programme d'exploration de texte

Country Status (3)

Country Link
US (1) US9135326B2 (fr)
JP (1) JPWO2011071174A1 (fr)
WO (1) WO2011071174A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016081412A (ja) * 2014-10-21 2016-05-16 日本電気株式会社 情報処理システム、情報処理プログラム、及び、情報処理方法
WO2020044558A1 (fr) * 2018-08-31 2020-03-05 富士通株式会社 Programme de génération de règle de classification, procédé de génération de règle de classification, et dispositif de génération de règle de classification

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5940135B2 (ja) * 2014-12-02 2016-06-29 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation 話題提示方法、装置及びコンピュータ・プログラム。
JP6794921B2 (ja) * 2017-05-01 2020-12-02 トヨタ自動車株式会社 興味判定装置、興味判定方法、及びプログラム
CN112069394B (zh) * 2020-08-14 2023-09-29 上海风秩科技有限公司 文本信息的挖掘方法及装置
CN112101030B (zh) * 2020-08-24 2024-01-26 沈阳东软智能医疗科技研究院有限公司 建立术语映射模型、实现标准词映射的方法、装置及设备
US11876633B2 (en) * 2022-04-30 2024-01-16 Zoom Video Communications, Inc. Dynamically generated topic segments for a communication session

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008146461A (ja) * 2006-12-12 2008-06-26 Yahoo Japan Corp 会話記録ブログ化装置
JP2008204274A (ja) * 2007-02-21 2008-09-04 Nomura Research Institute Ltd 会話解析装置および会話解析プログラム
JP2008278088A (ja) * 2007-04-27 2008-11-13 Hitachi Ltd 動画コンテンツに関するコメント管理装置

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3579204B2 (ja) * 1997-01-17 2004-10-20 富士通株式会社 文書要約装置およびその方法
US5875446A (en) * 1997-02-24 1999-02-23 International Business Machines Corporation System and method for hierarchically grouping and ranking a set of objects in a query context based on one or more relationships
JP3918374B2 (ja) * 1999-09-10 2007-05-23 富士ゼロックス株式会社 文書検索装置および方法
US7490092B2 (en) * 2000-07-06 2009-02-10 Streamsage, Inc. Method and system for indexing and searching timed media information based upon relevance intervals
US20020065857A1 (en) * 2000-10-04 2002-05-30 Zbigniew Michalewicz System and method for analysis and clustering of documents for search engine
US7269546B2 (en) * 2001-05-09 2007-09-11 International Business Machines Corporation System and method of finding documents related to other documents and of finding related words in response to a query to refine a search
JP2003016106A (ja) 2001-06-29 2003-01-17 Fuji Xerox Co Ltd 関連度値算出装置
US20030204496A1 (en) * 2002-04-29 2003-10-30 X-Mine, Inc. Inter-term relevance analysis for large libraries
US7243105B2 (en) * 2002-12-31 2007-07-10 British Telecommunications Public Limited Company Method and apparatus for automatic updating of user profiles
CN1629834A (zh) * 2003-12-17 2005-06-22 国际商业机器公司 计算机辅助写作、电子文档的浏览、检索和预订发布
US7596571B2 (en) * 2004-06-30 2009-09-29 Technorati, Inc. Ecosystem method of aggregation and search and related techniques
JP2006031198A (ja) 2004-07-14 2006-02-02 Nec Corp テキストマイニング装置及びそれに用いるテキストマイニング方法並びにそのプログラム
US8396864B1 (en) * 2005-06-29 2013-03-12 Wal-Mart Stores, Inc. Categorizing documents
US7739294B2 (en) * 2006-01-12 2010-06-15 Alexander David Wissner-Gross Method for creating a topical reading list
US7769751B1 (en) * 2006-01-17 2010-08-03 Google Inc. Method and apparatus for classifying documents based on user inputs
JP2007241348A (ja) 2006-03-06 2007-09-20 Advanced Telecommunication Research Institute International 用語収集装置、およびプログラム
US8201107B2 (en) * 2006-09-15 2012-06-12 Emc Corporation User readability improvement for dynamic updating of search results
CA2572116A1 (fr) * 2006-12-27 2008-06-27 Ibm Canada Limited - Ibm Canada Limitee Systeme et methode de traitement de communication multimodale dans un groupe de travail
US20090282038A1 (en) * 2008-09-23 2009-11-12 Michael Subotin Probabilistic Association Based Method and System for Determining Topical Relatedness of Domain Names

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008146461A (ja) * 2006-12-12 2008-06-26 Yahoo Japan Corp 会話記録ブログ化装置
JP2008204274A (ja) * 2007-02-21 2008-09-04 Nomura Research Institute Ltd 会話解析装置および会話解析プログラム
JP2008278088A (ja) * 2007-04-27 2008-11-13 Hitachi Ltd 動画コンテンツに関するコメント管理装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016081412A (ja) * 2014-10-21 2016-05-16 日本電気株式会社 情報処理システム、情報処理プログラム、及び、情報処理方法
WO2020044558A1 (fr) * 2018-08-31 2020-03-05 富士通株式会社 Programme de génération de règle de classification, procédé de génération de règle de classification, et dispositif de génération de règle de classification
JPWO2020044558A1 (ja) * 2018-08-31 2021-04-30 富士通株式会社 分類規則生成プログラム、分類規則生成方法および分類規則生成装置
JP7044162B2 (ja) 2018-08-31 2022-03-30 富士通株式会社 分類規則生成プログラム、分類規則生成方法および分類規則生成装置

Also Published As

Publication number Publication date
US9135326B2 (en) 2015-09-15
US20120284016A1 (en) 2012-11-08
JPWO2011071174A1 (ja) 2013-04-22

Similar Documents

Publication Publication Date Title
CN110765244B (zh) 获取应答话术的方法、装置、计算机设备及存储介质
JP5901001B1 (ja) 音響言語モデルトレーニングのための方法およびデバイス
US10303683B2 (en) Translation of natural language questions and requests to a structured query format
US9236047B2 (en) Voice stream augmented note taking
WO2011071174A1 (fr) Procédé d'exploration de texte, dispositif d'exploration de texte et programme d'exploration de texte
Mairesse et al. Phrase-based statistical language generation using graphical models and active learning
JP5440815B2 (ja) 情報分析装置、情報分析方法、及びプログラム
US10754886B2 (en) Using multiple natural language classifier to associate a generic query with a structured question type
US10748528B2 (en) Language model generating device, language model generating method, and recording medium
WO2010023938A1 (fr) Appareil d'exploration de texte, procédé d'exploration de texte et support d'enregistrement lisible par un ordinateur
US11526512B1 (en) Rewriting queries
JP2012083543A (ja) 言語モデル生成装置、その方法及びそのプログラム
CN111326144B (zh) 语音数据处理方法、装置、介质和计算设备
Abad et al. Supporting analysts by dynamic extraction and classification of requirements-related knowledge
CN112836016B (zh) 会议纪要生成方法、装置、设备和存储介质
CN111161730B (zh) 语音指令匹配方法、装置、设备及存储介质
WO2010023939A1 (fr) Appareil d'exploration de texte, procédé d'exploration de texte et support d'enregistrement lisible par un ordinateur
JP4653598B2 (ja) 構文・意味解析装置、音声認識装置、及び構文・意味解析プログラム
JP4478042B2 (ja) 頻度情報付き単語集合生成方法、プログラムおよびプログラム記憶媒体、ならびに、頻度情報付き単語集合生成装置、テキスト索引語作成装置、全文検索装置およびテキスト分類装置
JP2018181259A (ja) 対話ルール照合装置、対話装置、対話ルール照合方法、対話方法、対話ルール照合プログラム、及び対話プログラム
CN114444491A (zh) 新词识别方法和装置
JP2008165718A (ja) 意図判定装置、意図判定方法、及びプログラム
KR102445172B1 (ko) 질의 해석 방법 및 장치
US12019999B2 (en) Providing a well-formed alternate phrase as a suggestion in lieu of a not well-formed phrase
JP3737817B2 (ja) 表現変換方法及び表現変換装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10836093

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2011545274

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 13511504

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10836093

Country of ref document: EP

Kind code of ref document: A1