WO2018220688A1 - Dictionary generator, dictionary generation method, and program - Google Patents

Dictionary generator, dictionary generation method, and program Download PDF

Info

Publication number
WO2018220688A1
WO2018220688A1 PCT/JP2017/019947 JP2017019947W WO2018220688A1 WO 2018220688 A1 WO2018220688 A1 WO 2018220688A1 JP 2017019947 W JP2017019947 W JP 2017019947W WO 2018220688 A1 WO2018220688 A1 WO 2018220688A1
Authority
WO
WIPO (PCT)
Prior art keywords
group
identification information
unit
dictionary
name
Prior art date
Application number
PCT/JP2017/019947
Other languages
French (fr)
Japanese (ja)
Inventor
龍二 高山
桂介 甲斐
Original Assignee
株式会社Pfu
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社Pfu filed Critical 株式会社Pfu
Priority to PCT/JP2017/019947 priority Critical patent/WO2018220688A1/en
Publication of WO2018220688A1 publication Critical patent/WO2018220688A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools

Definitions

  • the present invention relates to a dictionary generation device, a dictionary generation method, and a program.
  • Patent Document 1 discloses a document processing apparatus that extracts a morpheme group having a dependency relationship from a document, classifies the extracted morpheme group according to its viewpoint, and classifies the document according to a classification result. It is disclosed.
  • the dictionary generation device includes an extraction unit that extracts identification information and a noun from a document file, a combination of the identification information extracted by the extraction unit, and the extracted noun.
  • a relationship evaluation unit that evaluates the correlation between the identification information and the noun based on the occurrence frequency, and a group generation unit that generates a group of identification information or a group of nouns based on the evaluation result by the relationship evaluation unit.
  • the document file is a work-related document file
  • the extraction unit extracts work object identification information and a work object name from the work-related document file
  • the relationship evaluation unit includes: Based on the co-occurrence frequency of the identification information of the work object and the name of the work object, the correlation between the identification information of the work object and the name of the work object is evaluated, and the group generation unit A group of identification information of an object or a group of names of work objects is generated.
  • the document file is a document file for a request or report for maintenance work or repair work
  • the extraction unit extracts the replacement part identification information and the name of the replacement part from the document file.
  • the relationship evaluation unit evaluates the correlation between the identification information of the replacement part and the name of the replacement part based on the co-occurrence frequency of the identification information of the replacement part and the name of the replacement part
  • the group generation unit Generates a group of identification information of replacement parts or a group of names of replacement parts.
  • a representative determination unit that determines a representative of identification information or a representative word of a part name
  • a dictionary output unit that outputs a dictionary in which identification information included in the group, a representative word of the group determined by the representative word determination unit or a representative of the identification information, and a noun included in each group are associated with each other; .
  • the apparatus further includes a group updating unit that evaluates the correlation among the subgroups within the group generated by the group generation unit and updates the group based on the increase / decrease in the evaluation result.
  • a group updating unit that evaluates the correlation among the subgroups within the group generated by the group generation unit and updates the group based on the increase / decrease in the evaluation result.
  • the apparatus further includes an analysis unit that generates statistical data regarding work using the dictionary output by the dictionary output unit.
  • the dictionary generation method includes an extraction step for extracting identification information to be lexicized and a noun from a document file, the identification information extracted by the extraction step, an extracted noun, A relationship evaluation step for evaluating the correlation between the identification information and the noun based on the co-occurrence frequency, and a group generation step for generating a group of identification information or a group of nouns based on the evaluation result of the relationship evaluation step And have.
  • the program according to the present invention also includes an extraction step for extracting identification information and nouns to be lexicized from a document file, a combination of the identification information extracted by the extraction step and the extracted nouns.
  • a relationship evaluation step for evaluating the correlation between the identification information and the noun based on the occurrence frequency
  • a group generation step for generating a group of identification information or a group of nouns based on the evaluation result of the relationship evaluation step. Let the computer run.
  • FIG. 3 is a diagram illustrating a hardware configuration of a dictionary generation device 3.
  • FIG. 3 is a diagram illustrating a functional configuration of a dictionary generation device 3.
  • FIG. 3 is a diagram illustrating a more detailed functional configuration of a search unit 334.
  • FIG. It is a figure which illustrates the information registered into a dictionary. It is a flowchart explaining the dictionary production
  • FIG. It is a flowchart explaining a dictionary addition process (S20). It is a flowchart explaining a dictionary arrangement
  • FIG. 1 is a diagram illustrating a hardware configuration of the dictionary generation device 3.
  • the dictionary generation device 3 includes a CPU 300, a memory 302, an HDD 304, a network interface 306 (network IF 306), a display device 308, and an input device 310, and these configurations are connected via a bus 312. Connected to each other. That is, the dictionary generation device 3 is a computer device.
  • the CPU 300 is, for example, a central processing unit.
  • the memory 302 is, for example, a volatile memory and functions as a main storage device.
  • the HDD 304 is, for example, a hard disk drive device, and stores a computer program (for example, the dictionary generation program 32 in FIG.
  • the network IF 306 is an interface for performing wired or wireless communication.
  • the display device 308 is, for example, a liquid crystal display.
  • the input device 310 is, for example, a keyboard and a mouse.
  • FIG. 2 is a diagram illustrating a functional configuration of the dictionary generation device 3.
  • a dictionary generation program 32 is installed in the dictionary generation apparatus 3 of this example, and a work information storage unit 360 is configured.
  • the dictionary generation program 32 is stored in, for example, a recording medium such as a CD-ROM, and is installed in the dictionary generation apparatus 3 via this recording medium.
  • the dictionary generation program 32 includes an extraction unit 320, a relationship evaluation unit 322, a group generation unit 324, a representative determination unit 326, an important word determination unit 328, a group update unit 330, a dictionary output unit 332, a search unit 334, and an analysis unit 342.
  • Part or all of the dictionary generation program 32 may be realized by hardware such as an ASIC, or may be realized by partially borrowing an OS (Operating System) function.
  • OS Operating System
  • the extraction unit 320 extracts identification information to be dictionaryd and nouns from the document file.
  • the lexicalization target may be anything as long as identification information is given, but is a work target that is a work target, for example.
  • a part that is a target of maintenance work of the device will be described as a specific example.
  • the extraction unit 320 of this example extracts part identification information (hereinafter referred to as part ID) and a noun that is likely to be a part name from a maintenance work or repair work request document or report document file.
  • the relationship evaluation unit 322 evaluates the correlation between the identification information and the noun based on the co-occurrence frequency of the identification information extracted by the extraction unit 320 and the extracted noun.
  • the relationship evaluation unit 322 of this example counts the frequency (co-occurrence frequency) at which nouns that are likely to be component names and component IDs appear within a predetermined unit, and generates a co-occurrence frequency matrix.
  • the default unit is a range of a document in which it can be determined that the component ID and the component name appear together, and includes, for example, a document file unit, an input field unit, a paragraph unit, and a sentence unit.
  • the relationship evaluation unit 322 of this example calculates a correlation coefficient between the component name and the component ID based on the generated co-occurrence frequency matrix.
  • the group generation unit 324 generates a group of identification information or a group of nouns based on the evaluation result by the relationship evaluation unit 322. For example, the group generation unit 324 clusters the component names based on the co-occurrence frequency of the component name and the component ID counted by the relationship evaluation unit 322. The group generation unit 324 of this example calculates the distance between the component names using the co-occurrence frequency of the component ID as a vector for each component name, and groups the component names having the calculated distances close to each other. Note that other clustering methods, distance measures, and the like may be used.
  • the representative determining unit 326 determines a representative word that is a representative of identification information or a representative of a noun from identification information or a group of nouns generated by the group generation unit 324. For example, the representative determining unit 326 determines the representative of the identification information or the representative word based on the appearance frequency of the identification information or noun in the group. The representative determining unit 326 of this example uses, as a representative word, the most frequently used term (component name) in the group from the group of component names generated by the group generating unit 324.
  • the important word determination unit 328 determines an important word representing a target name corresponding to this identification information from nouns (part names) associated with the same identification information. For example, the important word determination unit 328 selects an important word from a group of nouns associated with the same identification information based on their appearance frequencies. The important word determination unit 328 of this example sets the same part ID and the term (part name) having the highest co-occurrence frequency as the important words.
  • the group update unit 330 evaluates the correlation among the subgroups in the group generated by the group generation unit 324, and updates the group based on the increase / decrease in the evaluation result.
  • the group updating unit 330 in this example searches for subgroups in which the correlation coefficient is greater than or equal to the reference value among the subgroups in the group, and if a subgroup that is greater than or equal to the reference value is found, If a subgroup exceeding the reference value is not found, the subgroup is deleted from the dictionary.
  • the dictionary output unit 332 Based on the group generated by the group generation unit 324 or the group updated by the group update unit 330, the dictionary output unit 332 includes the identification information included in each group and the representative determined by the representative determination unit 326.
  • the dictionary that associates the nouns included in each group with each other is output to the display device 308 or the work information storage unit 360.
  • the dictionary output unit 332 of the present example includes a dictionary (FIG. 4A) that associates a component ID belonging to each group with a representative word of the group, and a component name belonging to each group.
  • the dictionary (FIG. 4B) associated with the representative word of the group is output to the work information storage unit 360.
  • the search unit 334 searches the identification information or name using the dictionary output by the dictionary output unit 332. More specifically, as illustrated in FIG. 3, the search unit 334 includes a synonym search unit 336, an inventory search unit 338, and a component ID search unit 340.
  • the synonym search unit 336 refers to the dictionary and extracts part names belonging to the same group for the input part names. As a result, name identification becomes possible, and document files can be analyzed.
  • the inventory search unit 338 extracts the part name of the replacement part from the document file of the maintenance work request form, refers to the dictionary, and represents the representative word associated with the extracted part name. Based on the above, the part IDs belonging to the same group are extracted, and the parts inventory is confirmed.
  • the component ID search unit 340 refers to the dictionary and extracts component IDs belonging to the same group based on the representative words associated with the input component name.
  • the analysis unit 342 controls the synonym search unit 336 for a plurality of input work document files (request document file or report file) to collate the names in the document file with the representative words.
  • Statistical data relating to work is generated based on the appearance frequency and the like.
  • the analysis unit 342 inputs a document file of a report including the contents of maintenance work (part names of replacement parts, etc.) and the work man-hours required for the work, and the input document A graph illustrated in FIG. 11 is created by associating the part names of the file with representative words according to a dictionary and counting the work man-hours for each representative word.
  • FIG. 5 is a flowchart for explaining dictionary generation processing (S10) by the dictionary generation device 3.
  • S10 dictionary generation processing
  • FIG. 5 a flowchart for explaining dictionary generation processing (S10) by the dictionary generation device 3.
  • S ⁇ b> 100 The transition of the data file in this flowchart is illustrated in FIG.
  • the extraction unit 320 of the dictionary generation device 3 extracts a part ID and a part name from the input document file.
  • the component ID is information that can uniquely identify the component, such as a model number of the component or a component inventory management number.
  • step 105 the dictionary generation program 32 repeats the process of S100 until the extraction process is completed from all the document files, and when the extraction process is completed, the process proceeds to the process of S110.
  • step 110 the relationship evaluation unit 322, for the component ID and the component name extracted by the extraction unit 320, the appearance frequency of the component name (FIG. 15A) and the appearance frequency of the component ID (FIG. 15B). )) Is counted, and a part name and part ID co-occurrence frequency matrix (FIG. 15C) is generated from the number of times the part ID and part name appear simultaneously (co-occurrence frequency).
  • weighting may be performed in counting up the number of appearances.
  • the relationship evaluation unit 322 performs weighting so that when a plurality of sets of component IDs and component names appear simultaneously, the count number is smaller than when only one set of component IDs and component names appears.
  • the important word determination unit 328 determines, from among the part names corresponding to the part ID, the part name having the maximum co-occurrence frequency as the important word of this part ID.
  • the keyword determining unit 328 may calculate a correlation coefficient matrix from the co-occurrence frequency matrix of the component ID and the component name, and may determine the component name having the maximum correlation coefficient as the keyword.
  • step 120 (S120) the group generation unit 324 groups the component names based on the similarity by clustering based on the co-occurrence frequency matrix of the component names and the component IDs generated by the relationship evaluation unit 322.
  • the clustering of this example is based on the co-occurrence frequency of the part name and the part ID, but is not limited to this, and various clustering methods and distance measures can be adopted.
  • step 125 the representative determining unit 325 determines, for each group generated by the group generating unit 324, the component name having the highest appearance frequency from the component names belonging to the group as the representative word of this group. To do.
  • step 130 the dictionary output unit 332 collates the representative words determined by the representative word determination unit 325 for each group generated by the group generation unit 324.
  • step 135 the dictionary output unit 332 stores the part name, part ID, representative word, and important word of each group after name identification as a dictionary in the work information storage unit 360.
  • FIG. 6 is a flowchart illustrating the dictionary addition process (S20).
  • S20 dictionary addition process
  • FIG. 6 a case where an entry is added to the dictionary generated in the dictionary generation process (S10) will be described as a specific example.
  • the transition of the data file in this flowchart is illustrated in FIG.
  • the group update unit 330 reads the component name and component ID from the work information storage unit 360.
  • the component name and component ID to be read may include the component name and component ID newly extracted from the document file in addition to the component name and component ID extracted when the dictionary is first created.
  • step 205 the group updating unit 330 reads the dictionary from the work information storage unit 360, and performs name identification of the read component names according to the read dictionary.
  • step 210 the group update unit 330 instructs the relationship evaluation unit 322 to generate a co-occurrence frequency matrix.
  • the relationship evaluation unit 322 counts the number of times the component ID and the component name appear simultaneously (co-occurrence frequency) for the read component ID and component name, and the co-occurrence frequency matrix of the component name and the component ID. (FIG. 16A) is generated.
  • step 215 the group update unit 330 instructs the relationship evaluation unit 322 to generate a correlation coefficient matrix.
  • the relationship evaluation unit 322 calculates a correlation coefficient between the component ID and the component name based on the counted co-occurrence frequencies, and generates a correlation coefficient matrix (FIG. 16B).
  • step 220 the group update unit 330 instructs the group generation unit 324 to perform regrouping.
  • the group generation unit 324 groups the component names again based on the component name and component ID co-occurrence frequency matrix generated by the relationship evaluation unit 322.
  • step 225 for each group generated by the group generation unit 324, the representative determination unit 325 re-assigns the part name having the highest appearance frequency from the part names belonging to the group to the representative word of this group. decide.
  • step 230 the group updating unit 330 acquires the top N component IDs and correlation coefficients having the highest correlation coefficient for each representative word of each group, and sets this value as the baseline.
  • step 235 the group updating unit 330 synthesizes a frequency matrix of part names for each group, and searches for a combination whose correlation coefficient is higher than the set baseline.
  • step 240 the dictionary generation program 32 proceeds to the process of S245 when a combination whose correlation coefficient is higher than the baseline is found, and proceeds to the process of S250 when it is not found.
  • step 245 the group updating unit 330 additionally registers a combination having a higher correlation coefficient than the baseline in the dictionary.
  • the dictionary generation program 32 repeats the processes of S205 to S245 a predetermined number of times, and after repeating the predetermined number of times, ends the dictionary addition process (S20).
  • FIG. 7 is a flowchart for explaining the dictionary organizing process (S30).
  • the dictionary generated in the dictionary generation process (S10) is organized will be described as a specific example.
  • the transition of the data file in this flowchart is illustrated in FIG.
  • the group update unit 330 reads the component name and component ID from the work information storage unit 360.
  • the component name and component ID to be read may include the component name and component ID newly extracted from the document file in addition to the component name and component ID extracted when the dictionary is first created.
  • step 305 the group updating unit 330 instructs the relationship evaluation unit 322 to generate a co-occurrence frequency matrix.
  • the relationship evaluation unit 322 counts the number of times the component ID and the component name appear simultaneously (co-occurrence frequency) for the read component ID and component name, and the co-occurrence frequency matrix of the component name and the component ID. Is generated.
  • step 310 the group update unit 330 instructs the relationship evaluation unit 322 to generate a correlation coefficient matrix.
  • the relationship evaluation unit 322 calculates a correlation coefficient between the component ID and the component name based on the counted co-occurrence frequency, and generates a correlation coefficient matrix.
  • step 315 the group update unit 330 acquires the top N component IDs and correlation coefficients having the highest correlation coefficient for each representative word of each group, and sets this value as the baseline.
  • step 320 the group updating unit 330 synthesizes a frequency matrix of part names for each group, and searches for a combination whose correlation coefficient is higher than the set baseline.
  • step 325 the dictionary generation program 32 proceeds to the process of S330 when a combination whose correlation coefficient is higher than the baseline is found, and proceeds to the process of S335 when it is not found.
  • step 330 the group updating unit 330 updates the dictionary with a combination having a higher correlation coefficient than the baseline.
  • step 335 the group updating unit 330 deletes the combination from the dictionary when no combination having a correlation coefficient higher than the baseline is found.
  • FIG. 8 is a diagram for more specifically explaining a part name extraction method by the extraction unit 320.
  • the extraction unit 320 performs morphological analysis, dependency analysis, case analysis, and the like, and obtains an analysis result. Subsequently, the extraction unit 320 obtains part name candidates by matching with a dictionary, and further narrows down to part names having a dependency relationship with “exchange” based on the segment dependency and the morpheme dependency. . At this time, a case including a meaning indicating negative such as “not exchanged” or a case including a meaning indicating a future schedule such as “scheduled replacement” is excluded.
  • the extraction unit 320 performs morphological analysis on the input document file, matches it with the dictionary, labels the part names in the morpheme string as illustrated in FIG. 8A, and outputs them as learning data.
  • the extraction unit 320 performs sequence labeling using the learning data. Subsequently, as illustrated in FIG. 8B, the extraction unit 320 gives a document file that is not used for learning to the component name extraction model, predicts the morpheme label, and selects a component name that does not exist in the dictionary. Extract as candidates and register in dictionary. Prior to registration, a person may visually confirm, or filtering may be performed by setting a threshold value for the appearance frequency or the like.
  • the dictionary generation device 3 extracts the part name and part ID of the replacement part that is the object of maintenance work from the maintenance work request form or report document file, and shares the part name and part ID. Based on the frequency of occurrence, a dictionary of part names can be created efficiently. Furthermore, the dictionary generation device 3 performs name identification on the document file of the request form or report using the created dictionary, and analyzes the document file. Thereby, statistical evaluation regarding the maintenance work becomes possible.
  • the component names are grouped and the representative word and the important word are determined for each group.
  • the component ID may be grouped to determine the representative of the component ID and the important ID.

Abstract

Provided is a dictionary generator for efficiently generating a dictionary of the nouns included in a document file. The dictionary generator has: an extraction unit for extracting, from within a document file, a noun and identification information to be formed into a dictionary; a relationship evaluation unit for evaluating the correlation between the identification information and the noun on the basis of the co-occurrence frequency of the identification information extracted by the extraction unit and the and the extracted noun; and a group generation unit for generating a group of identification information or a group of nouns on the basis of the result of evaluation by the evaluation unit.

Description

辞書生成装置、辞書生成方法、及びプログラムDictionary generating apparatus, dictionary generating method, and program
 本発明は、辞書生成装置、辞書生成方法、及びプログラムに関するものである。 The present invention relates to a dictionary generation device, a dictionary generation method, and a program.
 例えば、特許文献1には、文書から係り受け関係を有する形態素群を抽出し、抽出された形態素群を、その視点に応じて分類し、文書を、分類結果に応じて分類する文書処理装置が開示されている。 For example, Patent Document 1 discloses a document processing apparatus that extracts a morpheme group having a dependency relationship from a document, classifies the extracted morpheme group according to its viewpoint, and classifies the document according to a classification result. It is disclosed.
特許第3925003号Patent No. 392503
 文書ファイルに含まれている名詞の辞書を効率的に生成する辞書生成装置を提供する。 Provide a dictionary generator that efficiently generates a dictionary of nouns contained in a document file.
 本発明に係る辞書生成装置は、文書ファイルの中から、辞書化対象の識別情報と、名詞とを抽出する抽出部と、前記抽出部により抽出された識別情報と、抽出された名詞との共起頻度に基づいて、識別情報と名詞の相関関係を評価する関係評価部と、前記関係評価部による評価結果に基づいて、識別情報のグループ、又は、名詞のグループを生成するグループ生成部とを有する。 The dictionary generation device according to the present invention includes an extraction unit that extracts identification information and a noun from a document file, a combination of the identification information extracted by the extraction unit, and the extracted noun. A relationship evaluation unit that evaluates the correlation between the identification information and the noun based on the occurrence frequency, and a group generation unit that generates a group of identification information or a group of nouns based on the evaluation result by the relationship evaluation unit. Have.
 好適には、前記文書ファイルは、作業に関する文書ファイルであり、前記抽出部は、作業に関する文書ファイルから、作業対象物の識別情報と、作業対象物の名称を抽出し、前記関係評価部は、作業対象物の識別情報と、作業対象物の名称との共起頻度に基づいて、作業対象物の識別情報と、作業対象物の名称との相関関係を評価し、前記グループ生成部は、作業対象物の識別情報のグループ、又は、作業対象物の名称のグループを生成する。 Preferably, the document file is a work-related document file, the extraction unit extracts work object identification information and a work object name from the work-related document file, and the relationship evaluation unit includes: Based on the co-occurrence frequency of the identification information of the work object and the name of the work object, the correlation between the identification information of the work object and the name of the work object is evaluated, and the group generation unit A group of identification information of an object or a group of names of work objects is generated.
 好適には、前記文書ファイルは、保守作業又は修理作業の依頼書又は報告書の文書ファイルであり、前記抽出部は、前記文書ファイルから、交換部品の識別情報と、交換部品の名称を抽出し、前記関係評価部は、交換部品の識別情報と、交換部品の名称との共起頻度に基づいて、交換部品の識別情報と、交換部品の名称との相関関係を評価し、前記グループ生成部は、交換部品の識別情報のグループ、又は、交換部品の名称のグループを生成する。 Preferably, the document file is a document file for a request or report for maintenance work or repair work, and the extraction unit extracts the replacement part identification information and the name of the replacement part from the document file. The relationship evaluation unit evaluates the correlation between the identification information of the replacement part and the name of the replacement part based on the co-occurrence frequency of the identification information of the replacement part and the name of the replacement part, and the group generation unit Generates a group of identification information of replacement parts or a group of names of replacement parts.
 好適には、前記グループ生成部により生成されたグループ内で、識別情報の代表、又は、部品名称の代表語を決定する代表決定部と、前記グループ生成部により生成されたグループに基づいて、各グループに含まれる識別情報と、前記代表語決定部により決定されたグループの代表語又は識別情報の代表と、各グループに含まれる名詞とを互いに関連付けた辞書を出力する辞書出力部とをさらに有する。 Preferably, in the group generated by the group generation unit, a representative determination unit that determines a representative of identification information or a representative word of a part name, and a group generated by the group generation unit, A dictionary output unit that outputs a dictionary in which identification information included in the group, a representative word of the group determined by the representative word determination unit or a representative of the identification information, and a noun included in each group are associated with each other; .
 好適には、前記グループ生成部により生成されたグループ内のサブグループで、相関関係を評価し、評価結果の増減に基づいて、グループを更新するグループ更新部をさらに有する。 Preferably, the apparatus further includes a group updating unit that evaluates the correlation among the subgroups within the group generated by the group generation unit and updates the group based on the increase / decrease in the evaluation result.
 好適には、前記辞書出力部により出力された辞書を用いて、作業に関する統計データを生成する解析部をさらに有する。 Preferably, the apparatus further includes an analysis unit that generates statistical data regarding work using the dictionary output by the dictionary output unit.
 また、本発明に係る辞書生成方法は、文書ファイルの中から、辞書化対象の識別情報と、名詞とを抽出する抽出ステップと、前記抽出ステップにより抽出された識別情報と、抽出された名詞との共起頻度に基づいて、識別情報と名詞の相関関係を評価する関係評価ステップと、前記関係評価ステップによる評価結果に基づいて、識別情報のグループ、又は、名詞のグループを生成するグループ生成ステップとを有する。 Further, the dictionary generation method according to the present invention includes an extraction step for extracting identification information to be lexicized and a noun from a document file, the identification information extracted by the extraction step, an extracted noun, A relationship evaluation step for evaluating the correlation between the identification information and the noun based on the co-occurrence frequency, and a group generation step for generating a group of identification information or a group of nouns based on the evaluation result of the relationship evaluation step And have.
 また、本発明に係るプログラムは、文書ファイルの中から、辞書化対象の識別情報と、名詞とを抽出する抽出ステップと、前記抽出ステップにより抽出された識別情報と、抽出された名詞との共起頻度に基づいて、識別情報と名詞の相関関係を評価する関係評価ステップと、前記関係評価ステップによる評価結果に基づいて、識別情報のグループ、又は、名詞のグループを生成するグループ生成ステップとをコンピュータに実行させる。 The program according to the present invention also includes an extraction step for extracting identification information and nouns to be lexicized from a document file, a combination of the identification information extracted by the extraction step and the extracted nouns. A relationship evaluation step for evaluating the correlation between the identification information and the noun based on the occurrence frequency, and a group generation step for generating a group of identification information or a group of nouns based on the evaluation result of the relationship evaluation step. Let the computer run.
 対象の名称と識別情報の特殊な関係性に着目し、文書ファイルに含まれている名詞の辞書を効率的に生成できる。 Focusing on the special relationship between the target name and identification information, it is possible to efficiently generate a dictionary of nouns contained in the document file.
辞書生成装置3のハードウェア構成を例示する図である。3 is a diagram illustrating a hardware configuration of a dictionary generation device 3. FIG. 辞書生成装置3の機能構成を例示する図である。3 is a diagram illustrating a functional configuration of a dictionary generation device 3. FIG. 検索部334のより詳細な機能構成を例示する図である。3 is a diagram illustrating a more detailed functional configuration of a search unit 334. FIG. 辞書に登録される情報を例示する図である。It is a figure which illustrates the information registered into a dictionary. 辞書生成装置3による辞書生成処理(S10)を説明するフローチャートである。It is a flowchart explaining the dictionary production | generation process (S10) by the dictionary production | generation apparatus 3. FIG. 辞書追加処理(S20)を説明するフローチャートである。It is a flowchart explaining a dictionary addition process (S20). 辞書整理処理(S30)を説明するフローチャートである。It is a flowchart explaining a dictionary arrangement | positioning process (S30). 抽出部320による部品名の抽出方法をより具体的に説明する図である。It is a figure explaining the extraction method of the part name by the extraction part 320 more concretely. 在庫確認処理を説明する図である。It is a figure explaining stock confirmation processing. 作業解析処理を説明する図である。It is a figure explaining work analysis processing. 作業解析結果を例示する図である。It is a figure which illustrates work analysis results. 辞書生成処理におけるデータファイルの変遷を例示する図である。It is a figure which illustrates the transition of the data file in a dictionary production | generation process. 辞書追加処理におけるデータファイルの変遷を例示する図である。It is a figure which illustrates the transition of the data file in a dictionary addition process. 辞書整理処理におけるデータファイルの変遷を例示する図である。It is a figure which illustrates the transition of the data file in a dictionary arrangement | positioning process. 部品名と部品IDの共起頻度行列を説明する図である。It is a figure explaining the co-occurrence frequency matrix of a component name and component ID. 共起頻度行列から相関係数行列を生成する方法を説明する図である。It is a figure explaining the method to produce | generate a correlation coefficient matrix from a co-occurrence frequency matrix.
 (背景と概要)
 報告書などの文書を解析する前提として、同義語等が収録された辞書が必要になる。同一の物を表わす用語として、複数の同義語、類義語、略語、俗語、外国語等が存在するからである。また、略語や俗語は時代と共に変化していくため、辞書の整備は容易ではない。
 そこで、本実施形態の辞書生成装置3は、対象の名称となる名詞と、この対象の識別情報との関係性に着目して、効率的に辞書を生成する。一般に、辞書の自動生成を考える場合に、述語(動詞)と目的語(名詞)の構文に着目することが多いが、物の名称(名詞)と、物の識別情報(ID)は、構文等とは異なる特徴的な関係性を有するため、辞書生成の効率化が期待できる。
(Background and overview)
As a prerequisite for analyzing documents such as reports, a dictionary containing synonyms and the like is required. This is because there are a plurality of synonyms, synonyms, abbreviations, slang terms, foreign languages, and the like as terms representing the same thing. Also, since abbreviations and slang words change with the times, it is not easy to maintain a dictionary.
Therefore, the dictionary generation device 3 of the present embodiment efficiently generates a dictionary by paying attention to the relationship between the noun that is the target name and the identification information of the target. In general, when considering automatic generation of a dictionary, attention is often paid to the syntax of a predicate (verb) and an object (noun), but the name of the object (noun) and the identification information (ID) of the object are syntactic Because it has a characteristic relationship different from, it can be expected to improve the efficiency of dictionary generation.
 (実施形態)
 図1は、辞書生成装置3のハードウェア構成を例示する図である。
 図1に例示するように、辞書生成装置3は、CPU300、メモリ302、HDD304、ネットワークインタフェース306(ネットワークIF306)、表示装置308、及び、入力装置310を有し、これらの構成はバス312を介して互いに接続している。すなわち、辞書生成装置3は、コンピュータ装置である。
 CPU300は、例えば、中央演算装置である。
 メモリ302は、例えば、揮発性メモリであり、主記憶装置として機能する。
 HDD304は、例えば、ハードディスクドライブ装置であり、不揮発性の記録装置としてコンピュータプログラム(例えば、図2の辞書生成プログラム32)やその他のデータファイル(例えば、図4の辞書ファイル)を格納する。
 ネットワークIF306は、有線又は無線で通信するためのインタフェースである。
 表示装置308は、例えば、液晶ディスプレイである。
 入力装置310は、例えば、キーボード及びマウスである。
(Embodiment)
FIG. 1 is a diagram illustrating a hardware configuration of the dictionary generation device 3.
As illustrated in FIG. 1, the dictionary generation device 3 includes a CPU 300, a memory 302, an HDD 304, a network interface 306 (network IF 306), a display device 308, and an input device 310, and these configurations are connected via a bus 312. Connected to each other. That is, the dictionary generation device 3 is a computer device.
The CPU 300 is, for example, a central processing unit.
The memory 302 is, for example, a volatile memory and functions as a main storage device.
The HDD 304 is, for example, a hard disk drive device, and stores a computer program (for example, the dictionary generation program 32 in FIG. 2) and other data files (for example, the dictionary file in FIG. 4) as a nonvolatile recording device.
The network IF 306 is an interface for performing wired or wireless communication.
The display device 308 is, for example, a liquid crystal display.
The input device 310 is, for example, a keyboard and a mouse.
 図2は、辞書生成装置3の機能構成を例示する図である。
 図2に例示するように、本例の辞書生成装置3には、辞書生成プログラム32がインストールされると共に、作業情報格納部360が構成される。辞書生成プログラム32は、例えば、CD-ROM等の記録媒体に格納されており、この記録媒体を介して、辞書生成装置3にインストールされる。
 辞書生成プログラム32は、抽出部320、関係評価部322、グループ生成部324、代表決定部326、重要語決定部328、グループ更新部330、辞書出力部332、検索部334、及び解析部342を有する。
 なお、辞書生成プログラム32の一部又は全部は、ASICなどのハードウェアにより実現されてもよく、また、OS(Operating System)の機能を一部借用して実現されてもよい。
FIG. 2 is a diagram illustrating a functional configuration of the dictionary generation device 3.
As illustrated in FIG. 2, a dictionary generation program 32 is installed in the dictionary generation apparatus 3 of this example, and a work information storage unit 360 is configured. The dictionary generation program 32 is stored in, for example, a recording medium such as a CD-ROM, and is installed in the dictionary generation apparatus 3 via this recording medium.
The dictionary generation program 32 includes an extraction unit 320, a relationship evaluation unit 322, a group generation unit 324, a representative determination unit 326, an important word determination unit 328, a group update unit 330, a dictionary output unit 332, a search unit 334, and an analysis unit 342. Have.
Part or all of the dictionary generation program 32 may be realized by hardware such as an ASIC, or may be realized by partially borrowing an OS (Operating System) function.
 辞書生成プログラム32において、抽出部320は、文書ファイルの中から、辞書化対象の識別情報と、名詞とを抽出する。辞書化対象は、識別情報が付与されているものであれば何でもよいが、例えば、作業の対象となる作業対象物である。本例では、機器の保守作業の対象となる部品を具体例として説明する。
 本例の抽出部320は、保守作業又は修理作業の依頼書又は報告書の文書ファイルから、部品の識別情報(以下、部品ID)と、部品の名称らしい名詞とを抽出する。
In the dictionary generation program 32, the extraction unit 320 extracts identification information to be dictionaryd and nouns from the document file. The lexicalization target may be anything as long as identification information is given, but is a work target that is a work target, for example. In this example, a part that is a target of maintenance work of the device will be described as a specific example.
The extraction unit 320 of this example extracts part identification information (hereinafter referred to as part ID) and a noun that is likely to be a part name from a maintenance work or repair work request document or report document file.
 関係評価部322は、抽出部320により抽出された識別情報と、抽出された名詞との共起頻度に基づいて、識別情報と名詞の相関関係を評価する。
 本例の関係評価部322は、既定の単位内で、部品IDと、部品名らしい名詞が出現する頻度(共起頻度)をカウントし、共起頻度行列を生成する。ここで、既定の単位とは、部品IDと部品名が一緒に出現したと判定できる文書の範囲であり、例えば、文書ファイル単位、入力欄単位、段落単位、文単位などである。さらに、本例の関係評価部322は、生成された共起頻度行列に基づいて、部品名と部品IDの相関係数を算出する。
The relationship evaluation unit 322 evaluates the correlation between the identification information and the noun based on the co-occurrence frequency of the identification information extracted by the extraction unit 320 and the extracted noun.
The relationship evaluation unit 322 of this example counts the frequency (co-occurrence frequency) at which nouns that are likely to be component names and component IDs appear within a predetermined unit, and generates a co-occurrence frequency matrix. Here, the default unit is a range of a document in which it can be determined that the component ID and the component name appear together, and includes, for example, a document file unit, an input field unit, a paragraph unit, and a sentence unit. Further, the relationship evaluation unit 322 of this example calculates a correlation coefficient between the component name and the component ID based on the generated co-occurrence frequency matrix.
 グループ生成部324は、関係評価部322による評価結果に基づいて、識別情報のグループ、又は、名詞のグループを生成する。例えば、グループ生成部324は、関係評価部322によりカウントされた部品名と部品IDの共起頻度に基づいて、部品名をクラスタリングする。
 本例のグループ生成部324は、部品名毎に部品IDの共起頻度をベクトルとして、部品名間の距離を算出し、算出された距離の近い部品名同士をグルーピングする。なお、クラスタリング手法や距離尺度等は、他のものであってもよい。
The group generation unit 324 generates a group of identification information or a group of nouns based on the evaluation result by the relationship evaluation unit 322. For example, the group generation unit 324 clusters the component names based on the co-occurrence frequency of the component name and the component ID counted by the relationship evaluation unit 322.
The group generation unit 324 of this example calculates the distance between the component names using the co-occurrence frequency of the component ID as a vector for each component name, and groups the component names having the calculated distances close to each other. Note that other clustering methods, distance measures, and the like may be used.
 代表決定部326は、グループ生成部324により生成された識別情報又は名詞のグループの中から、識別情報の代表、又は、名詞の代表である代表語を決定する。例えば、代表決定部326は、グループ内における識別情報又は名詞の出現頻度に基づいて、識別情報の代表、又は、代表語を決定する。
 本例の代表決定部326は、グループ生成部324により生成された部品名のグループの中から、グループ内で最多頻度の用語(部品名)を代表語とする。
The representative determining unit 326 determines a representative word that is a representative of identification information or a representative of a noun from identification information or a group of nouns generated by the group generation unit 324. For example, the representative determining unit 326 determines the representative of the identification information or the representative word based on the appearance frequency of the identification information or noun in the group.
The representative determining unit 326 of this example uses, as a representative word, the most frequently used term (component name) in the group from the group of component names generated by the group generating unit 324.
 重要語決定部328は、同一の識別情報に関連付けられた名詞(部品名)の中から、この識別情報に対応する対象の名称を表わす重要語を決定する。例えば、重要語決定部328は、同一の識別情報に関連付けられた名詞群の中から、これらの出現頻度に基づいて、重要語を選択する。
 本例の重要語決定部328は、同一の部品IDと共起頻度が最も高い用語(部品名)を重要語とする。
The important word determination unit 328 determines an important word representing a target name corresponding to this identification information from nouns (part names) associated with the same identification information. For example, the important word determination unit 328 selects an important word from a group of nouns associated with the same identification information based on their appearance frequencies.
The important word determination unit 328 of this example sets the same part ID and the term (part name) having the highest co-occurrence frequency as the important words.
 グループ更新部330は、グループ生成部324により生成されたグループ内のサブグループで、相関関係を評価し、評価結果の増減に基づいて、グループを更新する。
 本例のグループ更新部330は、グループ内のサブグループで、相関係数が基準値以上となるサブグループを探索し、基準値以上となるサブグループが発見された場合に、このサブグループを辞書に追加登録し、基準値以上となるサブグループが発見されなかった場合に、サブグループを辞書から削除する。
The group update unit 330 evaluates the correlation among the subgroups in the group generated by the group generation unit 324, and updates the group based on the increase / decrease in the evaluation result.
The group updating unit 330 in this example searches for subgroups in which the correlation coefficient is greater than or equal to the reference value among the subgroups in the group, and if a subgroup that is greater than or equal to the reference value is found, If a subgroup exceeding the reference value is not found, the subgroup is deleted from the dictionary.
 辞書出力部332は、グループ生成部324により生成されたグループ、又は、グループ更新部330により更新されたグループに基づいて、各グループに含まれる識別情報と、代表決定部326により決定された代表と、各グループに含まれる名詞とを互いに関連付けた辞書を表示装置308又は作業情報格納部360に出力する。
 本例の辞書出力部332は、図4に例示するように、各グループに属する部品IDをそのグループの代表語に関連付けた辞書(図4(A))と、各グループに属する部品名をそのグループの代表語に関連付けた辞書(図4(B))とを作業情報格納部360に出力する。
Based on the group generated by the group generation unit 324 or the group updated by the group update unit 330, the dictionary output unit 332 includes the identification information included in each group and the representative determined by the representative determination unit 326. The dictionary that associates the nouns included in each group with each other is output to the display device 308 or the work information storage unit 360.
As illustrated in FIG. 4, the dictionary output unit 332 of the present example includes a dictionary (FIG. 4A) that associates a component ID belonging to each group with a representative word of the group, and a component name belonging to each group. The dictionary (FIG. 4B) associated with the representative word of the group is output to the work information storage unit 360.
 検索部334は、辞書出力部332により出力された辞書を用いて、識別情報又は名称の検索を行う。より具体的には、図3に例示するように、検索部334は、同義語検索部336と、在庫検索部338と、部品ID検索部340とを含む。
 同義語検索部336は、辞書を参照して、入力された部品名について、同一グループに属する部品名を抽出する。これにより名寄せが可能になり、文書ファイルの解析が可能になる。
 在庫検索部338は、図9に例示するように、保守作業の依頼書の文書ファイルから、交換部品の部品名を抽出し、辞書を参照して、抽出された部品名に関連付けられた代表語に基づいて、同一グループに属する部品IDを抽出し、部品の在庫を確認する。
 部品ID検索部340は、辞書を参照して、入力された部品名に関連付けられた代表語に基づいて、同一グループに属する部品IDを抽出する。
The search unit 334 searches the identification information or name using the dictionary output by the dictionary output unit 332. More specifically, as illustrated in FIG. 3, the search unit 334 includes a synonym search unit 336, an inventory search unit 338, and a component ID search unit 340.
The synonym search unit 336 refers to the dictionary and extracts part names belonging to the same group for the input part names. As a result, name identification becomes possible, and document files can be analyzed.
As illustrated in FIG. 9, the inventory search unit 338 extracts the part name of the replacement part from the document file of the maintenance work request form, refers to the dictionary, and represents the representative word associated with the extracted part name. Based on the above, the part IDs belonging to the same group are extracted, and the parts inventory is confirmed.
The component ID search unit 340 refers to the dictionary and extracts component IDs belonging to the same group based on the representative words associated with the input component name.
 解析部342は、入力された複数の作業の文書ファイル(依頼文書ファイル又は報告書ファイル)について、同義語検索部336を制御して、文書ファイル内の名称を代表語に名寄せし、代表語の出現頻度等に基づいて、作業に関する統計データを生成する。
 例えば、解析部342は、図10に例示するように、保守作業の内容(交換部品の部品名など)及び作業に要した作業工数が含まれた報告書の文書ファイルを入力し、入力した文書ファイルの部品名を辞書に従って代表語に名寄せし、代表語毎に作業工数を集計することによって、図11に例示するグラフを作成する。
The analysis unit 342 controls the synonym search unit 336 for a plurality of input work document files (request document file or report file) to collate the names in the document file with the representative words. Statistical data relating to work is generated based on the appearance frequency and the like.
For example, as illustrated in FIG. 10, the analysis unit 342 inputs a document file of a report including the contents of maintenance work (part names of replacement parts, etc.) and the work man-hours required for the work, and the input document A graph illustrated in FIG. 11 is created by associating the part names of the file with representative words according to a dictionary and counting the work man-hours for each representative word.
 図5は、辞書生成装置3による辞書生成処理(S10)を説明するフローチャートである。なお、本フローチャートでは、最初に辞書を生成する場合を具体例として説明する。また、本フローチャートにおけるデータファイルの変遷は、図12に例示する。
 図5に例示するように、ステップ100(S100)において、辞書生成装置3の抽出部320は、入力された文書ファイルから、部品ID及び部品名を抽出する。部品IDは、部品を一意に識別可能な情報であり、部品の型番、又は、部品在庫管理番号などである。
FIG. 5 is a flowchart for explaining dictionary generation processing (S10) by the dictionary generation device 3. In this flowchart, a case where a dictionary is first generated will be described as a specific example. The transition of the data file in this flowchart is illustrated in FIG.
As illustrated in FIG. 5, in step 100 (S <b> 100), the extraction unit 320 of the dictionary generation device 3 extracts a part ID and a part name from the input document file. The component ID is information that can uniquely identify the component, such as a model number of the component or a component inventory management number.
 ステップ105(S105)において、辞書生成プログラム32は、全ての文書ファイルから抽出処理が完了するまで、S100の処理を繰り返し、抽出処理が完了すると、S110の処理に移行する。 In step 105 (S105), the dictionary generation program 32 repeats the process of S100 until the extraction process is completed from all the document files, and when the extraction process is completed, the process proceeds to the process of S110.
 ステップ110(S110)において、関係評価部322は、抽出部320により抽出された部品ID及び部品名について、部品名の出現頻度(図15(A))及び部品IDの出現頻度(図15(B))を数え上げ、部品ID及び部品名が同時に出現する回数(共起頻度)から、部品名と部品IDの共起頻度行列(図15(C))を生成する。
 なお、出現回数の数え上げにおいて、重み付けを行ってもよい。例えば、関係評価部322は、同時に部品ID及び部品名が複数セット出現した場合に、部品ID及び部品名が1セットのみ出現した場合よりもカウント数が小さくなるような重み付けを行う。
In step 110 (S110), the relationship evaluation unit 322, for the component ID and the component name extracted by the extraction unit 320, the appearance frequency of the component name (FIG. 15A) and the appearance frequency of the component ID (FIG. 15B). )) Is counted, and a part name and part ID co-occurrence frequency matrix (FIG. 15C) is generated from the number of times the part ID and part name appear simultaneously (co-occurrence frequency).
Note that weighting may be performed in counting up the number of appearances. For example, the relationship evaluation unit 322 performs weighting so that when a plurality of sets of component IDs and component names appear simultaneously, the count number is smaller than when only one set of component IDs and component names appears.
 ステップ115(S115)において、重要語決定部328は、部品IDに対応する部品名のうち、共起頻度が最大となる部品名を、この部品IDの重要語に決定する。なお、重要語決定部328は、部品IDと部品名の共起頻度行列から、相関係数行列を算出し、相関係数が最大となる部品名を重要語に決定してもよい。 In step 115 (S115), the important word determination unit 328 determines, from among the part names corresponding to the part ID, the part name having the maximum co-occurrence frequency as the important word of this part ID. The keyword determining unit 328 may calculate a correlation coefficient matrix from the co-occurrence frequency matrix of the component ID and the component name, and may determine the component name having the maximum correlation coefficient as the keyword.
 ステップ120(S120)において、グループ生成部324は、関係評価部322により生成された部品名と部品IDの共起頻度行列に基づき、クラスタリングにより、部品名を類似性でグルーピングする。本例のクラスタリングは、部品名と部品IDの共起頻度によるものであるが、これに限定されるものではなく、種々のクラスタリング手法や距離尺度が採用可能である。 In step 120 (S120), the group generation unit 324 groups the component names based on the similarity by clustering based on the co-occurrence frequency matrix of the component names and the component IDs generated by the relationship evaluation unit 322. The clustering of this example is based on the co-occurrence frequency of the part name and the part ID, but is not limited to this, and various clustering methods and distance measures can be adopted.
 ステップ125(S125)において、代表決定部325は、グループ生成部324により生成された各グループについて、グループに属する部品名の中から、出現頻度が最大となる部品名をこのグループの代表語に決定する。 In step 125 (S125), the representative determining unit 325 determines, for each group generated by the group generating unit 324, the component name having the highest appearance frequency from the component names belonging to the group as the representative word of this group. To do.
 ステップ130(S130)において、辞書出力部332は、グループ生成部324により生成された各グループについて、代表語決定部325により決定された代表語に名寄せを行う。
 ステップ135(S135)において、辞書出力部332は、名寄せ後の各グループの部品名、部品ID、代表語、及び重要語を辞書として作業情報格納部360に格納する。
In step 130 (S130), the dictionary output unit 332 collates the representative words determined by the representative word determination unit 325 for each group generated by the group generation unit 324.
In step 135 (S135), the dictionary output unit 332 stores the part name, part ID, representative word, and important word of each group after name identification as a dictionary in the work information storage unit 360.
 図6は、辞書追加処理(S20)を説明するフローチャートである。なお、本フローチャートでは、辞書生成処理(S10)で生成された辞書にエントリを追加する場合を具体例として説明する。また、本フローチャートにおけるデータファイルの変遷は、図13に例示する。
 図6に例示するように、ステップ200(S200)において、グループ更新部330は、部品名及び部品IDを作業情報格納部360から読み出す。読み出される部品名及び部品IDは、最初の辞書作成時に抽出された部品名及び部品IDに加えて、新たに文書ファイルから抽出された部品名及び部品IDが含まれうる。
FIG. 6 is a flowchart illustrating the dictionary addition process (S20). In this flowchart, a case where an entry is added to the dictionary generated in the dictionary generation process (S10) will be described as a specific example. The transition of the data file in this flowchart is illustrated in FIG.
As illustrated in FIG. 6, in step 200 (S <b> 200), the group update unit 330 reads the component name and component ID from the work information storage unit 360. The component name and component ID to be read may include the component name and component ID newly extracted from the document file in addition to the component name and component ID extracted when the dictionary is first created.
 ステップ205(S205)において、グループ更新部330は、作業情報格納部360から辞書を読み出し、読み出された辞書に従って、読み出された部品名の名寄せを行う。 In step 205 (S205), the group updating unit 330 reads the dictionary from the work information storage unit 360, and performs name identification of the read component names according to the read dictionary.
 ステップ210(S210)において、グループ更新部330は、関係評価部322に共起頻度行列の生成を指示する。
 関係評価部322は、指示に応じて、読み出された部品ID及び部品名について、部品ID及び部品名が同時に出現する回数(共起頻度)を数え上げ、部品名と部品IDの共起頻度行列(図16(A))を生成する。
In step 210 (S210), the group update unit 330 instructs the relationship evaluation unit 322 to generate a co-occurrence frequency matrix.
In response to the instruction, the relationship evaluation unit 322 counts the number of times the component ID and the component name appear simultaneously (co-occurrence frequency) for the read component ID and component name, and the co-occurrence frequency matrix of the component name and the component ID. (FIG. 16A) is generated.
 ステップ215(S215)において、グループ更新部330は、関係評価部322に相関係数行列の生成を指示する。
 関係評価部322は、指示に応じて、数え上げた共起頻度に基づいて、部品IDと部品名の相関係数を算出し、相関係数行列(図16(B))を生成する。
In step 215 (S215), the group update unit 330 instructs the relationship evaluation unit 322 to generate a correlation coefficient matrix.
In response to the instruction, the relationship evaluation unit 322 calculates a correlation coefficient between the component ID and the component name based on the counted co-occurrence frequencies, and generates a correlation coefficient matrix (FIG. 16B).
 ステップ220(S220)において、グループ更新部330は、グループ生成部324に対して、再グルーピングを指示する。
 グループ生成部324は、関係評価部322により生成された部品名と部品IDの共起頻度行列に基づき、部品名を再度グルーピングする。
In step 220 (S220), the group update unit 330 instructs the group generation unit 324 to perform regrouping.
The group generation unit 324 groups the component names again based on the component name and component ID co-occurrence frequency matrix generated by the relationship evaluation unit 322.
 ステップ225(S225)において、代表決定部325は、グループ生成部324により生成された各グループについて、グループに属する部品名の中から、出現頻度が最大となる部品名をこのグループの代表語に再決定する。 In step 225 (S225), for each group generated by the group generation unit 324, the representative determination unit 325 re-assigns the part name having the highest appearance frequency from the part names belonging to the group to the representative word of this group. decide.
 ステップ230(S230)において、グループ更新部330は、各グループの代表語毎に、相関係数が高い上位N件の部品ID及び相関係数を取得し、この値をベースラインに設定する。 In step 230 (S230), the group updating unit 330 acquires the top N component IDs and correlation coefficients having the highest correlation coefficient for each representative word of each group, and sets this value as the baseline.
 ステップ235(S235)において、グループ更新部330は、グループ毎に部品名の頻度行列を合成し、設定されたベースラインより相関係数が高まる組合せを探索する。 In step 235 (S235), the group updating unit 330 synthesizes a frequency matrix of part names for each group, and searches for a combination whose correlation coefficient is higher than the set baseline.
 ステップ240(S240)において、辞書生成プログラム32は、ベースラインより相関係数が高まる組合せが発見された場合に、S245の処理に移行し、発見されなかった場合に、S250の処理に移行する。 In step 240 (S240), the dictionary generation program 32 proceeds to the process of S245 when a combination whose correlation coefficient is higher than the baseline is found, and proceeds to the process of S250 when it is not found.
 ステップ245(S245)において、グループ更新部330は、ベースラインよりも相関係数の高い組合せを辞書に追加登録する。
 ステップ250(S250)において、辞書生成プログラム32は、S205~S245の処理を既定回数繰り返し、既定回数繰り返した後で辞書追加処理(S20)を終了する。
In step 245 (S245), the group updating unit 330 additionally registers a combination having a higher correlation coefficient than the baseline in the dictionary.
In step 250 (S250), the dictionary generation program 32 repeats the processes of S205 to S245 a predetermined number of times, and after repeating the predetermined number of times, ends the dictionary addition process (S20).
 図7は、辞書整理処理(S30)を説明するフローチャートである。なお、本フローチャートでは、辞書生成処理(S10)で生成された辞書を整理する場合を具体例として説明する。また、本フローチャートにおけるデータファイルの変遷は、図14に例示する。
 図7に例示するように、ステップ300(S300)において、グループ更新部330は、部品名及び部品IDを作業情報格納部360から読み出す。読み出される部品名及び部品IDは、最初の辞書作成時に抽出された部品名及び部品IDに加えて、新たに文書ファイルから抽出された部品名及び部品IDが含まれうる。
FIG. 7 is a flowchart for explaining the dictionary organizing process (S30). In this flowchart, a case where the dictionary generated in the dictionary generation process (S10) is organized will be described as a specific example. Moreover, the transition of the data file in this flowchart is illustrated in FIG.
As illustrated in FIG. 7, in step 300 (S <b> 300), the group update unit 330 reads the component name and component ID from the work information storage unit 360. The component name and component ID to be read may include the component name and component ID newly extracted from the document file in addition to the component name and component ID extracted when the dictionary is first created.
 ステップ305(S305)において、グループ更新部330は、関係評価部322に共起頻度行列の生成を指示する。
 関係評価部322は、指示に応じて、読み出された部品ID及び部品名について、部品ID及び部品名が同時に出現する回数(共起頻度)を数え上げ、部品名と部品IDの共起頻度行列を生成する。
In step 305 (S305), the group updating unit 330 instructs the relationship evaluation unit 322 to generate a co-occurrence frequency matrix.
In response to the instruction, the relationship evaluation unit 322 counts the number of times the component ID and the component name appear simultaneously (co-occurrence frequency) for the read component ID and component name, and the co-occurrence frequency matrix of the component name and the component ID. Is generated.
 ステップ310(S310)において、グループ更新部330は、関係評価部322に相関係数行列の生成を指示する。
 関係評価部322は、指示に応じて、数え上げた共起頻度に基づいて、部品IDと部品名の相関係数を算出し、相関係数行列を生成する。
In step 310 (S310), the group update unit 330 instructs the relationship evaluation unit 322 to generate a correlation coefficient matrix.
In accordance with the instruction, the relationship evaluation unit 322 calculates a correlation coefficient between the component ID and the component name based on the counted co-occurrence frequency, and generates a correlation coefficient matrix.
 ステップ315(S315)において、グループ更新部330は、各グループの代表語毎に、相関係数が高い上位N件の部品ID及び相関係数を取得し、この値をベースラインに設定する。 In step 315 (S315), the group update unit 330 acquires the top N component IDs and correlation coefficients having the highest correlation coefficient for each representative word of each group, and sets this value as the baseline.
 ステップ320(S320)において、グループ更新部330は、グループ毎に部品名の頻度行列を合成し、設定されたベースラインより相関係数が高まる組合せを探索する。 In step 320 (S320), the group updating unit 330 synthesizes a frequency matrix of part names for each group, and searches for a combination whose correlation coefficient is higher than the set baseline.
 ステップ325(S325)において、辞書生成プログラム32は、ベースラインより相関係数が高まる組合せが発見された場合に、S330の処理に移行し、発見されなかった場合に、S335の処理に移行する。 In step 325 (S325), the dictionary generation program 32 proceeds to the process of S330 when a combination whose correlation coefficient is higher than the baseline is found, and proceeds to the process of S335 when it is not found.
 ステップ330(S330)において、グループ更新部330は、ベースラインよりも相関係数の高い組合せで辞書を更新する。
 ステップ335(S335)において、グループ更新部330は、ベースラインよりも相関係数の高い組合せが発見されなかった場合に、この組合せを辞書から削除する。
In step 330 (S330), the group updating unit 330 updates the dictionary with a combination having a higher correlation coefficient than the baseline.
In step 335 (S335), the group updating unit 330 deletes the combination from the dictionary when no combination having a correlation coefficient higher than the baseline is found.
 上記辞書追加処理(S20)の実施後に、辞書整理処理(S30)を実施することによって、辞書が適正に更新されていく。 After implementing the dictionary addition process (S20), the dictionary is properly updated by performing the dictionary organization process (S30).
 図8は、抽出部320による部品名の抽出方法をより具体的に説明する図である。
 抽出部320は、形態素解析、係り受け解析、格解析等を実施し、解析結果を得る。続いて、抽出部320は、辞書との突合せにより、部品名の候補を得、さらに、分節の係り受け、及び、形態素の係り受けに基づいて、「交換」と依存関係のある部品名に絞り込む。この際に、「交換しなかった」のような否定を示す意味を含む場合や「交換予定」のような未来の予定を示す意味を含む場合を除外する。
 抽出部320は、入力された文書ファイルを形態素解析し、辞書と突合せを行って、図8(A)に例示するように、形態素列中の部品名にラベルを付け、学習データとして出力する。
FIG. 8 is a diagram for more specifically explaining a part name extraction method by the extraction unit 320.
The extraction unit 320 performs morphological analysis, dependency analysis, case analysis, and the like, and obtains an analysis result. Subsequently, the extraction unit 320 obtains part name candidates by matching with a dictionary, and further narrows down to part names having a dependency relationship with “exchange” based on the segment dependency and the morpheme dependency. . At this time, a case including a meaning indicating negative such as “not exchanged” or a case including a meaning indicating a future schedule such as “scheduled replacement” is excluded.
The extraction unit 320 performs morphological analysis on the input document file, matches it with the dictionary, labels the part names in the morpheme string as illustrated in FIG. 8A, and outputs them as learning data.
 抽出部320は、上記学習データを用いて、系列ラベリングを行う。
 続いて、抽出部320は、図8(B)に例示するように、学習に使用していない文書ファイルを、部品名抽出モデルに与え、形態素のラベルを予測させ、辞書に存在しない部品名を候補として抽出し、辞書に登録する。登録前に、人が目視で確認してもよいし、出現頻度等に閾値を設定してフィルタリングしてもよい。
The extraction unit 320 performs sequence labeling using the learning data.
Subsequently, as illustrated in FIG. 8B, the extraction unit 320 gives a document file that is not used for learning to the component name extraction model, predicts the morpheme label, and selects a component name that does not exist in the dictionary. Extract as candidates and register in dictionary. Prior to registration, a person may visually confirm, or filtering may be performed by setting a threshold value for the appearance frequency or the like.
 以上説明したように、辞書生成装置3は、保守作業の対象となる交換部品の部品名と部品IDを、保守作業の依頼書又は報告書の文書ファイルから抽出し、部品名と部品IDの共起頻度に基づいて、部品名の辞書を効率的に作成することができる。
 さらに、辞書生成装置3は、作成された辞書を用いて、依頼書又は報告書の文書ファイルに対して名寄せを行い、文書ファイルの解析を実施する。これにより、保守作業に関する統計的評価が可能になる。
As described above, the dictionary generation device 3 extracts the part name and part ID of the replacement part that is the object of maintenance work from the maintenance work request form or report document file, and shares the part name and part ID. Based on the frequency of occurrence, a dictionary of part names can be created efficiently.
Furthermore, the dictionary generation device 3 performs name identification on the document file of the request form or report using the created dictionary, and analyzes the document file. Thereby, statistical evaluation regarding the maintenance work becomes possible.
 (変形例)
 上記実施形態では、部品名をグルーピングして、各グループで代表語及び重要語を決定しているが、部品IDをグルーピングして、部品IDの代表や、重要IDを決定してもよい。
(Modification)
In the above embodiment, the component names are grouped and the representative word and the important word are determined for each group. However, the component ID may be grouped to determine the representative of the component ID and the important ID.
 3  辞書生成装置
 32 辞書生成プログラム
3 Dictionary generator 32 Dictionary generator program

Claims (8)

  1.  文書ファイルの中から、辞書化対象の識別情報と、名詞とを抽出する抽出部と、
     前記抽出部により抽出された識別情報と、抽出された名詞との共起頻度に基づいて、識別情報と名詞の相関関係を評価する関係評価部と、
     前記関係評価部による評価結果に基づいて、識別情報のグループ、又は、名詞のグループを生成するグループ生成部と
     を有する辞書生成装置。
    An extractor for extracting identification information and nouns from the document file;
    Based on the co-occurrence frequency of the identification information extracted by the extraction unit and the extracted noun, a relationship evaluation unit that evaluates the correlation between the identification information and the noun,
    A dictionary generation device comprising: a group of identification information or a group of nouns based on an evaluation result by the relationship evaluation unit.
  2.  前記文書ファイルは、作業に関する文書ファイルであり、
     前記抽出部は、作業に関する文書ファイルから、作業対象物の識別情報と、作業対象物の名称を抽出し、
     前記関係評価部は、作業対象物の識別情報と、作業対象物の名称との共起頻度に基づいて、作業対象物の識別情報と、作業対象物の名称との相関関係を評価し、
     前記グループ生成部は、作業対象物の識別情報のグループ、又は、作業対象物の名称のグループを生成する
     請求項1に記載の辞書生成装置。
    The document file is a document file relating to work,
    The extraction unit extracts work object identification information and a work object name from a work-related document file,
    The relationship evaluation unit evaluates the correlation between the identification information of the work object and the name of the work object based on the co-occurrence frequency of the identification information of the work object and the name of the work object,
    The dictionary generation device according to claim 1, wherein the group generation unit generates a group of identification information of a work object or a group of names of work objects.
  3.  前記文書ファイルは、保守作業又は修理作業の依頼書又は報告書の文書ファイルであり、
     前記抽出部は、前記文書ファイルから、交換部品の識別情報と、交換部品の名称を抽出し、
     前記関係評価部は、交換部品の識別情報と、交換部品の名称との共起頻度に基づいて、交換部品の識別情報と、交換部品の名称との相関関係を評価し、
     前記グループ生成部は、交換部品の識別情報のグループ、又は、交換部品の名称のグループを生成する
     請求項2に記載の辞書生成装置。
    The document file is a document file of a request or report for maintenance work or repair work,
    The extraction unit extracts the replacement part identification information and the name of the replacement part from the document file,
    The relationship evaluation unit evaluates the correlation between the identification information of the replacement part and the name of the replacement part based on the co-occurrence frequency of the identification information of the replacement part and the name of the replacement part,
    The dictionary generation device according to claim 2, wherein the group generation unit generates a group of replacement part identification information or a group of replacement part names.
  4.  前記グループ生成部により生成されたグループ内で、識別情報の代表、又は、部品名称の代表語を決定する代表決定部と、
     前記グループ生成部により生成されたグループに基づいて、各グループに含まれる識別情報と、前記代表語決定部により決定されたグループの代表語又は識別情報の代表と、各グループに含まれる名詞とを互いに関連付けた辞書を出力する辞書出力部と
     をさらに有する請求項1に記載の辞書生成装置。
    In the group generated by the group generation unit, a representative determination unit for determining a representative of identification information or a representative word of a part name;
    Based on the group generated by the group generation unit, identification information included in each group, a representative word of the group determined by the representative word determination unit or a representative of identification information, and a noun included in each group The dictionary generation device according to claim 1, further comprising a dictionary output unit that outputs a dictionary associated with each other.
  5.  前記グループ生成部により生成されたグループ内のサブグループで、相関関係を評価し、評価結果の増減に基づいて、グループを更新するグループ更新部
     をさらに有する請求項1に記載の辞書生成装置。
    The dictionary generation device according to claim 1, further comprising: a group update unit that evaluates a correlation among subgroups within the group generated by the group generation unit, and updates the group based on increase or decrease of the evaluation result.
  6.  前記辞書出力部により出力された辞書を用いて、作業に関する統計データを生成する解析部
     をさらに有する請求項4に記載の辞書生成装置。
    The dictionary generation device according to claim 4, further comprising: an analysis unit that generates statistical data regarding work using the dictionary output by the dictionary output unit.
  7.  文書ファイルの中から、辞書化対象の識別情報と、名詞とを抽出する抽出ステップと、
     前記抽出ステップにより抽出された識別情報と、抽出された名詞との共起頻度に基づいて、識別情報と名詞の相関関係を評価する関係評価ステップと、
     前記関係評価ステップによる評価結果に基づいて、識別情報のグループ、又は、名詞のグループを生成するグループ生成ステップと
     を有する辞書生成方法。
    An extraction step for extracting identification information and nouns from the document file;
    A relationship evaluation step for evaluating the correlation between the identification information and the noun based on the co-occurrence frequency of the identification information extracted by the extraction step and the extracted noun;
    A dictionary generation method comprising: a group generation step of generating a group of identification information or a group of nouns based on an evaluation result in the relationship evaluation step.
  8.  文書ファイルの中から、辞書化対象の識別情報と、名詞とを抽出する抽出ステップと、
     前記抽出ステップにより抽出された識別情報と、抽出された名詞との共起頻度に基づいて、識別情報と名詞の相関関係を評価する関係評価ステップと、
     前記関係評価ステップによる評価結果に基づいて、識別情報のグループ、又は、名詞のグループを生成するグループ生成ステップと
     をコンピュータに実行させるプログラム。
    An extraction step for extracting identification information and nouns from the document file;
    A relationship evaluation step for evaluating the correlation between the identification information and the noun based on the co-occurrence frequency of the identification information extracted by the extraction step and the extracted noun;
    A program that causes a computer to execute a group generation step of generating a group of identification information or a group of nouns based on an evaluation result in the relationship evaluation step.
PCT/JP2017/019947 2017-05-29 2017-05-29 Dictionary generator, dictionary generation method, and program WO2018220688A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2017/019947 WO2018220688A1 (en) 2017-05-29 2017-05-29 Dictionary generator, dictionary generation method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2017/019947 WO2018220688A1 (en) 2017-05-29 2017-05-29 Dictionary generator, dictionary generation method, and program

Publications (1)

Publication Number Publication Date
WO2018220688A1 true WO2018220688A1 (en) 2018-12-06

Family

ID=64455251

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2017/019947 WO2018220688A1 (en) 2017-05-29 2017-05-29 Dictionary generator, dictionary generation method, and program

Country Status (1)

Country Link
WO (1) WO2018220688A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110969A (en) * 2019-04-10 2019-08-09 中国科学院国家空间科学中心 A kind of space environment forecast product gross examines appraisal procedure and system automatically

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006195756A (en) * 2005-01-13 2006-07-27 Just Syst Corp Information retrieval device, and device for presenting improvement method of information retrieval site
JP2007140861A (en) * 2005-11-17 2007-06-07 Konica Minolta Medical & Graphic Inc Information processing system, information processing method, and program
WO2014002776A1 (en) * 2012-06-25 2014-01-03 日本電気株式会社 Synonym extraction system, method, and recording medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006195756A (en) * 2005-01-13 2006-07-27 Just Syst Corp Information retrieval device, and device for presenting improvement method of information retrieval site
JP2007140861A (en) * 2005-11-17 2007-06-07 Konica Minolta Medical & Graphic Inc Information processing system, information processing method, and program
WO2014002776A1 (en) * 2012-06-25 2014-01-03 日本電気株式会社 Synonym extraction system, method, and recording medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110969A (en) * 2019-04-10 2019-08-09 中国科学院国家空间科学中心 A kind of space environment forecast product gross examines appraisal procedure and system automatically

Similar Documents

Publication Publication Date Title
US11030199B2 (en) Systems and methods for contextual retrieval and contextual display of records
US10489439B2 (en) System and method for entity extraction from semi-structured text documents
US9424524B2 (en) Extracting facts from unstructured text
US8572560B2 (en) Collaborative software development systems and methods providing automated programming assistance
US20180075013A1 (en) Method and system for automating training of named entity recognition in natural language processing
KR20160124742A (en) Method for disambiguating features in unstructured text
CN109791632B (en) Scene segment classifier, scene classifier, and recording medium
Limsettho et al. Automatic unsupervised bug report categorization
GB2555207A (en) System and method for identifying passages in electronic documents
JP5834795B2 (en) Information processing apparatus and program
CN102789473A (en) Identifier retrieval method and equipment
KR20200071877A (en) Method and System for information extraction using a self-augmented iterative learning
Ozyurt et al. Resource disambiguator for the web: extracting biomedical resources and their citations from the scientific literature
CN113971398A (en) Dictionary construction method for rapid entity identification in network security field
JP6522446B2 (en) Labeling apparatus, method and program
WO2018220688A1 (en) Dictionary generator, dictionary generation method, and program
Revindasari et al. Traceability between business process and software component using Probabilistic Latent Semantic Analysis
RU2546555C1 (en) Method of automated classification of formalised documents in electronic document circulation system
US11580499B2 (en) Method, system and computer-readable medium for information retrieval
AU2019290658B2 (en) Systems and methods for identifying and linking events in structured proceedings
Álvarez-Rodríguez et al. Towards a stepwise method for unifying and reconciling corporate names in public contracts metadata: The CORFU technique
Taghva et al. Acronym expansion via hidden Markov models
KR20210023453A (en) Apparatus and method for matching review advertisement
Afzal et al. Towards semantic annotation of bioinformatics services: building a controlled vocabulary
Ďuračík et al. Using concepts of text based plagiarism detection in source code plagiarism analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17912201

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17912201

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP