WO2018220688A1

WO2018220688A1 - Dictionary generator, dictionary generation method, and program

Info

Publication number: WO2018220688A1
Application number: PCT/JP2017/019947
Authority: WO
Inventors: 龍二高山; 桂介甲斐
Original assignee: 株式会社Pfu
Priority date: 2017-05-29
Filing date: 2017-05-29
Publication date: 2018-12-06

Abstract

Provided is a dictionary generator for efficiently generating a dictionary of the nouns included in a document file. The dictionary generator has: an extraction unit for extracting, from within a document file, a noun and identification information to be formed into a dictionary; a relationship evaluation unit for evaluating the correlation between the identification information and the noun on the basis of the co-occurrence frequency of the identification information extracted by the extraction unit and the and the extracted noun; and a group generation unit for generating a group of identification information or a group of nouns on the basis of the result of evaluation by the evaluation unit.

Description

Dictionary generating apparatus, dictionary generating method, and program

The present invention relates to a dictionary generation device, a dictionary generation method, and a program.

For example, Patent Document 1 discloses a document processing apparatus that extracts a morpheme group having a dependency relationship from a document, classifies the extracted morpheme group according to its viewpoint, and classifies the document according to a classification result. It is disclosed.

Patent No. 392503

Provide a dictionary generator that efficiently generates a dictionary of nouns contained in a document file.

The dictionary generation device according to the present invention includes an extraction unit that extracts identification information and a noun from a document file, a combination of the identification information extracted by the extraction unit, and the extracted noun. A relationship evaluation unit that evaluates the correlation between the identification information and the noun based on the occurrence frequency, and a group generation unit that generates a group of identification information or a group of nouns based on the evaluation result by the relationship evaluation unit. Have.

Preferably, the document file is a work-related document file, the extraction unit extracts work object identification information and a work object name from the work-related document file, and the relationship evaluation unit includes: Based on the co-occurrence frequency of the identification information of the work object and the name of the work object, the correlation between the identification information of the work object and the name of the work object is evaluated, and the group generation unit A group of identification information of an object or a group of names of work objects is generated.

Preferably, the document file is a document file for a request or report for maintenance work or repair work, and the extraction unit extracts the replacement part identification information and the name of the replacement part from the document file. The relationship evaluation unit evaluates the correlation between the identification information of the replacement part and the name of the replacement part based on the co-occurrence frequency of the identification information of the replacement part and the name of the replacement part, and the group generation unit Generates a group of identification information of replacement parts or a group of names of replacement parts.

Preferably, in the group generated by the group generation unit, a representative determination unit that determines a representative of identification information or a representative word of a part name, and a group generated by the group generation unit, A dictionary output unit that outputs a dictionary in which identification information included in the group, a representative word of the group determined by the representative word determination unit or a representative of the identification information, and a noun included in each group are associated with each other; .

Preferably, the apparatus further includes a group updating unit that evaluates the correlation among the subgroups within the group generated by the group generation unit and updates the group based on the increase / decrease in the evaluation result.

Preferably, the apparatus further includes an analysis unit that generates statistical data regarding work using the dictionary output by the dictionary output unit.

Further, the dictionary generation method according to the present invention includes an extraction step for extracting identification information to be lexicized and a noun from a document file, the identification information extracted by the extraction step, an extracted noun, A relationship evaluation step for evaluating the correlation between the identification information and the noun based on the co-occurrence frequency, and a group generation step for generating a group of identification information or a group of nouns based on the evaluation result of the relationship evaluation step And have.

The program according to the present invention also includes an extraction step for extracting identification information and nouns to be lexicized from a document file, a combination of the identification information extracted by the extraction step and the extracted nouns. A relationship evaluation step for evaluating the correlation between the identification information and the noun based on the occurrence frequency, and a group generation step for generating a group of identification information or a group of nouns based on the evaluation result of the relationship evaluation step. Let the computer run.

Focusing on the special relationship between the target name and identification information, it is possible to efficiently generate a dictionary of nouns contained in the document file.

3 is a diagram illustrating a hardware configuration of a dictionary generation device 3. FIG. 3 is a diagram illustrating a functional configuration of a dictionary generation device 3. FIG. 3 is a diagram illustrating a more detailed functional configuration of a search unit 334. FIG. It is a figure which illustrates the information registered into a dictionary. It is a flowchart explaining the dictionary production | generation process (S10) by the dictionary production | generation apparatus 3. FIG. It is a flowchart explaining a dictionary addition process (S20). It is a flowchart explaining a dictionary arrangement | positioning process (S30). It is a figure explaining the extraction method of the part name by the extraction part 320 more concretely. It is a figure explaining stock confirmation processing. It is a figure explaining work analysis processing. It is a figure which illustrates work analysis results. It is a figure which illustrates the transition of the data file in a dictionary production | generation process. It is a figure which illustrates the transition of the data file in a dictionary addition process. It is a figure which illustrates the transition of the data file in a dictionary arrangement | positioning process. It is a figure explaining the co-occurrence frequency matrix of a component name and component ID. It is a figure explaining the method to produce | generate a correlation coefficient matrix from a co-occurrence frequency matrix.

(Background and overview)
As a prerequisite for analyzing documents such as reports, a dictionary containing synonyms and the like is required. This is because there are a plurality of synonyms, synonyms, abbreviations, slang terms, foreign languages, and the like as terms representing the same thing. Also, since abbreviations and slang words change with the times, it is not easy to maintain a dictionary.
Therefore, the dictionary generation device 3 of the present embodiment efficiently generates a dictionary by paying attention to the relationship between the noun that is the target name and the identification information of the target. In general, when considering automatic generation of a dictionary, attention is often paid to the syntax of a predicate (verb) and an object (noun), but the name of the object (noun) and the identification information (ID) of the object are syntactic Because it has a characteristic relationship different from, it can be expected to improve the efficiency of dictionary generation.

(Embodiment)
FIG. 1 is a diagram illustrating a hardware configuration of the dictionary generation device 3.
As illustrated in FIG. 1, the dictionary generation device 3 includes a CPU 300, a memory 302, an HDD 304, a network interface 306 (network IF 306), a display device 308, and an input device 310, and these configurations are connected via a bus 312. Connected to each other. That is, the dictionary generation device 3 is a computer device.
The CPU 300 is, for example, a central processing unit.
The memory 302 is, for example, a volatile memory and functions as a main storage device.
The HDD 304 is, for example, a hard disk drive device, and stores a computer program (for example, the dictionary generation program 32 in FIG. 2) and other data files (for example, the dictionary file in FIG. 4) as a nonvolatile recording device.
The network IF 306 is an interface for performing wired or wireless communication.
The display device 308 is, for example, a liquid crystal display.
The input device 310 is, for example, a keyboard and a mouse.

FIG. 2 is a diagram illustrating a functional configuration of the dictionary generation device 3.
As illustrated in FIG. 2, a dictionary generation program 32 is installed in the dictionary generation apparatus 3 of this example, and a work information storage unit 360 is configured. The dictionary generation program 32 is stored in, for example, a recording medium such as a CD-ROM, and is installed in the dictionary generation apparatus 3 via this recording medium.
The dictionary generation program 32 includes an extraction unit 320, a relationship evaluation unit 322, a group generation unit 324, a representative determination unit 326, an important word determination unit 328, a group update unit 330, a dictionary output unit 332, a search unit 334, and an analysis unit 342. Have.
Part or all of the dictionary generation program 32 may be realized by hardware such as an ASIC, or may be realized by partially borrowing an OS (Operating System) function.

In the dictionary generation program 32, the extraction unit 320 extracts identification information to be dictionaryd and nouns from the document file. The lexicalization target may be anything as long as identification information is given, but is a work target that is a work target, for example. In this example, a part that is a target of maintenance work of the device will be described as a specific example.
The extraction unit 320 of this example extracts part identification information (hereinafter referred to as part ID) and a noun that is likely to be a part name from a maintenance work or repair work request document or report document file.

The relationship evaluation unit 322 evaluates the correlation between the identification information and the noun based on the co-occurrence frequency of the identification information extracted by the extraction unit 320 and the extracted noun.
The relationship evaluation unit 322 of this example counts the frequency (co-occurrence frequency) at which nouns that are likely to be component names and component IDs appear within a predetermined unit, and generates a co-occurrence frequency matrix. Here, the default unit is a range of a document in which it can be determined that the component ID and the component name appear together, and includes, for example, a document file unit, an input field unit, a paragraph unit, and a sentence unit. Further, the relationship evaluation unit 322 of this example calculates a correlation coefficient between the component name and the component ID based on the generated co-occurrence frequency matrix.

The group generation unit 324 generates a group of identification information or a group of nouns based on the evaluation result by the relationship evaluation unit 322. For example, the group generation unit 324 clusters the component names based on the co-occurrence frequency of the component name and the component ID counted by the relationship evaluation unit 322.
The group generation unit 324 of this example calculates the distance between the component names using the co-occurrence frequency of the component ID as a vector for each component name, and groups the component names having the calculated distances close to each other. Note that other clustering methods, distance measures, and the like may be used.

The representative determining unit 326 determines a representative word that is a representative of identification information or a representative of a noun from identification information or a group of nouns generated by the group generation unit 324. For example, the representative determining unit 326 determines the representative of the identification information or the representative word based on the appearance frequency of the identification information or noun in the group.
The representative determining unit 326 of this example uses, as a representative word, the most frequently used term (component name) in the group from the group of component names generated by the group generating unit 324.

The important word determination unit 328 determines an important word representing a target name corresponding to this identification information from nouns (part names) associated with the same identification information. For example, the important word determination unit 328 selects an important word from a group of nouns associated with the same identification information based on their appearance frequencies.
The important word determination unit 328 of this example sets the same part ID and the term (part name) having the highest co-occurrence frequency as the important words.

The group update unit 330 evaluates the correlation among the subgroups in the group generated by the group generation unit 324, and updates the group based on the increase / decrease in the evaluation result.
The group updating unit 330 in this example searches for subgroups in which the correlation coefficient is greater than or equal to the reference value among the subgroups in the group, and if a subgroup that is greater than or equal to the reference value is found, If a subgroup exceeding the reference value is not found, the subgroup is deleted from the dictionary.

Based on the group generated by the group generation unit 324 or the group updated by the group update unit 330, the dictionary output unit 332 includes the identification information included in each group and the representative determined by the representative determination unit 326. The dictionary that associates the nouns included in each group with each other is output to the display device 308 or the work information storage unit 360.
As illustrated in FIG. 4, the dictionary output unit 332 of the present example includes a dictionary (FIG. 4A) that associates a component ID belonging to each group with a representative word of the group, and a component name belonging to each group. The dictionary (FIG. 4B) associated with the representative word of the group is output to the work information storage unit 360.

The search unit 334 searches the identification information or name using the dictionary output by the dictionary output unit 332. More specifically, as illustrated in FIG. 3, the search unit 334 includes a synonym search unit 336, an inventory search unit 338, and a component ID search unit 340.
The synonym search unit 336 refers to the dictionary and extracts part names belonging to the same group for the input part names. As a result, name identification becomes possible, and document files can be analyzed.
As illustrated in FIG. 9, the inventory search unit 338 extracts the part name of the replacement part from the document file of the maintenance work request form, refers to the dictionary, and represents the representative word associated with the extracted part name. Based on the above, the part IDs belonging to the same group are extracted, and the parts inventory is confirmed.
The component ID search unit 340 refers to the dictionary and extracts component IDs belonging to the same group based on the representative words associated with the input component name.

The analysis unit 342 controls the synonym search unit 336 for a plurality of input work document files (request document file or report file) to collate the names in the document file with the representative words. Statistical data relating to work is generated based on the appearance frequency and the like.
For example, as illustrated in FIG. 10, the analysis unit 342 inputs a document file of a report including the contents of maintenance work (part names of replacement parts, etc.) and the work man-hours required for the work, and the input document A graph illustrated in FIG. 11 is created by associating the part names of the file with representative words according to a dictionary and counting the work man-hours for each representative word.

FIG. 5 is a flowchart for explaining dictionary generation processing (S10) by the dictionary generation device 3. In this flowchart, a case where a dictionary is first generated will be described as a specific example. The transition of the data file in this flowchart is illustrated in FIG.
As illustrated in FIG. 5, in step 100 (S <b> 100), the extraction unit 320 of the dictionary generation device 3 extracts a part ID and a part name from the input document file. The component ID is information that can uniquely identify the component, such as a model number of the component or a component inventory management number.

In step 105 (S105), the dictionary generation program 32 repeats the process of S100 until the extraction process is completed from all the document files, and when the extraction process is completed, the process proceeds to the process of S110.

In step 110 (S110), the relationship evaluation unit 322, for the component ID and the component name extracted by the extraction unit 320, the appearance frequency of the component name (FIG. 15A) and the appearance frequency of the component ID (FIG. 15B). )) Is counted, and a part name and part ID co-occurrence frequency matrix (FIG. 15C) is generated from the number of times the part ID and part name appear simultaneously (co-occurrence frequency).
Note that weighting may be performed in counting up the number of appearances. For example, the relationship evaluation unit 322 performs weighting so that when a plurality of sets of component IDs and component names appear simultaneously, the count number is smaller than when only one set of component IDs and component names appears.

In step 115 (S115), the important word determination unit 328 determines, from among the part names corresponding to the part ID, the part name having the maximum co-occurrence frequency as the important word of this part ID. The keyword determining unit 328 may calculate a correlation coefficient matrix from the co-occurrence frequency matrix of the component ID and the component name, and may determine the component name having the maximum correlation coefficient as the keyword.

In step 120 (S120), the group generation unit 324 groups the component names based on the similarity by clustering based on the co-occurrence frequency matrix of the component names and the component IDs generated by the relationship evaluation unit 322. The clustering of this example is based on the co-occurrence frequency of the part name and the part ID, but is not limited to this, and various clustering methods and distance measures can be adopted.

In step 125 (S125), the representative determining unit 325 determines, for each group generated by the group generating unit 324, the component name having the highest appearance frequency from the component names belonging to the group as the representative word of this group. To do.

In step 130 (S130), the dictionary output unit 332 collates the representative words determined by the representative word determination unit 325 for each group generated by the group generation unit 324.
In step 135 (S135), the dictionary output unit 332 stores the part name, part ID, representative word, and important word of each group after name identification as a dictionary in the work information storage unit 360.

FIG. 6 is a flowchart illustrating the dictionary addition process (S20). In this flowchart, a case where an entry is added to the dictionary generated in the dictionary generation process (S10) will be described as a specific example. The transition of the data file in this flowchart is illustrated in FIG.
As illustrated in FIG. 6, in step 200 (S <b> 200), the group update unit 330 reads the component name and component ID from the work information storage unit 360. The component name and component ID to be read may include the component name and component ID newly extracted from the document file in addition to the component name and component ID extracted when the dictionary is first created.

In step 205 (S205), the group updating unit 330 reads the dictionary from the work information storage unit 360, and performs name identification of the read component names according to the read dictionary.

In step 210 (S210), the group update unit 330 instructs the relationship evaluation unit 322 to generate a co-occurrence frequency matrix.
In response to the instruction, the relationship evaluation unit 322 counts the number of times the component ID and the component name appear simultaneously (co-occurrence frequency) for the read component ID and component name, and the co-occurrence frequency matrix of the component name and the component ID. (FIG. 16A) is generated.

In step 215 (S215), the group update unit 330 instructs the relationship evaluation unit 322 to generate a correlation coefficient matrix.
In response to the instruction, the relationship evaluation unit 322 calculates a correlation coefficient between the component ID and the component name based on the counted co-occurrence frequencies, and generates a correlation coefficient matrix (FIG. 16B).

In step 220 (S220), the group update unit 330 instructs the group generation unit 324 to perform regrouping.
The group generation unit 324 groups the component names again based on the component name and component ID co-occurrence frequency matrix generated by the relationship evaluation unit 322.

In step 225 (S225), for each group generated by the group generation unit 324, the representative determination unit 325 re-assigns the part name having the highest appearance frequency from the part names belonging to the group to the representative word of this group. decide.

In step 230 (S230), the group updating unit 330 acquires the top N component IDs and correlation coefficients having the highest correlation coefficient for each representative word of each group, and sets this value as the baseline.

In step 235 (S235), the group updating unit 330 synthesizes a frequency matrix of part names for each group, and searches for a combination whose correlation coefficient is higher than the set baseline.

In step 240 (S240), the dictionary generation program 32 proceeds to the process of S245 when a combination whose correlation coefficient is higher than the baseline is found, and proceeds to the process of S250 when it is not found.

In step 245 (S245), the group updating unit 330 additionally registers a combination having a higher correlation coefficient than the baseline in the dictionary.
In step 250 (S250), the dictionary generation program 32 repeats the processes of S205 to S245 a predetermined number of times, and after repeating the predetermined number of times, ends the dictionary addition process (S20).

FIG. 7 is a flowchart for explaining the dictionary organizing process (S30). In this flowchart, a case where the dictionary generated in the dictionary generation process (S10) is organized will be described as a specific example. Moreover, the transition of the data file in this flowchart is illustrated in FIG.
As illustrated in FIG. 7, in step 300 (S <b> 300), the group update unit 330 reads the component name and component ID from the work information storage unit 360. The component name and component ID to be read may include the component name and component ID newly extracted from the document file in addition to the component name and component ID extracted when the dictionary is first created.

In step 305 (S305), the group updating unit 330 instructs the relationship evaluation unit 322 to generate a co-occurrence frequency matrix.
In response to the instruction, the relationship evaluation unit 322 counts the number of times the component ID and the component name appear simultaneously (co-occurrence frequency) for the read component ID and component name, and the co-occurrence frequency matrix of the component name and the component ID. Is generated.

In step 310 (S310), the group update unit 330 instructs the relationship evaluation unit 322 to generate a correlation coefficient matrix.
In accordance with the instruction, the relationship evaluation unit 322 calculates a correlation coefficient between the component ID and the component name based on the counted co-occurrence frequency, and generates a correlation coefficient matrix.

In step 315 (S315), the group update unit 330 acquires the top N component IDs and correlation coefficients having the highest correlation coefficient for each representative word of each group, and sets this value as the baseline.

In step 320 (S320), the group updating unit 330 synthesizes a frequency matrix of part names for each group, and searches for a combination whose correlation coefficient is higher than the set baseline.

In step 325 (S325), the dictionary generation program 32 proceeds to the process of S330 when a combination whose correlation coefficient is higher than the baseline is found, and proceeds to the process of S335 when it is not found.

In step 330 (S330), the group updating unit 330 updates the dictionary with a combination having a higher correlation coefficient than the baseline.
In step 335 (S335), the group updating unit 330 deletes the combination from the dictionary when no combination having a correlation coefficient higher than the baseline is found.

After implementing the dictionary addition process (S20), the dictionary is properly updated by performing the dictionary organization process (S30).

FIG. 8 is a diagram for more specifically explaining a part name extraction method by the extraction unit 320.
The extraction unit 320 performs morphological analysis, dependency analysis, case analysis, and the like, and obtains an analysis result. Subsequently, the extraction unit 320 obtains part name candidates by matching with a dictionary, and further narrows down to part names having a dependency relationship with “exchange” based on the segment dependency and the morpheme dependency. . At this time, a case including a meaning indicating negative such as “not exchanged” or a case including a meaning indicating a future schedule such as “scheduled replacement” is excluded.
The extraction unit 320 performs morphological analysis on the input document file, matches it with the dictionary, labels the part names in the morpheme string as illustrated in FIG. 8A, and outputs them as learning data.

The extraction unit 320 performs sequence labeling using the learning data.
Subsequently, as illustrated in FIG. 8B, the extraction unit 320 gives a document file that is not used for learning to the component name extraction model, predicts the morpheme label, and selects a component name that does not exist in the dictionary. Extract as candidates and register in dictionary. Prior to registration, a person may visually confirm, or filtering may be performed by setting a threshold value for the appearance frequency or the like.

As described above, the dictionary generation device 3 extracts the part name and part ID of the replacement part that is the object of maintenance work from the maintenance work request form or report document file, and shares the part name and part ID. Based on the frequency of occurrence, a dictionary of part names can be created efficiently.
Furthermore, the dictionary generation device 3 performs name identification on the document file of the request form or report using the created dictionary, and analyzes the document file. Thereby, statistical evaluation regarding the maintenance work becomes possible.

(Modification)
In the above embodiment, the component names are grouped and the representative word and the important word are determined for each group. However, the component ID may be grouped to determine the representative of the component ID and the important ID.

3 Dictionary generator 32 Dictionary generator program

Claims

An extractor for extracting identification information and nouns from the document file;
Based on the co-occurrence frequency of the identification information extracted by the extraction unit and the extracted noun, a relationship evaluation unit that evaluates the correlation between the identification information and the noun,
A dictionary generation device comprising: a group of identification information or a group of nouns based on an evaluation result by the relationship evaluation unit.
The document file is a document file relating to work,
The extraction unit extracts work object identification information and a work object name from a work-related document file,
The relationship evaluation unit evaluates the correlation between the identification information of the work object and the name of the work object based on the co-occurrence frequency of the identification information of the work object and the name of the work object,
The dictionary generation device according to claim 1, wherein the group generation unit generates a group of identification information of a work object or a group of names of work objects.
The document file is a document file of a request or report for maintenance work or repair work,
The extraction unit extracts the replacement part identification information and the name of the replacement part from the document file,
The relationship evaluation unit evaluates the correlation between the identification information of the replacement part and the name of the replacement part based on the co-occurrence frequency of the identification information of the replacement part and the name of the replacement part,
The dictionary generation device according to claim 2, wherein the group generation unit generates a group of replacement part identification information or a group of replacement part names.
In the group generated by the group generation unit, a representative determination unit for determining a representative of identification information or a representative word of a part name;
Based on the group generated by the group generation unit, identification information included in each group, a representative word of the group determined by the representative word determination unit or a representative of identification information, and a noun included in each group The dictionary generation device according to claim 1, further comprising a dictionary output unit that outputs a dictionary associated with each other.
The dictionary generation device according to claim 1, further comprising: a group update unit that evaluates a correlation among subgroups within the group generated by the group generation unit, and updates the group based on increase or decrease of the evaluation result.
The dictionary generation device according to claim 4, further comprising: an analysis unit that generates statistical data regarding work using the dictionary output by the dictionary output unit.
An extraction step for extracting identification information and nouns from the document file;
A relationship evaluation step for evaluating the correlation between the identification information and the noun based on the co-occurrence frequency of the identification information extracted by the extraction step and the extracted noun;
A dictionary generation method comprising: a group generation step of generating a group of identification information or a group of nouns based on an evaluation result in the relationship evaluation step.
An extraction step for extracting identification information and nouns from the document file;
A relationship evaluation step for evaluating the correlation between the identification information and the noun based on the co-occurrence frequency of the identification information extracted by the extraction step and the extracted noun;
A program that causes a computer to execute a group generation step of generating a group of identification information or a group of nouns based on an evaluation result in the relationship evaluation step.