WO2015194140A1 - Document data processing device, document data processing method, and recording medium - Google Patents

Document data processing device, document data processing method, and recording medium Download PDF

Info

Publication number
WO2015194140A1
WO2015194140A1 PCT/JP2015/002938 JP2015002938W WO2015194140A1 WO 2015194140 A1 WO2015194140 A1 WO 2015194140A1 JP 2015002938 W JP2015002938 W JP 2015002938W WO 2015194140 A1 WO2015194140 A1 WO 2015194140A1
Authority
WO
WIPO (PCT)
Prior art keywords
expression
document
range
data processing
predetermined
Prior art date
Application number
PCT/JP2015/002938
Other languages
French (fr)
Japanese (ja)
Inventor
綾子 久野
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2016529029A priority Critical patent/JP6677158B2/en
Publication of WO2015194140A1 publication Critical patent/WO2015194140A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • the present invention relates to a document data processing apparatus, a document data processing method, and a recording medium for evaluating whether or not document description items are sufficient.
  • Patent Document 1 An example of a technique for detecting an expression that tends to be deficient in an input document is disclosed in Patent Document 1.
  • the document data processing apparatus of Patent Document 1 includes an input unit, a storage unit, a search unit, and an output unit.
  • the document data processing apparatus of Patent Document 1 first inputs document data (input document) to be processed by an input means.
  • the document data processing apparatus of Patent Document 1 searches a predetermined expression stored in the storage means (hereinafter referred to as “matching expression”) from the input document by the search means.
  • the document data processing apparatus of Patent Document 1 reads out a message associated with the matching expression stored in the storage unit, and outputs the message by the output unit.
  • Patent Document 1 detects an expression that tends to be deficient in an input document by storing an expression that tends to be deficient in a storage unit in advance.
  • Patent Document 2 An example of a technique for detecting, from an input document, a lack of another expression that should have a dependency relationship with a certain expression is disclosed in Patent Document 2.
  • the data processing apparatus of Patent Document 2 includes an input unit, a syntax analysis unit, a storage unit, a determination unit, and an output unit.
  • the data processing apparatus of Patent Document 2 inputs document data by an input means.
  • Patent Document 2 executes syntax analysis of document data by syntax analysis means.
  • the data processing apparatus of Patent Document 2 determines whether or not the most basic single element is missing in the syntax tree that is the result of the syntax analysis by the determination means. Then, the data processing apparatus of Patent Document 2 determines, based on the determination result, whether or not there is a shortage of phrase descriptions for establishing a grammatical “sentence” in the document data based on the determination result.
  • Patent Document 2 stores in advance the correspondence relationship of expressions that should be in a dependency relationship by the storage means.
  • Patent Document 2 actually determines the other expression that should be in the dependency relationship in the document data. It is determined whether or not there is a relationship.
  • the data processing apparatus disclosed in Patent Document 2 outputs a determination result as to whether or not there is an expression that should be in a dependency relationship by the output means.
  • Patent Document 3 An example of a technique for detecting a lack of another word to be described in relation to a certain word from an input document is disclosed in Patent Document 3.
  • the data processing apparatus disclosed in Patent Document 3 includes storage means, input means, determination means, and output means.
  • the data processing apparatus disclosed in Patent Document 3 first inputs document data by an input means.
  • the data processing apparatus of Patent Document 3 determines whether or not the first word is present in the input document by the determining means.
  • the data processing apparatus of Patent Document 3 determines whether or not the second word is present in the input document by the determination unit.
  • the data processing apparatus disclosed in Patent Document 3 includes a description including the second word in the input document by the determination unit. Judge that it is insufficient.
  • the data processing apparatus of Patent Document 3 outputs a determination result by an output unit.
  • the document data processing apparatus of Patent Document 1 determines the presence or absence of an expression that tends to be short of description, but does not determine whether or not there is actually a shortage of description in the input document. That is, the document data processing apparatus of Patent Document 1 has a problem that it cannot be determined whether or not there is actually a shortage of description in the input document.
  • the data processing apparatus disclosed in Patent Document 2 lacks a description when there is a syntactic deficiency in the input document data or when an expression that should have a dependency relationship with an expression registered in advance in the storage means is deficient. Is determined to exist. For this reason, the data processing apparatus of Patent Document 2 does not have one of the expressions that should be in a dependency relationship in one sentence, but there is a lack of description when there is an expression that should be in a dependency relationship in another sentence. Judge that. That is, if there is an expression that should be in a dependency relationship across sentences, it should be determined that there is no description shortage, but it will be determined that there is a shortage of description. As described above, the data processing apparatus of Patent Document 2 has a problem that it is impossible to determine whether there is a shortage of description in consideration of a plurality of sentences in the input document.
  • the data processing apparatus of Patent Document 3 has the same problem as the data processing apparatus of Patent Document 2.
  • the main object of the present invention is to determine whether or not an expression relating to a predetermined matter to be described in relation to a predetermined subject is described within an appropriate range when an expression relating to the predetermined subject exists in the document.
  • An object of the present invention is to provide a document data processing apparatus, a document data processing method, and a document data processing program.
  • the document data processing apparatus includes a position of an occurrence of a predetermined first expression relating to a predetermined subject in a first document and a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the distribution of the shortest distance from the appearance position, the second expression appears at the appearance position of the first expression in the second document that is the same document as the first document or a different document.
  • a co-occurrence range setting means for determining the first range of the power position, and a predetermined matter to be described in relation to the subject when the second expression does not appear in the first range in the second document Is provided with detailing deficiency detection means for detecting that is not described in an appropriate range.
  • the document data processing method of the present invention includes a position of an occurrence of a predetermined first expression relating to a predetermined subject in a first document and a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the distribution of the shortest distance from the appearance position, the second expression appears at the appearance position of the first expression in the second document that is the same document as the first document or a different document. If the first range of the power position is determined and the second expression does not appear in the first range in the second document, the predetermined matter to be described in relation to the subject is within the appropriate range It is characterized by detecting that it is not described.
  • the document data processing program of the present invention includes a position of an occurrence of a predetermined first expression relating to a predetermined subject in a first document, and a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the distribution of the shortest distance from the appearance position, the second expression appears at the appearance position of the first expression in the second document that is the same document as the first document or a different document.
  • a co-occurrence range setting process for determining the first range of the power position, and a predetermined matter to be described in relation to the subject when the second expression does not appear in the first range in the second document In this case, the computer is caused to execute a detail shortage detection process for detecting that the above is not described within an appropriate range.
  • the “input document” is a document (document information, document data) input as a processing target to the document data processing apparatus according to each embodiment of the present invention.
  • the “document (input document)” may be, for example, a document including a character string, a symbol, a table, or the like listed below.
  • Documents written in a specific language including one or more sentences (single sentences, compound sentences, heavy sentences, etc.) ⁇ Tables or forms (for example, Microsoft Excel) with multiple items arranged in rows or columns (Form sheets created by (registered trademark)) ⁇ Documents that contain a mixture of the above specific languages and tables or forms (for example, instruction manuals for various products)
  • the “document” may further include a figure and an image.
  • a plurality of “situation-restricted words” may be associated with one pair of “detailed expression” and “detailed expression”.
  • FIG. 1 is a block diagram showing an example of the configuration of a document data processing apparatus 10 according to the first embodiment of the present invention.
  • the document data processing apparatus 10 includes a document input unit 101, a detailed expression database 102, a word extraction unit 103, a co-occurrence presence / absence check unit 104, a co-occurrence range setting unit 105, a detail shortage detection unit 106, and an output unit 107. .
  • the document input unit 101 inputs a document (that is, an input document) that is a detection target of insufficient detailing to the document data processing apparatus 10.
  • the detailed expression database 102 stores in advance a “detailed table” including at least data in which the detailed expression is associated with the detailed expression.
  • the refinement table may further include a situation-limited word associated with a pair of refined expression and refined expression.
  • the refined expression database 102 may store the refined expression or the synonym of the refined expression in advance so as to treat the synonym or the like as being the same as the refined expression or the refined expression. Good.
  • the word extraction unit 103 searches the input document for character strings that match the refined expression, the refined expression, and the situation limited word stored in advance in the refined expression database 102. Then, the word extraction unit 103 records the position of the character string that matches each of the searched detailed description, detailed expression, and situation limited word in a storage device (not shown). The position of the character string is specified using a file name, page number, line number, sentence number, cell coordinates (cell number), or character number.
  • each character string in the document that matches the refined expression is “detailed place”
  • the position of each character string in the document that matches the refined expression is “detailed place”
  • the situation is limited
  • the position of each character string in the document that matches the word is called a “situation limited place”. It should be noted that each of the refined part, the refined part, and the situation-limited part has the same character string, but if the position in the document is different, another refined part, refined part, and situation-limited part Treated as a place.
  • the word extracting unit 103 uses the entire compound word as the refined expression or the refined expression. You may consider it a discovery. For example, when the character string “ID” is registered as the refined expression in the detailed expression database, the word extracting unit 103 uses “user ID” and “product ID” including “ID” in the input document. It may be regarded as a refined expression in which a compound word such as is found.
  • the co-occurrence presence / absence check unit 104 calculates the refined expression and the “minimum co-occurrence distance” of the refined expression based on the location information of the refined part and the refined part.
  • the “minimum co-occurrence distance” is a distance between the refined expression and the corresponding refined expression.
  • the minimum co-occurrence distance is the “distance” between the refined locations closest to the refined location before and after the refined location.
  • the “distance” may be anything that can represent the distance between the two expressions in the document by a numerical value, such as the number of characters, the number of lines, the difference in sentence numbers, the number of pages, etc. between the two expressions.
  • the co-occurrence presence check unit 104 also calculates the minimum co-occurrence distance between the refined location and the situation limited location.
  • the co-occurrence range setting unit 105 sets an “appropriate co-occurrence range” between the refined expression and the refined expression of each refined part. decide.
  • the “appropriate co-occurrence range” is a range of positions where the refined expression should appear with respect to the refined expression.
  • the co-occurrence range setting unit 105 refines the minimum co-occurrence distance having the highest appearance frequency when the appearance frequency of the minimum co-occurrence distance is histogrammed for each pair of the detailed expression and the detailed expression.
  • the appropriate co-occurrence range for each pair of expression and detailed expression is determined.
  • the co-occurrence range setting unit 105 may determine an appropriate co-occurrence range for each pair of the refined expression and the refined expression based on the distribution of distances of the refined parts to the refined part. . If the refined expression never appears in the refined expression, the co-occurrence range setting unit 105 determines “none” as the appropriate co-occurrence range.
  • the co-occurrence range setting unit 105 sets the minimum co-occurrence distance, the maximum minimum co-occurrence distance, or the minimum co-occurrence as the appropriate co-occurrence range. An average value of the distance may be determined. Note that as the appropriate co-occurrence range is set wider, the number of parts to be refined that are determined to be insufficient is reduced.
  • the detailing shortage detection unit 106 performs details based on the appropriate co-occurrence range determined by the co-occurrence range setting unit 105 and the “detailed shortage detection rule” for each pair of the refined expression and the detailed expression. A part to be detailed (hereinafter referred to as a “detailed insufficient part”) in which insufficient refinement has occurred is detected.
  • the “detailed shortage detection rule” means that if the location to be refined and the location to be refined co-occur within the appropriate co-occurrence range, the details are not insufficient (or the details are insufficient) It is a rule that determines whether or not to determine.
  • the detailing insufficient detection rule is, for example, a rule for determining that the detailing is insufficient if each detailed location does not co-occur within the appropriate co-occurrence range.
  • the detail shortage detection unit 106 selects all the detailed parts corresponding to the pair of the corresponding detailed expression and the detailed expression. It is considered that the details are insufficient.
  • the refinement shortage detecting means 106 detects a compound word including the refined expression as a variation of the refined expression.
  • the refinement location co-occurs within the appropriate co-occurrence range for at least one of the refinement locations corresponding to each variation, the detail refinement detection rule is It may be a rule that determines that the details are not insufficient.
  • the refinement shortage detecting means 106 detects the refinement insufficient location when the refined location and the situation limited location corresponding to the refined expression co-occur within a predetermined co-occurrence range. Perform detection.
  • the refinement shortage detection means 106 may detect the location of the refinement insufficient part when the situation limited word co-occurs within the appropriate co-occurrence range of the refined expression and the refinement expression set by the co-occurrence range setting means 105. Perform detection.
  • the output means 107 outputs the details of the insufficient detail extracted by the details insufficient detection means 106 in a manner that the user can discriminate, for example.
  • the output mode is, for example, a list display recognizable by the user, information provision to an external device, or the like.
  • the output unit 107 may output the portion to be refined in such a manner that the user can determine whether or not it is determined that the location is a portion where detail is insufficient.
  • the output unit 107 may change the color, font, line thickness, etc. in the output between the parts that are not detailed and the parts that are not detailed, among the parts to be detailed. Good.
  • FIG. 2 is a flowchart showing the operation of the document data processing apparatus 10 in the first embodiment of the present invention.
  • the document input unit 101 inputs a document (input document) that is a target for detection of insufficient detail (step S101).
  • the word extraction unit 103 detects a refined part and a refined part that match the refined expression and the refined expression stored in the refined expression database 102 from the input document (step S102). .
  • the co-occurrence presence check unit 104 calculates the minimum co-occurrence distance between the refined location and the refined location for each pair of the refined location and the refined location extracted by the word extracting unit 103 (step S103). ).
  • the co-occurrence range setting means 105 determines an appropriate co-occurrence range based on the minimum co-occurrence distance between the detailed portion and the detailed portion for each pair of the detailed expression and the detailed portion (step S104). .
  • Detailed insufficient detection means 106 detects the insufficiently detailed part based on the appropriate co-occurrence range determined by the co-occurrence range setting means 105 and the detailed insufficient detection detection rule (step S105).
  • the output means 107 outputs the insufficient detail location detected by the detail lacking detection means 106 (step S106).
  • FIG. 3 is a diagram for explaining a specific example of the detailed table T1 in the first embodiment of the present invention.
  • the detailed expression database 102 stores the detailed table T1.
  • FIG. 4 is a diagram for explaining a specific example of the distribution of the minimum co-occurrence distance in the first embodiment of the present invention.
  • the refined table T1 can store a refined expression C1, a refined expression C2, and a situation limited word C3. However, since the situation limited word is not specified in the detailed table T1, the situation limited word C3 column is blank.
  • the detailed expression database 102 stores in advance at least a detailed table T1 in which the detailed expression C1 and the detailed expression C2 are associated with each other.
  • the document input unit 101 inputs a document that is a target for detection of insufficient detail (step S101 in FIG. 2).
  • the word extraction means 103 refers to the refinement table T1 and finds a refined part and a refined part that match the stored refined expression C1 “search system” and the refined expression C2 “performance”, respectively. Then, it is detected from the input document (step S102 in FIG. 2).
  • the co-occurrence presence check means 104 determines the minimum co-occurrence distance between the refined location and the refined location for each pair of the refined location corresponding to the “search system” and the refined location corresponding to “performance”. Calculate (step S103 in FIG. 2). When there are a plurality of refinement locations corresponding to “performance”, the co-occurrence presence check means 104 determines the distance between the refinement location and the refinement location closest to the refinement location as the minimum co-occurrence distance. Calculate as
  • the co-occurrence range setting means 105 is based on the minimum co-occurrence distance between the refined location corresponding to the “search system” and the refined location corresponding to the “performance” calculated by the co-occurrence presence check means 104.
  • the appropriate co-occurrence range is determined (step S104 in FIG. 2).
  • the frequency where the minimum co-occurrence distance is “1 row” is the highest.
  • the co-occurrence range setting unit 105 determines “one line” as the appropriate co-occurrence range.
  • ⁇ Detailed shortage detection means 106 detects an insufficient detail location based on the appropriate co-occurrence range determined by the co-occurrence presence check means 104 and the detailed shortage detection rule (step S105 in FIG. 2).
  • the detailing shortage detection rule is a rule “determining that there is insufficient detailing if each detailed location does not co-occur within the appropriate co-occurrence range”. In this case, if there is no refined part corresponding to “performance” within one line before and after the refined part corresponding to “search system”, the refinement insufficient detection means 106 refines the refined part. Detect as missing part.
  • the output means 107 outputs the insufficient detail location detected by the detail lack detection means 106 (step S106 in FIG. 2).
  • FIG. 5 is a diagram for explaining a specific example of the detailed table T2 in the first embodiment of the present invention.
  • FIG. 6 is a diagram for explaining a specific example of the input document D1 in the first embodiment.
  • the refined expression C4 “csv” appears at positions P1, P2, P3, and P4.
  • “(page break)” indicates a symbol indicating page break
  • “:” and “(omitted)” indicate that a part of the document is omitted.
  • the document input unit 101 inputs the input document D1 that is a target for detection of insufficient detail (step S101 in FIG. 2).
  • the word extraction means 103 refers to the refinement table T2, and stores each of the memorized refined expression C4 “csv”, the refined expression C5 “character code”, the situation limited word C6 “input” and “output”. 2 is detected from the input document D1 (step S102 in FIG. 2).
  • the detailed parts corresponding to the detailed expression C4 “csv” are the detailed parts P1, P2, P3, and P4.
  • the co-occurrence presence / absence check unit 104 calculates the minimum co-occurrence distance between the refined location and the refined location for each pair of the refined location corresponding to “csv” and the refined location corresponding to “character code”. Calculate (step S103 in FIG. 2). If there are a plurality of detailed locations corresponding to the “character code”, the co-occurrence presence check means 104 determines the minimum co-occurrence of the distance between the detailed location and the detailed location closest to the detailed location. Calculate as distance.
  • the co-occurrence range setting means 105 is based on the minimum co-occurrence distance between the refined part corresponding to “csv” and the refined part corresponding to “character code” calculated by the co-occurrence presence check means 104.
  • the appropriate co-occurrence range is determined (step S104 in FIG. 2). In the distribution of the minimum co-occurrence distance between the refined expression C4 “csv” and the detailed expression C5 “character code”, the minimum co-occurrence distance with the highest appearance frequency is “1 line”.
  • the co-occurrence range setting unit 105 sets “one line” to the refined expression C4 “csv” and the refined expression C5 “character code”. As the appropriate co-occurrence range.
  • ⁇ Detailed shortage detection means 106 detects an insufficient detail location based on the appropriate co-occurrence range determined by the co-occurrence presence check means 104 and the detailed shortage detection rule (step S105 in FIG. 2).
  • the detail refinement detection rule is "If there is a situation-limited word co-occurring on the same page for each refined location and no refinement location co-occurs within the appropriate co-occurrence range, it is determined that the detail is insufficient. The case of the rule will be described. In this case, the detail shortage detection means 106 causes the situation limited part corresponding to “input” or “output” to co-occur in the same page of the part to be detailed corresponding to “csv” and within one line before and after. When there is no detailed location corresponding to the “character code”, the detailed location is detected as an insufficient detail location.
  • the refined part P1 is insufficiently detailed because a situation-limited part corresponding to “input” co-occurs in the same page and a detailed part corresponding to “character code” co-occurs within one line before and after. Not a place.
  • the refined part P2 is insufficiently detailed because a situation-limited part corresponding to “input” co-occurs in the same page and a detailed part corresponding to “character code” does not co-occur within one line before and after. It is a place.
  • the refined location P4 is not a location where the detail is insufficient because a situation-limited location corresponding to “input” or “output” does not co-occur within the same page.
  • the output means 107 outputs the insufficient detail location detected by the detail lack detection means 106 (step S106 in FIG. 2).
  • the expression related to the predetermined matter to be described in relation to the predetermined subject is appropriate. It is possible to determine whether or not it is described within a range. The reason is that the document data processing apparatus 10 determines the appropriate co-occurrence range based on the distribution of the minimum co-occurrence distance between the refined location and the refined location, and whether there is an insufficient detail location in the appropriate co-occurrence range. It is because it determines.
  • the expression relating to the predetermined subject can be rephrased as a refined expression.
  • An expression relating to a given item to be described in relation to a given subject can be rephrased as a refined expression.
  • whether or not these are described within an appropriate range can be rephrased as the presence or absence of a location where details are insufficient.
  • the document data processing apparatus 10 sets the appropriate co-occurrence range for such an enormous input document based on the distribution of the minimum co-occurrence distance between the detailed portion and the detailed portion. If there is no refinement location within the appropriate co-occurrence range among the refinement locations, it is determined that there is an insufficient refinement location. For this reason, in the document data processing apparatus 10 of this embodiment, there exists an effect that the presence or absence of a refinement
  • the configuration of the document data processing apparatus in the present embodiment is the same as the configuration of the document data processing apparatus 10 in the first embodiment.
  • the document data processing apparatus 10 refines the entire compound word when the character string registered as the refined expression or the refined expression is a part of the compound word in the input document. Considered to be discovered as an expression or refined expression.
  • FIG. 7 is a diagram for explaining a specific example of the detailed table T3 in the second embodiment of the present invention.
  • the detailed expression database 102 stores the detailed table T3 in advance.
  • the detailed table T3 stores “ID” in the detailed expression C7 and “cannot be changed” in the detailed expression C8.
  • the situation limited word C9 is blank.
  • FIG. 8 is a diagram for explaining a specific example of the input document D2 in the second embodiment of the present invention.
  • the refined expression C7 “ID” appears at positions P5, P6, P7, P8, and P9.
  • Document input means 101 inputs input document D2 (step S101 in FIG. 2).
  • the word extraction means 103 uses the input document D2 to specify the refined part and the refined part corresponding to the refined expression C7 “ID” and the refined expression C8 “unchangeable” stored in the refined table T3. It detects (step S102 of FIG. 2).
  • the word extraction means 103 includes compound words “user ID”, “product ID”, “store ID”, and “order ID” including the refined expression C7 “ID” in the input document D2, respectively.
  • the entire compound word is detected as a refined part. That is, the word extraction unit 103 specifies the refined location P5 corresponding to the “user ID”, the refined location P6 corresponding to the “product ID”, and the refined location corresponding to the “store ID” in the input document D2. P7, P9, and the refined portion P8 corresponding to the “order ID” are detected.
  • the co-occurrence presence check means 104 calculates the minimum co-occurrence distance between the refined location and the refined location for each pair of the refined location corresponding to “ID” and the refined location corresponding to “unchangeable”. Calculate (step S103 in FIG. 2). When there are a plurality of refinement locations corresponding to “unchangeable”, the co-occurrence presence check unit 104 determines the minimum co-occurrence of the distance between the refinement location and the refinement location closest to the refinement location. Calculate as distance.
  • the co-occurrence range setting means 105 is based on the minimum co-occurrence distance between the refined part corresponding to “ID” and the refined part corresponding to “impossible to change”, calculated by the co-occurrence presence check means 104.
  • the appropriate co-occurrence range is determined (step S104 in FIG. 2).
  • the appearance frequency of the minimum co-occurrence distance 0 line (same line) is The appearance frequency of the minimum co-occurrence distance 5 lines (P8 to P7 line) is 3 times, and the appearance frequency of the minimum co-occurrence distance 7 lines (P9 to P7 line) including one blank line is 1 time. Therefore, the co-occurrence range setting unit 105 determines the minimum co-occurrence distance 0 rows as an appropriate co-occurrence range between the detailed expression C7 “ID” and the detailed expression C8 “unchangeable”.
  • ⁇ Detailed shortage detection means 106 detects a shortage in detail based on the appropriate co-occurrence range determined by the co-occurrence range setting means 105 and the detailed shortage detection rule (step S105 in FIG. 2). If the refinement location co-occurs within the appropriate co-occurrence range for at least one refinement location corresponding to each variation of the compound word including the refinement expression, the refinement shortage detection rule A case will be described in which the rule “determines that the variation of expression is not insufficient in detail”. In this case, if there is no refinement location corresponding to “unchangeable” within 0 lines (same row) for any refinement location corresponding to a specific compound word including “ID”, refinement is performed.
  • the deficiency detecting means 106 detects the detailed part as a detailed deficient part.
  • the refined locations P5, P6, and P7 among the refined locations P5 to P9 are not refined portions because there are refined locations corresponding to “unchangeable” on the same line.
  • the detailed location P8 “order ID” has no detailed location corresponding to the detailed location “cannot be changed” on the same line, and details corresponding to the “order ID” other than the detailed location P8. Since there is no refinement location, it is a location where details are insufficient.
  • the refined part P9 does not have a refined part corresponding to “impossible to change” on the same line, but the refined part P7 corresponding to “store ID” is not a insufficient refinement part, so the refined part P9 Is not a lack of detail.
  • the configuration of the document data processing apparatus in the present embodiment is the same as that of the document data processing apparatus 10 in the second embodiment.
  • the detailed expression database 102 stores in advance the same detailed table T3 as in the second embodiment.
  • the co-occurrence presence check unit 104 and the co-occurrence range setting unit 105 determine the appropriate co-occurrence range by distinguishing the direction of the detailed expression viewed from the detailed expression.
  • FIG. 9 is a diagram for explaining a specific example of the input document D3 in the third embodiment of the present invention.
  • the input document D3 includes a value in the “attribute” column C10 and a “remarks” column C12 for each of “user ID”, “product ID”, “store ID”, and “order ID” that are values in the “item” column C11.
  • This is a document containing a table in which the values of are described.
  • the values of the “attribute” of the items “user ID”, “product ID”, and “store ID” are “non-overlapping” and “cannot be changed”.
  • the value of the “attribute” of the item “order ID” is “non-overlapping”.
  • the values of “Remarks” in the items “User ID”, “Product ID”, and “Store ID” are blank.
  • the value of “Remarks” in the item “Order ID” is “Product ID cannot be changed even if the product is changed”.
  • the input document D3 is a compound word including “ID” which is the refined expression C7, and “user ID”, “product ID”, “store ID”, “order ID” are values in the “item” column C11 column. Include as. “ID” that is the refined expression C7 is associated with “unchangeable” that is the refined expression C8 in the refined table T3.
  • the word extraction means 103 uses the input document D3 to specify the detailed portion and the detailed portion corresponding to the detailed expression C7 “ID” and the detailed expression C8 “unchangeable” stored in the detailed table T3. It detects (step S102 of FIG. 2). Note that the word extraction unit 103 includes “user ID”, “product ID”, “store ID”, and “order ID” for the compound word including the refined expression C7 “ID” in the input document D3. The entire compound word is detected as a refined part. That is, the word extraction unit 103 detects the second line of the “item” column C11 that is the refined portion corresponding to the “user ID” in the input document D3.
  • the word extraction unit 103 detects the third line of the “item” column C11 and the fifth line of the “remarks” column C12, which are refined portions corresponding to the “product ID”, in the input document D3. Further, the word extracting means 103 is a refined location corresponding to “order ID” in line 4 of the “item” column C11, which is a refined location corresponding to “store ID” in the input document D3. The fifth line of the “item” column C11 is detected. In addition, the word extraction unit 103 detects the second, third, and fourth lines of the “attribute” column C10 that is a refined portion corresponding to “unchangeable” in the input document D3.
  • the co-occurrence presence / absence check unit 104 determines the refinement location and the refinement location for each pair of the refinement location corresponding to “ID” and the refinement location corresponding to “unchangeable” in the input document D3. Is calculated (step S103 in FIG. 2). When there are a plurality of refinement locations corresponding to “unchangeable”, the co-occurrence presence check unit 104 determines the minimum co-occurrence of the distance between the refinement location and the refinement location closest to the refinement location. Calculate as distance. However, the co-occurrence presence / absence check unit 104 distinguishes the direction of the detailed expression viewed from the detailed expression.
  • the co-occurrence presence / absence check unit 104 detects a refined location corresponding to “unchangeable” for a refined location corresponding to each of “user ID”, “product ID”, and “store ID”. .
  • the refined portion is detected in a column C10 that is a column on the left side of the column C11 including the refined portion having the minimum co-occurrence distance of 0 rows (the same row).
  • the co-occurrence presence / absence check unit 104 detects a refined location corresponding to “unchangeable” for the refined location corresponding to “order ID”.
  • the refined portion is detected in the column C12 that is the column on the right side of the column C11 that includes the refined portion having the minimum co-occurrence distance of 0 rows (the same row).
  • the co-occurrence range setting means 105 is based on the minimum co-occurrence distance between the refined part corresponding to “ID” and the refined part corresponding to “impossible to change”, calculated by the co-occurrence presence check means 104.
  • the appropriate co-occurrence range is determined (step S104 in FIG. 2).
  • the co-occurrence range setting unit 105 determines the appropriate co-occurrence range by distinguishing the direction in which the detailed portion co-occurs from the detailed portion in addition to the minimum co-occurrence distance. That is, in the input document D3, in the distribution of the minimum co-occurrence distance between the refined expression C7 “ID” and the refined expression C8 “unchangeable”, the minimum co-occurrence distance appears on the left side with 0 line (same line).
  • the co-occurrence range setting unit 105 assumes that the minimum co-occurrence distance is “0 lines on the left” as the appropriate co-occurrence range between the refined expression C7 “ID” and the refined expression C8 “unchangeable”. decide.
  • ⁇ Detailed shortage detection means 106 detects a shortage in detail based on the appropriate co-occurrence range determined by the co-occurrence range setting means 105 and the detailed shortage detection rule (step S105 in FIG. 2). However, in the appropriate co-occurrence range, the direction in which the detailed portion co-occurs with respect to the detailed portion is also distinguished.
  • the detailing shortage detection rule is a rule “determining that there is insufficient detailing if each detailed location does not co-occur within the appropriate co-occurrence range”.
  • the refinement shortage detection means 106 has a refinement location corresponding to “unchangeable” on the left side with respect to the refinement location corresponding to “ID”, and co-occurs within 0 lines. It is determined that the details are not insufficient.
  • the detail shortage detection unit 106 determines that the details to be refined corresponding to the “user ID”, “product ID”, and “store ID” are not short enough in the input document D3.
  • the refinement insufficiency detecting means 106 determines that the refinement location corresponding to the “order ID” in the input document D3 is inadequate detail.
  • FIG. 10 is a block diagram showing an example of the configuration of the document data processing apparatus 11 in the fourth embodiment of the present invention.
  • the document data processing apparatus 11 includes a co-occurrence presence / absence check unit 114, a co-occurrence range setting unit 115, and a detail shortage detection unit 116.
  • the co-occurrence presence / absence check unit 114 determines a character string (detailed expression) related to a predetermined subject in the input document that is to be described in relation to the predetermined subject closest to the detailed expression.
  • the distance distribution (minimum co-occurrence distance distribution) with the character string (detailed expression) related to the item is stored.
  • the co-occurrence range setting unit 115 sets an appropriate co-occurrence range between the detailed expression and the detailed expression in the input document. decide.
  • the detail shortage detection unit 116 should be described in relation to a predetermined subject when there is no detailed representation in the appropriate co-occurrence range determined by the co-occurrence range setting unit 115 in the input document. Detect that a given item is not described within an appropriate range.
  • FIG. 11 is a block diagram showing an example of the configuration of the document data processing apparatus 12 in the fifth embodiment of the present invention.
  • the document data processing device 12 Based on the distribution of the given minimum co-occurrence distance, the document data processing device 12 has an expression (detailed expression) regarding a predetermined matter to be described in relation to a predetermined subject in the input document within an appropriate range. It is determined whether it is described in.
  • the minimum co-occurrence distance distribution is a distance distribution with a detailed expression that is closest to the detailed expression for a character string (detailed expression) related to a predetermined subject in the document.
  • the distribution of the minimum co-occurrence distance is output, for example, when the document data analysis apparatus 13 having the co-occurrence presence / absence check unit 114 in the fourth embodiment analyzes the reference document.
  • the reference document is a document in which the refined expression and the refined expression are common to the input document.
  • the document data processing apparatus 12 includes a co-occurrence range setting unit 125 and a detail shortage detection unit 126.
  • the co-occurrence range setting means 125 determines an appropriate co-occurrence range between the detailed expression and the detailed expression in the input document based on the distribution of the given minimum co-occurrence distance.
  • the appropriate co-occurrence range is, for example, a range in which the distance from the part to be refined is 0 or more and not more than the distance having the highest appearance frequency in the distribution of the minimum co-occurrence distance.
  • the appropriate co-occurrence range is, for example, a range in which the distance from the detailed portion is equal to the distance having the highest appearance frequency in the minimum co-occurrence distance distribution.
  • the detailed shortage detection unit 126 is described in relation to a predetermined subject when there is no detailed expression in the appropriate co-occurrence range determined by the co-occurrence range setting unit 125 in the input document. It is detected that the predetermined matters to be corrected are not described within an appropriate range.
  • the document data processing apparatus 12 of the present embodiment can use the minimum co-occurrence distance distribution obtained by analyzing a reference document different from the input document.
  • the reference document may be the same document as the input document. Therefore, according to the document data processing apparatus 12 of the present embodiment, the minimum co-occurrence in the reference document is more preferable than the input document in order to determine whether or not the detailed expression is described within an appropriate range.
  • Distance distribution can be used. Note that once the minimum co-occurrence distance distribution in the reference document is created, the minimum co-occurrence distance distribution can be used any number of times. This eliminates the need to calculate the minimum co-occurrence distance distribution and determine the appropriate co-occurrence range each time for the input document.
  • the document data processing apparatus in each embodiment described above may be realized by a dedicated apparatus, but can also be realized by a computer (information processing apparatus).
  • a computer information processing apparatus
  • the check unit 114, the co-occurrence range setting unit 115, the detailed shortage detection unit 116, the co-occurrence range setting unit 125, and the detailed shortage detection unit 126 can be regarded as a function (processing) unit (software module) of the software program. it can.
  • FIG. 12 is a diagram illustrating an exemplary configuration of an information processing apparatus 1000 (computer) that can execute the document data processing apparatus 10 (11, 12) according to the embodiment of the present invention.
  • An information processing apparatus 1000 illustrated in FIG. 12 is a general computer in which the following configurations are connected via a bus 3008 (communication line).
  • the recording medium 3010 on which the program is recorded is supplied by the drive device 3009 reading it.
  • downloading the computer program via the communication I / F 3006 is also included in the information processing apparatus 1000 reading.
  • the computer program is read and interpreted by the CPU 3001 of the hardware and executed by the CPU 3001.
  • the computer program supplied to the apparatus may be stored in a volatile storage memory (RAM 3003) that can be read and written or a nonvolatile storage device such as the storage apparatus 3004.
  • the software program (computer program) can be regarded as constituting the present invention.
  • a computer-readable storage medium storing such a software program can also be understood as constituting the present invention.
  • a document data processing apparatus comprising: a detail insufficiency detecting means for detecting.
  • Appendix 2 The shortest distance is an appearance position of the second expression closest to the appearance position of the first expression among the appearance positions of the second expression before and after the appearance position of the first expression.
  • Appendix 3 The document data processing apparatus according to appendix 1 or appendix 2, further comprising co-occurrence presence / absence check means for recording the distribution in the first document.
  • Word extraction means for detecting the appearance position of the first expression and the appearance position of the second expression in the second document; 4.
  • the document data processing apparatus further comprising a detailed expression database that stores the first expression and the second expression in association with each other.
  • the first range is the shortest distance having the highest appearance frequency in the distribution, or the maximum value, the minimum value of the shortest distance having the highest appearance frequency when there are a plurality of shortest distances having the highest appearance frequency, or
  • the document data processing apparatus according to any one of Supplementary Note 1 to Supplementary Note 4, which includes an average value.
  • the detailing deficiency detection means causes the second expression to appear in any of the first ranges corresponding to the compound word.
  • the distribution is a direction of the appearance position of the second expression from the appearance position of the first expression, in addition to the information on the distance between the appearance position of the second expression and the appearance position of the first expression. Including further information
  • the document data processing apparatus according to any one of appendix 1 to appendix 6, wherein the co-occurrence range setting unit determines the first range based on distance and direction information included in the distribution.
  • the detailed shortage detection unit includes a predetermined third expression and the first expression appearing in a predetermined second range, and the second expression is the first document.
  • the predetermined matter to be described in relation to the subject is detected when it does not appear in a range, and is not described in an appropriate range.
  • Document data processing device (Appendix 9)
  • the co-occurrence range setting means determines the occurrence of the first synonym of the first expression or the second synonym of the second expression as the first expression or the 9.
  • the document data processing apparatus according to any one of supplementary notes 1 to 8, wherein the document data processing apparatus is regarded as an appearance of a second expression.
  • the document data processing apparatus according to any one of appendix 1 to appendix 9, further comprising: means. (Appendix 11) Distribution of the shortest distance between the appearance position of a predetermined first expression relating to a predetermined subject in the first document and the appearance position of a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the first position of the position where the second expression should appear relative to the position where the first expression appears in a second document that is the same document as the first document or a different document.
  • a document data processing method characterized by detecting. (Appendix 12) Distribution of the shortest distance between the appearance position of a predetermined first expression relating to a predetermined subject in the first document and the appearance position of a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the first position of the position where the second expression should appear relative to the position where the first expression appears in a second document that is the same document as the first document or a different document.
  • the second expression when the second expression does not appear in the first range, the predetermined matter to be described in relation to the subject is not described in an appropriate range.
  • a document data processing program that causes a computer to execute a detection process of insufficient detailing to be detected.
  • the present invention has been described above using the above-described embodiment as an exemplary example. However, the present invention is not limited to the above-described embodiment. That is, the present invention can apply various modes that can be understood by those skilled in the art within the scope of the present invention. This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2014-124850 for which it applied on June 18, 2014, and takes in those the indications of all here.

Abstract

[Problem] To determine, when an expression related to a predetermined subject is present within a document, whether or not an expression related to a predetermined item that should be described in association with the predetermined subject is described within an appropriate range. [Solution] A document data processing device is equipped with: a co-occurrence range setting means for determining a first range on the basis of the distribution in a first document with respect to minimum distance between the appearance position of a predetermined first expression related to a predetermined subject and the appearance position of a predetermined second expression related to a predetermined item that should be described in association with the subject, said first range being the range in which the second expression should be located relative to the appearance position of the first expression in a second document, which is the same document as the first document or a different document; and an insufficient details detection means for detecting that the predetermined item that should be described in association with the subject is not described within an appropriate range when the second expression does not appear within the first range in the second document.

Description

文書データ処理装置、文書データ処理方法、及び記録媒体Document data processing apparatus, document data processing method, and recording medium
 本発明は、文書の記載事項が十分か否かを評価する文書データ処理装置、文書データ処理方法、及び記録媒体に関する。 The present invention relates to a document data processing apparatus, a document data processing method, and a recording medium for evaluating whether or not document description items are sufficient.
 近年、自然言語等によって記述された入力文書を、情報処理装置等を用いて分析することで、その入力文書において記述不足が発生した箇所を検出するシステムが開発されている。 In recent years, a system has been developed that detects an incomplete description in an input document by analyzing the input document described in a natural language or the like using an information processing device or the like.
 記述不足になりがちな表現を入力文書の中から検出する技術の一例が、特許文献1に開示されている。 An example of a technique for detecting an expression that tends to be deficient in an input document is disclosed in Patent Document 1.
 特許文献1の文書データ処理装置は、入力手段と、記憶手段と、検索手段と、出力手段とを備える。 The document data processing apparatus of Patent Document 1 includes an input unit, a storage unit, a search unit, and an output unit.
 特許文献1の文書データ処理装置は、まず、入力手段によって処理対象である文書データ(入力文書)を入力する。 The document data processing apparatus of Patent Document 1 first inputs document data (input document) to be processed by an input means.
 特許文献1の文書データ処理装置は、次に、記憶手段に記憶される予め定められた表現(以下、「適合表現」と称す。)を、検索手段によって入力文書の中から検索する。 Next, the document data processing apparatus of Patent Document 1 searches a predetermined expression stored in the storage means (hereinafter referred to as “matching expression”) from the input document by the search means.
 適合表現が入力文書中に存在する場合、特許文献1の文書データ処理装置は、記憶手段に記憶された、適合表現と関連付けられるメッセージを読み出し、出力手段によってそのメッセージを出力する。 When the matching expression exists in the input document, the document data processing apparatus of Patent Document 1 reads out a message associated with the matching expression stored in the storage unit, and outputs the message by the output unit.
 即ち、特許文献1の文書データ処理装置は、記述不足になりがちな表現を適合表現として予め記憶手段に記憶することで、記述不足になりがちな表現を入力文書の中から検出する。 That is, the document data processing apparatus disclosed in Patent Document 1 detects an expression that tends to be deficient in an input document by storing an expression that tends to be deficient in a storage unit in advance.
 ある表現に対して係り受け関係にあるべき別の表現の欠落を入力文書の中から検出する技術の一例が、特許文献2に開示されている。 An example of a technique for detecting, from an input document, a lack of another expression that should have a dependency relationship with a certain expression is disclosed in Patent Document 2.
 特許文献2のデータ処理装置は、入力手段と、構文解析手段と、記憶手段と、判定手段と、出力手段とを備える。 The data processing apparatus of Patent Document 2 includes an input unit, a syntax analysis unit, a storage unit, a determination unit, and an output unit.
 特許文献2のデータ処理装置は、まず、入力手段によって文書データを入力する。 First, the data processing apparatus of Patent Document 2 inputs document data by an input means.
 特許文献2のデータ処理装置は、次に、構文解析手段によって文書データの構文解析を実行する。 Next, the data processing apparatus disclosed in Patent Document 2 executes syntax analysis of document data by syntax analysis means.
 特許文献2のデータ処理装置は、続いて、判定手段によって構文解析の結果である構文木において最も基底の単一の要素が欠損しているか否かを判定する。そして、特許文献2のデータ処理装置は、判定手段によって、その判定結果に基づいて、文書データにおいて文法上の“文”として成立するための文節の記述が不足しているか否かを判定する。 Subsequently, the data processing apparatus of Patent Document 2 determines whether or not the most basic single element is missing in the syntax tree that is the result of the syntax analysis by the determination means. Then, the data processing apparatus of Patent Document 2 determines, based on the determination result, whether or not there is a shortage of phrase descriptions for establishing a grammatical “sentence” in the document data based on the determination result.
 また、特許文献2のデータ処理装置は、記憶手段によって、係り受け関係にあるべき表現の対応関係を予め記憶する。 Further, the data processing apparatus of Patent Document 2 stores in advance the correspondence relationship of expressions that should be in a dependency relationship by the storage means.
 特許文献2のデータ処理装置は、続いて、係り受け関係にあるべき表現の一方が文書データ中に記述される場合に、係り受け関係にあるべき他方の表現が、文書データにおいて実際に係り受け関係にあるか否かを判定する。 Subsequently, when one of the expressions that should be in the dependency relationship is described in the document data, the data processing device of Patent Document 2 actually determines the other expression that should be in the dependency relationship in the document data. It is determined whether or not there is a relationship.
 特許文献2のデータ処理装置は、続いて、出力手段によって、係り受け関係にあるべき表現が存在するか否かの判定結果を出力する。 Subsequently, the data processing apparatus disclosed in Patent Document 2 outputs a determination result as to whether or not there is an expression that should be in a dependency relationship by the output means.
 ある単語に関連して記述されるべき別の単語の欠落を入力文書の中から検出する技術の一例が、特許文献3に開示されている。 An example of a technique for detecting a lack of another word to be described in relation to a certain word from an input document is disclosed in Patent Document 3.
 特許文献3のデータ処理装置は、記憶手段と、入力手段と、判定手段と、出力手段とを備える。 The data processing apparatus disclosed in Patent Document 3 includes storage means, input means, determination means, and output means.
 特許文献3のデータ処理装置は、予め、記憶手段によって、第1の単語が登録されたテキストマイニング辞書テーブルと、第1の単語に関連して記述されるべき第2の単語が登録された関連情報テーブルとを保持する。 In the data processing device of Patent Document 3, the text mining dictionary table in which the first word is registered and the relationship in which the second word to be described in relation to the first word is registered by the storage unit in advance. Information table.
 特許文献3のデータ処理装置は、まず、入力手段によって文書データを入力する。 The data processing apparatus disclosed in Patent Document 3 first inputs document data by an input means.
 特許文献3のデータ処理装置は、次に、判定手段によって、第1の単語が入力文書中に存在するか否かを判定する。第1の単語が入力文書に存在する場合、特許文献3のデータ処理装置は、判定手段によって、第2の単語が入力文書中に存在するか否かを判定する。第1の単語が入力文書に存在し、且つ第2の単語が入力文書に存在しない場合には、特許文献3のデータ処理装置は、判定手段によって、入力文書において第2の単語を含む記述が不足していると判定する。 Next, the data processing apparatus of Patent Document 3 determines whether or not the first word is present in the input document by the determining means. When the first word is present in the input document, the data processing apparatus of Patent Document 3 determines whether or not the second word is present in the input document by the determination unit. When the first word is present in the input document and the second word is not present in the input document, the data processing apparatus disclosed in Patent Document 3 includes a description including the second word in the input document by the determination unit. Judge that it is insufficient.
 特許文献3のデータ処理装置は、続いて、出力手段によって判定結果を出力する。 Subsequently, the data processing apparatus of Patent Document 3 outputs a determination result by an output unit.
特開2008-033887号公報JP 2008-033887 A 特許第5095128号公報Japanese Patent No. 5095128 特開2007-310829号公報JP 2007-310829 A
 特許文献1の文書データ処理装置は、記述不足になりがちな表現の有無を判定するが、入力文書において実際に記述不足が存在するか否かの判定を行わない。即ち、特許文献1の文書データ処理装置には、入力文書において実際に記述不足が存在するか否かを判定することができないという問題がある。 The document data processing apparatus of Patent Document 1 determines the presence or absence of an expression that tends to be short of description, but does not determine whether or not there is actually a shortage of description in the input document. That is, the document data processing apparatus of Patent Document 1 has a problem that it cannot be determined whether or not there is actually a shortage of description in the input document.
 特許文献2のデータ処理装置は、入力した文書データにおいて、構文上の欠損がある場合、または記憶手段に予め登録された表現と係り受け関係にあるべき表現が欠損している場合に、記述不足が存在すると判定する。このため、特許文献2のデータ処理装置は、ある文において係り受け関係にあるべき表現の一方が存在しないが、別の文において係り受け関係にあるべき表現が存在する場合に、記述不足が存在すると判定する。即ち、文をまたがって係り受け関係にあるべき表現が存在する場合は、記述不足ではないと判定すべきであるが、記述不足が存在すると判定してしまう。このように、特許文献2のデータ処理装置には、入力文書中の複数の文を考慮して記述不足の有無を判定することができないという問題がある。 The data processing apparatus disclosed in Patent Document 2 lacks a description when there is a syntactic deficiency in the input document data or when an expression that should have a dependency relationship with an expression registered in advance in the storage means is deficient. Is determined to exist. For this reason, the data processing apparatus of Patent Document 2 does not have one of the expressions that should be in a dependency relationship in one sentence, but there is a lack of description when there is an expression that should be in a dependency relationship in another sentence. Judge that. That is, if there is an expression that should be in a dependency relationship across sentences, it should be determined that there is no description shortage, but it will be determined that there is a shortage of description. As described above, the data processing apparatus of Patent Document 2 has a problem that it is impossible to determine whether there is a shortage of description in consideration of a plurality of sentences in the input document.
 特許文献3のデータ処理装置にも、特許文献2のデータ処理装置と同様の問題がある。
(発明の目的)
 本発明の主たる目的は、文書において、所定の主題に関する表現が存在するときに、所定の主題に関連して記述されるべき所定の事項に関する表現が適切な範囲内に記述されているか否かを判定する文書データ処理装置、文書データ処理方法、及び文書データ処理プログラムを提供することにある。
The data processing apparatus of Patent Document 3 has the same problem as the data processing apparatus of Patent Document 2.
(Object of invention)
The main object of the present invention is to determine whether or not an expression relating to a predetermined matter to be described in relation to a predetermined subject is described within an appropriate range when an expression relating to the predetermined subject exists in the document. An object of the present invention is to provide a document data processing apparatus, a document data processing method, and a document data processing program.
 本発明の文書データ処理装置は、第1の文書における、所定の主題に関する所定の第1の表現の出現位置と、主題に関連して記述されるべき所定の事項に関する所定の第2の表現の出現位置との最短距離の分布に基づいて、前記第1の文書と同じ文書か又は別の文書である第2の文書における、第1の表現の出現位置に対して第2の表現が出現すべき位置の第1の範囲を決定する共起範囲設定手段と、第2の文書において、第2の表現が第1の範囲に出現しない場合に、主題に関連して記述されるべき所定の事項が適切な範囲内に記述されていないことを検出する詳細化不足検出手段とを備えることを特徴とする。 The document data processing apparatus according to the present invention includes a position of an occurrence of a predetermined first expression relating to a predetermined subject in a first document and a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the distribution of the shortest distance from the appearance position, the second expression appears at the appearance position of the first expression in the second document that is the same document as the first document or a different document. A co-occurrence range setting means for determining the first range of the power position, and a predetermined matter to be described in relation to the subject when the second expression does not appear in the first range in the second document Is provided with detailing deficiency detection means for detecting that is not described in an appropriate range.
 本発明の文書データ処理方法は、第1の文書における、所定の主題に関する所定の第1の表現の出現位置と、主題に関連して記述されるべき所定の事項に関する所定の第2の表現の出現位置との最短距離の分布に基づいて、前記第1の文書と同じ文書か又は別の文書である第2の文書における、第1の表現の出現位置に対して第2の表現が出現すべき位置の第1の範囲を決定し、第2の文書において、第2の表現が第1の範囲に出現しない場合に、主題に関連して記述されるべき所定の事項が適切な範囲内に記述されていないことを検出することを特徴とする。 The document data processing method of the present invention includes a position of an occurrence of a predetermined first expression relating to a predetermined subject in a first document and a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the distribution of the shortest distance from the appearance position, the second expression appears at the appearance position of the first expression in the second document that is the same document as the first document or a different document. If the first range of the power position is determined and the second expression does not appear in the first range in the second document, the predetermined matter to be described in relation to the subject is within the appropriate range It is characterized by detecting that it is not described.
 本発明の文書データ処理プログラムは、第1の文書における、所定の主題に関する所定の第1の表現の出現位置と、主題に関連して記述されるべき所定の事項に関する所定の第2の表現の出現位置との最短距離の分布に基づいて、前記第1の文書と同じ文書か又は別の文書である第2の文書における、第1の表現の出現位置に対して第2の表現が出現すべき位置の第1の範囲を決定する共起範囲設定処理と、第2の文書において、第2の表現が第1の範囲に出現しない場合に、主題に関連して記述されるべき所定の事項が適切な範囲内に記述されていないことを検出する詳細化不足検出処理とをコンピュータに実行させることを特徴とする。 The document data processing program of the present invention includes a position of an occurrence of a predetermined first expression relating to a predetermined subject in a first document, and a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the distribution of the shortest distance from the appearance position, the second expression appears at the appearance position of the first expression in the second document that is the same document as the first document or a different document. A co-occurrence range setting process for determining the first range of the power position, and a predetermined matter to be described in relation to the subject when the second expression does not appear in the first range in the second document In this case, the computer is caused to execute a detail shortage detection process for detecting that the above is not described within an appropriate range.
 本発明によれば、文書において、所定の主題に関する表現が存在するときに、所定の主題に関連して記述されるべき所定の事項に関する表現が適切な範囲内に記述されているか否かを判定することができるという効果がある。 According to the present invention, when an expression relating to a predetermined subject exists in a document, it is determined whether or not an expression relating to a predetermined matter to be described in relation to the predetermined subject is described within an appropriate range. There is an effect that can be done.
本発明の第1の実施形態における文書データ処理装置10の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the document data processing apparatus 10 in the 1st Embodiment of this invention. 本発明の第1の実施形態における文書データ処理装置10の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the document data processing apparatus 10 in the 1st Embodiment of this invention. 本発明の第1の実施形態における被詳細化テーブルT1の具体例を説明するための図である。It is a figure for demonstrating the specific example of the detailed table T1 in the 1st Embodiment of this invention. 本発明の第1の実施形態における最小共起距離の分布の具体例を説明するための図である。It is a figure for demonstrating the specific example of distribution of the minimum co-occurrence distance in the 1st Embodiment of this invention. 本発明の第1の実施形態における被詳細化テーブルT2の具体例を説明するための図である。It is a figure for demonstrating the specific example of the detailed table T2 in the 1st Embodiment of this invention. 本発明の第1の実施形態における入力文書D1の具体例を説明するための図である。It is a figure for demonstrating the specific example of the input document D1 in the 1st Embodiment of this invention. 本発明の第2の実施形態における被詳細化テーブルT3の具体例を説明するための図である。It is a figure for demonstrating the specific example of the detailed table T3 in the 2nd Embodiment of this invention. 本発明の第2の実施形態における入力文書D2の具体例を説明するための図である。It is a figure for demonstrating the specific example of the input document D2 in the 2nd Embodiment of this invention. 本発明の第3の実施形態における入力文書D3の具体例を説明するための図である。It is a figure for demonstrating the specific example of the input document D3 in the 3rd Embodiment of this invention. 本発明の第4の実施形態における文書データ処理装置11の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the document data processing apparatus 11 in the 4th Embodiment of this invention. 本発明の第5の実施形態における文書データ処理装置12の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the document data processing apparatus 12 in the 5th Embodiment of this invention. 本発明を実現するための情報処理装置の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the information processing apparatus for implement | achieving this invention.
 以下、本発明の実施形態について、図面を参照して詳細に説明する。尚、すべての図面において、同等の構成要素には同じ符号を付し、適宜説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In all the drawings, equivalent components are denoted by the same reference numerals, and description thereof will be omitted as appropriate.
 はじめに、以下の各実施形態の説明において、共通して使用する用語について説明する。 First, common terms used in the following description of each embodiment will be described.
 「入力文書」とは、本発明の各実施形態における文書データ処理装置に、処理対象として入力される文書(文書情報、文書データ)である。 The “input document” is a document (document information, document data) input as a processing target to the document data processing apparatus according to each embodiment of the present invention.
 そして、本発明の各実施形態において、「文書(入力文書)」は、例えば、以下に挙げる、文字列、記号、又は表等を含む文書であってよい。 In each embodiment of the present invention, the “document (input document)” may be, for example, a document including a character string, a symbol, a table, or the like listed below.
 ・1つ以上の文(単文、複文、重文等)を含む、特定の言語により記述された文書
 ・行方向又は列方向に複数の項目が並べられた、表又は帳票(例えば米国マイクロソフト社のエクセル(登録商標)によって作成された帳票シート等)
 ・上記特定の言語と表又は帳票とが混在する文書(例えば、各種製品の取扱い説明書等)
 尚、「文書」は、図、画像を更に含んでもよい。
・ Documents written in a specific language including one or more sentences (single sentences, compound sentences, heavy sentences, etc.) ・ Tables or forms (for example, Microsoft Excel) with multiple items arranged in rows or columns (Form sheets created by (registered trademark))
・ Documents that contain a mixture of the above specific languages and tables or forms (for example, instruction manuals for various products)
The “document” may further include a figure and an image.
 また、「詳細化」とは、文書において、ある主題について、その主題に関連して説明されるべき事項を記述することである。 Also, “detailing” is to describe a matter to be explained in relation to a subject in a document.
 「詳細化」に関連して、以下の用語について説明する。 The following terms will be explained in relation to “detailed”.
 ・「被詳細化表現」:関連事項が記述されるべき対象である主題を示す表現
 ・「詳細化表現」:主題に関連して記述されるべき事項を示す表現
 ・「状況限定ワード」:「被詳細化表現」と「詳細化表現」とが文書中で共起すべき状況を限定する条件を示す表現
 尚、上記の「被詳細化表現」、「詳細化表現」および「状況限定ワード」の各々の「表現」は、例えば、名詞又は名詞の一部である。しかしながら、「表現」は、名詞又は名詞の一部に限定されず、文字(文字列)、記号(記号列)、表、帳票、又は図の何れか、或いは、それらの組み合わせを含んでもよい。
・ "Detailed expression": an expression indicating the subject matter for which the related matter is to be described ・ "Detailed expression": an expression indicating the item to be described in relation to the subject matter ・ "Situation limited word": " An expression indicating the conditions that limit the situations in which the "detailed expression" and the "detailed expression" should co-occur in the document. Note that the above "detailed expression", "detailed expression" and "situation limited word" Each “expression” is, for example, a noun or a part of a noun. However, “expression” is not limited to a noun or a part of a noun, and may include any of a character (character string), a symbol (symbol string), a table, a form, a figure, or a combination thereof.
 また、文書中の「被詳細化表現」に対応する、少なくとも1つの「詳細化表現」が欠落していることを「詳細化不足」という。 Also, the lack of at least one “detailed expression” corresponding to the “detailed expression” in the document is called “insufficient detail”.
 また、「被詳細化表現」と「詳細化表現」との一つの対に対して、複数の「状況限定ワード」が関連付けられてもよい。
(第1の実施形態)
 本実施形態における構成について説明する。
A plurality of “situation-restricted words” may be associated with one pair of “detailed expression” and “detailed expression”.
(First embodiment)
A configuration in the present embodiment will be described.
 図1は、本発明の第1の実施形態における文書データ処理装置10の構成の一例を示すブロック図である。 FIG. 1 is a block diagram showing an example of the configuration of a document data processing apparatus 10 according to the first embodiment of the present invention.
 文書データ処理装置10は、文書入力手段101、詳細化表現データベース102、単語抽出手段103、共起有無チェック手段104、共起範囲設定手段105、詳細化不足検出手段106、及び出力手段107を有する。 The document data processing apparatus 10 includes a document input unit 101, a detailed expression database 102, a word extraction unit 103, a co-occurrence presence / absence check unit 104, a co-occurrence range setting unit 105, a detail shortage detection unit 106, and an output unit 107. .
 文書入力手段101は、詳細化不足の検出対象である文書(即ち、入力文書)を文書データ処理装置10に入力する。 The document input unit 101 inputs a document (that is, an input document) that is a detection target of insufficient detailing to the document data processing apparatus 10.
 詳細化表現データベース102は、少なくとも、被詳細化表現を詳細化表現と関連付けたデータから成る「被詳細化テーブル」を予め記憶する。尚、被詳細化テーブルは、被詳細化表現と詳細化表現との対に関連付けられた状況限定ワードをさらに含んでもよい。また、詳細化表現データベース102は、被詳細化表現又は詳細化表現の同義語等をさらに予め記憶することにより、同義語等を被詳細化表現又は詳細化表現と同一であるものとして扱ってもよい。 The detailed expression database 102 stores in advance a “detailed table” including at least data in which the detailed expression is associated with the detailed expression. Note that the refinement table may further include a situation-limited word associated with a pair of refined expression and refined expression. Further, the refined expression database 102 may store the refined expression or the synonym of the refined expression in advance so as to treat the synonym or the like as being the same as the refined expression or the refined expression. Good.
 単語抽出手段103は、入力文書の中から、詳細化表現データベース102に予め記憶される被詳細化表現、詳細化表現、及び状況限定ワードのそれぞれと一致する文字列を検索する。そして、単語抽出手段103は、検索された、被詳細化表現、詳細化表現、及び状況限定ワードのそれぞれと一致する文字列の位置を、記憶デバイス(不図示)に記録する。文字列の位置は、ファイル名、ページ番号、行番号、文番号、セル座標(セル番号)、又は文字番号などを用いて特定される。 The word extraction unit 103 searches the input document for character strings that match the refined expression, the refined expression, and the situation limited word stored in advance in the refined expression database 102. Then, the word extraction unit 103 records the position of the character string that matches each of the searched detailed description, detailed expression, and situation limited word in a storage device (not shown). The position of the character string is specified using a file name, page number, line number, sentence number, cell coordinates (cell number), or character number.
 以降、被詳細化表現と一致する文書中の個々の文字列の位置を「被詳細化箇所」、詳細化表現と一致する文書中の個々の文字列の位置を「詳細化箇所」、状況限定ワードと一致する文書中の個々の文字列の位置を「状況限定箇所」と言う。尚、被詳細化箇所、詳細化箇所、状況限定箇所のそれぞれは、対応する文字列が同じであっても、文書中の位置が異なれば、別の被詳細化箇所、詳細化箇所、状況限定箇所として扱われる。 After that, the position of each character string in the document that matches the refined expression is “detailed place”, the position of each character string in the document that matches the refined expression is “detailed place”, and the situation is limited The position of each character string in the document that matches the word is called a “situation limited place”. It should be noted that each of the refined part, the refined part, and the situation-limited part has the same character string, but if the position in the document is different, another refined part, refined part, and situation-limited part Treated as a place.
 単語抽出手段103は、被詳細化表現又は詳細化表現として登録された文字列が、入力文書内において複合語の一部である場合、その複合語全体を、被詳細化表現又は詳細化表現として発見したものとみなしてもよい。例えば、詳細化表現データベースに「ID」という文字列が被詳細化表現として登録されている場合、単語抽出手段103は、入力文書の中の「ID」を含む「ユーザID」、「商品ID」等の複合語を発見した被詳細化表現とみなしてもよい。 When the character string registered as the refined expression or the refined expression is a part of the compound word in the input document, the word extracting unit 103 uses the entire compound word as the refined expression or the refined expression. You may consider it a discovery. For example, when the character string “ID” is registered as the refined expression in the detailed expression database, the word extracting unit 103 uses “user ID” and “product ID” including “ID” in the input document. It may be regarded as a refined expression in which a compound word such as is found.
 共起有無チェック手段104は、被詳細化表現と詳細化表現の「最小共起距離」を被詳細化箇所と詳細化箇所の位置情報に基づいて算出する。ここで、「最小共起距離」とは、被詳細化表現とそれに対応する詳細化表現との間の距離である。即ち、最小共起距離とは、被詳細化箇所の前後にある、詳細化箇所のうち、最も距離が近い詳細化箇所との「距離」である。ここで、「距離」は、二つの表現間の文字数、行数、文番号の差、ページ数など、文書内での二つの表現の距離を数値で表せるものであればよい。なお、詳細化表現データベース102に状況限定ワードが登録されている場合には、共起有無チェック手段104は、被詳細化箇所と状況限定箇所との最小共起距離も算出する。 The co-occurrence presence / absence check unit 104 calculates the refined expression and the “minimum co-occurrence distance” of the refined expression based on the location information of the refined part and the refined part. Here, the “minimum co-occurrence distance” is a distance between the refined expression and the corresponding refined expression. In other words, the minimum co-occurrence distance is the “distance” between the refined locations closest to the refined location before and after the refined location. Here, the “distance” may be anything that can represent the distance between the two expressions in the document by a numerical value, such as the number of characters, the number of lines, the difference in sentence numbers, the number of pages, etc. between the two expressions. When the situation limited word is registered in the detailed expression database 102, the co-occurrence presence check unit 104 also calculates the minimum co-occurrence distance between the refined location and the situation limited location.
 共起範囲設定手段105は、共起有無チェック手段104により算出された最小共起距離に基づいて、各被詳細化箇所の、被詳細化表現と詳細化表現との「適正共起範囲」を決定する。ここで、「適正共起範囲」とは、被詳細化表現に対して詳細化表現が出現すべき位置の範囲である。 Based on the minimum co-occurrence distance calculated by the co-occurrence presence / absence check unit 104, the co-occurrence range setting unit 105 sets an “appropriate co-occurrence range” between the refined expression and the refined expression of each refined part. decide. Here, the “appropriate co-occurrence range” is a range of positions where the refined expression should appear with respect to the refined expression.
 共起範囲設定手段105は、例えば、被詳細化表現と詳細化表現との対ごとに最小共起距離の出現頻度をヒストグラム化した場合に出現頻度が最も多い最小共起距離を、被詳細化表現と詳細化表現との対ごとの適正共起範囲として決定する。あるいは、共起範囲設定手段105は、各詳細化箇所の被詳細化箇所に対する距離の分布に基づいて、被詳細化表現と詳細化表現との対ごとの適正共起範囲を決定してもよい。尚、被詳細化表現に対し、詳細化表現が一度も出現しない場合には、共起範囲設定手段105は、適正共起範囲として「なし」を決定する。また、出現頻度が最も多い最小共起距離が複数ある場合には、共起範囲設定手段105は、適正共起範囲として、最小の最小共起距離、最大の最小共起距離、または最小共起距離の平均値等を決定してもよい。なお、適正共起範囲が広く設定されるほど、詳細化不足と判定される被詳細化箇所が少なくなる。 For example, the co-occurrence range setting unit 105 refines the minimum co-occurrence distance having the highest appearance frequency when the appearance frequency of the minimum co-occurrence distance is histogrammed for each pair of the detailed expression and the detailed expression. The appropriate co-occurrence range for each pair of expression and detailed expression is determined. Alternatively, the co-occurrence range setting unit 105 may determine an appropriate co-occurrence range for each pair of the refined expression and the refined expression based on the distribution of distances of the refined parts to the refined part. . If the refined expression never appears in the refined expression, the co-occurrence range setting unit 105 determines “none” as the appropriate co-occurrence range. When there are a plurality of minimum co-occurrence distances having the highest appearance frequency, the co-occurrence range setting unit 105 sets the minimum co-occurrence distance, the maximum minimum co-occurrence distance, or the minimum co-occurrence as the appropriate co-occurrence range. An average value of the distance may be determined. Note that as the appropriate co-occurrence range is set wider, the number of parts to be refined that are determined to be insufficient is reduced.
 詳細化不足検出手段106は、被詳細化表現と詳細化表現との対ごとに、共起範囲設定手段105によって決定された適正共起範囲と「詳細化不足検出ルール」とに基づいて、詳細化不足が発生した被詳細化箇所(以下、「詳細化不足箇所」と言う。)を検出する。 The detailing shortage detection unit 106 performs details based on the appropriate co-occurrence range determined by the co-occurrence range setting unit 105 and the “detailed shortage detection rule” for each pair of the refined expression and the detailed expression. A part to be detailed (hereinafter referred to as a “detailed insufficient part”) in which insufficient refinement has occurred is detected.
 「詳細化不足検出ルール」とは、適正共起範囲内に被詳細化箇所と詳細化箇所がどのような条件で共起すれば、詳細化不足ではないと(または、詳細化不足であると)判定するかを定めたルールである。詳細化不足検出ルールは、例えば、各被詳細化箇所について適正共起範囲内に詳細化箇所が共起しなければ詳細化不足であると判定するというルールである。 The “detailed shortage detection rule” means that if the location to be refined and the location to be refined co-occur within the appropriate co-occurrence range, the details are not insufficient (or the details are insufficient) It is a rule that determines whether or not to determine. The detailing insufficient detection rule is, for example, a rule for determining that the detailing is insufficient if each detailed location does not co-occur within the appropriate co-occurrence range.
 尚、適正共起範囲が「なし」に設定された場合には、詳細化不足検出手段106は、例えば、該当する被詳細化表現と詳細化表現との対に対応する被詳細化箇所のすべてにおいて詳細化不足であるものとみなす。 When the appropriate co-occurrence range is set to “none”, the detail shortage detection unit 106, for example, selects all the detailed parts corresponding to the pair of the corresponding detailed expression and the detailed expression. It is considered that the details are insufficient.
 また、詳細化不足検出手段106が被詳細化表現を含む複合語を被詳細化表現のバリエーションとして検出する場合がある。この場合には、詳細化不足検出ルールは、各バリエーションに対応する被詳細化箇所のうち少なくとも1つについて適正共起範囲内に詳細化箇所が共起すれば、当該被詳細化表現のバリエーションは詳細化不足ではないと判定するルールであってもよい。 Further, there is a case where the refinement shortage detecting means 106 detects a compound word including the refined expression as a variation of the refined expression. In this case, if the refinement location co-occurs within the appropriate co-occurrence range for at least one of the refinement locations corresponding to each variation, the detail refinement detection rule is It may be a rule that determines that the details are not insufficient.
 また、詳細化表現データベース102に状況限定ワードが登録される場合がある。この場合には、詳細化不足検出手段106は、当該被詳細化表現に対応する被詳細化箇所と状況限定箇所が予め定められた共起範囲内に共起した場合に、詳細化不足箇所の検出を行う。あるいは、詳細化不足検出手段106は、共起範囲設定手段105により設定された被詳細化表現と詳細化表現の適正共起範囲内に状況限定ワードが共起した場合に、詳細化不足箇所の検出を行う。 Also, there may be a situation limited word registered in the detailed expression database 102. In this case, the refinement shortage detecting means 106 detects the refinement insufficient location when the refined location and the situation limited location corresponding to the refined expression co-occur within a predetermined co-occurrence range. Perform detection. Alternatively, the refinement shortage detection means 106 may detect the location of the refinement insufficient part when the situation limited word co-occurs within the appropriate co-occurrence range of the refined expression and the refinement expression set by the co-occurrence range setting means 105. Perform detection.
 出力手段107は、詳細化不足検出手段106によって抽出された詳細化不足箇所を、例えば、ユーザが判別可能な態様によって出力する。出力の態様は、例えば、ユーザが認識可能な一覧表示、外部装置への情報提供等である。または、出力手段107は、詳細化不足箇所であると判定されたか否かをユーザが判別可能な態様で、被詳細化箇所を出力しても良い。例えば、出力手段107は、被詳細化箇所のうち、詳細化不足箇所ではない箇所と、詳細化不足箇所である箇所とで出力における色、フォント、線の太さ等を変えて出力してもよい。 The output means 107 outputs the details of the insufficient detail extracted by the details insufficient detection means 106 in a manner that the user can discriminate, for example. The output mode is, for example, a list display recognizable by the user, information provision to an external device, or the like. Alternatively, the output unit 107 may output the portion to be refined in such a manner that the user can determine whether or not it is determined that the location is a portion where detail is insufficient. For example, the output unit 107 may change the color, font, line thickness, etc. in the output between the parts that are not detailed and the parts that are not detailed, among the parts to be detailed. Good.
 次に、本実施形態における動作について説明する。 Next, the operation in this embodiment will be described.
 図2は、本発明の第1の実施形態における文書データ処理装置10の動作を示すフローチャートである。 FIG. 2 is a flowchart showing the operation of the document data processing apparatus 10 in the first embodiment of the present invention.
 文書入力手段101は、詳細化不足検出対象である文書(入力文書)を入力する(ステップS101)。 The document input unit 101 inputs a document (input document) that is a target for detection of insufficient detail (step S101).
 単語抽出手段103は、入力文書の中から、詳細化表現データベース102に記憶された被詳細化表現、詳細化表現のそれぞれと一致する、被詳細化箇所、詳細化箇所を検出する(ステップS102)。 The word extraction unit 103 detects a refined part and a refined part that match the refined expression and the refined expression stored in the refined expression database 102 from the input document (step S102). .
 共起有無チェック手段104は、単語抽出手段103により抽出された被詳細化箇所と詳細化箇所との対ごとに、被詳細化箇所と詳細化箇所との最小共起距離を算出する(ステップS103)。 The co-occurrence presence check unit 104 calculates the minimum co-occurrence distance between the refined location and the refined location for each pair of the refined location and the refined location extracted by the word extracting unit 103 (step S103). ).
 共起範囲設定手段105は、被詳細化表現と詳細化箇所との対ごとに、被詳細化箇所と詳細化箇所との最小共起距離に基づいて適正共起範囲を決定する(ステップS104)。 The co-occurrence range setting means 105 determines an appropriate co-occurrence range based on the minimum co-occurrence distance between the detailed portion and the detailed portion for each pair of the detailed expression and the detailed portion (step S104). .
 詳細化不足検出手段106は、共起範囲設定手段105により決定された適正共起範囲と詳細化不足検出ルールとに基づいて、詳細化不足箇所を検出する(ステップS105)。 詳細 Detailed insufficient detection means 106 detects the insufficiently detailed part based on the appropriate co-occurrence range determined by the co-occurrence range setting means 105 and the detailed insufficient detection detection rule (step S105).
 出力手段107は、詳細化不足検出手段106により検出された詳細化不足箇所を出力する(ステップS106)。 The output means 107 outputs the insufficient detail location detected by the detail lacking detection means 106 (step S106).
 次に、本発明の第1の実施形態の処理の具体例について説明する。 Next, a specific example of processing according to the first embodiment of the present invention will be described.
 図3は、本発明の第1の実施形態における被詳細化テーブルT1の具体例を説明するための図である。 FIG. 3 is a diagram for explaining a specific example of the detailed table T1 in the first embodiment of the present invention.
 詳細化表現データベース102は、被詳細化テーブルT1を記憶する。 The detailed expression database 102 stores the detailed table T1.
 図4は、本発明の第1の実施形態における最小共起距離の分布の具体例を説明するための図である。 FIG. 4 is a diagram for explaining a specific example of the distribution of the minimum co-occurrence distance in the first embodiment of the present invention.
 以下、図3及び図4を用いて、状況限定ワードが指定されない場合の文書データ処理装置10の動作を説明する。 Hereinafter, the operation of the document data processing apparatus 10 when the situation-limited word is not designated will be described with reference to FIGS. 3 and 4.
 被詳細化テーブルT1は、被詳細化表現C1、詳細化表現C2、及び状況限定ワードC3を記憶することができる。但し、被詳細化テーブルT1では状況限定ワードが指定されていないので、状況限定ワードC3欄は空欄である。詳細化表現データベース102は、少なくとも、被詳細化表現C1と詳細化表現C2とが関連付けられた被詳細化テーブルT1を予め記憶する。 The refined table T1 can store a refined expression C1, a refined expression C2, and a situation limited word C3. However, since the situation limited word is not specified in the detailed table T1, the situation limited word C3 column is blank. The detailed expression database 102 stores in advance at least a detailed table T1 in which the detailed expression C1 and the detailed expression C2 are associated with each other.
 被詳細化テーブルT1において、被詳細化表現C1には「検索システム」が、詳細化表現C2には「パフォーマンス」が記憶されており、状況限定ワードC3は空欄である。 In the refined table T1, “search system” is stored in the refined expression C1, “performance” is stored in the refined expression C2, and the situation limited word C3 is blank.
 文書入力手段101は、詳細化不足検出対象である文書を入力する(図2のステップS101)。 The document input unit 101 inputs a document that is a target for detection of insufficient detail (step S101 in FIG. 2).
 単語抽出手段103は、被詳細化テーブルT1を参照して、記憶された被詳細化表現C1「検索システム」、詳細化表現C2「パフォーマンス」のそれぞれと一致する被詳細化箇所、詳細化箇所を、入力文書中から検出する(図2のステップS102)。 The word extraction means 103 refers to the refinement table T1 and finds a refined part and a refined part that match the stored refined expression C1 “search system” and the refined expression C2 “performance”, respectively. Then, it is detected from the input document (step S102 in FIG. 2).
 共起有無チェック手段104は、「検索システム」に対応する被詳細化箇所と「パフォーマンス」に対応する詳細化箇所との対ごとに、被詳細化箇所と詳細化箇所との最小共起距離を算出する(図2のステップS103)。尚、「パフォーマンス」に対応する詳細化箇所が複数ある場合には、共起有無チェック手段104は、被詳細化箇所と、被詳細化箇所に最も近い詳細化箇所との距離を最小共起距離として算出する。 The co-occurrence presence check means 104 determines the minimum co-occurrence distance between the refined location and the refined location for each pair of the refined location corresponding to the “search system” and the refined location corresponding to “performance”. Calculate (step S103 in FIG. 2). When there are a plurality of refinement locations corresponding to “performance”, the co-occurrence presence check means 104 determines the distance between the refinement location and the refinement location closest to the refinement location as the minimum co-occurrence distance. Calculate as
 共起範囲設定手段105は、共起有無チェック手段104により算出された、「検索システム」に対応する被詳細化箇所と「パフォーマンス」に対応する詳細化箇所との最小共起距離に基づいて、適正共起範囲を決定する(図2のステップS104)。図4では、被詳細化表現C1「検索システム」と詳細化表現C2「パフォーマンス」との最小共起距離の分布において、最小共起距離が「1行」である頻度が最も多い。出現頻度が最も多い最小共起距離を適正共起範囲として決定する場合には、共起範囲設定手段105は、「1行」を適正共起範囲として決定する。 The co-occurrence range setting means 105 is based on the minimum co-occurrence distance between the refined location corresponding to the “search system” and the refined location corresponding to the “performance” calculated by the co-occurrence presence check means 104. The appropriate co-occurrence range is determined (step S104 in FIG. 2). In FIG. 4, in the distribution of the minimum co-occurrence distance between the refined expression C1 “search system” and the detailed expression C2 “performance”, the frequency where the minimum co-occurrence distance is “1 row” is the highest. When determining the minimum co-occurrence distance with the highest appearance frequency as the appropriate co-occurrence range, the co-occurrence range setting unit 105 determines “one line” as the appropriate co-occurrence range.
 詳細化不足検出手段106は、共起有無チェック手段104によって決定された適正共起範囲と、詳細化不足検出ルールとに基づいて、詳細化不足箇所を検出する(図2のステップS105)。詳細化不足検出ルールが「各被詳細化箇所について適正共起範囲内に詳細化箇所が共起しなければ詳細化不足であると判定する」ルールである場合について説明する。この場合、詳細化不足検出手段106は、「検索システム」に対応する被詳細化箇所の前後1行以内に「パフォーマンス」に対応する詳細化箇所が存在しない場合、当該被詳細化箇所を詳細化不足箇所として検出する。 <Detailed shortage detection means 106 detects an insufficient detail location based on the appropriate co-occurrence range determined by the co-occurrence presence check means 104 and the detailed shortage detection rule (step S105 in FIG. 2). A case will be described in which the detailing shortage detection rule is a rule “determining that there is insufficient detailing if each detailed location does not co-occur within the appropriate co-occurrence range”. In this case, if there is no refined part corresponding to “performance” within one line before and after the refined part corresponding to “search system”, the refinement insufficient detection means 106 refines the refined part. Detect as missing part.
 出力手段107は、詳細化不足検出手段106によって検出された詳細化不足箇所を出力する(図2のステップS106)。 The output means 107 outputs the insufficient detail location detected by the detail lack detection means 106 (step S106 in FIG. 2).
 次に、本発明の第1の実施形態の処理の別の具体例について説明する。 Next, another specific example of the process according to the first embodiment of the present invention will be described.
 以下、図5及び図6を用いて、状況限定ワードが指定された場合の文書データ処理装置10の動作を説明する。 Hereinafter, the operation of the document data processing apparatus 10 when a situation-limited word is designated will be described with reference to FIGS. 5 and 6.
 図5は、本発明の第1の実施形態における被詳細化テーブルT2の具体例を説明するための図である。 FIG. 5 is a diagram for explaining a specific example of the detailed table T2 in the first embodiment of the present invention.
 被詳細化テーブルT2において、被詳細化表現C4には「csv」が、詳細化表現C5には「文字コード」が、状況限定ワードC6には「入力」及び「出力」が記憶されている。 In the refinement table T2, “csv” is stored in the refined expression C4, “character code” is stored in the refined expression C5, and “input” and “output” are stored in the situation limited word C6.
 図6は、第1の実施形態における入力文書D1の具体例を説明するための図である。入力文書D1において、被詳細化表現C4「csv」が位置P1、P2、P3、P4に出現する。なお、入力文書D1において、「(改ページ)」は改ページを示す記号を、「:」及び「(中略)」は文書の一部が省略されていることを示す。 FIG. 6 is a diagram for explaining a specific example of the input document D1 in the first embodiment. In the input document D1, the refined expression C4 “csv” appears at positions P1, P2, P3, and P4. In the input document D1, “(page break)” indicates a symbol indicating page break, and “:” and “(omitted)” indicate that a part of the document is omitted.
 文書入力手段101は、詳細化不足検出対象である入力文書D1を入力する(図2のステップS101)。 The document input unit 101 inputs the input document D1 that is a target for detection of insufficient detail (step S101 in FIG. 2).
 単語抽出手段103は、被詳細化テーブルT2を参照して、記憶された被詳細化表現C4「csv」、詳細化表現C5「文字コード」、状況限定ワードC6「入力」及び「出力」のそれぞれと一致する被詳細化箇所、詳細化箇所、状況限定箇所を入力文書D1から検出する(図2のステップS102)。入力文書D1において、被詳細化表現C4「csv」に対応する被詳細化箇所は、被詳細化箇所P1、P2、P3、P4である。 The word extraction means 103 refers to the refinement table T2, and stores each of the memorized refined expression C4 “csv”, the refined expression C5 “character code”, the situation limited word C6 “input” and “output”. 2 is detected from the input document D1 (step S102 in FIG. 2). In the input document D1, the detailed parts corresponding to the detailed expression C4 “csv” are the detailed parts P1, P2, P3, and P4.
 共起有無チェック手段104は、「csv」に対応する被詳細化箇所と「文字コード」に対応する詳細化箇所との対ごとに、被詳細化箇所と詳細化箇所との最小共起距離を算出する(図2のステップS103)。尚、「文字コード」に対応する詳細化箇所が複数ある場合には、共起有無チェック手段104は、被詳細化箇所と、被詳細化箇所に最も近い詳細化箇所との距離を最小共起距離として算出する。 The co-occurrence presence / absence check unit 104 calculates the minimum co-occurrence distance between the refined location and the refined location for each pair of the refined location corresponding to “csv” and the refined location corresponding to “character code”. Calculate (step S103 in FIG. 2). If there are a plurality of detailed locations corresponding to the “character code”, the co-occurrence presence check means 104 determines the minimum co-occurrence of the distance between the detailed location and the detailed location closest to the detailed location. Calculate as distance.
 共起範囲設定手段105は、共起有無チェック手段104により算出された、「csv」に対応する被詳細化箇所と「文字コード」に対応する詳細化箇所との最小共起距離に基づいて、適正共起範囲を決定する(図2のステップS104)。被詳細化表現C4「csv」と詳細化表現C5「文字コード」との最小共起距離の分布において、出現頻度が最も多い最小共起距離が「1行」であるものとする。出現頻度が最も多い最小共起距離を適正共起範囲として決定する場合には、共起範囲設定手段105は、「1行」を被詳細化表現C4「csv」と詳細化表現C5「文字コード」との適正共起範囲として決定する。 The co-occurrence range setting means 105 is based on the minimum co-occurrence distance between the refined part corresponding to “csv” and the refined part corresponding to “character code” calculated by the co-occurrence presence check means 104. The appropriate co-occurrence range is determined (step S104 in FIG. 2). In the distribution of the minimum co-occurrence distance between the refined expression C4 “csv” and the detailed expression C5 “character code”, the minimum co-occurrence distance with the highest appearance frequency is “1 line”. When determining the minimum co-occurrence distance having the highest appearance frequency as the appropriate co-occurrence range, the co-occurrence range setting unit 105 sets “one line” to the refined expression C4 “csv” and the refined expression C5 “character code”. As the appropriate co-occurrence range.
 詳細化不足検出手段106は、共起有無チェック手段104によって決定された適正共起範囲と、詳細化不足検出ルールとに基づいて、詳細化不足箇所を検出する(図2のステップS105)。詳細化不足検出ルールが「各被詳細化箇所について同じページ内に状況限定ワードが共起し、かつ、適正共起範囲内に詳細化箇所が共起しなければ詳細化不足であると判定する」ルールである場合について説明する。この場合、詳細化不足検出手段106は、「csv」に対応する被詳細化箇所の同じページ内に「入力」もしくは「出力」に対応する状況限定箇所が共起し、かつ前後1行以内に「文字コード」に対応する詳細化箇所が存在しない場合、当該被詳細化箇所を詳細化不足箇所として検出する。 <Detailed shortage detection means 106 detects an insufficient detail location based on the appropriate co-occurrence range determined by the co-occurrence presence check means 104 and the detailed shortage detection rule (step S105 in FIG. 2). The detail refinement detection rule is "If there is a situation-limited word co-occurring on the same page for each refined location and no refinement location co-occurs within the appropriate co-occurrence range, it is determined that the detail is insufficient. The case of the rule will be described. In this case, the detail shortage detection means 106 causes the situation limited part corresponding to “input” or “output” to co-occur in the same page of the part to be detailed corresponding to “csv” and within one line before and after. When there is no detailed location corresponding to the “character code”, the detailed location is detected as an insufficient detail location.
 被詳細化箇所P1は、同じページ内に「入力」に対応する状況限定箇所が共起し、かつ前後1行以内に「文字コード」に対応する詳細化箇所が共起するため、詳細化不足箇所ではない。被詳細化箇所P2は、同じページ内に「入力」に対応する状況限定箇所が共起し、かつ前後1行以内に「文字コード」に対応する詳細化箇所が共起しないため、詳細化不足箇所である。被詳細化箇所P3は、同じページ内に「出力」に対応する状況限定箇所が共起し、かつ前後1行以内に「文字コード」なる詳細化箇所が共起するため、詳細化不足箇所ではない。被詳細化箇所P4は、同じページ内に「入力」または「出力」に対応する状況限定箇所が共起しないため、詳細化不足箇所ではない。 The refined part P1 is insufficiently detailed because a situation-limited part corresponding to “input” co-occurs in the same page and a detailed part corresponding to “character code” co-occurs within one line before and after. Not a place. The refined part P2 is insufficiently detailed because a situation-limited part corresponding to “input” co-occurs in the same page and a detailed part corresponding to “character code” does not co-occur within one line before and after. It is a place. In the refined part P3, a situation-limited part corresponding to “output” co-occurs on the same page, and a refined part “character code” co-occurs within one line before and after. Absent. The refined location P4 is not a location where the detail is insufficient because a situation-limited location corresponding to “input” or “output” does not co-occur within the same page.
 出力手段107は、詳細化不足検出手段106によって検出された詳細化不足箇所を出力する(図2のステップS106)。 The output means 107 outputs the insufficient detail location detected by the detail lack detection means 106 (step S106 in FIG. 2).
 以上説明したように、本実施形態の文書データ処理装置10によれば、文書において所定の主題に関する表現が存在するときに、所定の主題に関連して記述されるべき所定の事項に関する表現が適切な範囲内に記述されているか否かを判定することができる。その理由は、文書データ処理装置10は、被詳細化箇所と詳細化箇所との最小共起距離の分布に基づいて、適正共起範囲を決定し、適正共起範囲において詳細化不足箇所の有無を判定するからである。ここで、所定の主題に関する表現は、被詳細化表現と言い換えることができる。所定の主題に関連して記述されるべき所定の事項に関する表現は、詳細化表現と言い換えることができる。更に、これらが適切な範囲内に記述されているか否かとは、詳細化不足箇所の有無と言い換えることができる。 As described above, according to the document data processing apparatus 10 of the present embodiment, when there is an expression related to the predetermined subject in the document, the expression related to the predetermined matter to be described in relation to the predetermined subject is appropriate. It is possible to determine whether or not it is described within a range. The reason is that the document data processing apparatus 10 determines the appropriate co-occurrence range based on the distribution of the minimum co-occurrence distance between the refined location and the refined location, and whether there is an insufficient detail location in the appropriate co-occurrence range. It is because it determines. Here, the expression relating to the predetermined subject can be rephrased as a refined expression. An expression relating to a given item to be described in relation to a given subject can be rephrased as a refined expression. Furthermore, whether or not these are described within an appropriate range can be rephrased as the presence or absence of a location where details are insufficient.
 また、一般に、膨大な文字列や記号列等によって構成される帳票シートやソフトウェア・プログラム等が入力文書である場合には、その入力文書全体に共通する妥当な詳細化不足箇所を検出することは難しい。本実施形態における文書データ処理装置10は、このような膨大な入力文書を、被詳細化箇所と詳細化箇所の最小共起距離の分布に基づいて適正共起範囲を設定することで、被詳細化箇所のうち、適正共起範囲内に詳細化箇所がない場合に詳細化不足箇所が存在するものと判定する。このため、本実施形態の文書データ処理装置10では、被詳細化箇所ごとに異なる適正共起範囲内において詳細化不足箇所の有無を判定することができるという効果がある。
(第2の実施形態)
 次に、上述した第1の実施形態を基本とする第2の実施形態について説明する。以下の説明において、第1の実施形態と同等の構成要素には同じ符号を付し、適宜説明を省略する。
In general, when a form sheet, software program, or the like composed of a large number of character strings and symbol strings is an input document, it is not possible to detect a reasonable lack of detail common to the entire input document. difficult. The document data processing apparatus 10 according to the present embodiment sets the appropriate co-occurrence range for such an enormous input document based on the distribution of the minimum co-occurrence distance between the detailed portion and the detailed portion. If there is no refinement location within the appropriate co-occurrence range among the refinement locations, it is determined that there is an insufficient refinement location. For this reason, in the document data processing apparatus 10 of this embodiment, there exists an effect that the presence or absence of a refinement | minimization insufficient location can be determined within the appropriate co-occurrence range which differs for every refinement | reporting location.
(Second Embodiment)
Next, a second embodiment based on the above-described first embodiment will be described. In the following description, the same components as those in the first embodiment are denoted by the same reference numerals, and description thereof is omitted as appropriate.
 本実施形態における構成について説明する。 The configuration in this embodiment will be described.
 本実施形態における文書データ処理装置の構成は、第1の実施形態における文書データ処理装置10の構成と同じである。 The configuration of the document data processing apparatus in the present embodiment is the same as the configuration of the document data processing apparatus 10 in the first embodiment.
 次に、本実施形態における動作について説明する。 Next, the operation in this embodiment will be described.
 本実施形態では、文書データ処理装置10は、被詳細化表現又は詳細化表現として登録された文字列が、入力文書内において複合語の一部である場合、その複合語全体を、被詳細化表現又は詳細化表現として発見したものとみなす。 In the present embodiment, the document data processing apparatus 10 refines the entire compound word when the character string registered as the refined expression or the refined expression is a part of the compound word in the input document. Considered to be discovered as an expression or refined expression.
 次に、本実施形態における処理の具体例について説明する。 Next, a specific example of processing in this embodiment will be described.
 図7は、本発明の第2の実施形態における被詳細化テーブルT3の具体例を説明するための図である。 FIG. 7 is a diagram for explaining a specific example of the detailed table T3 in the second embodiment of the present invention.
 詳細化表現データベース102は、被詳細化テーブルT3を予め記憶する。被詳細化テーブルT3は、被詳細化表現C7には「ID」を、詳細化表現C8には「変更不可」を記憶する。尚、被詳細化テーブルT3では状況限定ワードが指定されていないので、状況限定ワードC9は空欄である。 The detailed expression database 102 stores the detailed table T3 in advance. The detailed table T3 stores “ID” in the detailed expression C7 and “cannot be changed” in the detailed expression C8. In the detailed table T3, since the situation limited word is not designated, the situation limited word C9 is blank.
 図8は、本発明の第2の実施形態における入力文書D2の具体例を説明するための図である。入力文書D2において、被詳細化表現C7「ID」が位置P5、P6、P7、P8、P9に出現する。 FIG. 8 is a diagram for explaining a specific example of the input document D2 in the second embodiment of the present invention. In the input document D2, the refined expression C7 “ID” appears at positions P5, P6, P7, P8, and P9.
 文書入力手段101は、入力文書D2を入力する(図2のステップS101)。 Document input means 101 inputs input document D2 (step S101 in FIG. 2).
 単語抽出手段103は、被詳細化テーブルT3に記憶された被詳細化表現C7「ID」、詳細化表現C8「変更不可」のそれぞれに対応する被詳細化箇所、詳細化箇所を入力文書D2から検出する(図2のステップS102)。尚、単語抽出手段103は、入力文書D2には被詳細化表現C7「ID」を含む複合語「ユーザID」、「商品ID」、「店舗ID」、「注文ID」が存在するため、それぞれの複合語全体を被詳細化箇所として検出する。つまり、単語抽出手段103は、入力文書D2において、「ユーザID」に対応する被詳細化箇所P5、「商品ID」に対応する被詳細化箇所P6、「店舗ID」に対応する被詳細化箇所P7、P9、「注文ID」に対応する被詳細化箇所P8を検出する。 The word extraction means 103 uses the input document D2 to specify the refined part and the refined part corresponding to the refined expression C7 “ID” and the refined expression C8 “unchangeable” stored in the refined table T3. It detects (step S102 of FIG. 2). Note that the word extraction means 103 includes compound words “user ID”, “product ID”, “store ID”, and “order ID” including the refined expression C7 “ID” in the input document D2, respectively. The entire compound word is detected as a refined part. That is, the word extraction unit 103 specifies the refined location P5 corresponding to the “user ID”, the refined location P6 corresponding to the “product ID”, and the refined location corresponding to the “store ID” in the input document D2. P7, P9, and the refined portion P8 corresponding to the “order ID” are detected.
 共起有無チェック手段104は、「ID」に対応する被詳細化箇所と「変更不可」に対応する詳細化箇所との対ごとに、被詳細化箇所と詳細化箇所との最小共起距離を算出する(図2のステップS103)。尚、「変更不可」に対応する詳細化箇所が複数ある場合には、共起有無チェック手段104は、被詳細化箇所と、被詳細化箇所に最も近い詳細化箇所との距離を最小共起距離として算出する。 The co-occurrence presence check means 104 calculates the minimum co-occurrence distance between the refined location and the refined location for each pair of the refined location corresponding to “ID” and the refined location corresponding to “unchangeable”. Calculate (step S103 in FIG. 2). When there are a plurality of refinement locations corresponding to “unchangeable”, the co-occurrence presence check unit 104 determines the minimum co-occurrence of the distance between the refinement location and the refinement location closest to the refinement location. Calculate as distance.
 共起範囲設定手段105は、共起有無チェック手段104により算出された、「ID」に対応する被詳細化箇所と「変更不可」に対応する詳細化箇所との最小共起距離に基づいて、適正共起範囲を決定する(図2のステップS104)。入力文書D2では、被詳細化表現C7「ID」と詳細化表現C8「変更不可」との最小共起距離(行数)の分布において、最小共起距離0行(同じ行)の出現頻度が3回、最小共起距離5行(P8からP7の行)の出現頻度が1回、ブランクの一行を含む最小共起距離7行(P9からP7の行)の出現頻度が1回である。そこで、共起範囲設定手段105は、最小共起距離0行を被詳細化表現C7「ID」と詳細化表現C8「変更不可」との適正共起範囲として決定する。 The co-occurrence range setting means 105 is based on the minimum co-occurrence distance between the refined part corresponding to “ID” and the refined part corresponding to “impossible to change”, calculated by the co-occurrence presence check means 104. The appropriate co-occurrence range is determined (step S104 in FIG. 2). In the input document D2, in the distribution of the minimum co-occurrence distance (number of lines) between the refined expression C7 “ID” and the detailed expression C8 “unchangeable”, the appearance frequency of the minimum co-occurrence distance 0 line (same line) is The appearance frequency of the minimum co-occurrence distance 5 lines (P8 to P7 line) is 3 times, and the appearance frequency of the minimum co-occurrence distance 7 lines (P9 to P7 line) including one blank line is 1 time. Therefore, the co-occurrence range setting unit 105 determines the minimum co-occurrence distance 0 rows as an appropriate co-occurrence range between the detailed expression C7 “ID” and the detailed expression C8 “unchangeable”.
 詳細化不足検出手段106は、共起範囲設定手段105によって決定された適正共起範囲と、詳細化不足検出ルールとに基づいて、詳細化不足箇所を検出する(図2のステップS105)。詳細化不足検出ルールが「被詳細化表現を含む複合語の各バリエーションに対応する被詳細化箇所のうち少なくとも1つについて適正共起範囲内に詳細化箇所が共起すれば、当該被詳細化表現のバリエーションは詳細化不足ではないと判定する」ルールである場合について説明する。この場合には、「ID」を含む特定の複合語に対応する被詳細化箇所のいずれについても0行(同じ行)以内に「変更不可」に対応する詳細化箇所が存在しない場合、詳細化不足検出手段106は、当該被詳細化箇所を詳細化不足箇所として検出する。入力文書D2では、被詳細化箇所P5~P9のうち、被詳細化箇所P5、P6、P7は、同じ行に「変更不可」に対応する詳細化箇所が存在するため、詳細化不足箇所ではない。被詳細化箇所P8「注文ID」は、同じ行に詳細化箇所である「変更不可」に対応する詳細化箇所が存在せず、且つ被詳細化箇所P8以外に「注文ID」に対応する詳細化箇所が存在しないため、詳細化不足箇所である。被詳細化箇所P9は、同じ行に「変更不可」に対応する詳細化箇所が存在しないが、「店舗ID」に対応する被詳細化箇所P7は詳細化不足箇所でないため、被詳細化箇所P9は詳細化不足箇所ではない。 <Detailed shortage detection means 106 detects a shortage in detail based on the appropriate co-occurrence range determined by the co-occurrence range setting means 105 and the detailed shortage detection rule (step S105 in FIG. 2). If the refinement location co-occurs within the appropriate co-occurrence range for at least one refinement location corresponding to each variation of the compound word including the refinement expression, the refinement shortage detection rule A case will be described in which the rule “determines that the variation of expression is not insufficient in detail”. In this case, if there is no refinement location corresponding to “unchangeable” within 0 lines (same row) for any refinement location corresponding to a specific compound word including “ID”, refinement is performed. The deficiency detecting means 106 detects the detailed part as a detailed deficient part. In the input document D2, the refined locations P5, P6, and P7 among the refined locations P5 to P9 are not refined portions because there are refined locations corresponding to “unchangeable” on the same line. . The detailed location P8 “order ID” has no detailed location corresponding to the detailed location “cannot be changed” on the same line, and details corresponding to the “order ID” other than the detailed location P8. Since there is no refinement location, it is a location where details are insufficient. The refined part P9 does not have a refined part corresponding to “impossible to change” on the same line, but the refined part P7 corresponding to “store ID” is not a insufficient refinement part, so the refined part P9 Is not a lack of detail.
 以上説明したように、本実施形態の文書データ処理装置によれば、1つの被詳細化表現を指定して、その被詳細化表現を含む複数の被詳細化表現のバリエーションのそれぞれについて、詳細化不足箇所を検出することができる。従って、本実施形態の文書データ処理装置によれば、第1の実施形態における効果に加えて、被詳細化表現の個々のバリエーションを登録することなく、詳細化不足箇所を適切に絞り込むことができるという効果がある。
(第3の実施形態)
 次に、上述した第2の実施形態を基本とする第3の実施形態について説明する。以下の説明において、第2の実施形態と同等の構成要素には同じ符号を付し、適宜説明を省略する。
As described above, according to the document data processing apparatus of the present embodiment, one detailed expression is specified, and each of the variations of the plurality of detailed expressions including the detailed expression is detailed. The shortage can be detected. Therefore, according to the document data processing apparatus of the present embodiment, in addition to the effects of the first embodiment, it is possible to appropriately narrow down the details of insufficient detail without registering individual variations of the detailed expression. There is an effect.
(Third embodiment)
Next, a third embodiment based on the above-described second embodiment will be described. In the following description, the same components as those in the second embodiment are denoted by the same reference numerals, and description thereof is omitted as appropriate.
 本実施形態における構成について説明する。 The configuration in this embodiment will be described.
 本実施形態における文書データ処理装置の構成は、第2の実施形態における文書データ処理装置10の構成と同じである。 The configuration of the document data processing apparatus in the present embodiment is the same as that of the document data processing apparatus 10 in the second embodiment.
 次に、本実施形態における動作について説明する。 Next, the operation in this embodiment will be described.
 詳細化表現データベース102は、第2の実施形態と同じ被詳細化テーブルT3を予め記憶する。 The detailed expression database 102 stores in advance the same detailed table T3 as in the second embodiment.
 本実施形態では、共起有無チェック手段104及び共起範囲設定手段105は、被詳細化表現から見た詳細化表現の方向を区別して適正共起範囲を決定する。 In this embodiment, the co-occurrence presence check unit 104 and the co-occurrence range setting unit 105 determine the appropriate co-occurrence range by distinguishing the direction of the detailed expression viewed from the detailed expression.
 次に、本実施形態における処理の具体例について説明する。 Next, a specific example of processing in this embodiment will be described.
 図9は、本発明の第3に実施形態における入力文書D3の具体例を説明するための図である。入力文書D3は、「項目」欄C11の値である「ユーザID」、「商品ID」、「店舗ID」、「注文ID」ごとに、「属性」欄C10の値、及び「備考」欄C12の値が記述された表を含む文書である。項目「ユーザID」、「商品ID」、「店舗ID」の「属性」の値は、「重複不可」及び「変更不可」である。項目「注文ID」の「属性」の値は、「重複不可」である。項目「ユーザID」、「商品ID」、「店舗ID」の「備考」の値は、空欄である。項目「注文ID」の「備考」の値は、「商品が変更されても商品IDは変更不可」である。 FIG. 9 is a diagram for explaining a specific example of the input document D3 in the third embodiment of the present invention. The input document D3 includes a value in the “attribute” column C10 and a “remarks” column C12 for each of “user ID”, “product ID”, “store ID”, and “order ID” that are values in the “item” column C11. This is a document containing a table in which the values of are described. The values of the “attribute” of the items “user ID”, “product ID”, and “store ID” are “non-overlapping” and “cannot be changed”. The value of the “attribute” of the item “order ID” is “non-overlapping”. The values of “Remarks” in the items “User ID”, “Product ID”, and “Store ID” are blank. The value of “Remarks” in the item “Order ID” is “Product ID cannot be changed even if the product is changed”.
 入力文書D3は、被詳細化表現C7である「ID」を含む複合語である、「ユーザID」、「商品ID」、「店舗ID」、「注文ID」を「項目」欄C11列の値として含む。
被詳細化表現C7である「ID」には、被詳細化テーブルT3において、詳細化表現C8である「変更不可」が関連付けられている。
The input document D3 is a compound word including “ID” which is the refined expression C7, and “user ID”, “product ID”, “store ID”, “order ID” are values in the “item” column C11 column. Include as.
“ID” that is the refined expression C7 is associated with “unchangeable” that is the refined expression C8 in the refined table T3.
 単語抽出手段103は、被詳細化テーブルT3に記憶された被詳細化表現C7「ID」、詳細化表現C8「変更不可」のそれぞれに対応する被詳細化箇所、詳細化箇所を入力文書D3から検出する(図2のステップS102)。尚、単語抽出手段103は、入力文書D3には被詳細化表現C7「ID」を含む複合語について「ユーザID」、「商品ID」、「店舗ID」、「注文ID」が存在するため、それぞれの複合語全体を被詳細化箇所として検出する。即ち、単語抽出手段103は、入力文書D3において、「ユーザID」に対応する被詳細化箇所である「項目」欄C11の2行目を検出する。また、単語抽出手段103は、入力文書D3において、「商品ID」に対応する被詳細化箇所である「項目」欄C11の3行目及び「備考」欄C12の5行目を検出する。また、単語抽出手段103は、入力文書D3において、「店舗ID」に対応する被詳細化箇所である「項目」欄C11の4行目、「注文ID」に対応する被詳細化箇所である「項目」欄C11の5行目を検出する。
また、単語抽出手段103は、入力文書D3において、「変更不可」に対応する詳細化箇所である「属性」欄C10の2、3、4行目を検出する。
The word extraction means 103 uses the input document D3 to specify the detailed portion and the detailed portion corresponding to the detailed expression C7 “ID” and the detailed expression C8 “unchangeable” stored in the detailed table T3. It detects (step S102 of FIG. 2). Note that the word extraction unit 103 includes “user ID”, “product ID”, “store ID”, and “order ID” for the compound word including the refined expression C7 “ID” in the input document D3. The entire compound word is detected as a refined part. That is, the word extraction unit 103 detects the second line of the “item” column C11 that is the refined portion corresponding to the “user ID” in the input document D3. In addition, the word extraction unit 103 detects the third line of the “item” column C11 and the fifth line of the “remarks” column C12, which are refined portions corresponding to the “product ID”, in the input document D3. Further, the word extracting means 103 is a refined location corresponding to “order ID” in line 4 of the “item” column C11, which is a refined location corresponding to “store ID” in the input document D3. The fifth line of the “item” column C11 is detected.
In addition, the word extraction unit 103 detects the second, third, and fourth lines of the “attribute” column C10 that is a refined portion corresponding to “unchangeable” in the input document D3.
 そこで、共起有無チェック手段104は、入力文書D3において、「ID」に対応する被詳細化箇所と「変更不可」に対応する詳細化箇所との対ごとに、被詳細化箇所と詳細化箇所との最小共起距離を算出する(図2のステップS103)。尚、「変更不可」に対応する詳細化箇所が複数ある場合には、共起有無チェック手段104は、被詳細化箇所と、被詳細化箇所に最も近い詳細化箇所との距離を最小共起距離として算出する。但し、共起有無チェック手段104は、被詳細化表現から見た詳細化表現の方向を区別する。つまり、共起有無チェック手段104は、「ユーザID」、「商品ID」、「店舗ID」のそれぞれに対応する被詳細化箇所に対して、「変更不可」に対応する詳細化箇所を検出する。その詳細化箇所は、最小共起距離0行(同じ行)の被詳細化箇所が含まれる列C11の左側の列である列C10において検出される。また、共起有無チェック手段104は、「注文ID」に対応する被詳細化箇所に対して、「変更不可」に対応する詳細化箇所を検出する。その詳細化箇所は、最小共起距離0行(同じ行)の被詳細化箇所が含まれる列C11の右側の列である列C12において検出される。 Therefore, the co-occurrence presence / absence check unit 104 determines the refinement location and the refinement location for each pair of the refinement location corresponding to “ID” and the refinement location corresponding to “unchangeable” in the input document D3. Is calculated (step S103 in FIG. 2). When there are a plurality of refinement locations corresponding to “unchangeable”, the co-occurrence presence check unit 104 determines the minimum co-occurrence of the distance between the refinement location and the refinement location closest to the refinement location. Calculate as distance. However, the co-occurrence presence / absence check unit 104 distinguishes the direction of the detailed expression viewed from the detailed expression. That is, the co-occurrence presence / absence check unit 104 detects a refined location corresponding to “unchangeable” for a refined location corresponding to each of “user ID”, “product ID”, and “store ID”. . The refined portion is detected in a column C10 that is a column on the left side of the column C11 including the refined portion having the minimum co-occurrence distance of 0 rows (the same row). Further, the co-occurrence presence / absence check unit 104 detects a refined location corresponding to “unchangeable” for the refined location corresponding to “order ID”. The refined portion is detected in the column C12 that is the column on the right side of the column C11 that includes the refined portion having the minimum co-occurrence distance of 0 rows (the same row).
 共起範囲設定手段105は、共起有無チェック手段104により算出された、「ID」に対応する被詳細化箇所と「変更不可」に対応する詳細化箇所との最小共起距離に基づいて、適正共起範囲を決定する(図2のステップS104)。但し、共起範囲設定手段105は、最小共起距離に加えて、被詳細化箇所に対して詳細化箇所が共起する方向も区別して適正共起範囲を決定する。つまり、入力文書D3では、被詳細化表現C7「ID」と詳細化表現C8「変更不可」との最小共起距離の分布において、最小共起距離が0行(同じ行)で左側にある出現頻度が3回、最小共起距離が0行(同じ行)で右側にある出現頻度が1回である。そこで、共起範囲設定手段105は、被詳細化表現C7「ID」と詳細化表現C8「変更不可」との適正共起範囲として、最小共起距離が「左側に0行」であるものと決定する。 The co-occurrence range setting means 105 is based on the minimum co-occurrence distance between the refined part corresponding to “ID” and the refined part corresponding to “impossible to change”, calculated by the co-occurrence presence check means 104. The appropriate co-occurrence range is determined (step S104 in FIG. 2). However, the co-occurrence range setting unit 105 determines the appropriate co-occurrence range by distinguishing the direction in which the detailed portion co-occurs from the detailed portion in addition to the minimum co-occurrence distance. That is, in the input document D3, in the distribution of the minimum co-occurrence distance between the refined expression C7 “ID” and the refined expression C8 “unchangeable”, the minimum co-occurrence distance appears on the left side with 0 line (same line). The frequency is 3 times, the minimum co-occurrence distance is 0 (same row), and the appearance frequency on the right side is 1 time. Therefore, the co-occurrence range setting unit 105 assumes that the minimum co-occurrence distance is “0 lines on the left” as the appropriate co-occurrence range between the refined expression C7 “ID” and the refined expression C8 “unchangeable”. decide.
 詳細化不足検出手段106は、共起範囲設定手段105によって決定された適正共起範囲と、詳細化不足検出ルールとに基づいて、詳細化不足箇所を検出する(図2のステップS105)。但し、適正共起範囲において、被詳細化箇所に対して詳細化箇所が共起する方向も区別される。詳細化不足検出ルールが「各被詳細化箇所について適正共起範囲内に詳細化箇所が共起しなければ詳細化不足であると判定する」ルールである場合について説明する。この場合、詳細化不足検出手段106は、「ID」に対応する被詳細化箇所に対して「変更不可」に対応する詳細化箇所が左側にあり、0行以内に共起している場合に詳細化不足ではないと判定する。即ち、詳細化不足検出手段106は、入力文書D3において、「ユーザID」、「商品ID」、「店舗ID」に対応する被詳細化箇所については詳細化不足ではない判定する。一方、詳細化不足検出手段106は、入力文書D3において、「注文ID」に対応する被詳細化箇所については、詳細化不足であると判定する。 <Detailed shortage detection means 106 detects a shortage in detail based on the appropriate co-occurrence range determined by the co-occurrence range setting means 105 and the detailed shortage detection rule (step S105 in FIG. 2). However, in the appropriate co-occurrence range, the direction in which the detailed portion co-occurs with respect to the detailed portion is also distinguished. A case will be described in which the detailing shortage detection rule is a rule “determining that there is insufficient detailing if each detailed location does not co-occur within the appropriate co-occurrence range”. In this case, the refinement shortage detection means 106 has a refinement location corresponding to “unchangeable” on the left side with respect to the refinement location corresponding to “ID”, and co-occurs within 0 lines. It is determined that the details are not insufficient. In other words, the detail shortage detection unit 106 determines that the details to be refined corresponding to the “user ID”, “product ID”, and “store ID” are not short enough in the input document D3. On the other hand, the refinement insufficiency detecting means 106 determines that the refinement location corresponding to the “order ID” in the input document D3 is inadequate detail.
 以上説明したように、本実施形態の文書データ処理装置によれば、被詳細化箇所に対する詳細化箇所の共起位置の方向を区別して、詳細化不足箇所を検出することができる。従って、本実施形態の文書データ処理装置によれば、第2の実施形態における効果に加えて、被詳細化表現のバリエーションを増やすことなく、詳細化不足箇所を適切に絞り込むことができるという効果がある。
(第4の実施形態)
 次に、上述した各実施形態及び変形例に共通する概念を表す第4の実施形態について説明する。
As described above, according to the document data processing apparatus of the present embodiment, it is possible to detect the insufficiently detailed portion by distinguishing the direction of the co-occurrence position of the detailed portion relative to the detailed portion. Therefore, according to the document data processing apparatus of the present embodiment, in addition to the effect of the second embodiment, there is an effect that it is possible to appropriately narrow down the details shortage without increasing the variation of the detailed expression. is there.
(Fourth embodiment)
Next, a fourth embodiment representing a concept common to the above-described embodiments and modifications will be described.
 図10は、本発明の第4の実施形態における文書データ処理装置11の構成の一例を示すブロック図である。 FIG. 10 is a block diagram showing an example of the configuration of the document data processing apparatus 11 in the fourth embodiment of the present invention.
 文書データ処理装置11は、共起有無チェック手段114、共起範囲設定手段115、及び詳細化不足検出手段116を有する。 The document data processing apparatus 11 includes a co-occurrence presence / absence check unit 114, a co-occurrence range setting unit 115, and a detail shortage detection unit 116.
 まず、共起有無チェック手段114は、入力文書において、所定の主題に関する文字列(被詳細化表現)について、被詳細化表現の最も近くにある、所定の主題に関連して記述されるべき所定の事項に関する文字列(詳細化表現)との距離の分布(最小共起距離の分布)を記憶する。 First, the co-occurrence presence / absence check unit 114 determines a character string (detailed expression) related to a predetermined subject in the input document that is to be described in relation to the predetermined subject closest to the detailed expression. The distance distribution (minimum co-occurrence distance distribution) with the character string (detailed expression) related to the item is stored.
 次に、共起範囲設定手段115は、共起有無チェック手段114によって記憶された最小共起距離の分布に基づいて、入力文書において、被詳細化表現と詳細化表現との適正共起範囲を決定する。 Next, based on the distribution of the minimum co-occurrence distance stored by the co-occurrence presence / absence check unit 114, the co-occurrence range setting unit 115 sets an appropriate co-occurrence range between the detailed expression and the detailed expression in the input document. decide.
 そして、詳細化不足検出手段116は、入力文書において、共起範囲設定手段115によって決定された適正共起範囲に、詳細化表現が存在しない場合に、所定の主題に関連して記述されるべき所定の事項が適切な範囲内に記述されていないことを検出する。 Then, the detail shortage detection unit 116 should be described in relation to a predetermined subject when there is no detailed representation in the appropriate co-occurrence range determined by the co-occurrence range setting unit 115 in the input document. Detect that a given item is not described within an appropriate range.
 以上、説明したように、本実施形態の文書データ処理装置11によれば、文書において、所定の主題に関する表現(被詳細化表現)が存在するときに、所定の主題に関連して記述されるべき所定の事項に関する表現(詳細化表現)が適切な範囲内に記述されているか否かを判定することができる。
(第5の実施形態)
 次に、上述した各実施形態及び変形例に共通する概念を表す第5の実施形態について説明する。
As described above, according to the document data processing apparatus 11 of the present embodiment, when an expression relating to a predetermined subject (detailed expression) exists in the document, the description is related to the predetermined subject. It is possible to determine whether or not an expression (detailed expression) relating to a predetermined matter to be sought is described within an appropriate range.
(Fifth embodiment)
Next, a fifth embodiment representing a concept common to the above-described embodiments and modifications will be described.
 図11は、本発明の第5の実施形態における文書データ処理装置12の構成の一例を示すブロック図である。 FIG. 11 is a block diagram showing an example of the configuration of the document data processing apparatus 12 in the fifth embodiment of the present invention.
 文書データ処理装置12は、与えられた最小共起距離の分布に基づいて、入力文書において、所定の主題に関連して記述されるべき所定の事項に関する表現(詳細化表現)が適切な範囲内に記述されているか否かを判定する。尚、最小共起距離の分布は、文書において、所定の主題に関する文字列(被詳細化表現)について、被詳細化表現の最も近くにある、詳細化表現との距離の分布である。 Based on the distribution of the given minimum co-occurrence distance, the document data processing device 12 has an expression (detailed expression) regarding a predetermined matter to be described in relation to a predetermined subject in the input document within an appropriate range. It is determined whether it is described in. The minimum co-occurrence distance distribution is a distance distribution with a detailed expression that is closest to the detailed expression for a character string (detailed expression) related to a predetermined subject in the document.
 最小共起距離の分布は、例えば、第4の実施形態における共起有無チェック手段114を有する文書データ分析装置13が基準文書を分析することにより出力される。尚、基準文書は、被詳細化表現、及び詳細化表現が入力文書と共通する文書である。 The distribution of the minimum co-occurrence distance is output, for example, when the document data analysis apparatus 13 having the co-occurrence presence / absence check unit 114 in the fourth embodiment analyzes the reference document. The reference document is a document in which the refined expression and the refined expression are common to the input document.
 文書データ処理装置12は、共起範囲設定手段125、及び詳細化不足検出手段126を有する。 The document data processing apparatus 12 includes a co-occurrence range setting unit 125 and a detail shortage detection unit 126.
 まず、共起範囲設定手段125は、与えられた最小共起距離の分布に基づいて、入力文書において、被詳細化表現と詳細化表現との適正共起範囲を決定する。尚、適正共起範囲は、例えば、被詳細化箇所からの距離が、0以上、且つ最小共起距離の分布において出現頻度が最も大きい距離以下の範囲である。あるいは、適正共起範囲は、例えば、被詳細化箇所からの距離が、最小共起距離の分布において出現頻度が最も大きい距離に等しい範囲である。 First, the co-occurrence range setting means 125 determines an appropriate co-occurrence range between the detailed expression and the detailed expression in the input document based on the distribution of the given minimum co-occurrence distance. The appropriate co-occurrence range is, for example, a range in which the distance from the part to be refined is 0 or more and not more than the distance having the highest appearance frequency in the distribution of the minimum co-occurrence distance. Alternatively, the appropriate co-occurrence range is, for example, a range in which the distance from the detailed portion is equal to the distance having the highest appearance frequency in the minimum co-occurrence distance distribution.
 次に、詳細化不足検出手段126は、入力文書において、共起範囲設定手段125によって決定された適正共起範囲に、詳細化表現が存在しない場合に、所定の主題に関連して記述されるべき所定の事項が適切な範囲内に記述されていないことを検出する。 Next, the detailed shortage detection unit 126 is described in relation to a predetermined subject when there is no detailed expression in the appropriate co-occurrence range determined by the co-occurrence range setting unit 125 in the input document. It is detected that the predetermined matters to be corrected are not described within an appropriate range.
 以上、説明したように、本実施形態の文書データ処理装置12によれば、文書において、所定の主題に関する表現(被詳細化表現)が存在するときに、所定の主題に関連して記述されるべき所定の事項に関する表現(詳細化表現)が適切な範囲内に記述されているか否かを判定することができる。 As described above, according to the document data processing apparatus 12 of the present embodiment, when an expression relating to a predetermined subject (detailed expression) exists in the document, it is described in relation to the predetermined subject. It is possible to determine whether or not an expression (detailed expression) relating to a predetermined matter to be sought is described within an appropriate range.
 また、本実施形態の文書データ処理装置12は、入力文書とは別の基準文書を分析して得られた最小共起距離の分布を利用することができる。もちろん、基準文書は、入力文書と同一の文書であってもよい。従って、本実施形態の文書データ処理装置12によれば、詳細化表現が適切な範囲内に記述されているか否かを判定するために、入力文書に比べてより好適な基準文書における最小共起距離の分布を利用することができる。尚、基準文書における最小共起距離の分布が一旦作成されれば、最小共起距離の分布は何度でも利用可能である。そのため、入力文書に対してその都度、最小共起距離の分布を算出し適正共起範囲を決定する工程が不要になる。 Also, the document data processing apparatus 12 of the present embodiment can use the minimum co-occurrence distance distribution obtained by analyzing a reference document different from the input document. Of course, the reference document may be the same document as the input document. Therefore, according to the document data processing apparatus 12 of the present embodiment, the minimum co-occurrence in the reference document is more preferable than the input document in order to determine whether or not the detailed expression is described within an appropriate range. Distance distribution can be used. Note that once the minimum co-occurrence distance distribution in the reference document is created, the minimum co-occurrence distance distribution can be used any number of times. This eliminates the need to calculate the minimum co-occurrence distance distribution and determine the appropriate co-occurrence range each time for the input document.
 尚、上述した各実施形態における文書データ処理装置は、専用の装置によって実現してもよいが、コンピュータ(情報処理装置)によっても実現可能である。
 この場合において、図1、図10および図11に示した各手段のうち、少なくとも単語抽出手段103、共起有無チェック手段104、共起範囲設定手段105、詳細化不足検出手段106、共起有無チェック手段114、共起範囲設定手段115、詳細化不足検出手段116、共起範囲設定手段125、詳細化不足検出手段126は、ソフトウェア・プログラムの機能(処理)単位(ソフトウェアモジュール)と捉えることができる。これらの機能(処理)を実現可能なハードウェア環境の一例を、図12を参照して説明する。但し、これらの図面に示した各手段の区分けは、説明の便宜上の構成であり、実装に際しては、様々な構成が想定され得る。
 図12は、本発明の実施形態に係る文書データ処理装置10(11、12)を実行可能な情報処理装置1000(コンピュータ)の構成を例示的に説明する図である。
 図12に示した情報処理装置1000は、以下の構成がバス3008(通信線)を介して接続された一般的なコンピュータである。
 ・CPU(Central_Processing_Unit)3001、
 ・ROM(Read_Only_Memory)3002、
 ・RAM(Random_Access_Memory)3003、
 ・記憶装置3004、
 ・入出力ユーザインタフェース(Interface:以降、「I/F」と称する)3005、
 ・外部装置や外部ネットワークとの通信I/F3006、
 ・記録媒体3010が記録する情報を読み取るドライブ装置3009。
 そして、上述したハードウェア環境において、上述した実施形態は、以下の手順によって達成される。即ち、図12に示した情報処理装置1000に対して、その実施形態の説明において参照したブロック構成図(図1、図10および図11)、或いはフローチャート(図2)の機能を実現可能なコンピュータ・プログラムを記録した記録媒体3010を、ドライブ装置3009が読み取ることにより供給される。このほか、通信I/F3006を介して当該コンピュータ・プログラムをダウンロードすることも情報処理装置1000が読み取ることに含まれる。その後、そのコンピュータ・プログラムは、当該ハードウェアのCPU3001に読み出されて解釈され、CPU3001において実行される。また、当該装置内に供給されたコンピュータ・プログラムは、読み書き可能な揮発性の記憶メモリ(RAM3003)または記憶装置3004等の不揮発性の記憶デバイスに格納すればよい。
 そして、このような場合、係るソフトウェア・プログラム(コンピュータ・プログラム)は、本発明を構成すると捉えることができる。更に、係るソフトウェア・プログラムを格納した、コンピュータ読み取り可能な記憶媒体も、本発明を構成すると捉えることができる。
The document data processing apparatus in each embodiment described above may be realized by a dedicated apparatus, but can also be realized by a computer (information processing apparatus).
In this case, among the means shown in FIGS. 1, 10, and 11, at least the word extraction means 103, the co-occurrence presence / absence check means 104, the co-occurrence range setting means 105, the refinement insufficiency detection means 106, the co-occurrence presence / absence The check unit 114, the co-occurrence range setting unit 115, the detailed shortage detection unit 116, the co-occurrence range setting unit 125, and the detailed shortage detection unit 126 can be regarded as a function (processing) unit (software module) of the software program. it can. An example of a hardware environment capable of realizing these functions (processing) will be described with reference to FIG. However, the classification of each means shown in these drawings is a configuration for convenience of explanation, and various configurations can be assumed for mounting.
FIG. 12 is a diagram illustrating an exemplary configuration of an information processing apparatus 1000 (computer) that can execute the document data processing apparatus 10 (11, 12) according to the embodiment of the present invention.
An information processing apparatus 1000 illustrated in FIG. 12 is a general computer in which the following configurations are connected via a bus 3008 (communication line).
CPU (Central_Processing_Unit) 3001,
ROM (Read_Only_Memory) 3002,
RAM (Random_Access_Memory) 3003,
Storage device 3004,
Input / output user interface (Interface: hereinafter referred to as “I / F”) 3005,
・ Communication I / F 3006 with external devices and networks
A drive device 3009 that reads information recorded by the recording medium 3010.
And in the hardware environment mentioned above, embodiment mentioned above is achieved by the following procedures. That is, for the information processing apparatus 1000 shown in FIG. 12, a computer capable of realizing the functions of the block configuration diagrams (FIGS. 1, 10, and 11) or the flowchart (FIG. 2) referred to in the description of the embodiment. The recording medium 3010 on which the program is recorded is supplied by the drive device 3009 reading it. In addition, downloading the computer program via the communication I / F 3006 is also included in the information processing apparatus 1000 reading. Thereafter, the computer program is read and interpreted by the CPU 3001 of the hardware and executed by the CPU 3001. The computer program supplied to the apparatus may be stored in a volatile storage memory (RAM 3003) that can be read and written or a nonvolatile storage device such as the storage apparatus 3004.
In such a case, the software program (computer program) can be regarded as constituting the present invention. Furthermore, a computer-readable storage medium storing such a software program can also be understood as constituting the present invention.
 以上、本発明を、上述した各実施形態およびその変形例によって例示的に説明した。しかしながら、本発明の技術的範囲は、上述した各実施形態およびその変形例に記載した範囲には限定されない。当業者には、係る実施形態に対して多様な変更又は改良を加えることが可能であることは明らかである。そのような場合、係る変更又は改良を加えた新たな実施形態も、本発明の技術的範囲に含まれ得る。そしてこのことは、請求の範囲に記載した事項から明らかである。 As described above, the present invention has been described by way of example with the above-described embodiments and modifications thereof. However, the technical scope of the present invention is not limited to the scope described in the above-described embodiments and modifications thereof. It will be apparent to those skilled in the art that various modifications and improvements can be made to such embodiments. In such a case, new embodiments to which such changes or improvements are added can also be included in the technical scope of the present invention. This is clear from the matters described in the claims.
 上記の実施形態の一部または全部は、以下の付記のようにも記載されうるが、以下には限られない。
(付記1)
 第1の文書における、所定の主題に関する所定の第1の表現の出現位置と、前記主題に関連して記述されるべき所定の事項に関する所定の第2の表現の出現位置との最短距離の分布に基づいて、前記第1の文書と同じ文書か又は別の文書である第2の文書における、前記第1の表現の出現位置に対して前記第2の表現が出現すべき位置の第1の範囲を決定する共起範囲設定手段と、
 前記第2の文書において、前記第2の表現が前記第1の範囲に出現しない場合に、前記主題に関連して記述されるべき前記所定の事項が適切な範囲内に記述されていないことを検出する詳細化不足検出手段とを備える
ことを特徴とする文書データ処理装置。
(付記2)
 前記最短距離は、前記第1の表現の出現位置の前後にある、前記第2の表現の出現位置のうち、前記第1の表現の出現位置に最も近い前記第2の表現の出現位置との距離であることを特徴とする付記1に記載の文書データ処理装置。
(付記3)
 前記第1の文書における、前記分布を記録する共起有無チェック手段を更に備える
ことを特徴とする付記1又は付記2に記載の文書データ処理装置。
(付記4)
 前記第2の文書における、前記第1の表現の出現位置と、前記第2の表現の出現位置とを検出する単語抽出手段と、
 前記第1の表現と前記第2の表現とを関連付けて記憶する詳細化表現データベースと
を更に備える
ことを特徴とする付記1乃至付記3のいずれか1項に記載の文書データ処理装置。
(付記5)
 前記第1の範囲は、前記分布において出現頻度が最も多い最短距離、又は前記出現頻度が最も多い最短距離が複数存在する場合には前記出現頻度が最も多い最短距離の最大値、最小値、若しくは平均値を含む
ことを特徴とする付記1乃至付記4の何れか1項に記載の文書データ処理装置。
(付記6)
 前記詳細化不足検出手段は、前記第2の文書において前記第1の表現を含む複合語が出現する場合、前記複合語に対応する前記第1の範囲のいずれにおいても前記第2の表現が出現しない場合に、前記複合語により限定される前記主題に関連して記述されるべき前記事項が適切な範囲内に記述されていないことを検出する
ことを特徴とする付記1乃至付記5の何れかに記載の文書データ処理装置。
(付記7)
 前記分布は、前記第2の表現の出現位置と前記第1の表現の出現位置との距離の情報に加えて、前記第2の表現の出現位置の前記第1の表現の出現位置からみた方向の情報を更に含み、
 前記共起範囲設定手段は、前記分布に含まれる距離及び方向の情報に基づいて前記第1の範囲を決定する
ことを特徴とする付記1乃至付記6の何れかに記載の文書データ処理装置。
(付記8)
 前記詳細化不足検出手段は、前記第2の文書において、所定の第3の表現と前記第1の表現とが所定の第2の範囲に出現し、且つ前記第2の表現が前記第1の範囲に出現しない場合に前記主題に関連して記述されるべき前記所定の事項が適切な範囲内に記述されていないことを検出する
ことを特徴とする付記1乃至付記7の何れかに記載の文書データ処理装置。
(付記9)
 前記共起範囲設定手段は、前記第2の文書において、前記第1の表現の第1の同義語又は前記第2の表現の第2の同義語の出現を、それぞれ前記第1の表現又は前記第2の表現の出現とみなすことを特徴とする付記1乃至付記8の何れかに記載の文書データ処理装置。
(付記10)
 前記第1の表現について前記主題に関連して記述されるべき前記所定の事項が適切な範囲内に記述されているか否かをユーザが識別できる態様で、前記各第1の表現を出力する出力手段を更に備える
ことを特徴とする付記1乃至付記9の何れかに記載の文書データ処理装置。
(付記11)
 第1の文書における、所定の主題に関する所定の第1の表現の出現位置と、前記主題に関連して記述されるべき所定の事項に関する所定の第2の表現の出現位置との最短距離の分布に基づいて、前記第1の文書と同じ文書か又は別の文書である第2の文書における、前記第1の表現の出現位置に対して前記第2の表現が出現すべき位置の第1の範囲を決定し、
 前記第2の文書において、前記第2の表現が前記第1の範囲に出現しない場合に、前記主題に関連して記述されるべき前記所定の事項が適切な範囲内に記述されていないことを検出する
ことを特徴とする文書データ処理方法。
(付記12)
 第1の文書における、所定の主題に関する所定の第1の表現の出現位置と、前記主題に関連して記述されるべき所定の事項に関する所定の第2の表現の出現位置との最短距離の分布に基づいて、前記第1の文書と同じ文書か又は別の文書である第2の文書における、前記第1の表現の出現位置に対して前記第2の表現が出現すべき位置の第1の範囲を決定する共起範囲設定処理と、
 前記第2の文書において、前記第2の表現が前記第1の範囲に出現しない場合に、前記主題に関連して記述されるべき前記所定の事項が適切な範囲内に記述されていないことを検出する詳細化不足検出処理
とをコンピュータに実行させることを特徴とする文書データ処理プログラム。
 以上、上述した実施形態を模範的な例として本発明を説明した。しかしながら、本発明は、上述した実施形態には限定されない。即ち、本発明は、本発明のスコープ内において、当業者が理解し得る様々な態様を適用することができる。
 この出願は2014年6月18日に出願された日本出願特願2014-124850を基礎とする優先権を主張し、その開示の全てをここに取り込む。
A part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.
(Appendix 1)
Distribution of the shortest distance between the appearance position of a predetermined first expression relating to a predetermined subject in the first document and the appearance position of a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the first position of the position where the second expression should appear relative to the position where the first expression appears in a second document that is the same document as the first document or a different document. A co-occurrence range setting means for determining a range;
In the second document, when the second expression does not appear in the first range, the predetermined matter to be described in relation to the subject is not described in an appropriate range. A document data processing apparatus, comprising: a detail insufficiency detecting means for detecting.
(Appendix 2)
The shortest distance is an appearance position of the second expression closest to the appearance position of the first expression among the appearance positions of the second expression before and after the appearance position of the first expression. The document data processing apparatus according to appendix 1, wherein the document data processing apparatus is a distance.
(Appendix 3)
The document data processing apparatus according to appendix 1 or appendix 2, further comprising co-occurrence presence / absence check means for recording the distribution in the first document.
(Appendix 4)
Word extraction means for detecting the appearance position of the first expression and the appearance position of the second expression in the second document;
4. The document data processing apparatus according to claim 1, further comprising a detailed expression database that stores the first expression and the second expression in association with each other.
(Appendix 5)
The first range is the shortest distance having the highest appearance frequency in the distribution, or the maximum value, the minimum value of the shortest distance having the highest appearance frequency when there are a plurality of shortest distances having the highest appearance frequency, or The document data processing apparatus according to any one of Supplementary Note 1 to Supplementary Note 4, which includes an average value.
(Appendix 6)
When the compound word including the first expression appears in the second document, the detailing deficiency detection means causes the second expression to appear in any of the first ranges corresponding to the compound word. If not, any one of appendix 1 to appendix 5, wherein it is detected that the matter to be described in relation to the subject limited by the compound word is not described within an appropriate range Document data processing apparatus described in 1.
(Appendix 7)
The distribution is a direction of the appearance position of the second expression from the appearance position of the first expression, in addition to the information on the distance between the appearance position of the second expression and the appearance position of the first expression. Including further information
The document data processing apparatus according to any one of appendix 1 to appendix 6, wherein the co-occurrence range setting unit determines the first range based on distance and direction information included in the distribution.
(Appendix 8)
In the second document, the detailed shortage detection unit includes a predetermined third expression and the first expression appearing in a predetermined second range, and the second expression is the first document. 8. The method according to any one of appendix 1 to appendix 7, wherein the predetermined matter to be described in relation to the subject is detected when it does not appear in a range, and is not described in an appropriate range. Document data processing device.
(Appendix 9)
In the second document, the co-occurrence range setting means determines the occurrence of the first synonym of the first expression or the second synonym of the second expression as the first expression or the 9. The document data processing apparatus according to any one of supplementary notes 1 to 8, wherein the document data processing apparatus is regarded as an appearance of a second expression.
(Appendix 10)
Output that outputs each first expression in a manner that allows a user to identify whether the predetermined matter to be described in relation to the subject is described within an appropriate range for the first expression The document data processing apparatus according to any one of appendix 1 to appendix 9, further comprising: means.
(Appendix 11)
Distribution of the shortest distance between the appearance position of a predetermined first expression relating to a predetermined subject in the first document and the appearance position of a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the first position of the position where the second expression should appear relative to the position where the first expression appears in a second document that is the same document as the first document or a different document. Determine the range,
In the second document, when the second expression does not appear in the first range, the predetermined matter to be described in relation to the subject is not described in an appropriate range. A document data processing method characterized by detecting.
(Appendix 12)
Distribution of the shortest distance between the appearance position of a predetermined first expression relating to a predetermined subject in the first document and the appearance position of a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the first position of the position where the second expression should appear relative to the position where the first expression appears in a second document that is the same document as the first document or a different document. Co-occurrence range setting processing to determine the range;
In the second document, when the second expression does not appear in the first range, the predetermined matter to be described in relation to the subject is not described in an appropriate range. A document data processing program that causes a computer to execute a detection process of insufficient detailing to be detected.
The present invention has been described above using the above-described embodiment as an exemplary example. However, the present invention is not limited to the above-described embodiment. That is, the present invention can apply various modes that can be understood by those skilled in the art within the scope of the present invention.
This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2014-124850 for which it applied on June 18, 2014, and takes in those the indications of all here.
 10 文書データ処理装置
 101 文書入力手段
 102 詳細化表現データベース
 103 単語抽出手段
 104 共起有無チェック手段
 105 共起範囲設定手段
 106 詳細化不足検出手段
 107 出力手段
 11 文書データ処理装置
 114 共起有無チェック手段
 115 共起範囲設定手段
 116 詳細化不足検出手段
 12 文書データ処理装置
 13 文書データ分析装置
 125 共起範囲設定手段
 126 詳細化不足検出手段
DESCRIPTION OF SYMBOLS 10 Document data processing apparatus 101 Document input means 102 Detailed expression database 103 Word extraction means 104 Co-occurrence presence / absence check means 105 Co-occurrence range setting means 106 Refinement deficiency detection means 107 Output means 11 Document data processing apparatus 114 Co-occurrence presence / absence check means 115 Co-occurrence range setting means 116 Refinement deficiency detection means 12 Document data processing device 13 Document data analysis device 125 Co-occurrence range setting means 126 Detail refinement lack detection means

Claims (10)

  1.  第1の文書における、所定の主題に関する所定の第1の表現の出現位置と、前記主題に関連して記述されるべき所定の事項に関する所定の第2の表現の出現位置との最短距離の分布に基づいて、前記第1の文書と同じ文書か又は別の文書である第2の文書における、前記第1の表現の出現位置に対して前記第2の表現が出現すべき位置の第1の範囲を決定する共起範囲設定手段と、
     前記第2の文書において、前記第2の表現が前記第1の範囲に出現しない場合に、前記主題に関連して記述されるべき前記所定の事項が適切な範囲内に記述されていないことを検出する詳細化不足検出手段とを備える
    ことを特徴とする文書データ処理装置。
    Distribution of the shortest distance between the appearance position of a predetermined first expression relating to a predetermined subject in the first document and the appearance position of a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the first position of the position where the second expression should appear relative to the position where the first expression appears in a second document that is the same document as the first document or a different document. A co-occurrence range setting means for determining a range;
    In the second document, when the second expression does not appear in the first range, the predetermined matter to be described in relation to the subject is not described in an appropriate range. A document data processing apparatus, comprising: a detail insufficiency detecting means for detecting.
  2.  前記最短距離は、前記第1の表現の出現位置の前後にある、前記第2の表現の出現位置のうち、前記第1の表現の出現位置に最も近い前記第2の表現の出現位置との距離であることを特徴とする請求項1に記載の文書データ処理装置。 The shortest distance is an appearance position of the second expression closest to the appearance position of the first expression among the appearance positions of the second expression before and after the appearance position of the first expression. The document data processing apparatus according to claim 1, wherein the document data processing apparatus is a distance.
  3.  前記第1の文書における、前記分布を記録する共起有無チェック手段を更に備える
    ことを特徴とする請求項1又は請求項2に記載の文書データ処理装置。
    The document data processing apparatus according to claim 1, further comprising a co-occurrence presence / absence check unit that records the distribution in the first document.
  4.  前記第2の文書における、前記第1の表現の出現位置と、前記第2の表現の出現位置とを検出する単語抽出手段と、
     前記第1の表現と前記第2の表現とを関連付けて記憶する詳細化表現データベースと
    を更に備える
    ことを特徴とする請求項1乃至請求項3のいずれか1項に記載の文書データ処理装置。
    Word extraction means for detecting the appearance position of the first expression and the appearance position of the second expression in the second document;
    The document data processing apparatus according to any one of claims 1 to 3, further comprising a detailed expression database that stores the first expression and the second expression in association with each other.
  5.  前記第1の範囲は、前記分布において出現頻度が最も多い最短距離、又は前記出現頻度が最も多い最短距離が複数存在する場合には前記出現頻度が最も多い最短距離の最大値、最小値、若しくは平均値を含む
    ことを特徴とする請求項1乃至請求項4の何れか1項に記載の文書データ処理装置。
    The first range is the shortest distance having the highest appearance frequency in the distribution, or the maximum value, the minimum value of the shortest distance having the highest appearance frequency when there are a plurality of shortest distances having the highest appearance frequency, or 5. The document data processing apparatus according to claim 1, further comprising an average value.
  6.  前記詳細化不足検出手段は、前記第2の文書において前記第1の表現を含む複合語が出現する場合、前記複合語に対応する前記第1の範囲のいずれにおいても前記第2の表現が出現しない場合に、前記複合語により限定される前記主題に関連して記述されるべき前記事項が適切な範囲内に記述されていないことを検出する
    ことを特徴とする請求項1乃至請求項5の何れかに記載の文書データ処理装置。
    When the compound word including the first expression appears in the second document, the detailing deficiency detection means causes the second expression to appear in any of the first ranges corresponding to the compound word. If not, it is detected that the matter to be described in relation to the subject matter limited by the compound word is not described within an appropriate range. The document data processing apparatus according to any one of the above.
  7.  前記分布は、前記第2の表現の出現位置と前記第1の表現の出現位置との距離の情報に加えて、前記第2の表現の出現位置の前記第1の表現の出現位置からみた方向の情報を更に含み、
     前記共起範囲設定手段は、前記分布に含まれる距離及び方向の情報に基づいて前記第1の範囲を決定する
    ことを特徴とする請求項1乃至請求項6の何れかに記載の文書データ処理装置。
    The distribution is a direction of the appearance position of the second expression from the appearance position of the first expression, in addition to the information on the distance between the appearance position of the second expression and the appearance position of the first expression. Including further information
    7. The document data processing according to claim 1, wherein the co-occurrence range setting unit determines the first range based on distance and direction information included in the distribution. apparatus.
  8.  前記詳細化不足検出手段は、前記第2の文書において、所定の第3の表現と前記第1の表現とが所定の第2の範囲に出現し、且つ前記第2の表現が前記第1の範囲に出現しない場合に前記主題に関連して記述されるべき前記所定の事項が適切な範囲内に記述されていないことを検出する
    ことを特徴とする請求項1乃至請求項7の何れかに記載の文書データ処理装置。
    In the second document, the detailed shortage detection unit includes a predetermined third expression and the first expression appearing in a predetermined second range, and the second expression is the first document. 8. The method according to claim 1, further comprising: detecting that the predetermined matter to be described in relation to the subject is not described in an appropriate range when it does not appear in the range. The document data processing apparatus described.
  9.  第1の文書における、所定の主題に関する所定の第1の表現の出現位置と、前記主題に関連して記述されるべき所定の事項に関する所定の第2の表現の出現位置との最短距離の分布に基づいて、前記第1の文書と同じ文書か又は別の文書である第2の文書における、前記第1の表現の出現位置に対して前記第2の表現が出現すべき位置の第1の範囲を決定し、
     前記第2の文書において、前記第2の表現が前記第1の範囲に出現しない場合に、前記主題に関連して記述されるべき前記所定の事項が適切な範囲内に記述されていないことを検出する
    ことを特徴とする文書データ処理方法。
    Distribution of the shortest distance between the appearance position of a predetermined first expression relating to a predetermined subject in the first document and the appearance position of a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the first position of the position where the second expression should appear relative to the position where the first expression appears in a second document that is the same document as the first document or a different document. Determine the range,
    In the second document, when the second expression does not appear in the first range, the predetermined matter to be described in relation to the subject is not described in an appropriate range. A document data processing method characterized by detecting.
  10.  第1の文書における、所定の主題に関する所定の第1の表現の出現位置と、前記主題に関連して記述されるべき所定の事項に関する所定の第2の表現の出現位置との最短距離の分布に基づいて、前記第1の文書と同じ文書か又は別の文書である第2の文書における、前記第1の表現の出現位置に対して前記第2の表現が出現すべき位置の第1の範囲を決定する共起範囲設定処理と、
     前記第2の文書において、前記第2の表現が前記第1の範囲に出現しない場合に、前記主題に関連して記述されるべき前記所定の事項が適切な範囲内に記述されていないことを検出する詳細化不足検出処理
    とをコンピュータに実行させることを特徴とするコンピュータ読み取り可能なプログラムを記録する記録媒体。
    Distribution of the shortest distance between the appearance position of a predetermined first expression relating to a predetermined subject in the first document and the appearance position of a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the first position of the position where the second expression should appear relative to the position where the first expression appears in a second document that is the same document as the first document or a different document. Co-occurrence range setting processing to determine the range;
    In the second document, when the second expression does not appear in the first range, the predetermined matter to be described in relation to the subject is not described in an appropriate range. A recording medium for recording a computer-readable program, characterized by causing a computer to execute a detailing insufficient detection process to be detected.
PCT/JP2015/002938 2014-06-18 2015-06-11 Document data processing device, document data processing method, and recording medium WO2015194140A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2016529029A JP6677158B2 (en) 2014-06-18 2015-06-11 Document data processing apparatus, document data processing method, and document data processing program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2014-124850 2014-06-18
JP2014124850 2014-06-18

Publications (1)

Publication Number Publication Date
WO2015194140A1 true WO2015194140A1 (en) 2015-12-23

Family

ID=54935149

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/002938 WO2015194140A1 (en) 2014-06-18 2015-06-11 Document data processing device, document data processing method, and recording medium

Country Status (2)

Country Link
JP (1) JP6677158B2 (en)
WO (1) WO2015194140A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6172694B1 (en) * 2016-11-14 2017-08-02 国立大学法人名古屋大学 Report classification system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6325764A (en) * 1986-07-18 1988-02-03 Matsushita Electric Ind Co Ltd Documentation device
JPH1021236A (en) * 1996-07-04 1998-01-23 Ricoh Co Ltd Cooccurrence relation knowledge learning device
JP2009110405A (en) * 2007-10-31 2009-05-21 Toshiba Corp Document data processor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6325764A (en) * 1986-07-18 1988-02-03 Matsushita Electric Ind Co Ltd Documentation device
JPH1021236A (en) * 1996-07-04 1998-01-23 Ricoh Co Ltd Cooccurrence relation knowledge learning device
JP2009110405A (en) * 2007-10-31 2009-05-21 Toshiba Corp Document data processor

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6172694B1 (en) * 2016-11-14 2017-08-02 国立大学法人名古屋大学 Report classification system

Also Published As

Publication number Publication date
JPWO2015194140A1 (en) 2017-04-20
JP6677158B2 (en) 2020-04-08

Similar Documents

Publication Publication Date Title
CN110837788B (en) PDF document processing method and device
US7756871B2 (en) Article extraction
Xu et al. Open information extraction with tree kernels
CN106462604B (en) Identifying query intent
JP5144940B2 (en) Improved robustness in table of contents extraction
WO2012132388A1 (en) Text analyzing device, problematic behavior extraction method, and problematic behavior extraction program
CN108153728B (en) Keyword determination method and device
CN112668311A (en) Text error detection method and device
JP4631795B2 (en) Information search support system, information search support method, and information search support program
Duran et al. Some issues on the normalization of a corpus of products reviews in Portuguese
WO2015194140A1 (en) Document data processing device, document data processing method, and recording medium
US20150019382A1 (en) Corpus creation device, corpus creation method and corpus creation program
US20120265520A1 (en) Text processor and method of text processing
AlShenaifi et al. ARIB@ QALB-2015 shared task: a hybrid cascade model for Arabic spelling error detection and correction
CN114220113A (en) Paper quality detection method, device and equipment
Balahur et al. Multilingual feature-driven opinion extraction and summarization from customer reviews
Porntrakoon Improve the Accuracy of SenseComp in Thai Consumer’s Review Using Syntactic Analysis
KR20100115048A (en) System for distinguishing copy document and method therefor
JP2009169761A (en) Electronic dictionary system, display control method of electronic dictionary, computer program, and data storage medium
JP6417359B2 (en) Claim parsing configuration method
JP6303508B2 (en) Document analysis apparatus, document analysis system, document analysis method, and program
Lima et al. An adaptive information extraction system based on wrapper induction with POS tagging
Hnátková et al. Linguistic annotation of corpora in the Czech National Corpus
El-Beltagy et al. A corpus based approach for the automatic creation of Arabic broken plural dictionaries
Jaf A simple approach to unify ambiguously encoded Kurdish characters

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15809261

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016529029

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15809261

Country of ref document: EP

Kind code of ref document: A1