WO2015194140A1

WO2015194140A1 - Document data processing device, document data processing method, and recording medium

Info

Publication number: WO2015194140A1
Application number: PCT/JP2015/002938
Authority: WO
Inventors: 綾子久野
Original assignee: 日本電気株式会社
Priority date: 2014-06-18
Filing date: 2015-06-11
Publication date: 2015-12-23
Also published as: JPWO2015194140A1; JP6677158B2

Abstract

[Problem] To determine, when an expression related to a predetermined subject is present within a document, whether or not an expression related to a predetermined item that should be described in association with the predetermined subject is described within an appropriate range. [Solution] A document data processing device is equipped with: a co-occurrence range setting means for determining a first range on the basis of the distribution in a first document with respect to minimum distance between the appearance position of a predetermined first expression related to a predetermined subject and the appearance position of a predetermined second expression related to a predetermined item that should be described in association with the subject, said first range being the range in which the second expression should be located relative to the appearance position of the first expression in a second document, which is the same document as the first document or a different document; and an insufficient details detection means for detecting that the predetermined item that should be described in association with the subject is not described within an appropriate range when the second expression does not appear within the first range in the second document.

Description

Document data processing apparatus, document data processing method, and recording medium

The present invention relates to a document data processing apparatus, a document data processing method, and a recording medium for evaluating whether or not document description items are sufficient.

In recent years, a system has been developed that detects an incomplete description in an input document by analyzing the input document described in a natural language or the like using an information processing device or the like.

An example of a technique for detecting an expression that tends to be deficient in an input document is disclosed in Patent Document 1.

The document data processing apparatus of Patent Document 1 includes an input unit, a storage unit, a search unit, and an output unit.

The document data processing apparatus of Patent Document 1 first inputs document data (input document) to be processed by an input means.

Next, the document data processing apparatus of Patent Document 1 searches a predetermined expression stored in the storage means (hereinafter referred to as “matching expression”) from the input document by the search means.

When the matching expression exists in the input document, the document data processing apparatus of Patent Document 1 reads out a message associated with the matching expression stored in the storage unit, and outputs the message by the output unit.

That is, the document data processing apparatus disclosed in Patent Document 1 detects an expression that tends to be deficient in an input document by storing an expression that tends to be deficient in a storage unit in advance.

An example of a technique for detecting, from an input document, a lack of another expression that should have a dependency relationship with a certain expression is disclosed in Patent Document 2.

The data processing apparatus of Patent Document 2 includes an input unit, a syntax analysis unit, a storage unit, a determination unit, and an output unit.

First, the data processing apparatus of Patent Document 2 inputs document data by an input means.

Next, the data processing apparatus disclosed in Patent Document 2 executes syntax analysis of document data by syntax analysis means.

Subsequently, the data processing apparatus of Patent Document 2 determines whether or not the most basic single element is missing in the syntax tree that is the result of the syntax analysis by the determination means. Then, the data processing apparatus of Patent Document 2 determines, based on the determination result, whether or not there is a shortage of phrase descriptions for establishing a grammatical “sentence” in the document data based on the determination result.

Further, the data processing apparatus of Patent Document 2 stores in advance the correspondence relationship of expressions that should be in a dependency relationship by the storage means.

Subsequently, when one of the expressions that should be in the dependency relationship is described in the document data, the data processing device of Patent Document 2 actually determines the other expression that should be in the dependency relationship in the document data. It is determined whether or not there is a relationship.

Subsequently, the data processing apparatus disclosed in Patent Document 2 outputs a determination result as to whether or not there is an expression that should be in a dependency relationship by the output means.

An example of a technique for detecting a lack of another word to be described in relation to a certain word from an input document is disclosed in Patent Document 3.

The data processing apparatus disclosed in Patent Document 3 includes storage means, input means, determination means, and output means.

In the data processing device of Patent Document 3, the text mining dictionary table in which the first word is registered and the relationship in which the second word to be described in relation to the first word is registered by the storage unit in advance. Information table.

The data processing apparatus disclosed in Patent Document 3 first inputs document data by an input means.

Next, the data processing apparatus of Patent Document 3 determines whether or not the first word is present in the input document by the determining means. When the first word is present in the input document, the data processing apparatus of Patent Document 3 determines whether or not the second word is present in the input document by the determination unit. When the first word is present in the input document and the second word is not present in the input document, the data processing apparatus disclosed in Patent Document 3 includes a description including the second word in the input document by the determination unit. Judge that it is insufficient.

Subsequently, the data processing apparatus of Patent Document 3 outputs a determination result by an output unit.

JP 2008-033887 A Japanese Patent No. 5095128 JP 2007-310829 A

The document data processing apparatus of Patent Document 1 determines the presence or absence of an expression that tends to be short of description, but does not determine whether or not there is actually a shortage of description in the input document. That is, the document data processing apparatus of Patent Document 1 has a problem that it cannot be determined whether or not there is actually a shortage of description in the input document.

The data processing apparatus disclosed in Patent Document 2 lacks a description when there is a syntactic deficiency in the input document data or when an expression that should have a dependency relationship with an expression registered in advance in the storage means is deficient. Is determined to exist. For this reason, the data processing apparatus of Patent Document 2 does not have one of the expressions that should be in a dependency relationship in one sentence, but there is a lack of description when there is an expression that should be in a dependency relationship in another sentence. Judge that. That is, if there is an expression that should be in a dependency relationship across sentences, it should be determined that there is no description shortage, but it will be determined that there is a shortage of description. As described above, the data processing apparatus of Patent Document 2 has a problem that it is impossible to determine whether there is a shortage of description in consideration of a plurality of sentences in the input document.

The data processing apparatus of Patent Document 3 has the same problem as the data processing apparatus of Patent Document 2.
(Object of invention)
The main object of the present invention is to determine whether or not an expression relating to a predetermined matter to be described in relation to a predetermined subject is described within an appropriate range when an expression relating to the predetermined subject exists in the document. An object of the present invention is to provide a document data processing apparatus, a document data processing method, and a document data processing program.

The document data processing apparatus according to the present invention includes a position of an occurrence of a predetermined first expression relating to a predetermined subject in a first document and a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the distribution of the shortest distance from the appearance position, the second expression appears at the appearance position of the first expression in the second document that is the same document as the first document or a different document. A co-occurrence range setting means for determining the first range of the power position, and a predetermined matter to be described in relation to the subject when the second expression does not appear in the first range in the second document Is provided with detailing deficiency detection means for detecting that is not described in an appropriate range.

The document data processing method of the present invention includes a position of an occurrence of a predetermined first expression relating to a predetermined subject in a first document and a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the distribution of the shortest distance from the appearance position, the second expression appears at the appearance position of the first expression in the second document that is the same document as the first document or a different document. If the first range of the power position is determined and the second expression does not appear in the first range in the second document, the predetermined matter to be described in relation to the subject is within the appropriate range It is characterized by detecting that it is not described.

The document data processing program of the present invention includes a position of an occurrence of a predetermined first expression relating to a predetermined subject in a first document, and a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the distribution of the shortest distance from the appearance position, the second expression appears at the appearance position of the first expression in the second document that is the same document as the first document or a different document. A co-occurrence range setting process for determining the first range of the power position, and a predetermined matter to be described in relation to the subject when the second expression does not appear in the first range in the second document In this case, the computer is caused to execute a detail shortage detection process for detecting that the above is not described within an appropriate range.

According to the present invention, when an expression relating to a predetermined subject exists in a document, it is determined whether or not an expression relating to a predetermined matter to be described in relation to the predetermined subject is described within an appropriate range. There is an effect that can be done.

It is a block diagram which shows an example of a structure of the document data processing apparatus 10 in the 1st Embodiment of this invention. It is a flowchart which shows operation | movement of the document data processing apparatus 10 in the 1st Embodiment of this invention. It is a figure for demonstrating the specific example of the detailed table T1 in the 1st Embodiment of this invention. It is a figure for demonstrating the specific example of distribution of the minimum co-occurrence distance in the 1st Embodiment of this invention. It is a figure for demonstrating the specific example of the detailed table T2 in the 1st Embodiment of this invention. It is a figure for demonstrating the specific example of the input document D1 in the 1st Embodiment of this invention. It is a figure for demonstrating the specific example of the detailed table T3 in the 2nd Embodiment of this invention. It is a figure for demonstrating the specific example of the input document D2 in the 2nd Embodiment of this invention. It is a figure for demonstrating the specific example of the input document D3 in the 3rd Embodiment of this invention. It is a block diagram which shows an example of a structure of the document data processing apparatus 11 in the 4th Embodiment of this invention. It is a block diagram which shows an example of a structure of the document data processing apparatus 12 in the 5th Embodiment of this invention. It is a block diagram which shows an example of a structure of the information processing apparatus for implement | achieving this invention.

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In all the drawings, equivalent components are denoted by the same reference numerals, and description thereof will be omitted as appropriate.

First, common terms used in the following description of each embodiment will be described.

The “input document” is a document (document information, document data) input as a processing target to the document data processing apparatus according to each embodiment of the present invention.

In each embodiment of the present invention, the “document (input document)” may be, for example, a document including a character string, a symbol, a table, or the like listed below.

・ Documents written in a specific language including one or more sentences (single sentences, compound sentences, heavy sentences, etc.) ・ Tables or forms (for example, Microsoft Excel) with multiple items arranged in rows or columns (Form sheets created by (registered trademark))
・ Documents that contain a mixture of the above specific languages and tables or forms (for example, instruction manuals for various products)
The “document” may further include a figure and an image.

Also, “detailing” is to describe a matter to be explained in relation to a subject in a document.

The following terms will be explained in relation to “detailed”.

・ "Detailed expression": an expression indicating the subject matter for which the related matter is to be described ・ "Detailed expression": an expression indicating the item to be described in relation to the subject matter ・ "Situation limited word": " An expression indicating the conditions that limit the situations in which the "detailed expression" and the "detailed expression" should co-occur in the document. Note that the above "detailed expression", "detailed expression" and "situation limited word" Each “expression” is, for example, a noun or a part of a noun. However, “expression” is not limited to a noun or a part of a noun, and may include any of a character (character string), a symbol (symbol string), a table, a form, a figure, or a combination thereof.

Also, the lack of at least one “detailed expression” corresponding to the “detailed expression” in the document is called “insufficient detail”.

A plurality of “situation-restricted words” may be associated with one pair of “detailed expression” and “detailed expression”.
(First embodiment)
A configuration in the present embodiment will be described.

FIG. 1 is a block diagram showing an example of the configuration of a document data processing apparatus 10 according to the first embodiment of the present invention.

The document data processing apparatus 10 includes a document input unit 101, a detailed expression database 102, a word extraction unit 103, a co-occurrence presence / absence check unit 104, a co-occurrence range setting unit 105, a detail shortage detection unit 106, and an output unit 107. .

The document input unit 101 inputs a document (that is, an input document) that is a detection target of insufficient detailing to the document data processing apparatus 10.

The detailed expression database 102 stores in advance a “detailed table” including at least data in which the detailed expression is associated with the detailed expression. Note that the refinement table may further include a situation-limited word associated with a pair of refined expression and refined expression. Further, the refined expression database 102 may store the refined expression or the synonym of the refined expression in advance so as to treat the synonym or the like as being the same as the refined expression or the refined expression. Good.

The word extraction unit 103 searches the input document for character strings that match the refined expression, the refined expression, and the situation limited word stored in advance in the refined expression database 102. Then, the word extraction unit 103 records the position of the character string that matches each of the searched detailed description, detailed expression, and situation limited word in a storage device (not shown). The position of the character string is specified using a file name, page number, line number, sentence number, cell coordinates (cell number), or character number.

After that, the position of each character string in the document that matches the refined expression is “detailed place”, the position of each character string in the document that matches the refined expression is “detailed place”, and the situation is limited The position of each character string in the document that matches the word is called a “situation limited place”. It should be noted that each of the refined part, the refined part, and the situation-limited part has the same character string, but if the position in the document is different, another refined part, refined part, and situation-limited part Treated as a place.

When the character string registered as the refined expression or the refined expression is a part of the compound word in the input document, the word extracting unit 103 uses the entire compound word as the refined expression or the refined expression. You may consider it a discovery. For example, when the character string “ID” is registered as the refined expression in the detailed expression database, the word extracting unit 103 uses “user ID” and “product ID” including “ID” in the input document. It may be regarded as a refined expression in which a compound word such as is found.

The co-occurrence presence / absence check unit 104 calculates the refined expression and the “minimum co-occurrence distance” of the refined expression based on the location information of the refined part and the refined part. Here, the “minimum co-occurrence distance” is a distance between the refined expression and the corresponding refined expression. In other words, the minimum co-occurrence distance is the “distance” between the refined locations closest to the refined location before and after the refined location. Here, the “distance” may be anything that can represent the distance between the two expressions in the document by a numerical value, such as the number of characters, the number of lines, the difference in sentence numbers, the number of pages, etc. between the two expressions. When the situation limited word is registered in the detailed expression database 102, the co-occurrence presence check unit 104 also calculates the minimum co-occurrence distance between the refined location and the situation limited location.

Based on the minimum co-occurrence distance calculated by the co-occurrence presence / absence check unit 104, the co-occurrence range setting unit 105 sets an “appropriate co-occurrence range” between the refined expression and the refined expression of each refined part. decide. Here, the “appropriate co-occurrence range” is a range of positions where the refined expression should appear with respect to the refined expression.

For example, the co-occurrence range setting unit 105 refines the minimum co-occurrence distance having the highest appearance frequency when the appearance frequency of the minimum co-occurrence distance is histogrammed for each pair of the detailed expression and the detailed expression. The appropriate co-occurrence range for each pair of expression and detailed expression is determined. Alternatively, the co-occurrence range setting unit 105 may determine an appropriate co-occurrence range for each pair of the refined expression and the refined expression based on the distribution of distances of the refined parts to the refined part. . If the refined expression never appears in the refined expression, the co-occurrence range setting unit 105 determines “none” as the appropriate co-occurrence range. When there are a plurality of minimum co-occurrence distances having the highest appearance frequency, the co-occurrence range setting unit 105 sets the minimum co-occurrence distance, the maximum minimum co-occurrence distance, or the minimum co-occurrence as the appropriate co-occurrence range. An average value of the distance may be determined. Note that as the appropriate co-occurrence range is set wider, the number of parts to be refined that are determined to be insufficient is reduced.

The detailing shortage detection unit 106 performs details based on the appropriate co-occurrence range determined by the co-occurrence range setting unit 105 and the “detailed shortage detection rule” for each pair of the refined expression and the detailed expression. A part to be detailed (hereinafter referred to as a “detailed insufficient part”) in which insufficient refinement has occurred is detected.

The “detailed shortage detection rule” means that if the location to be refined and the location to be refined co-occur within the appropriate co-occurrence range, the details are not insufficient (or the details are insufficient) It is a rule that determines whether or not to determine. The detailing insufficient detection rule is, for example, a rule for determining that the detailing is insufficient if each detailed location does not co-occur within the appropriate co-occurrence range.

When the appropriate co-occurrence range is set to “none”, the detail shortage detection unit 106, for example, selects all the detailed parts corresponding to the pair of the corresponding detailed expression and the detailed expression. It is considered that the details are insufficient.

Further, there is a case where the refinement shortage detecting means 106 detects a compound word including the refined expression as a variation of the refined expression. In this case, if the refinement location co-occurs within the appropriate co-occurrence range for at least one of the refinement locations corresponding to each variation, the detail refinement detection rule is It may be a rule that determines that the details are not insufficient.

Also, there may be a situation limited word registered in the detailed expression database 102. In this case, the refinement shortage detecting means 106 detects the refinement insufficient location when the refined location and the situation limited location corresponding to the refined expression co-occur within a predetermined co-occurrence range. Perform detection. Alternatively, the refinement shortage detection means 106 may detect the location of the refinement insufficient part when the situation limited word co-occurs within the appropriate co-occurrence range of the refined expression and the refinement expression set by the co-occurrence range setting means 105. Perform detection.

The output means 107 outputs the details of the insufficient detail extracted by the details insufficient detection means 106 in a manner that the user can discriminate, for example. The output mode is, for example, a list display recognizable by the user, information provision to an external device, or the like. Alternatively, the output unit 107 may output the portion to be refined in such a manner that the user can determine whether or not it is determined that the location is a portion where detail is insufficient. For example, the output unit 107 may change the color, font, line thickness, etc. in the output between the parts that are not detailed and the parts that are not detailed, among the parts to be detailed. Good.

Next, the operation in this embodiment will be described.

FIG. 2 is a flowchart showing the operation of the document data processing apparatus 10 in the first embodiment of the present invention.

The document input unit 101 inputs a document (input document) that is a target for detection of insufficient detail (step S101).

The word extraction unit 103 detects a refined part and a refined part that match the refined expression and the refined expression stored in the refined expression database 102 from the input document (step S102). .

The co-occurrence presence check unit 104 calculates the minimum co-occurrence distance between the refined location and the refined location for each pair of the refined location and the refined location extracted by the word extracting unit 103 (step S103). ).

The co-occurrence range setting means 105 determines an appropriate co-occurrence range based on the minimum co-occurrence distance between the detailed portion and the detailed portion for each pair of the detailed expression and the detailed portion (step S104). .

詳細 Detailed insufficient detection means 106 detects the insufficiently detailed part based on the appropriate co-occurrence range determined by the co-occurrence range setting means 105 and the detailed insufficient detection detection rule (step S105).

The output means 107 outputs the insufficient detail location detected by the detail lacking detection means 106 (step S106).

Next, a specific example of processing according to the first embodiment of the present invention will be described.

FIG. 3 is a diagram for explaining a specific example of the detailed table T1 in the first embodiment of the present invention.

The detailed expression database 102 stores the detailed table T1.

FIG. 4 is a diagram for explaining a specific example of the distribution of the minimum co-occurrence distance in the first embodiment of the present invention.

Hereinafter, the operation of the document data processing apparatus 10 when the situation-limited word is not designated will be described with reference to FIGS. 3 and 4.

The refined table T1 can store a refined expression C1, a refined expression C2, and a situation limited word C3. However, since the situation limited word is not specified in the detailed table T1, the situation limited word C3 column is blank. The detailed expression database 102 stores in advance at least a detailed table T1 in which the detailed expression C1 and the detailed expression C2 are associated with each other.

In the refined table T1, “search system” is stored in the refined expression C1, “performance” is stored in the refined expression C2, and the situation limited word C3 is blank.

The document input unit 101 inputs a document that is a target for detection of insufficient detail (step S101 in FIG. 2).

The word extraction means 103 refers to the refinement table T1 and finds a refined part and a refined part that match the stored refined expression C1 “search system” and the refined expression C2 “performance”, respectively. Then, it is detected from the input document (step S102 in FIG. 2).

The co-occurrence presence check means 104 determines the minimum co-occurrence distance between the refined location and the refined location for each pair of the refined location corresponding to the “search system” and the refined location corresponding to “performance”. Calculate (step S103 in FIG. 2). When there are a plurality of refinement locations corresponding to “performance”, the co-occurrence presence check means 104 determines the distance between the refinement location and the refinement location closest to the refinement location as the minimum co-occurrence distance. Calculate as

The co-occurrence range setting means 105 is based on the minimum co-occurrence distance between the refined location corresponding to the “search system” and the refined location corresponding to the “performance” calculated by the co-occurrence presence check means 104. The appropriate co-occurrence range is determined (step S104 in FIG. 2). In FIG. 4, in the distribution of the minimum co-occurrence distance between the refined expression C1 “search system” and the detailed expression C2 “performance”, the frequency where the minimum co-occurrence distance is “1 row” is the highest. When determining the minimum co-occurrence distance with the highest appearance frequency as the appropriate co-occurrence range, the co-occurrence range setting unit 105 determines “one line” as the appropriate co-occurrence range.

<Detailed shortage detection means 106 detects an insufficient detail location based on the appropriate co-occurrence range determined by the co-occurrence presence check means 104 and the detailed shortage detection rule (step S105 in FIG. 2). A case will be described in which the detailing shortage detection rule is a rule “determining that there is insufficient detailing if each detailed location does not co-occur within the appropriate co-occurrence range”. In this case, if there is no refined part corresponding to “performance” within one line before and after the refined part corresponding to “search system”, the refinement insufficient detection means 106 refines the refined part. Detect as missing part.

The output means 107 outputs the insufficient detail location detected by the detail lack detection means 106 (step S106 in FIG. 2).

Next, another specific example of the process according to the first embodiment of the present invention will be described.

Hereinafter, the operation of the document data processing apparatus 10 when a situation-limited word is designated will be described with reference to FIGS. 5 and 6.

FIG. 5 is a diagram for explaining a specific example of the detailed table T2 in the first embodiment of the present invention.

In the refinement table T2, “csv” is stored in the refined expression C4, “character code” is stored in the refined expression C5, and “input” and “output” are stored in the situation limited word C6.

FIG. 6 is a diagram for explaining a specific example of the input document D1 in the first embodiment. In the input document D1, the refined expression C4 “csv” appears at positions P1, P2, P3, and P4. In the input document D1, “(page break)” indicates a symbol indicating page break, and “:” and “(omitted)” indicate that a part of the document is omitted.

The document input unit 101 inputs the input document D1 that is a target for detection of insufficient detail (step S101 in FIG. 2).

The word extraction means 103 refers to the refinement table T2, and stores each of the memorized refined expression C4 “csv”, the refined expression C5 “character code”, the situation limited word C6 “input” and “output”. 2 is detected from the input document D1 (step S102 in FIG. 2). In the input document D1, the detailed parts corresponding to the detailed expression C4 “csv” are the detailed parts P1, P2, P3, and P4.

The co-occurrence presence / absence check unit 104 calculates the minimum co-occurrence distance between the refined location and the refined location for each pair of the refined location corresponding to “csv” and the refined location corresponding to “character code”. Calculate (step S103 in FIG. 2). If there are a plurality of detailed locations corresponding to the “character code”, the co-occurrence presence check means 104 determines the minimum co-occurrence of the distance between the detailed location and the detailed location closest to the detailed location. Calculate as distance.

The co-occurrence range setting means 105 is based on the minimum co-occurrence distance between the refined part corresponding to “csv” and the refined part corresponding to “character code” calculated by the co-occurrence presence check means 104. The appropriate co-occurrence range is determined (step S104 in FIG. 2). In the distribution of the minimum co-occurrence distance between the refined expression C4 “csv” and the detailed expression C5 “character code”, the minimum co-occurrence distance with the highest appearance frequency is “1 line”. When determining the minimum co-occurrence distance having the highest appearance frequency as the appropriate co-occurrence range, the co-occurrence range setting unit 105 sets “one line” to the refined expression C4 “csv” and the refined expression C5 “character code”. As the appropriate co-occurrence range.

<Detailed shortage detection means 106 detects an insufficient detail location based on the appropriate co-occurrence range determined by the co-occurrence presence check means 104 and the detailed shortage detection rule (step S105 in FIG. 2). The detail refinement detection rule is "If there is a situation-limited word co-occurring on the same page for each refined location and no refinement location co-occurs within the appropriate co-occurrence range, it is determined that the detail is insufficient. The case of the rule will be described. In this case, the detail shortage detection means 106 causes the situation limited part corresponding to “input” or “output” to co-occur in the same page of the part to be detailed corresponding to “csv” and within one line before and after. When there is no detailed location corresponding to the “character code”, the detailed location is detected as an insufficient detail location.

The refined part P1 is insufficiently detailed because a situation-limited part corresponding to “input” co-occurs in the same page and a detailed part corresponding to “character code” co-occurs within one line before and after. Not a place. The refined part P2 is insufficiently detailed because a situation-limited part corresponding to “input” co-occurs in the same page and a detailed part corresponding to “character code” does not co-occur within one line before and after. It is a place. In the refined part P3, a situation-limited part corresponding to “output” co-occurs on the same page, and a refined part “character code” co-occurs within one line before and after. Absent. The refined location P4 is not a location where the detail is insufficient because a situation-limited location corresponding to “input” or “output” does not co-occur within the same page.

As described above, according to the document data processing apparatus 10 of the present embodiment, when there is an expression related to the predetermined subject in the document, the expression related to the predetermined matter to be described in relation to the predetermined subject is appropriate. It is possible to determine whether or not it is described within a range. The reason is that the document data processing apparatus 10 determines the appropriate co-occurrence range based on the distribution of the minimum co-occurrence distance between the refined location and the refined location, and whether there is an insufficient detail location in the appropriate co-occurrence range. It is because it determines. Here, the expression relating to the predetermined subject can be rephrased as a refined expression. An expression relating to a given item to be described in relation to a given subject can be rephrased as a refined expression. Furthermore, whether or not these are described within an appropriate range can be rephrased as the presence or absence of a location where details are insufficient.

In general, when a form sheet, software program, or the like composed of a large number of character strings and symbol strings is an input document, it is not possible to detect a reasonable lack of detail common to the entire input document. difficult. The document data processing apparatus 10 according to the present embodiment sets the appropriate co-occurrence range for such an enormous input document based on the distribution of the minimum co-occurrence distance between the detailed portion and the detailed portion. If there is no refinement location within the appropriate co-occurrence range among the refinement locations, it is determined that there is an insufficient refinement location. For this reason, in the document data processing apparatus 10 of this embodiment, there exists an effect that the presence or absence of a refinement | minimization insufficient location can be determined within the appropriate co-occurrence range which differs for every refinement | reporting location.
(Second Embodiment)
Next, a second embodiment based on the above-described first embodiment will be described. In the following description, the same components as those in the first embodiment are denoted by the same reference numerals, and description thereof is omitted as appropriate.

The configuration in this embodiment will be described.

The configuration of the document data processing apparatus in the present embodiment is the same as the configuration of the document data processing apparatus 10 in the first embodiment.

Next, the operation in this embodiment will be described.

In the present embodiment, the document data processing apparatus 10 refines the entire compound word when the character string registered as the refined expression or the refined expression is a part of the compound word in the input document. Considered to be discovered as an expression or refined expression.

Next, a specific example of processing in this embodiment will be described.

FIG. 7 is a diagram for explaining a specific example of the detailed table T3 in the second embodiment of the present invention.

The detailed expression database 102 stores the detailed table T3 in advance. The detailed table T3 stores “ID” in the detailed expression C7 and “cannot be changed” in the detailed expression C8. In the detailed table T3, since the situation limited word is not designated, the situation limited word C9 is blank.

FIG. 8 is a diagram for explaining a specific example of the input document D2 in the second embodiment of the present invention. In the input document D2, the refined expression C7 “ID” appears at positions P5, P6, P7, P8, and P9.

Document input means 101 inputs input document D2 (step S101 in FIG. 2).

The word extraction means 103 uses the input document D2 to specify the refined part and the refined part corresponding to the refined expression C7 “ID” and the refined expression C8 “unchangeable” stored in the refined table T3. It detects (step S102 of FIG. 2). Note that the word extraction means 103 includes compound words “user ID”, “product ID”, “store ID”, and “order ID” including the refined expression C7 “ID” in the input document D2, respectively. The entire compound word is detected as a refined part. That is, the word extraction unit 103 specifies the refined location P5 corresponding to the “user ID”, the refined location P6 corresponding to the “product ID”, and the refined location corresponding to the “store ID” in the input document D2. P7, P9, and the refined portion P8 corresponding to the “order ID” are detected.

The co-occurrence presence check means 104 calculates the minimum co-occurrence distance between the refined location and the refined location for each pair of the refined location corresponding to “ID” and the refined location corresponding to “unchangeable”. Calculate (step S103 in FIG. 2). When there are a plurality of refinement locations corresponding to “unchangeable”, the co-occurrence presence check unit 104 determines the minimum co-occurrence of the distance between the refinement location and the refinement location closest to the refinement location. Calculate as distance.

The co-occurrence range setting means 105 is based on the minimum co-occurrence distance between the refined part corresponding to “ID” and the refined part corresponding to “impossible to change”, calculated by the co-occurrence presence check means 104. The appropriate co-occurrence range is determined (step S104 in FIG. 2). In the input document D2, in the distribution of the minimum co-occurrence distance (number of lines) between the refined expression C7 “ID” and the detailed expression C8 “unchangeable”, the appearance frequency of the minimum co-occurrence distance 0 line (same line) is The appearance frequency of the minimum co-occurrence distance 5 lines (P8 to P7 line) is 3 times, and the appearance frequency of the minimum co-occurrence distance 7 lines (P9 to P7 line) including one blank line is 1 time. Therefore, the co-occurrence range setting unit 105 determines the minimum co-occurrence distance 0 rows as an appropriate co-occurrence range between the detailed expression C7 “ID” and the detailed expression C8 “unchangeable”.

<Detailed shortage detection means 106 detects a shortage in detail based on the appropriate co-occurrence range determined by the co-occurrence range setting means 105 and the detailed shortage detection rule (step S105 in FIG. 2). If the refinement location co-occurs within the appropriate co-occurrence range for at least one refinement location corresponding to each variation of the compound word including the refinement expression, the refinement shortage detection rule A case will be described in which the rule “determines that the variation of expression is not insufficient in detail”. In this case, if there is no refinement location corresponding to “unchangeable” within 0 lines (same row) for any refinement location corresponding to a specific compound word including “ID”, refinement is performed. The deficiency detecting means 106 detects the detailed part as a detailed deficient part. In the input document D2, the refined locations P5, P6, and P7 among the refined locations P5 to P9 are not refined portions because there are refined locations corresponding to “unchangeable” on the same line. . The detailed location P8 “order ID” has no detailed location corresponding to the detailed location “cannot be changed” on the same line, and details corresponding to the “order ID” other than the detailed location P8. Since there is no refinement location, it is a location where details are insufficient. The refined part P9 does not have a refined part corresponding to “impossible to change” on the same line, but the refined part P7 corresponding to “store ID” is not a insufficient refinement part, so the refined part P9 Is not a lack of detail.

As described above, according to the document data processing apparatus of the present embodiment, one detailed expression is specified, and each of the variations of the plurality of detailed expressions including the detailed expression is detailed. The shortage can be detected. Therefore, according to the document data processing apparatus of the present embodiment, in addition to the effects of the first embodiment, it is possible to appropriately narrow down the details of insufficient detail without registering individual variations of the detailed expression. There is an effect.
(Third embodiment)
Next, a third embodiment based on the above-described second embodiment will be described. In the following description, the same components as those in the second embodiment are denoted by the same reference numerals, and description thereof is omitted as appropriate.

The configuration in this embodiment will be described.

The configuration of the document data processing apparatus in the present embodiment is the same as that of the document data processing apparatus 10 in the second embodiment.

Next, the operation in this embodiment will be described.

The detailed expression database 102 stores in advance the same detailed table T3 as in the second embodiment.

In this embodiment, the co-occurrence presence check unit 104 and the co-occurrence range setting unit 105 determine the appropriate co-occurrence range by distinguishing the direction of the detailed expression viewed from the detailed expression.

Next, a specific example of processing in this embodiment will be described.

FIG. 9 is a diagram for explaining a specific example of the input document D3 in the third embodiment of the present invention. The input document D3 includes a value in the “attribute” column C10 and a “remarks” column C12 for each of “user ID”, “product ID”, “store ID”, and “order ID” that are values in the “item” column C11. This is a document containing a table in which the values of are described. The values of the “attribute” of the items “user ID”, “product ID”, and “store ID” are “non-overlapping” and “cannot be changed”. The value of the “attribute” of the item “order ID” is “non-overlapping”. The values of “Remarks” in the items “User ID”, “Product ID”, and “Store ID” are blank. The value of “Remarks” in the item “Order ID” is “Product ID cannot be changed even if the product is changed”.

The input document D3 is a compound word including “ID” which is the refined expression C7, and “user ID”, “product ID”, “store ID”, “order ID” are values in the “item” column C11 column. Include as.
“ID” that is the refined expression C7 is associated with “unchangeable” that is the refined expression C8 in the refined table T3.

The word extraction means 103 uses the input document D3 to specify the detailed portion and the detailed portion corresponding to the detailed expression C7 “ID” and the detailed expression C8 “unchangeable” stored in the detailed table T3. It detects (step S102 of FIG. 2). Note that the word extraction unit 103 includes “user ID”, “product ID”, “store ID”, and “order ID” for the compound word including the refined expression C7 “ID” in the input document D3. The entire compound word is detected as a refined part. That is, the word extraction unit 103 detects the second line of the “item” column C11 that is the refined portion corresponding to the “user ID” in the input document D3. In addition, the word extraction unit 103 detects the third line of the “item” column C11 and the fifth line of the “remarks” column C12, which are refined portions corresponding to the “product ID”, in the input document D3. Further, the word extracting means 103 is a refined location corresponding to “order ID” in line 4 of the “item” column C11, which is a refined location corresponding to “store ID” in the input document D3. The fifth line of the “item” column C11 is detected.
In addition, the word extraction unit 103 detects the second, third, and fourth lines of the “attribute” column C10 that is a refined portion corresponding to “unchangeable” in the input document D3.

Therefore, the co-occurrence presence / absence check unit 104 determines the refinement location and the refinement location for each pair of the refinement location corresponding to “ID” and the refinement location corresponding to “unchangeable” in the input document D3. Is calculated (step S103 in FIG. 2). When there are a plurality of refinement locations corresponding to “unchangeable”, the co-occurrence presence check unit 104 determines the minimum co-occurrence of the distance between the refinement location and the refinement location closest to the refinement location. Calculate as distance. However, the co-occurrence presence / absence check unit 104 distinguishes the direction of the detailed expression viewed from the detailed expression. That is, the co-occurrence presence / absence check unit 104 detects a refined location corresponding to “unchangeable” for a refined location corresponding to each of “user ID”, “product ID”, and “store ID”. . The refined portion is detected in a column C10 that is a column on the left side of the column C11 including the refined portion having the minimum co-occurrence distance of 0 rows (the same row). Further, the co-occurrence presence / absence check unit 104 detects a refined location corresponding to “unchangeable” for the refined location corresponding to “order ID”. The refined portion is detected in the column C12 that is the column on the right side of the column C11 that includes the refined portion having the minimum co-occurrence distance of 0 rows (the same row).

The co-occurrence range setting means 105 is based on the minimum co-occurrence distance between the refined part corresponding to “ID” and the refined part corresponding to “impossible to change”, calculated by the co-occurrence presence check means 104. The appropriate co-occurrence range is determined (step S104 in FIG. 2). However, the co-occurrence range setting unit 105 determines the appropriate co-occurrence range by distinguishing the direction in which the detailed portion co-occurs from the detailed portion in addition to the minimum co-occurrence distance. That is, in the input document D3, in the distribution of the minimum co-occurrence distance between the refined expression C7 “ID” and the refined expression C8 “unchangeable”, the minimum co-occurrence distance appears on the left side with 0 line (same line). The frequency is 3 times, the minimum co-occurrence distance is 0 (same row), and the appearance frequency on the right side is 1 time. Therefore, the co-occurrence range setting unit 105 assumes that the minimum co-occurrence distance is “0 lines on the left” as the appropriate co-occurrence range between the refined expression C7 “ID” and the refined expression C8 “unchangeable”. decide.

<Detailed shortage detection means 106 detects a shortage in detail based on the appropriate co-occurrence range determined by the co-occurrence range setting means 105 and the detailed shortage detection rule (step S105 in FIG. 2). However, in the appropriate co-occurrence range, the direction in which the detailed portion co-occurs with respect to the detailed portion is also distinguished. A case will be described in which the detailing shortage detection rule is a rule “determining that there is insufficient detailing if each detailed location does not co-occur within the appropriate co-occurrence range”. In this case, the refinement shortage detection means 106 has a refinement location corresponding to “unchangeable” on the left side with respect to the refinement location corresponding to “ID”, and co-occurs within 0 lines. It is determined that the details are not insufficient. In other words, the detail shortage detection unit 106 determines that the details to be refined corresponding to the “user ID”, “product ID”, and “store ID” are not short enough in the input document D3. On the other hand, the refinement insufficiency detecting means 106 determines that the refinement location corresponding to the “order ID” in the input document D3 is inadequate detail.

As described above, according to the document data processing apparatus of the present embodiment, it is possible to detect the insufficiently detailed portion by distinguishing the direction of the co-occurrence position of the detailed portion relative to the detailed portion. Therefore, according to the document data processing apparatus of the present embodiment, in addition to the effect of the second embodiment, there is an effect that it is possible to appropriately narrow down the details shortage without increasing the variation of the detailed expression. is there.
(Fourth embodiment)
Next, a fourth embodiment representing a concept common to the above-described embodiments and modifications will be described.

FIG. 10 is a block diagram showing an example of the configuration of the document data processing apparatus 11 in the fourth embodiment of the present invention.

The document data processing apparatus 11 includes a co-occurrence presence / absence check unit 114, a co-occurrence range setting unit 115, and a detail shortage detection unit 116.

First, the co-occurrence presence / absence check unit 114 determines a character string (detailed expression) related to a predetermined subject in the input document that is to be described in relation to the predetermined subject closest to the detailed expression. The distance distribution (minimum co-occurrence distance distribution) with the character string (detailed expression) related to the item is stored.

Next, based on the distribution of the minimum co-occurrence distance stored by the co-occurrence presence / absence check unit 114, the co-occurrence range setting unit 115 sets an appropriate co-occurrence range between the detailed expression and the detailed expression in the input document. decide.

Then, the detail shortage detection unit 116 should be described in relation to a predetermined subject when there is no detailed representation in the appropriate co-occurrence range determined by the co-occurrence range setting unit 115 in the input document. Detect that a given item is not described within an appropriate range.

As described above, according to the document data processing apparatus 11 of the present embodiment, when an expression relating to a predetermined subject (detailed expression) exists in the document, the description is related to the predetermined subject. It is possible to determine whether or not an expression (detailed expression) relating to a predetermined matter to be sought is described within an appropriate range.
(Fifth embodiment)
Next, a fifth embodiment representing a concept common to the above-described embodiments and modifications will be described.

FIG. 11 is a block diagram showing an example of the configuration of the document data processing apparatus 12 in the fifth embodiment of the present invention.

Based on the distribution of the given minimum co-occurrence distance, the document data processing device 12 has an expression (detailed expression) regarding a predetermined matter to be described in relation to a predetermined subject in the input document within an appropriate range. It is determined whether it is described in. The minimum co-occurrence distance distribution is a distance distribution with a detailed expression that is closest to the detailed expression for a character string (detailed expression) related to a predetermined subject in the document.

The distribution of the minimum co-occurrence distance is output, for example, when the document data analysis apparatus 13 having the co-occurrence presence / absence check unit 114 in the fourth embodiment analyzes the reference document. The reference document is a document in which the refined expression and the refined expression are common to the input document.

The document data processing apparatus 12 includes a co-occurrence range setting unit 125 and a detail shortage detection unit 126.

First, the co-occurrence range setting means 125 determines an appropriate co-occurrence range between the detailed expression and the detailed expression in the input document based on the distribution of the given minimum co-occurrence distance. The appropriate co-occurrence range is, for example, a range in which the distance from the part to be refined is 0 or more and not more than the distance having the highest appearance frequency in the distribution of the minimum co-occurrence distance. Alternatively, the appropriate co-occurrence range is, for example, a range in which the distance from the detailed portion is equal to the distance having the highest appearance frequency in the minimum co-occurrence distance distribution.

Next, the detailed shortage detection unit 126 is described in relation to a predetermined subject when there is no detailed expression in the appropriate co-occurrence range determined by the co-occurrence range setting unit 125 in the input document. It is detected that the predetermined matters to be corrected are not described within an appropriate range.

As described above, according to the document data processing apparatus 12 of the present embodiment, when an expression relating to a predetermined subject (detailed expression) exists in the document, it is described in relation to the predetermined subject. It is possible to determine whether or not an expression (detailed expression) relating to a predetermined matter to be sought is described within an appropriate range.

Also, the document data processing apparatus 12 of the present embodiment can use the minimum co-occurrence distance distribution obtained by analyzing a reference document different from the input document. Of course, the reference document may be the same document as the input document. Therefore, according to the document data processing apparatus 12 of the present embodiment, the minimum co-occurrence in the reference document is more preferable than the input document in order to determine whether or not the detailed expression is described within an appropriate range. Distance distribution can be used. Note that once the minimum co-occurrence distance distribution in the reference document is created, the minimum co-occurrence distance distribution can be used any number of times. This eliminates the need to calculate the minimum co-occurrence distance distribution and determine the appropriate co-occurrence range each time for the input document.

The document data processing apparatus in each embodiment described above may be realized by a dedicated apparatus, but can also be realized by a computer (information processing apparatus).
In this case, among the means shown in FIGS. 1, 10, and 11, at least the word extraction means 103, the co-occurrence presence / absence check means 104, the co-occurrence range setting means 105, the refinement insufficiency detection means 106, the co-occurrence presence / absence The check unit 114, the co-occurrence range setting unit 115, the detailed shortage detection unit 116, the co-occurrence range setting unit 125, and the detailed shortage detection unit 126 can be regarded as a function (processing) unit (software module) of the software program. it can. An example of a hardware environment capable of realizing these functions (processing) will be described with reference to FIG. However, the classification of each means shown in these drawings is a configuration for convenience of explanation, and various configurations can be assumed for mounting.
FIG. 12 is a diagram illustrating an exemplary configuration of an information processing apparatus 1000 (computer) that can execute the document data processing apparatus 10 (11, 12) according to the embodiment of the present invention.
An information processing apparatus 1000 illustrated in FIG. 12 is a general computer in which the following configurations are connected via a bus 3008 (communication line).
CPU (Central_Processing_Unit) 3001,
ROM (Read_Only_Memory) 3002,
RAM (Random_Access_Memory) 3003,
Storage device 3004,
Input / output user interface (Interface: hereinafter referred to as “I / F”) 3005,
・ Communication I / F 3006 with external devices and networks
A drive device 3009 that reads information recorded by the recording medium 3010.
And in the hardware environment mentioned above, embodiment mentioned above is achieved by the following procedures. That is, for the information processing apparatus 1000 shown in FIG. 12, a computer capable of realizing the functions of the block configuration diagrams (FIGS. 1, 10, and 11) or the flowchart (FIG. 2) referred to in the description of the embodiment. The recording medium 3010 on which the program is recorded is supplied by the drive device 3009 reading it. In addition, downloading the computer program via the communication I / F 3006 is also included in the information processing apparatus 1000 reading. Thereafter, the computer program is read and interpreted by the CPU 3001 of the hardware and executed by the CPU 3001. The computer program supplied to the apparatus may be stored in a volatile storage memory (RAM 3003) that can be read and written or a nonvolatile storage device such as the storage apparatus 3004.
In such a case, the software program (computer program) can be regarded as constituting the present invention. Furthermore, a computer-readable storage medium storing such a software program can also be understood as constituting the present invention.

As described above, the present invention has been described by way of example with the above-described embodiments and modifications thereof. However, the technical scope of the present invention is not limited to the scope described in the above-described embodiments and modifications thereof. It will be apparent to those skilled in the art that various modifications and improvements can be made to such embodiments. In such a case, new embodiments to which such changes or improvements are added can also be included in the technical scope of the present invention. This is clear from the matters described in the claims.

A part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.
(Appendix 1)
Distribution of the shortest distance between the appearance position of a predetermined first expression relating to a predetermined subject in the first document and the appearance position of a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the first position of the position where the second expression should appear relative to the position where the first expression appears in a second document that is the same document as the first document or a different document. A co-occurrence range setting means for determining a range;
In the second document, when the second expression does not appear in the first range, the predetermined matter to be described in relation to the subject is not described in an appropriate range. A document data processing apparatus, comprising: a detail insufficiency detecting means for detecting.
(Appendix 2)
The shortest distance is an appearance position of the second expression closest to the appearance position of the first expression among the appearance positions of the second expression before and after the appearance position of the first expression. The document data processing apparatus according to appendix 1, wherein the document data processing apparatus is a distance.
(Appendix 3)
The document data processing apparatus according to appendix 1 or appendix 2, further comprising co-occurrence presence / absence check means for recording the distribution in the first document.
(Appendix 4)
Word extraction means for detecting the appearance position of the first expression and the appearance position of the second expression in the second document;
4. The document data processing apparatus according to claim 1, further comprising a detailed expression database that stores the first expression and the second expression in association with each other.
(Appendix 5)
The first range is the shortest distance having the highest appearance frequency in the distribution, or the maximum value, the minimum value of the shortest distance having the highest appearance frequency when there are a plurality of shortest distances having the highest appearance frequency, or The document data processing apparatus according to any one of Supplementary Note 1 to Supplementary Note 4, which includes an average value.
(Appendix 6)
When the compound word including the first expression appears in the second document, the detailing deficiency detection means causes the second expression to appear in any of the first ranges corresponding to the compound word. If not, any one of appendix 1 to appendix 5, wherein it is detected that the matter to be described in relation to the subject limited by the compound word is not described within an appropriate range Document data processing apparatus described in 1.
(Appendix 7)
The distribution is a direction of the appearance position of the second expression from the appearance position of the first expression, in addition to the information on the distance between the appearance position of the second expression and the appearance position of the first expression. Including further information
The document data processing apparatus according to any one of appendix 1 to appendix 6, wherein the co-occurrence range setting unit determines the first range based on distance and direction information included in the distribution.
(Appendix 8)
In the second document, the detailed shortage detection unit includes a predetermined third expression and the first expression appearing in a predetermined second range, and the second expression is the first document. 8. The method according to any one of appendix 1 to appendix 7, wherein the predetermined matter to be described in relation to the subject is detected when it does not appear in a range, and is not described in an appropriate range. Document data processing device.
(Appendix 9)
In the second document, the co-occurrence range setting means determines the occurrence of the first synonym of the first expression or the second synonym of the second expression as the first expression or the 9. The document data processing apparatus according to any one of supplementary notes 1 to 8, wherein the document data processing apparatus is regarded as an appearance of a second expression.
(Appendix 10)
Output that outputs each first expression in a manner that allows a user to identify whether the predetermined matter to be described in relation to the subject is described within an appropriate range for the first expression The document data processing apparatus according to any one of appendix 1 to appendix 9, further comprising: means.
(Appendix 11)
Distribution of the shortest distance between the appearance position of a predetermined first expression relating to a predetermined subject in the first document and the appearance position of a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the first position of the position where the second expression should appear relative to the position where the first expression appears in a second document that is the same document as the first document or a different document. Determine the range,
In the second document, when the second expression does not appear in the first range, the predetermined matter to be described in relation to the subject is not described in an appropriate range. A document data processing method characterized by detecting.
(Appendix 12)
Distribution of the shortest distance between the appearance position of a predetermined first expression relating to a predetermined subject in the first document and the appearance position of a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the first position of the position where the second expression should appear relative to the position where the first expression appears in a second document that is the same document as the first document or a different document. Co-occurrence range setting processing to determine the range;
In the second document, when the second expression does not appear in the first range, the predetermined matter to be described in relation to the subject is not described in an appropriate range. A document data processing program that causes a computer to execute a detection process of insufficient detailing to be detected.
The present invention has been described above using the above-described embodiment as an exemplary example. However, the present invention is not limited to the above-described embodiment. That is, the present invention can apply various modes that can be understood by those skilled in the art within the scope of the present invention.
This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2014-124850 for which it applied on June 18, 2014, and takes in those the indications of all here.

DESCRIPTION OF SYMBOLS 10 Document data processing apparatus 101 Document input means 102 Detailed expression database 103 Word extraction means 104 Co-occurrence presence / absence check means 105 Co-occurrence range setting means 106 Refinement deficiency detection means 107 Output means 11 Document data processing apparatus 114 Co-occurrence presence / absence check means 115 Co-occurrence range setting means 116 Refinement deficiency detection means 12 Document data processing device 13 Document data analysis device 125 Co-occurrence range setting means 126 Detail refinement lack detection means

Claims

Distribution of the shortest distance between the appearance position of a predetermined first expression relating to a predetermined subject in the first document and the appearance position of a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the first position of the position where the second expression should appear relative to the position where the first expression appears in a second document that is the same document as the first document or a different document. A co-occurrence range setting means for determining a range;
In the second document, when the second expression does not appear in the first range, the predetermined matter to be described in relation to the subject is not described in an appropriate range. A document data processing apparatus, comprising: a detail insufficiency detecting means for detecting.
The shortest distance is an appearance position of the second expression closest to the appearance position of the first expression among the appearance positions of the second expression before and after the appearance position of the first expression. The document data processing apparatus according to claim 1, wherein the document data processing apparatus is a distance.
The document data processing apparatus according to claim 1, further comprising a co-occurrence presence / absence check unit that records the distribution in the first document.
Word extraction means for detecting the appearance position of the first expression and the appearance position of the second expression in the second document;
The document data processing apparatus according to any one of claims 1 to 3, further comprising a detailed expression database that stores the first expression and the second expression in association with each other.
The first range is the shortest distance having the highest appearance frequency in the distribution, or the maximum value, the minimum value of the shortest distance having the highest appearance frequency when there are a plurality of shortest distances having the highest appearance frequency, or 5. The document data processing apparatus according to claim 1, further comprising an average value.
When the compound word including the first expression appears in the second document, the detailing deficiency detection means causes the second expression to appear in any of the first ranges corresponding to the compound word. If not, it is detected that the matter to be described in relation to the subject matter limited by the compound word is not described within an appropriate range. The document data processing apparatus according to any one of the above.
The distribution is a direction of the appearance position of the second expression from the appearance position of the first expression, in addition to the information on the distance between the appearance position of the second expression and the appearance position of the first expression. Including further information
7. The document data processing according to claim 1, wherein the co-occurrence range setting unit determines the first range based on distance and direction information included in the distribution. apparatus.
In the second document, the detailed shortage detection unit includes a predetermined third expression and the first expression appearing in a predetermined second range, and the second expression is the first document. 8. The method according to claim 1, further comprising: detecting that the predetermined matter to be described in relation to the subject is not described in an appropriate range when it does not appear in the range. The document data processing apparatus described.
Distribution of the shortest distance between the appearance position of a predetermined first expression relating to a predetermined subject in the first document and the appearance position of a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the first position of the position where the second expression should appear relative to the position where the first expression appears in a second document that is the same document as the first document or a different document. Determine the range,
In the second document, when the second expression does not appear in the first range, the predetermined matter to be described in relation to the subject is not described in an appropriate range. A document data processing method characterized by detecting.
Distribution of the shortest distance between the appearance position of a predetermined first expression relating to a predetermined subject in the first document and the appearance position of a predetermined second expression relating to a predetermined matter to be described in relation to the subject. Based on the first position of the position where the second expression should appear relative to the position where the first expression appears in a second document that is the same document as the first document or a different document. Co-occurrence range setting processing to determine the range;
In the second document, when the second expression does not appear in the first range, the predetermined matter to be described in relation to the subject is not described in an appropriate range. A recording medium for recording a computer-readable program, characterized by causing a computer to execute a detailing insufficient detection process to be detected.