WO2015173894A1 - Document analysis system, control method for document analysis system, and control program for document analysis system - Google Patents

Document analysis system, control method for document analysis system, and control program for document analysis system Download PDF

Info

Publication number
WO2015173894A1
WO2015173894A1 PCT/JP2014/062743 JP2014062743W WO2015173894A1 WO 2015173894 A1 WO2015173894 A1 WO 2015173894A1 JP 2014062743 W JP2014062743 W JP 2014062743W WO 2015173894 A1 WO2015173894 A1 WO 2015173894A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
unit
keyword
analysis system
sentence
Prior art date
Application number
PCT/JP2014/062743
Other languages
French (fr)
Japanese (ja)
Inventor
守本 正宏
秀樹 武田
和巳 蓮子
Original Assignee
株式会社Ubic
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社Ubic filed Critical 株式会社Ubic
Priority to PCT/JP2014/062743 priority Critical patent/WO2015173894A1/en
Priority to JP2015510547A priority patent/JP5815911B1/en
Publication of WO2015173894A1 publication Critical patent/WO2015173894A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to a document analysis system for analyzing a document.
  • Patent Document 1 discloses a document separation system that analyzes a digital document collected for submission as evidence in a lawsuit and separates it so as to facilitate use in a lawsuit.
  • JP2013-214152A released on October 17, 2013
  • the score is calculated based on the evaluation value of the related term included in the extracted document and the number of the related term.
  • the present invention has been made in view of the above problems, and an object of the present invention is to provide a document analysis system or the like that can accurately calculate a score that correctly reflects the sentence meaning.
  • a document analysis system for analyzing a document, and a keyword indicating whether or not a predetermined keyword is included in a sentence included in the document.
  • a generating unit that generates a vector for each sentence, and a keyword vector generated by the generating unit is multiplied by a correlation matrix that indicates a correlation between the predetermined keyword and another keyword different from the predetermined keyword.
  • the classification code indicating the relevance between the document and the predetermined event
  • a calculation unit that calculates a score indicating the strength associated with the document.
  • the keyword vector includes, for example, whether each keyword element includes a value of “0” or “1”, so that a predetermined keyword associated with the element is included in the document.
  • This is a vector indicating whether or not.
  • the above correlation matrix indicates, for example, when the keyword “price” appears in a sentence, the ease of appearance of another keyword (for example, “adjustment”) with respect to the keyword (that is, “correlation”) in the sentence. It is a square matrix represented in each element of the matrix.
  • the document analysis system generates a keyword vector for each sentence, so that the keyword vector has a structure (expression) that can correctly reflect the sentence meaning of “sentence”. Scores can be calculated accurately so that there is a significant difference between documents.
  • the calculation unit calculates the score by calculating an inner product of the summed value and a weight vector indicating a weight for the predetermined keyword. Good.
  • the document analysis system may further include an extraction unit that extracts a sentence corresponding to the keyword vector indicating that the predetermined keyword is contained most in the document.
  • the document analysis system provides a summarizing unit that generates a summary of a document by enumerating sentences corresponding to the keyword vector indicating that the predetermined keyword is included in the document. May be further provided.
  • a phase for classifying a predetermined action that causes the predetermined case according to progress of the predetermined action is set to a score calculated by the calculation unit. You may further provide the specific
  • the document analysis system may further include a change estimation unit that estimates a change in the phase identified by the phase identification unit based on a temporal transition of the phase.
  • the document analysis system may further include a code assigning unit that assigns a classification code to the document based on the score calculated by the calculation unit.
  • a control method for a document analysis system is a control method for a document analysis system for analyzing a document, wherein a sentence includes a predetermined keyword.
  • a generation step for generating a keyword vector for each sentence, and a correlation between the predetermined keyword and another keyword different from the predetermined keyword.
  • a multiplication step for obtaining a correlation vector for each sentence by multiplying each correlation matrix, and a degree of relevance between the document and a predetermined event is shown based on a sum of all correlation vectors obtained in the multiplication step. Calculation that calculates the score that indicates the strength with which the classification code is associated with the document. And a step.
  • control method of the document analysis system has the same effect as the document analysis system.
  • a control program for a document analysis system is a control program for a document analysis system that analyzes a document, and the computer includes a predetermined keyword for a sentence included in the document.
  • a generation function for generating a keyword vector for each sentence, a keyword vector generated by the generation function, the predetermined keyword, and another keyword different from the predetermined keyword
  • the classification code indicating the degree of relevance indicates the strength associated with the document To realize a calculating function for calculating the core.
  • the document analysis system according to each aspect of the present invention may be realized by a computer.
  • a control program of the document analysis system for realizing the document analysis system in the computer by operating the computer as each unit included in the document analysis system, and a computer-readable recording medium storing the control program are also provided. It falls within the scope of the present invention.
  • control program of the document analysis system has the same effect as the document analysis system.
  • the document analysis system, the document analysis system control method, and the document analysis system control program according to one aspect of the present invention have a structure capable of correctly reflecting the sentence meaning of “sentence” by generating a keyword vector for each sentence. Since the keyword vector has (expression), there is an effect that the score can be accurately calculated so that there is a significant difference between two documents having different properties.
  • FIG. 5 is a flowchart illustrating an example of predictive coding according to a survey type in the example of the process illustrated in FIG. 4.
  • Embodiment 1 A first embodiment (Embodiment 1) of the present invention will be described with reference to FIGS.
  • FIG. 1 is a block diagram showing a main configuration of a document analysis system 100 according to the first embodiment of the present invention.
  • the document analysis system 100 is a system for analyzing a document (document analysis system).
  • the document analysis system 100 only needs to be a device that can execute the processing described below, and can be realized using an arbitrary computer.
  • the document analysis system 100 includes a reception unit 21, a control unit 10 (acquisition unit 11, generation unit 12, multiplication unit 13, calculation unit 14, extraction unit 15, summarization unit 16, phase identification unit 17. , A change estimation unit 18) and a display unit 50.
  • the receiving unit 21 receives the document data 1 from an external computer by communicating with the outside through a communication network according to a predetermined communication method.
  • the receiving unit 21 only needs to have an essential function for realizing communication with the computer, and a communication line, a communication method, a communication medium, and the like are not limited.
  • the receiving unit 21 can be configured by a device such as an Ethernet (registered trademark) adapter, for example.
  • the receiving unit 21 can use a communication method or a communication medium such as IEEE802.11 wireless communication or Bluetooth (registered trademark).
  • the control unit 10 comprehensively controls various functions of the document analysis system 100.
  • the control unit 10 includes an acquisition unit 11, a generation unit 12, a multiplication unit 13, a calculation unit 14, an extraction unit 15, a summarization unit 16, a phase identification unit 17, and a change estimation unit 18.
  • the acquisition unit 11 acquires the document data 1 received by the reception unit 21 and outputs the document data 1 to the generation unit 12.
  • the generation unit 12 generates, for each sentence, a keyword vector 2 indicating whether or not a predetermined keyword (morpheme) is included in the sentence included in the document data (document) 1.
  • a keyword vector 2 indicating whether or not a predetermined keyword (morpheme) is included in the sentence included in the document data (document) 1.
  • the keyword vector 2 whether or not a predetermined keyword associated with the element is included in the document data 1 when each element of the keyword vector 2 takes a value of “0” or “1”. Is a vector indicating
  • the generation unit 12 changes the element corresponding to the “price” of the keyword vector 2 from “0”. Change to “1”.
  • the generating unit 12 outputs the generated keyword vector 2 to the multiplying unit 13, the extracting unit 15, the summarizing unit 16, and the phase specifying unit 17, respectively.
  • the multiplication unit 13 multiplies the keyword vector 2 generated by the generation unit 12 by a correlation matrix that indicates the correlation between the predetermined keyword and another keyword different from the predetermined keyword, for each sentence.
  • Correlation vector 3 is obtained. For example, when the keyword “price” appears in a sentence, the correlation matrix indicates the likelihood (that is, the correlation) that another keyword (for example, “adjustment”) appears for the keyword in the sentence. It is a square matrix represented in each element.
  • the multiplication unit 13 outputs the correlation vector 3 to the calculation unit 14.
  • the correlation matrix is optimized in advance using a learning data set including a predetermined number of predetermined document data. For example, when a keyword “price” appears in a certain sentence, a value obtained by normalizing the number of occurrences of other keywords with respect to the keyword between 0 and 1 (that is, a maximum likelihood estimated value) (Therefore, the sum for each column of the correlation matrix is 1). Thereby, the document analysis system 100 can calculate the correlation vector 3 optimally.
  • the calculation unit 14 indicates the degree of association between the document data 1 and a predetermined case based on the sum of all the correlation vectors 3 obtained by the multiplication unit 13.
  • a score 4 indicating the strength with which the classification code is associated with the document data 1 is calculated for each document data 1. More specifically, as shown in the following [Equation 1], the calculation unit 14 calculates the sum (the vertical vector) and a weight vector W (horizontal) indicating the weight for the predetermined keyword. The score 4 is calculated for each document by calculating the inner product with the vector.
  • C represents a correlation matrix
  • s s represents the s-th keyword vector 2.
  • TFnorm (the above summed value) is calculated as shown in [Equation 2] below.
  • TF i represents the appearance frequency (Term Frequency) of the i-th keyword
  • s js represents the j-th element of the s-th keyword vector 2.
  • the calculating unit 14 calculates the above score 4 for each document by calculating the following [Equation 3].
  • w i is the i-th element of the weight vector W.
  • the calculating unit 14 outputs the calculated score 4 to the phase specifying unit 17, the change estimating unit 18, and the display unit 50.
  • the extraction unit 15 extracts a sentence (maximum sentence 5) corresponding to the keyword vector 2 indicating that the predetermined number of keywords is contained most in the document data 1. For example, in the sentence “The price of product a sold by company A is higher than the price of product b sold by company B, we adjusted the price of both products.” Appears 3 times.
  • the extraction unit 15 outputs the sentence as the most sentence 5 to the display unit 50.
  • the predetermined keyword in the above example, the keyword “price” may be given to the document analysis system 100 via a predetermined input device.
  • the summary unit 16 generates a summary of the document data 1 by enumerating sentences corresponding to the keyword vector 2 indicating that the predetermined keyword is included in the document data 1. For example, the summary unit 16 generates the summary by listing the sentences included in the document data 1 and including the keyword “price”, and displays the summary information 6 including information on the summary. To the unit 50. As described above, the predetermined keyword may be given to the document analysis system 100 via a predetermined input device.
  • the phase identification unit 17 performs a predetermined action (an action performed by an organization or an individual composed of a plurality of persons) that causes a predetermined case (for example, lawsuit, fraud investigation, collusion, information leakage, fictitious request, etc.)
  • a predetermined case for example, lawsuit, fraud investigation, collusion, information leakage, fictitious request, etc.
  • the phase to be classified according to the progress of the predetermined action is specified based on the score 4 calculated by the calculation unit 14.
  • the predetermined event may be given to the document analysis system 100 via a predetermined input device.
  • the phase is an index indicating each stage in which the predetermined action progresses (classified according to the progress of the predetermined action). For example, if “rigidation” is specified as the predetermined case, “Relationship Building” (phase for building relationships with customers / competitions), “Preparation” (phase for exchanging information about competition with third parties), “ It can be assumed that there is a phase such as “competition” (a phase in which a price is presented to a customer, feedback is obtained, and communication is made with the competition regarding the feedback).
  • the phase specifying unit 17 specifies a phase associated with the predetermined value range, and outputs phase information 7 including information on the phase to the change estimating unit 18. It's okay.
  • the phase specifying unit 17 is configured to calculate a likelihood (each phase of a model (observation process, likelihood function)) representing a process in which a predetermined action subject (an organization or an individual composed of a plurality of persons) reaches the predetermined action.
  • the phase (maximum likelihood phase) that maximizes the value calculated as the score according to the above may be specified, and the phase information 7 including information on the phase may be output to the change estimation unit 18.
  • the keyword vector 2 is input from the generation unit 12 and that the keyword vector 2 includes a predetermined keyword (for example, “price”, “adjustment”, etc.).
  • the phase specifying unit 17 specifies the phase corresponding to the predetermined keyword (if the keywords “price” and “adjustment” are included, the phase is specified as “Competition”), and the phase The phase information 7 including the information regarding may be output to the change estimation unit 18.
  • the change estimating unit 18 estimates the change of the phase specified by the phase specifying unit 17 based on the temporal transition of the phase. For example, a series of transitions in which the phase of “Relationship Building” develops through the phase of “Preparation” to the phase of “Competition” (competition) In the case where it is clear (by using time-series information indicating the order), when the phase identification unit 17 identifies that the current phase is in the “Preparation” phase, the change estimation unit 18 Presumed to develop into a phase called “Competition”.
  • the change estimation unit 18 outputs change information 8 including information related to the change of the phase to the display unit 50.
  • the change estimation unit 18 may estimate the phase change by calculating the correlation between the moving average of the score 4 calculated by the calculation unit 14 and a predetermined pattern.
  • the predetermined pattern is a pattern in which a score calculated in another case different from the predetermined case (for example, lawsuit, fraud investigation, collusion, information leakage, fictitious request, etc.) changes with the passage of time. It may be.
  • the change estimation unit 18 sets the moving average to the predetermined value.
  • the correlation between the moving average of the score 4 for the document data 1 analyzed this time and the predetermined pattern is calculated.
  • the change estimation unit 18 calculates the degree of coincidence (correlation) between the two while shifting the elapsed time and / or score. If the correlation between the two becomes high, the change estimation unit 18 estimates that the current score will take the same value in the future in conjunction with the predetermined pattern.
  • the display unit 50 displays the score 4 input from the calculation unit 14, the most frequent sentence 5 input from the extraction unit 15, summary information 6 input from the summary unit 16, and change information 8 input from the change estimation unit 18.
  • a display device capable of displaying for example, a liquid crystal display. 1 shows a configuration example in which the document analysis system 100 includes the display unit 50, the display unit 50 only needs to be able to present each of the above information to the user.
  • the document analysis system 100 It may be an external display device connected to be communicable.
  • FIG. 2 is a flowchart illustrating an example of processing executed by the document analysis system 100.
  • parenthesized “ ⁇ steps” represent steps included in the control method of the document analysis system 100 (control method of the document analysis system).
  • the acquisition unit 11 acquires the document data 1 (Step 1, hereinafter “Step” is abbreviated as “S”).
  • the generation unit 12 generates, for each sentence, a keyword vector 2 indicating whether or not a predetermined keyword is included in the sentence included in the document data 1 (S2, generation step).
  • the multiplying unit 13 multiplies the keyword vector 2 generated in S2 by a correlation matrix indicating the correlation between the predetermined keyword and another keyword different from the predetermined keyword, for each sentence. Correlation vector 3 is obtained (S3, multiplication step).
  • the calculation unit 14 determines the strength with which the classification code indicating the degree of association between the document data 1 and the predetermined event is associated with the document based on the sum of all the correlation vectors 3 obtained in S3.
  • the score 4 shown is calculated (S4, calculation step).
  • the above control method is executed not only by the above-described processing with reference to FIG. 2 but also by the acquisition unit 11, the extraction unit 15, the summarization unit 16, the phase identification unit 17, and / or the change estimation unit 18. Processing may optionally be included.
  • Embodiment 2 A second embodiment (Embodiment 2) of the present invention will be described with reference to FIGS. In the present embodiment, only the configuration added to the first embodiment and the configuration different from the configuration of the first embodiment will be described. That is, all the configurations described in the first embodiment can be included in the second embodiment. Moreover, the definition of the term described in Embodiment 1 is the same also in Embodiment 2.
  • FIG. 3 is a block diagram showing a main configuration of the document analysis system 101 according to Embodiment 2 of the present invention.
  • the document analysis system 101 is a system that acquires information recorded in a predetermined computer or server and analyzes document information including a plurality of documents included in the acquired information.
  • the document analysis system 101 includes the control unit 10 (acquisition unit 11, generation unit 12, multiplication unit 13, calculation unit 14, extraction unit 15, summarization unit 16, phase described in the first embodiment.
  • the data storage unit 108 digital information storage area 102, survey basic database 103, keyword database 104, related term database 105, score calculation database 106, report creation database 107
  • database Management unit 109 information extraction unit 24, search unit 30, document analysis unit 118, survey category input reception unit 20, survey type determination unit 22, presentation unit 130, category selection unit 26, first automatic sorting unit 201, second automatic A sorting unit 301, a sorting code reception and grant unit 131, and a third automatic sorting unit 401 are further provided.
  • the document analysis system 101 may further include a trend information generation unit 124, a quality inspection unit 501, a learning unit 601, a report creation unit 701, a lawyer review reception unit 133, a language determination unit 120, and a translation unit 122.
  • the survey category input receiving unit 20 receives a category input by the user. When a category is input, the survey category input reception unit 20 outputs the category to the survey type determination unit 22 and the category selection unit 26.
  • the category is an index that can classify each document included in a plurality of documents.
  • the above categories represent the type of litigation or fraud investigation (representing the nature of the case relating to the litigation or fraud investigation, such as antitrust, patents, foreign bribery prohibition (FCPA), product liability (PL), Information leakage, fictitious billing, etc.).
  • the category may be an attribute of document information (representing the nature of information included in the document information, such as competing opponent information, price, estimate sheet, price list, product, etc.).
  • the category may be a phase classified according to the progress of a predetermined action that causes a lawsuit or fraud investigation.
  • the survey type determination unit 22 determines a category to be surveyed based on the category received by the survey category input reception unit 20 and extracts a necessary information type from the survey basic database 103. For example, when the document information is an e-mail, a presentation material, a spreadsheet, a meeting material, a contract, an organization chart, or a business plan, the investigation type determination unit 22 sets each of the types of necessary information as above. The information is output to the information extraction unit 24. Therefore, the document analysis system 101 can extract the necessary information types.
  • the information extraction unit 24 extracts a plurality of documents from the document information. Specifically, the information extraction unit 24 uses information input from the survey type determination unit 22 (for example, e-mail, presentation materials, spreadsheet materials, meeting materials, contracts, organization charts, business plans, etc.) The keywords and / or sentences included in the information are extracted as information related to lawsuits or fraud investigations, and the extracted results are stored in the investigation basic database 103. In addition, the information extraction unit 24 outputs the extracted result as document data 1 to the control unit 10. Therefore, the document analysis system 101 can specify information related to the lawsuit or fraud investigation and hold it in the database.
  • the survey type determination unit 22 for example, e-mail, presentation materials, spreadsheet materials, meeting materials, contracts, organization charts, business plans, etc.
  • the keywords and / or sentences included in the information are extracted as information related to lawsuits or fraud investigations, and the extracted results are stored in the investigation basic database 103.
  • the information extraction unit 24 outputs the extracted result as document data 1 to the control unit 10. Therefore, the document analysis system 101 can specify information related
  • the category selection unit 26 selects the category and outputs the selected category to the control unit 10. When a plurality of categories are assumed, the category selection unit 26 can sequentially select one category from the plurality of categories.
  • the category selection unit 26 can select the input category. Thereby, the document analysis system 101 can reliably select the category input by the user.
  • the presenting unit 130 presents the score 4 calculated by the control unit 10 (calculating unit 14) to the user so as to be grasped.
  • the presentation unit 130 can present the score 4 to the user by displaying the score 4 on the display unit 50 (not shown in FIG. 3).
  • the document analysis system 101 can make a user grasp
  • the search unit 30 searches a plurality of documents for keywords and / or sentences included in the document information (document data 1). Thereby, the document analysis system 101 can extract keywords and / or sentences included in the document information.
  • the first automatic sorting unit 201 When the keyword stored in the keyword database 104 is searched by the search unit 30 and a document including the keyword is extracted from the document information by the information extraction unit 24, the first automatic sorting unit 201 performs processing on the extracted document. Thus, a specific classification code is automatically assigned based on the keyword correspondence information.
  • the second automatic classification unit 301 extracts a document including related terms stored in the related term database from the document information, and based on the evaluation value of the related terms included in the extracted document and the number of the related terms.
  • a predetermined classification code is automatically assigned based on the score and related term correspondence information to a document that includes the related term and whose score exceeds a certain value. To do.
  • the classification code receiving / giving unit 131 accepts a classification code given by the user based on the relevance to the lawsuit for a plurality of documents that are extracted from the document information and to which the classification code is not given, and outputs the classification code. Give.
  • the document analysis unit 118 analyzes the document given the classification code by the classification code reception / giving unit 131. Further, the document analysis unit 118, based on the relevance to the lawsuit, in addition to the document that has been given and received the classification code from the user, in the first automatic classification unit 201 and the second automatic classification unit 301, keywords, related terms, Based on the score, the document automatically assigned with the classification code is analyzed, and the above-mentioned document automatically received with the classification code is integrated with the above-mentioned document automatically received with the classification code. You may obtain a simple analysis result. In this case, the third automatic classification unit 401 can automatically assign a classification code based on the comprehensive analysis result.
  • the classification and investigation work can be carried out through automatic classification by word search, acceptance of classification and investigation by users, automatic classification and investigation using scores, automatic classification and investigation through the learning process, and automatic classification through quality assurance. There are various ways to proceed, such as surveys.
  • the document analysis unit 118 analyzes a plurality of documents assigned classification codes together with a progress history that indicates in what order and how the various classification and investigation operations have progressed in combination, and will be described later.
  • the report creation unit 701 may report the analysis result.
  • the third automatic classification unit 401 assigns a classification code to a plurality of documents extracted from the document information based on a result obtained by analyzing the document to which the classification code is given by the classification code receiving / giving unit 131 by the document analysis unit 118. Grant automatically.
  • the trend information generation unit 124 is similar to a document to which a classification code possessed by each document is assigned based on the type, number of occurrences, and evaluation value of the word included in each document for the document analysis unit 118 to analyze.
  • the trend information indicating the degree of the is generated.
  • the quality inspection unit 501 compares the classification code received by the classification code reception / giving unit 131 with the classification code given by the trend information by the document analysis unit 118, and the classification code received by the classification code reception / granting unit 131. Verify the validity of.
  • the learning unit 601 learns the weighting of each keyword or related term based on the result of sorting the document.
  • the learning unit 601 learns the weighting of each keyword or related term using Expression (3) based on the first to fourth processing results (described later).
  • the learning unit 601 may reflect the learning result on the keyword database 104, the related term database 105, or the score calculation database 106.
  • the report creation unit 701 outputs an optimal investigation report according to the type of litigation or the investigation type of the fraud investigation based on the result of separating the documents.
  • the lawsuit includes, for example, antitrust, patent, foreign bribery prohibition (FCPA), product liability (PL), and the like.
  • the fraud investigation includes, for example, information leakage and fictitious billing.
  • the lawyer review reception unit 133 receives reviews of the chief attorney or the lead patent attorney in order to improve the quality of the classification survey and the report and clarify the responsibility of the classification survey and the report.
  • the language determination unit 120 determines the language type of the extracted document.
  • the translation unit 122 receives the designation from the user or automatically translates the extracted document.
  • the language delimiter in the language determination unit be smaller than one sentence so that it can be used for a single-sentence multilingual compound language.
  • one or both of predictive coding and character coding may be used for language determination.
  • a process of excluding an HTML (Hyper Text Markup Language) header or the like from translation targets may be performed.
  • the data storage unit 108 stores digital information acquired from a plurality of computers or servers in the digital information storage area 102 for use in analysis of lawsuits or fraud investigations.
  • the data storage unit 108 includes a survey basic database 103, a keyword database 104, a related term database 105, a score calculation database 106, and a report creation database 107.
  • the data storage unit 108 may be a recording medium included in the document analysis system 101 or an external recording medium connected to the document analysis system 101 so as to be communicable. It may be.
  • the basic research database 103 includes, for example, litigation matters including antitrust, patents, foreign bribery prohibition (Foreign Corrupt Practices Act) (FCPA), product liability (Products Liability, PL), and / or information leakage, fictitious claims, etc. It holds the case attribute, company name, person in charge, custodian, and the structure of the investigation or classification input screen indicating which of the fraud investigations includes
  • the keyword database 104 includes a specific classification code of a document, a keyword having a close relationship with the specific classification code, and a correspondence relationship between the specific classification code and the keyword included in the acquired digital information. Holds keyword correspondence information.
  • the related term database 105 includes a predetermined classification code, a related term composed of words having a high appearance frequency in a document to which the predetermined classification code is assigned, and a relationship indicating a correspondence relationship between the predetermined classification code and the related term. Holds term correspondence information.
  • the score calculation database 106 holds weights of words included in the document in order to calculate a score indicating the strength of connection between the document and the classification code.
  • the report creation database 107 holds a report format determined according to the category, custodian, and contents of the classification work.
  • the database management unit 109 manages the update of data contents of the survey basic database 103, the keyword database 104, the related term database 105, the score calculation database 106, and the report creation database 107.
  • the database management unit 109 may be connected to the information storage device 902 via a dedicated connection line or the Internet line 901. In this case, the database management unit 109 determines whether the survey basic database 103, the keyword database 104, the related term database 105, the score calculation database 106, and the report creation database 107 are based on the contents of data stored in the information storage device 902. Data content may be updated.
  • the “classification code” is an identifier used for classifying documents, and is an identifier indicating the degree of relevance with the lawsuit so that the document can be easily used in the lawsuit. For example, when document information is used as evidence in a lawsuit, it may be given according to the type of evidence.
  • Document is data including one or more words, and may be, for example, e-mail, presentation materials, spreadsheet materials, meeting materials, contracts, organization charts, business plans, and the like.
  • “Word” is a group of the smallest character strings having meaning. For example, a sentence “document means data including one or more words” includes “document”, “one”, “more”, “word”, “include”, “data”, “ The word "” is included.
  • Keyword is a group of character strings having a certain meaning in a certain language. For example, if a keyword is selected from a sentence “classify a document”, it can be set to “document” or “classify”. In the present embodiment, keywords such as “infringement”, “lawsuit”, or “patent publication XX” are selected with priority.
  • the “keyword” may include a morpheme.
  • Key correspondence information is information representing the correspondence between a keyword and a specific classification code. For example, when the classification code “important” representing an important document in a lawsuit has a close relationship with the keyword “infringer”, the above “keyword correspondence information” uses the classification code “important” and the keyword “infringer”. It may be information managed in association with each other.
  • the “related term” is a term having an evaluation value of a certain value or more among words having a high appearance frequency in common with a document to which a predetermined classification code is assigned.
  • the appearance frequency may be, for example, a ratio of related terms appearing in the total number of words appearing in one document.
  • “Evaluation value” is a value indicating the amount of information that is exhibited in a document with each word.
  • the “evaluation value” may be calculated based on the amount of transmitted information.
  • the “related term” may refer to the name of the technical field to which the product belongs, the country where the product is sold, the name of a similar product of the product, and the like.
  • “related terms” in the case of assigning the product name of the apparatus that performs the image encoding process as a classification code includes “encoding process”, “Japan”, “encoder”, and the like.
  • “Related term correspondence information” refers to information indicating the correspondence between related terms and classification codes. For example, when the classification code “product A”, which is the product name related to the lawsuit, has a related term “image encoding”, which is a function of the product A, the “related term correspondence information” is the classification code “product A”. And the related term “image coding” may be managed in association with each other.
  • Score refers to a value obtained by quantitatively evaluating the strength of association with a specific classification code in a document as described above. In each embodiment of the present invention, for example, the score is calculated according to the above-described [Equation 1] to [Equation 3].
  • the document analysis system 101 may extract words that frequently appear in documents having a common classification code assigned by the user. Then, for each document, the extracted word type, the evaluation value of each word, and the trend information of the number of appearances included in each document are analyzed for each document, and the classification code is not accepted by the classification code acceptance and grant unit 131. Among them, a common classification code may be assigned to documents having the same tendency as the analyzed trend information.
  • the “trend information” is information representing the degree of similarity of each document with a classification code, and is based on the type of word, the number of occurrences, and the word evaluation value included in each document.
  • Information represented by the degree of association with a predetermined classification code For example, when each document is similar in degree of relevance between a document given a predetermined classification code and the predetermined classification code, the two documents are said to have the same trend information.
  • documents having the same evaluation value and the same number of occurrences may be documents having the same tendency.
  • FIG. 4 is a flowchart illustrating an example of processing executed by the document analysis system 101.
  • the flow shown in FIG. 2 may be executed as a process independent of the flow shown in FIG. 4 or may be executed as a process included in any part of the flow shown in FIG. .
  • a category corresponding to a lawsuit including antitrust, patent, FCPA, PL, or fraud investigation including information leakage and fictitious claims is displayed. It can be specified (S11).
  • a use database such as a survey basic database and a document analysis database can be identified (S12).
  • the information storage device 902 that stores the latest database can be accessed.
  • the information storage device 902 may be installed inside an organization that performs sorting or may be installed outside the organization. As a case where the information storage device 902 is installed outside the organization, for example, it may be installed in a partner law firm or patent office.
  • authentication by ID and password can be performed to maintain security (S13).
  • the usage database such as the survey basic database and the document analysis database can be updated to the guideline database (S14).
  • the updated survey basic database is searched (S15), and the name of the company, the person in charge, and the custodian can be presented on the screen of the display device (S16). If the names of the person in charge and the custodian displayed on the screen of the display device are different from the names of the persons in charge and the custodian actually, the user corrects the names of the person in charge and the custodian on the screen of the display device.
  • the document analysis system can accept the user's correction input and specify the names of the actual person in charge and the custodian (S17).
  • digital document information can be extracted in order to perform document analysis work (S18).
  • the updated document analysis database the updated keyword database, related term database, and score calculation database are searched (S19), and a classification code can be assigned to the extracted document information (S20).
  • the classification code by the reviewer can be received and the classification code can be given to the extracted document information (S21).
  • the database is searched using the classification result as teacher data, and a classification code can be assigned to the extracted document information (S22).
  • a review by the chief attorney or patent attorney can be accepted (S23). This can improve the quality of the survey.
  • the category is specified by the user's argument designation (S24), and the report creation database can be specified according to the specified category (S25).
  • the format of the report can be determined by the identified report creation database, and the report can be automatically output (S26).
  • FIG. 5 is a flowchart showing an example of the investigation and classification process according to the investigation type in the example of the process shown in FIG.
  • the survey type can be input (S31).
  • the user will try to carry out from a fraud investigation including antitrust, patents, litigation cases including overseas bribery prohibition (FCPA), product liability (PL) or information leakage, fictitious claims, etc. Enter the category corresponding to the survey and sorting work.
  • the document analysis system can accept a user category input and specify a category to be investigated.
  • the type of survey and document analysis processing and the type of database to be used can be determined (S32).
  • a stock of information stored in a usage database such as a survey basic database or a document analysis database may be accessed (S33).
  • the survey basic database is accessed according to the specified category, and each keyword input screen corresponding to the specified category can be displayed (S34).
  • the survey basic database is accessed according to the identified category, and each text input screen can be displayed according to the identified category (S35).
  • the survey basic database is accessed according to the identified category and identified.
  • a keyword or document can be extracted according to the category (S36).
  • weighting can be added to the teacher data for automatic classification code assignment (predictive coding) (S37).
  • the extracted documents and information can be narrowed down by performing a keyword search in the document analysis database (S38).
  • FIG. 6 is a flowchart showing an example of predictive coding according to the investigation type in the example of the process shown in FIG.
  • the document analysis system can ask the user for input according to the type of survey, and can accept the user's input for that. For example, regarding cartels in relation to the antitrust law, user input is requested for target products, parties (name and email address), related organizations (name and department), and time, and user input is accepted. it can. In addition, regarding related organizations, it is possible to request user input regarding competitor companies and customer companies, and accept user input in response to the input (S51).
  • the keyword and related terms are updated and registered in advance using the result of the past classification process (S100).
  • the keyword and the related term are updated and registered together with the keyword correspondence information and the related term correspondence information which are correspondence information between the classification code and the keyword or the related term.
  • a document including the keyword updated and registered in the first stage is extracted from all document information.
  • the updated keyword correspondence information recorded in the first stage is referred to, and the classification corresponding to the keyword is performed.
  • a first separation process for assigning a code is performed (S200).
  • the document including the related term updated and registered in the first stage is extracted from the document information that has not been given the classification code in the second stage, and the score of the document including the related term is calculated.
  • a second classification process is performed in which a classification code is assigned (S300).
  • the classification code given by the user is accepted for the document information that has not been given the classification code by the third stage, and the classification code accepted from the user is given to the document information.
  • the document information provided with the classification code received from the user is analyzed, the document without the classification code is extracted based on the analysis result, and the third classification for adding the classification code to the extracted document Process. For example, words that frequently appear in documents with a common classification code assigned by the user are extracted, and the types of extracted words, evaluation values possessed by each word, and trend information on the number of appearances are included for each document.
  • the common classification code is assigned to the document having the same tendency as the trend information (S400).
  • the classification code to be given is determined based on the analyzed trend information for the document to which the user has given the classification code in the fourth stage, and the determined classification code and the classification code given by the user are determined.
  • the validity of the classification process is verified by comparison (S500). Moreover, you may perform a learning process based on the result of a document analysis process as needed.
  • the trend information used in the fourth and fifth stage processing refers to the degree of similarity between each document and the document to which the classification code is assigned.
  • the type of word included in each document the number of occurrences, This is based on the evaluation value of a word. For example, when each document is similar in degree of relevance between a document assigned a predetermined classification code and the predetermined classification code, the two documents have the same tendency information. In addition, even if the types of words included are different, documents having the same evaluation value and the same number of occurrences may be documents having the same tendency.
  • the keyword database 104 creates a management table for each classification code based on the result of classifying documents in past lawsuits, and specifies keywords corresponding to each classification code (S111).
  • the document to which each classification code is assigned is analyzed, and the number of occurrences of each keyword in the document and the evaluation value are used.
  • a method, a method of manual selection by the user, or the like may be used.
  • the keyword correspondence information indicating that the keyword has a special relationship is created (S112). Then, the identified keyword is registered in the keyword database 104. At this time, the identified keyword is associated with the keyword correspondence information and recorded in the management table of the classification code “important” in the keyword database 104 (S113).
  • the related term database 105 creates a management table for each classification code based on the result of classifying documents in past lawsuits, and registers the related terms corresponding to each classification code (S121).
  • S121 classification code
  • encoding process” and “product a” are registered as related terms of “product A”
  • decoding” and “product b” are registered as related terms of “product B”.
  • the related term correspondence information indicating which classification code each registered related term corresponds to is created (S122) and recorded in each management table (S123). At this time, the related term correspondence information also records a threshold value serving as a score necessary for determining an evaluation value and a classification code of each related term.
  • the keyword and the keyword correspondence information, and the related term and the related term correspondence information are updated and registered (S113, S123).
  • the first automatic sorting unit 201 extracts documents including the keywords “infringement” and “patent attorney” registered in the keyword database 104 in the first step (S100) from the document information (S211).
  • a management table in which the keyword is recorded is referred to from the keyword correspondence information to the extracted document (S212), and a classification code of “important” is given (S213).
  • the second automatic classification unit 301 assigns the classification codes “product A” and “product B” to the document information that has not been assigned the classification code in the second stage (S200). Process.
  • the second automatic classification unit 301 records a document including related terms “encoding process”, “product a”, “decoding”, and “product b” recorded in the related term database 105 in the first stage. Extract (S311). For the extracted document, a score is calculated by the score calculation unit 116 using Expression (1) based on the appearance frequency and evaluation value of the four related terms recorded (S312). The score represents the degree of association between each document and the classification codes “product A” and “product B”.
  • the appearance frequency of the related terms “encoding process” and “product a” and the evaluation value of the related term “encoding process” are high, and the score indicating the degree of association with the classification code “product A” is a threshold value. Is exceeded, the document is given a classification code “Product A”.
  • the second automatic sorting unit 301 recalculates the evaluation value of the related term by the following [Equation 4] using the score calculated in S432 in the fourth stage, and weights the evaluation value (S315). ).
  • w i, L represents the weight of the i-th selected keyword after the L-th learning
  • ⁇ L represents a learning parameter in the L-th learning
  • represents a learning effect threshold value. For example, if there are more than a certain number of documents where the appearance frequency of “decryption” is very high but the score is lower than a certain value, the evaluation value of the related term “decoding” is lowered and the related term correspondence information is again displayed. Record.
  • the classification code from the reviewer is given to a certain percentage of the document information extracted from the document information to which the classification code is not given. Acceptance and the accepted classification code are assigned to the document information.
  • the document information given the classification code received from the reviewer is analyzed, and based on the analysis result, the classification code is given to the document information to which the classification code is not given.
  • a process of assigning classification codes of “important”, “product A”, and “product B” is performed on the document information. The fourth stage is further described below.
  • the information extraction unit 24 samples a document at random from the document information to be processed in the fourth stage and displays it on the display unit 50.
  • 20% of the document information to be processed is extracted at random and set as a classification target by the reviewer.
  • Sampling may be an extraction method in which documents are arranged in order of document creation date and time or in order of name, and 30% of documents are selected from the top.
  • the user browses the document display screen shown in FIG. 18 displayed on the display unit 50, and selects a classification code to be assigned to each document.
  • the classification code reception / giving unit 131 receives the classification code selected by the user (S411) and classifies the classification code based on the given classification code (S412).
  • the document analysis unit 118 extracts words that frequently appear in the documents classified by classification code by the classification code reception and grant unit 131 (S421).
  • the evaluation value of the extracted common word is analyzed by equation (2) (S422), and the appearance frequency of the common word in the document is analyzed (S423).
  • FIG. 14 is a graph showing a result of analyzing words frequently appearing in the document to which the classification code “important” is assigned in S424.
  • the vertical axis R_hot includes a word selected as a word associated with the classification code “important” among all documents to which the classification code “important” is assigned by the user, and the classification code “important” is assigned. Shows the percentage of documents that were used.
  • the horizontal axis indicates the ratio of documents including the word extracted in S421 by the classification code receiving and assigning unit 131 among all documents subjected to the classification process by the user.
  • the processing from S421 to S424 is also executed for the documents to which the classification codes “product A” and “product B” are assigned, and the trend information of the documents is analyzed.
  • the third automatic classification unit 401 performs processing on the document that has not been given the classification code by the classification code reception / giving unit 131 in step S411 out of the document information to be processed in the fourth stage.
  • a document having the same trend information as the trend information of the document assigned with the classification codes “important”, “product A”, and “product B” analyzed in S 424 from such a document. Are extracted (S431), and a score is calculated for the extracted document using equation (1) based on the trend method (S432). Further, an appropriate classification code is assigned to the document extracted in S431 based on the trend information (S433).
  • the third automatic sorting unit 401 further reflects the sorting result in each database using the score calculated in S432 (S434). Specifically, a process of lowering the evaluation values of keywords and related terms included in a document having a low score and increasing the evaluation values of keywords and related terms included in a document having a high score may be performed.
  • the third automatic classification unit 401 may perform a classification process on the document information that has not been accepted by the classification code reception / giving unit 131 in step S411 out of the document information to be processed in the fourth stage. .
  • the same trend information as the trend information of the document to which the classification code “important” is assigned is analyzed from the document in S424. Is extracted (S442), and the score of the extracted document is calculated using equation (1) based on the trend information (S443). Further, an appropriate classification code is assigned to the document extracted in S442 based on the trend information (S444).
  • the third automatic sorting unit 401 further reflects the sorting result in each database using the score calculated in S443 (S445). Specifically, the evaluation value of the keyword and the related term included in the document with a low score is lowered, while the evaluation value of the keyword and the related term included in the document with a high score is increased.
  • the data for score calculation is collectively stored in the score calculation database 106. May be stored.
  • ⁇ Fifth stage (S500)> A detailed processing flow of the quality inspection unit 501 in the fifth stage will be described with reference to FIG.
  • the classification code reception / giving unit 131 determines the classification code to be given based on the trend information analyzed by the document analysis unit 118 in S424 for the document received in S411 (S511). .
  • the classification code received by the classification code reception / giving unit 131 is compared with the classification code determined in S511 (S512), and the validity of the classification code received in S411 is verified (S513).
  • the document analysis system 101 may include a learning unit 601.
  • the learning unit 601 learns the weighting of each keyword or related term based on the first to fourth processing results using Expression (2).
  • the learning result may be reflected in the keyword database 104, the related term database 105, or the score calculation database 106.
  • the document analysis system 101 is based on the result of the document analysis process. It is possible to provide a report creation unit 701 for outputting an optimum survey report according to the survey type (eg, fictitious billing).
  • the survey type eg, fictitious billing
  • the contents of the survey vary depending on the survey type. For example, 1. When and how did the competing personnel communicate with the cartel (price adjustment)? 2. Who is the organization involved? Is the point.
  • a document survey report system, a document survey report method, and a document survey report program according to another example of the embodiment of the present invention will be described below.
  • a document that has already been given a classification code is analyzed in correspondence with similar search information, and a range in which the classification code is assigned based on the analysis result is determined. adjust. Then, based on the range to which the adjusted classification code is assigned, the classification work and the survey work are performed, and a report is created based on the results of the classification work and the survey work.
  • the method of adjusting the range to which the classification code is assigned by clustering similar search information corresponding to the similar search information There is a method to perform prediction classification by learning.
  • a common classification code may be given to the reply document of the reply document of the original document.
  • the same or similar classification codes are given to similar search information by learning to integrate similar search information for the classification results.
  • the reliability of the analysis result varies depending on the number of documents to be analyzed.
  • a statistical method may be added to the total number of documents to be classified to determine at what time point the percentage of all documents to be adjusted for the range to which the classification code is assigned based on the analysis results. .
  • the classification is performed by clustering the search information corresponding to the similar search information.
  • the range of the document to which the classification code is assigned may be adjusted by executing both the method of adjusting the range to be performed and the method of performing the prediction classification by learning the classification result.
  • a report is created based on the results of these sorting operations and surveys.
  • a display screen control unit that controls a display screen that presents the type of information extracted by the survey type determination unit to the user may be provided.
  • an input receiving unit that receives a keyword and / or sentence input by a user corresponding to the type of information presented on the display screen control unit may be provided.
  • the embodiment of the present invention automatically updates the database according to a category by accepting a user input for a category of litigation case or fraud investigation case.
  • the burden of office work for inputting the names of persons in charge, custodians, etc. is reduced.
  • the search word is adjusted by the database automatically updated according to the category, and a classification code is automatically assigned to the document information using the adjusted search word. This reduces the burden of sorting the document information used for litigation or fraud investigation cases. That is, according to the present invention, it becomes easy to analyze document information used in a lawsuit.
  • the control blocks (particularly the control unit 10) of the document analysis system 100 and the document analysis system 101 may be realized by a logic circuit (hardware) formed in an integrated circuit (IC chip) or the like, or a CPU (Central Processing). Unit) and may be realized by software.
  • the document analysis systems 100 and 101 are CPUs that execute instructions of a program (control program for the document analysis systems 100 and 101) that is software for realizing each function, and the programs and various data are computers (or CPUs).
  • ROM Read Only Memory
  • storage device referred to as “recording medium” recorded in such a manner as to be readable, and a RAM (Random Access Memory) for expanding the program.
  • a computer reads the said program from the said recording medium and runs it.
  • a “non-temporary tangible medium” such as a tape, a disk, a card, a semiconductor memory, a programmable logic circuit, or the like can be used.
  • the program may be supplied to the computer via an arbitrary transmission medium (such as a communication network or a broadcast wave) that can transmit the program.
  • the present invention can also be realized in the form of a data signal embedded in a carrier wave in which the program is embodied by electronic transmission.
  • a control program for a document analysis system is a control program for a document analysis system that analyzes a document, and includes a computer, (document analysis system 100), a generation function, and multiplication. Functions and calculation functions are realized.
  • the generation function, multiplication function, and calculation function can be realized by the generation unit 12, the multiplication unit 13, and the calculation unit 14, respectively. Details are as described above.
  • the present invention can be widely applied to arbitrary computers such as personal computers, workstations, and mainframes.
  • 1 document data (document), 2: keyword vector, 3: correlation vector, 4: score, 5: most frequent sentence (sentence corresponding to a keyword vector indicating that a predetermined keyword is contained most), 6: summary information (Summary), 7: phase information (phase), 8: change information (change in phase), 12: generation unit, 13: multiplication unit, 14: calculation unit, 15: extraction unit, 16: summary unit, 17: phase Identification unit, 18: change estimation unit, 100: document analysis system, 101: document analysis system

Abstract

The present invention accurately calculates a score that correctly reflects the meaning of a sentence. A document analysis system that is provided with: a generating unit that generates for each sentence included in a document a keyword vector that indicates whether prescribed keywords are included in the sentence; a multiplying unit that obtains a correlation vector for each sentence by multiplying the generated keyword vectors by a correlation matrix that indicates the correlation between the prescribed keywords and other keywords that are different from the prescribed keywords; and a calculating unit that, on the basis of a value that aggregates all of the correlation vectors, calculates a score that indicates the strength of the connection between the document and a sorting code that indicates the degree of relatedness between a document and a prescribed event.

Description

文書分析システム、文書分析システムの制御方法、および、文書分析システムの制御プログラムDocument analysis system, document analysis system control method, and document analysis system control program
 本発明は、文書を分析する文書分析システム等に関するものである。 The present invention relates to a document analysis system for analyzing a document.
 米国の民事訴訟においては、eディスカバリ制度のもとで、訴訟の原告および被告の双方が、当該訴訟に関連するデジタル情報を証拠として提出する責任を負う。膨大な量の文書がデジタル情報として保存される近年においては、上記制度による訴訟当事者の負担が大きい。 In US civil lawsuits, under the e-discovery system, both plaintiffs and defendants of the lawsuit are responsible for submitting digital information related to the lawsuit as evidence. In recent years, when a huge amount of documents is stored as digital information, the burden on lawyers under the above system is large.
 上記負担を軽減するために、「フォレンジックシステム」と呼ばれる文書分析システムが、従来から提案されている。例えば、下記の特許文献1には、訴訟において証拠として提出するために収集されたデジタル文書を分析し、訴訟への利用が容易になるように分別する文書分別システムが開示されている。 In order to reduce the burden, a document analysis system called a “forensic system” has been proposed. For example, Patent Document 1 below discloses a document separation system that analyzes a digital document collected for submission as evidence in a lawsuit and separates it so as to facilitate use in a lawsuit.
特開2013-214152号公報(2013年10月17日公開)JP2013-214152A (released on October 17, 2013)
 上記の特許文献1に開示された文書分別システムによれば、抽出された文書に含まれる関連用語の評価値、および当該関連用語の数に基づいて、スコアが算出される。 According to the document classification system disclosed in Patent Document 1 above, the score is calculated based on the evaluation value of the related term included in the extracted document and the number of the related term.
 しかし、上記文書分別システムによれば、例えば、(a)すべてのセンテンスに「価格」および「調整」というキーワードが含まれる文書、および(b)各センテンスに「価格」または「調整」というキーワードが断片的に含まれる文書、という互いに性質の異なる2つの文書に対して、そのスコアに有意な差がつかない。所定のキーワードが含まれるか否かを示すキーワードベクトルは、文書ごとに生成されており、「文書」よりも小さい単位である「センテンス」の文意を正しく反映できる構造を、上記キーワードベクトルが有し得ないためである。 However, according to the document classification system, for example, (a) a document in which the keywords “price” and “adjustment” are included in all sentences, and (b) a keyword “price” or “adjustment” is included in each sentence. There is no significant difference in the scores of two documents having different properties such as documents included in a fragmented manner. A keyword vector indicating whether or not a predetermined keyword is included is generated for each document, and the keyword vector has a structure that can correctly reflect the meaning of “sentence”, which is a unit smaller than “document”. This is because it cannot be done.
 本発明は、上記の問題点に鑑みてなされたものであり、その目的は、センテンスの文意を正しく反映したスコアを正確に算出可能な文書分析システム等を提供することである。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a document analysis system or the like that can accurately calculate a score that correctly reflects the sentence meaning.
 上記課題を解決するために、本発明の一態様に係る文書分析システムは、文書を分析する文書分析システムであって、前記文書に含まれるセンテンスに所定のキーワードが含まれるか否かを示すキーワードベクトルを、当該センテンスごとに生成する生成部と、前記生成部によって生成されたキーワードベクトルを、前記所定のキーワードと、当該所定のキーワードとは異なる他のキーワードとの相関を示す相関マトリクスにそれぞれ乗じることによって、前記センテンスごとに相関ベクトルを得る乗算部と、前記乗算部によって得られた全ての相関ベクトルについて合算した値に基づいて、前記文書と所定の事件との関連度を示す分別符号が、当該文書と結びつく強さを示すスコアを算出する算出部とを備えている。 In order to solve the above problem, a document analysis system according to an aspect of the present invention is a document analysis system for analyzing a document, and a keyword indicating whether or not a predetermined keyword is included in a sentence included in the document. A generating unit that generates a vector for each sentence, and a keyword vector generated by the generating unit is multiplied by a correlation matrix that indicates a correlation between the predetermined keyword and another keyword different from the predetermined keyword. Thus, based on a value obtained by summing up all the correlation vectors obtained by the multiplication unit that obtains a correlation vector for each sentence, the classification code indicating the relevance between the document and the predetermined event, A calculation unit that calculates a score indicating the strength associated with the document.
 ここで、上記キーワードベクトルは、例えば、当該キーワードベクトルのそれぞれの要素が「0」または「1」の値をとることによって、当該要素に対応付けられた所定のキーワードが、上記文書に含まれるか否かを示すベクトルである。また、上記相関マトリクスは、例えば「価格」というキーワードがセンテンスに出現した場合、当該センテンスにおいて、当該キーワードに対する他のキーワード(例えば「調整」)の出現しやすさ(すなわち、相関)を、当該相関マトリクスのそれぞれの要素において表す正方行列である。 Here, the keyword vector includes, for example, whether each keyword element includes a value of “0” or “1”, so that a predetermined keyword associated with the element is included in the document. This is a vector indicating whether or not. In addition, the above correlation matrix indicates, for example, when the keyword “price” appears in a sentence, the ease of appearance of another keyword (for example, “adjustment”) with respect to the keyword (that is, “correlation”) in the sentence. It is a square matrix represented in each element of the matrix.
 上記構成によれば、上記文書分析システムは、キーワードベクトルをセンテンスごとに生成することによって、「センテンス」の文意を正しく反映できる構造(表現)をキーワードベクトルが有するため、互いに性質の異なる2つの文書について有意な差がつくように、スコアを正確に算出できる。 According to the above configuration, the document analysis system generates a keyword vector for each sentence, so that the keyword vector has a structure (expression) that can correctly reflect the sentence meaning of “sentence”. Scores can be calculated accurately so that there is a significant difference between documents.
 また、本発明の一態様に係る文書分析システムにおいて、前記算出部は、前記合算した値と、前記所定のキーワードに対する重みを示す重みベクトルとの内積を算出することによって、前記スコアを算出してよい。 In the document analysis system according to one aspect of the present invention, the calculation unit calculates the score by calculating an inner product of the summed value and a weight vector indicating a weight for the predetermined keyword. Good.
 また、本発明の一態様に係る文書分析システムは、前記文書において、前記所定のキーワードが最も多く含まれることを示す前記キーワードベクトルに対応するセンテンスを抽出する抽出部をさらに備えてよい。 In addition, the document analysis system according to an aspect of the present invention may further include an extraction unit that extracts a sentence corresponding to the keyword vector indicating that the predetermined keyword is contained most in the document.
 また、本発明の一態様に係る文書分析システムは、前記文書において、前記所定のキーワードが含まれることを示す前記キーワードベクトルに対応するセンテンスを列挙することにより、当該文書の要約を生成する要約部をさらに備えてよい。 The document analysis system according to an aspect of the present invention provides a summarizing unit that generates a summary of a document by enumerating sentences corresponding to the keyword vector indicating that the predetermined keyword is included in the document. May be further provided.
 また、本発明の一態様に係る文書分析システムは、前記所定の事件の原因となる所定の行為を、当該所定の行為の進展に応じて分類するフェーズを、前記算出部によって算出されたスコアに基づいて特定する特定部をさらに備えてよい。 In the document analysis system according to one aspect of the present invention, a phase for classifying a predetermined action that causes the predetermined case according to progress of the predetermined action is set to a score calculated by the calculation unit. You may further provide the specific | specification part specified based on.
 また、本発明の一態様に係る文書分析システムは、前記フェーズの時間的な遷移に基づいて、前記フェーズ特定部によって特定されたフェーズの変化を推定する変化推定部をさらに備えてよい。 The document analysis system according to an aspect of the present invention may further include a change estimation unit that estimates a change in the phase identified by the phase identification unit based on a temporal transition of the phase.
 また、本発明の一態様に係る文書分析システムは、前記算出部によって算出されたスコアに基づいて、前記文書に分別符号を付与する符号付与部をさらに備えてよい。 The document analysis system according to an aspect of the present invention may further include a code assigning unit that assigns a classification code to the document based on the score calculated by the calculation unit.
 上記課題を解決するために、本発明の一態様に係る文書分析システムの制御方法は、文書を分析する文書分析システムの制御方法であって、前記文書に含まれるセンテンスに所定のキーワードが含まれるか否かを示すキーワードベクトルを、当該センテンスごとに生成する生成ステップと、前記生成ステップにおいて生成したキーワードベクトルを、前記所定のキーワードと、当該所定のキーワードとは異なる他のキーワードとの相関を示す相関マトリクスにそれぞれ乗じることによって、前記センテンスごとに相関ベクトルを得る乗算ステップと、前記乗算ステップにおいて得た全ての相関ベクトルについて合算した値に基づいて、前記文書と所定の事件との関連度を示す分別符号が、当該文書と結びつく強さを示すスコアを算出する算出ステップとを含んでいる。 In order to solve the above problems, a control method for a document analysis system according to an aspect of the present invention is a control method for a document analysis system for analyzing a document, wherein a sentence includes a predetermined keyword. A generation step for generating a keyword vector for each sentence, and a correlation between the predetermined keyword and another keyword different from the predetermined keyword. A multiplication step for obtaining a correlation vector for each sentence by multiplying each correlation matrix, and a degree of relevance between the document and a predetermined event is shown based on a sum of all correlation vectors obtained in the multiplication step. Calculation that calculates the score that indicates the strength with which the classification code is associated with the document. And a step.
 したがって、上記文書分析システムの制御方法は、上記文書分析システムと同じ効果を奏する。 Therefore, the control method of the document analysis system has the same effect as the document analysis system.
 上記課題を解決するために、本発明の一態様に係る文書分析システムの制御プログラムは、文書を分析する文書分析システムの制御プログラムであって、コンピュータに、前記文書に含まれるセンテンスに所定のキーワードが含まれるか否かを示すキーワードベクトルを、当該センテンスごとに生成する生成機能と、前記生成機能によって生成されたキーワードベクトルを、前記所定のキーワードと、当該所定のキーワードとは異なる他のキーワードとの相関を示す相関マトリクスにそれぞれ乗じることによって、前記センテンスごとに相関ベクトルを得る乗算機能と、前記乗算機能によって得られた全ての相関ベクトルについて合算した値に基づいて、前記文書と所定の事件との関連度を示す分別符号が、当該文書と結びつく強さを示すスコアを算出する算出機能とを実現させる。 In order to solve the above-described problem, a control program for a document analysis system according to an aspect of the present invention is a control program for a document analysis system that analyzes a document, and the computer includes a predetermined keyword for a sentence included in the document. A generation function for generating a keyword vector for each sentence, a keyword vector generated by the generation function, the predetermined keyword, and another keyword different from the predetermined keyword By multiplying a correlation matrix indicating the correlation of each of the documents, a multiplication function for obtaining a correlation vector for each sentence, and a sum of all the correlation vectors obtained by the multiplication function, The classification code indicating the degree of relevance indicates the strength associated with the document To realize a calculating function for calculating the core.
 すなわち、本発明の各態様に係る文書分析システムは、コンピュータによって実現されてもよい。この場合、コンピュータを上記文書分析システムが備えた各部として動作させることによって、上記文書分析システムをコンピュータにおいて実現させる文書分析システムの制御プログラム、および当該制御プログラムを記録したコンピュータ読み取り可能な記録媒体も、本発明の範疇に入る。 That is, the document analysis system according to each aspect of the present invention may be realized by a computer. In this case, a control program of the document analysis system for realizing the document analysis system in the computer by operating the computer as each unit included in the document analysis system, and a computer-readable recording medium storing the control program are also provided. It falls within the scope of the present invention.
 したがって、上記文書分析システムの制御プログラムは、上記文書分析システムと同じ効果を奏する。 Therefore, the control program of the document analysis system has the same effect as the document analysis system.
 本発明の一態様に係る文書分析システム、文書分析システムの制御方法、および、文書分析システムの制御プログラムは、キーワードベクトルをセンテンスごとに生成することによって、「センテンス」の文意を正しく反映できる構造(表現)をキーワードベクトルが有するため、互いに性質の異なる2つの文書について有意な差がつくように、スコアを正確に算出できるという効果を奏する。 The document analysis system, the document analysis system control method, and the document analysis system control program according to one aspect of the present invention have a structure capable of correctly reflecting the sentence meaning of “sentence” by generating a keyword vector for each sentence. Since the keyword vector has (expression), there is an effect that the score can be accurately calculated so that there is a significant difference between two documents having different properties.
本発明の第1の実施の形態に係る文書分析システムの要部構成を示すブロック図である。It is a block diagram which shows the principal part structure of the document analysis system which concerns on the 1st Embodiment of this invention. 図1に示される文書分析システムが実行する処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process which the document analysis system shown by FIG. 1 performs. 本発明の第2の実施の形態に係る文書分析システムの要部構成を示すブロック図である。It is a block diagram which shows the principal part structure of the document analysis system which concerns on the 2nd Embodiment of this invention. 図3に示される文書分析システムが実行する処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process which the document analysis system shown by FIG. 3 performs. 図4に示される処理の一例における、調査種類に応じた調査および分別処理の一例を示すフローチャートである。It is a flowchart which shows an example of the investigation and classification process according to the investigation kind in an example of the process shown by FIG. 図4に示される処理の一例における、調査種類に応じたプレディクティブコーディングの一例を示すフローチャートである。FIG. 5 is a flowchart illustrating an example of predictive coding according to a survey type in the example of the process illustrated in FIG. 4. 第2の実施の形態における段階ごとの処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process for every step in 2nd Embodiment. 第2の実施の形態におけるキーワードデータベースの処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process of the keyword database in 2nd Embodiment. 第2の実施の形態における関連用語データベースの処理の一例を示すフローチャートである。It is a flowchart which shows an example of a process of the related term database in 2nd Embodiment. 第2の実施の形態における第1自動分別部の処理の一例を示すフローチャートである。It is a flowchart which shows an example of a process of the 1st automatic classification part in 2nd Embodiment. 第2の実施の形態における第2自動分別部の処理の一例を示すフローチャートである。It is a flowchart which shows an example of a process of the 2nd automatic classification part in 2nd Embodiment. 第2の実施の形態における分別符号受付付与部の処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process of the classification code reception provision part in 2nd Embodiment. 第2の実施の形態における文書解析部の処理の一例を示したフローチャートである。It is the flowchart which showed an example of the process of the document analysis part in 2nd Embodiment. 第2の実施の形態における文書解析部での解析結果を示したグラフである。It is the graph which showed the analysis result in the document analysis part in 2nd Embodiment. 第2の実施の形態における第3自動分別部の処理の一例を示すフローチャートである。It is a flowchart which shows an example of a process of the 3rd automatic classification part in 2nd Embodiment. 第2の実施の形態における第3自動分別部の処理の他の一例を示すフローチャートである。It is a flowchart which shows another example of the process of the 3rd automatic classification part in 2nd Embodiment. 第2の実施の形態における品質検査部の処理の一例を示すフローチャートである。It is a flowchart which shows an example of a process of the quality inspection part in 2nd Embodiment. 第2の実施の形態における文書表示画面の一例を示す模式図である。It is a schematic diagram which shows an example of the document display screen in 2nd Embodiment.
 〔実施形態1〕
 図1~図2に基づいて、本発明の第1の実施の形態(実施形態1)を説明する。
Embodiment 1
A first embodiment (Embodiment 1) of the present invention will be described with reference to FIGS.
 (文書分析システム100の構成)
 図1は、本発明の第1の実施の形態に係る文書分析システム100の要部構成を示すブロック図である。文書分析システム100は、文書を分析するシステム(文書分析システム)である。なお、上記文書分析システム100は、以下で説明する処理を実行可能な機器でありさえすればよく、任意のコンピュータを用いて実現され得る。
(Configuration of document analysis system 100)
FIG. 1 is a block diagram showing a main configuration of a document analysis system 100 according to the first embodiment of the present invention. The document analysis system 100 is a system for analyzing a document (document analysis system). The document analysis system 100 only needs to be a device that can execute the processing described below, and can be realized using an arbitrary computer.
 図1に示されるように、文書分析システム100は、受信部21、制御部10(取得部11、生成部12、乗算部13、算出部14、抽出部15、要約部16、フェーズ特定部17、変化推定部18)、および、表示部50を備えている。 As shown in FIG. 1, the document analysis system 100 includes a reception unit 21, a control unit 10 (acquisition unit 11, generation unit 12, multiplication unit 13, calculation unit 14, extraction unit 15, summarization unit 16, phase identification unit 17. , A change estimation unit 18) and a display unit 50.
 受信部21は、所定の通信方式にしたがう通信網を介して外部と通信することによって、外部のコンピュータから文書データ1を受信する。受信部21は、当該コンピュータとの通信を実現する本質的な機能が備わってさえいればよく、通信回線、通信方式、または通信媒体などは限定されない。受信部21は、例えばイーサネット(登録商標)アダプタなどの機器で構成できる。また、受信部21は、例えばIEEE802.11無線通信、Bluetooth(登録商標)などの通信方式や通信媒体を利用できる。 The receiving unit 21 receives the document data 1 from an external computer by communicating with the outside through a communication network according to a predetermined communication method. The receiving unit 21 only needs to have an essential function for realizing communication with the computer, and a communication line, a communication method, a communication medium, and the like are not limited. The receiving unit 21 can be configured by a device such as an Ethernet (registered trademark) adapter, for example. The receiving unit 21 can use a communication method or a communication medium such as IEEE802.11 wireless communication or Bluetooth (registered trademark).
 制御部10は、文書分析システム100が有する各種の機能を統括的に制御する。制御部10は、取得部11、生成部12、乗算部13、算出部14、抽出部15、要約部16、フェーズ特定部17、および変化推定部18を含む。 The control unit 10 comprehensively controls various functions of the document analysis system 100. The control unit 10 includes an acquisition unit 11, a generation unit 12, a multiplication unit 13, a calculation unit 14, an extraction unit 15, a summarization unit 16, a phase identification unit 17, and a change estimation unit 18.
 取得部11は、受信部21によって受信された文書データ1を取得し、当該文書データ1を生成部12に出力する。 The acquisition unit 11 acquires the document data 1 received by the reception unit 21 and outputs the document data 1 to the generation unit 12.
 生成部12は、文書データ(文書)1に含まれるセンテンスに所定のキーワード(形態素)が含まれるか否かを示すキーワードベクトル2を、当該センテンスごとに生成する。上記キーワードベクトル2は、当該キーワードベクトル2のそれぞれの要素が「0」または「1」の値をとることによって、当該要素に対応付けられた所定のキーワードが、上記文書データ1に含まれるか否かを示すベクトルである。 The generation unit 12 generates, for each sentence, a keyword vector 2 indicating whether or not a predetermined keyword (morpheme) is included in the sentence included in the document data (document) 1. In the keyword vector 2, whether or not a predetermined keyword associated with the element is included in the document data 1 when each element of the keyword vector 2 takes a value of “0” or “1”. Is a vector indicating
 例えば、上記文書データ1に含まれる2番目のセンテンスに、「価格」というキーワードが含まれている場合、生成部12は、上記キーワードベクトル2の上記「価格」に対応する要素を「0」から「1」に変更する。生成部12は、生成したキーワードベクトル2を、乗算部13、抽出部15、要約部16、およびフェーズ特定部17にそれぞれ出力する。 For example, when the keyword “price” is included in the second sentence included in the document data 1, the generation unit 12 changes the element corresponding to the “price” of the keyword vector 2 from “0”. Change to “1”. The generating unit 12 outputs the generated keyword vector 2 to the multiplying unit 13, the extracting unit 15, the summarizing unit 16, and the phase specifying unit 17, respectively.
 乗算部13は、生成部12によって生成されたキーワードベクトル2を、上記所定のキーワードと、当該所定のキーワードとは異なる他のキーワードとの相関を示す相関マトリクスにそれぞれ乗じることによって、上記センテンスごとに相関ベクトル3を得る。上記相関マトリクスは、例えば「価格」というキーワードがセンテンスに出現した場合、当該センテンスにおいて、当該キーワードに対する他のキーワード(例えば「調整」)の出現しやすさ(すなわち、相関)を、当該相関マトリクスのそれぞれの要素において表す正方行列である。乗算部13は、上記相関ベクトル3を算出部14に出力する。 The multiplication unit 13 multiplies the keyword vector 2 generated by the generation unit 12 by a correlation matrix that indicates the correlation between the predetermined keyword and another keyword different from the predetermined keyword, for each sentence. Correlation vector 3 is obtained. For example, when the keyword “price” appears in a sentence, the correlation matrix indicates the likelihood (that is, the correlation) that another keyword (for example, “adjustment”) appears for the keyword in the sentence. It is a square matrix represented in each element. The multiplication unit 13 outputs the correlation vector 3 to the calculation unit 14.
 なお、上記相関マトリクスは、所定の文書データを所定数だけ含む学習用データセットを用いて、あらかじめ最適化されている。例えば、あるセンテンスにおいて「価格」というキーワードが出現する場合、当該キーワードに対する他のキーワードの出現数を0~1の間に正規化した値(すなわち、最尤推定値)が、上記相関マトリクスのそれぞれの要素に格納されている(したがって、上記相関マトリクスの各列に対する総和は1になる)。これにより、文書分析システム100は、上記相関ベクトル3を最適に計算することができる。 The correlation matrix is optimized in advance using a learning data set including a predetermined number of predetermined document data. For example, when a keyword “price” appears in a certain sentence, a value obtained by normalizing the number of occurrences of other keywords with respect to the keyword between 0 and 1 (that is, a maximum likelihood estimated value) (Therefore, the sum for each column of the correlation matrix is 1). Thereby, the document analysis system 100 can calculate the correlation vector 3 optimally.
 算出部14は、下記の〔数1〕に示されるように、乗算部13によって得られた全ての相関ベクトル3について合算した値に基づいて、文書データ1と所定の事件との関連度を示す分別符号が、当該文書データ1と結びつく強さを示すスコア4を、当該文書データ1ごとに算出する。より具体的には、算出部14は、下記の〔数1〕に示されるように、上記合算した値(縦ベクトルで表される)と、上記所定のキーワードに対する重みを示す重みベクトルW(横ベクトルで表される)との内積を算出することによって、上記スコア4を文書ごとに算出する。 As shown in [Equation 1] below, the calculation unit 14 indicates the degree of association between the document data 1 and a predetermined case based on the sum of all the correlation vectors 3 obtained by the multiplication unit 13. A score 4 indicating the strength with which the classification code is associated with the document data 1 is calculated for each document data 1. More specifically, as shown in the following [Equation 1], the calculation unit 14 calculates the sum (the vertical vector) and a weight vector W (horizontal) indicating the weight for the predetermined keyword. The score 4 is calculated for each document by calculating the inner product with the vector.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 ここで、上記〔数1〕において、Cは相関マトリクスを表し、sはs番目のキーワードベクトル2を表す。また、TFnorm(上記合算した値)は、下記の〔数2〕に示されるように計算する。 Here, in the above [Expression 1], C represents a correlation matrix, and s s represents the s-th keyword vector 2. Also, TFnorm (the above summed value) is calculated as shown in [Equation 2] below.
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 ここで、上記〔数2〕において、TFはi番目のキーワードの出現頻度(Term Frequency)を表し、sjsは上記s番目のキーワードベクトル2のj番目の要素を表す。 Here, in [Expression 2], TF i represents the appearance frequency (Term Frequency) of the i-th keyword, and s js represents the j-th element of the s-th keyword vector 2.
 上記〔数1〕および〔数2〕をまとめると、算出部14は、以下の〔数3〕を計算することによって文書ごとに上記スコア4を算出する。 Summarizing the above [Equation 1] and [Equation 2], the calculating unit 14 calculates the above score 4 for each document by calculating the following [Equation 3].
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 ここで、上記〔数3〕において、wは上記重みベクトルWのi番目の要素である。算出部14は、算出したスコア4をフェーズ特定部17、変化推定部18、および表示部50に出力する。 Here, in the above [Equation 3], w i is the i-th element of the weight vector W. The calculating unit 14 outputs the calculated score 4 to the phase specifying unit 17, the change estimating unit 18, and the display unit 50.
 抽出部15は、上記文書データ1において、所定のキーワードが最も多く含まれることを示す上記キーワードベクトル2に対応するセンテンス(最多センテンス5)を抽出する。例えば、「企業Aが販売する製品aの価格は、企業Bが販売する製品bの価格よりも高いため、弊社で両製品の価格を調整しました」というセンテンスには、「価格」というキーワードが3回出現する。上記センテンスが「価格」というキーワードを最も多く含む場合、抽出部15は、当該センテンスを上記最多センテンス5として表示部50に出力する。なお、上記所定のキーワード(上記の例においては「価格」というキーワード)は、所定の入力機器を介して文書分析システム100に与えられてよい。 The extraction unit 15 extracts a sentence (maximum sentence 5) corresponding to the keyword vector 2 indicating that the predetermined number of keywords is contained most in the document data 1. For example, in the sentence “The price of product a sold by company A is higher than the price of product b sold by company B, we adjusted the price of both products.” Appears 3 times. When the sentence includes the most keywords “price”, the extraction unit 15 outputs the sentence as the most sentence 5 to the display unit 50. The predetermined keyword (in the above example, the keyword “price”) may be given to the document analysis system 100 via a predetermined input device.
 要約部16は、文書データ1において、上記所定のキーワードが含まれることを示す上記キーワードベクトル2に対応するセンテンスを列挙することによって、当該文書データ1の要約を生成する。例えば、要約部16は、上記文書データ1に含まれるセンテンスであって、「価格」というキーワードを含むセンテンスを列挙することによって、上記要約を生成し、当該要約に関する情報を含む要約情報6を表示部50に出力する。なお、前述と同様に、上記所定のキーワードは、所定の入力機器を介して文書分析システム100に与えられてよい。 The summary unit 16 generates a summary of the document data 1 by enumerating sentences corresponding to the keyword vector 2 indicating that the predetermined keyword is included in the document data 1. For example, the summary unit 16 generates the summary by listing the sentences included in the document data 1 and including the keyword “price”, and displays the summary information 6 including information on the summary. To the unit 50. As described above, the predetermined keyword may be given to the document analysis system 100 via a predetermined input device.
 フェーズ特定部17は、所定の事件(例えば、訴訟、不正調査、談合、情報漏洩、架空請求など)の原因となる所定の行為(複数人から構成される組織または個人によって行われる行為)を、当該所定の行為の進展に応じて分類するフェーズを、算出部14によって算出されたスコア4に基づいて特定する。なお、上記所定の事件は、所定の入力機器を介して文書分析システム100に与えられてよい。 The phase identification unit 17 performs a predetermined action (an action performed by an organization or an individual composed of a plurality of persons) that causes a predetermined case (for example, lawsuit, fraud investigation, collusion, information leakage, fictitious request, etc.) The phase to be classified according to the progress of the predetermined action is specified based on the score 4 calculated by the calculation unit 14. The predetermined event may be given to the document analysis system 100 via a predetermined input device.
 ここで、上記フェーズは、上記所定の行為が進展する各段階を示す(上記所定の行為の進展に応じて分類する)指標である。例えば、上記所定の事件として「談合」が指定された場合、「Relationship Building」(顧客・競合と関係を構築するフェーズ)、「Preparation」(第三者と競合に関する情報を交換するフェーズ)、「Competition」(顧客へ価格を提示し、フィードバックを得て、当該フィードバックに関して競合とコミュニケーションを取るフェーズ)などのフェーズが存在することを仮定できる。 Here, the phase is an index indicating each stage in which the predetermined action progresses (classified according to the progress of the predetermined action). For example, if “rigidation” is specified as the predetermined case, “Relationship Building” (phase for building relationships with customers / competitions), “Preparation” (phase for exchanging information about competition with third parties), “ It can be assumed that there is a phase such as “competition” (a phase in which a price is presented to a customer, feedback is obtained, and communication is made with the competition regarding the feedback).
 フェーズ特定部17は、例えば、上記スコア4が所定の値域におさまる場合、当該所定の値域に対応付けられたフェーズを特定し、当該フェーズに関する情報を含むフェーズ情報7を変化推定部18に出力してよい。または、フェーズ特定部17は、所定の行動主体(複数人から構成される組織または個人)が、上記所定の行為に至る過程を表すモデル(観測過程、尤度関数)の尤度(それぞれのフェーズに応じて上記スコアとして計算される値)を最大化するフェーズ(最尤フェーズ)を特定し、当該フェーズに関する情報を含むフェーズ情報7を変化推定部18に出力してもよい。 For example, when the score 4 falls within a predetermined value range, the phase specifying unit 17 specifies a phase associated with the predetermined value range, and outputs phase information 7 including information on the phase to the change estimating unit 18. It's okay. Alternatively, the phase specifying unit 17 is configured to calculate a likelihood (each phase of a model (observation process, likelihood function)) representing a process in which a predetermined action subject (an organization or an individual composed of a plurality of persons) reaches the predetermined action. The phase (maximum likelihood phase) that maximizes the value calculated as the score according to the above may be specified, and the phase information 7 including information on the phase may be output to the change estimation unit 18.
 または、生成部12から上記キーワードベクトル2が入力された場合であって、当該キーワードベクトル2によって所定のキーワード(例えば、「価格」、「調整」など)が含まれていることが示されている場合、フェーズ特定部17は、当該所定のキーワードに対応するフェーズを特定し(「価格」および「調整」というキーワードが含まれていた場合、「Competition」のフェーズにあると特定し)、当該フェーズに関する情報を含むフェーズ情報7を変化推定部18に出力してもよい。 Alternatively, it is indicated that the keyword vector 2 is input from the generation unit 12 and that the keyword vector 2 includes a predetermined keyword (for example, “price”, “adjustment”, etc.). In this case, the phase specifying unit 17 specifies the phase corresponding to the predetermined keyword (if the keywords “price” and “adjustment” are included, the phase is specified as “Competition”), and the phase The phase information 7 including the information regarding may be output to the change estimation unit 18.
 変化推定部18は、上記フェーズの時間的な遷移に基づいて、フェーズ特定部17によって特定されたフェーズの変化を推定する。例えば、「Relationship Building」(関係構築)というフェーズが、「Preparation」(準備)というフェーズを経て、「Competition」(競合)というフェーズに発展するという一連の遷移が、(例えば、フェーズの時間的な序列を示す時系列情報を用いることによって)明らかである場合において、現在のフェーズが「Preparation」(準備)のフェーズにあるとフェーズ特定部17によって特定された場合、変化推定部18は、次は「Competition」(競合)というフェーズに発展すると推定する。変化推定部18は、上記フェーズの変化に関する情報を含む変化情報8を表示部50に出力する。 The change estimating unit 18 estimates the change of the phase specified by the phase specifying unit 17 based on the temporal transition of the phase. For example, a series of transitions in which the phase of “Relationship Building” develops through the phase of “Preparation” to the phase of “Competition” (competition) In the case where it is clear (by using time-series information indicating the order), when the phase identification unit 17 identifies that the current phase is in the “Preparation” phase, the change estimation unit 18 Presumed to develop into a phase called “Competition”. The change estimation unit 18 outputs change information 8 including information related to the change of the phase to the display unit 50.
 または、変化推定部18は、算出部14によって算出されたスコア4の移動平均と、所定のパターンとの相関を計算することによって、フェーズの変化を推定してもよい。ここで、上記所定のパターンは、上記所定の事件(例えば、訴訟、不正調査、談合、情報漏洩、架空請求など)とは異なる他の事件において算出されたスコアが、時間の経過とともに変化するパターンであってよい。 Alternatively, the change estimation unit 18 may estimate the phase change by calculating the correlation between the moving average of the score 4 calculated by the calculation unit 14 and a predetermined pattern. Here, the predetermined pattern is a pattern in which a score calculated in another case different from the predetermined case (for example, lawsuit, fraud investigation, collusion, information leakage, fictitious request, etc.) changes with the passage of time. It may be.
 例えば、過去に提起された訴訟において、証拠資料を提出するために当該訴訟に関連する分析を行い、上記スコアの移動平均が算出されていた場合、変化推定部18は、当該移動平均を上記所定のパターンとし、今回分析される文書データ1に対するスコア4の移動平均と、当該所定のパターンとの相関を計算する。言い換えれば、変化推定部18は、経過時間および/またはスコアをずらしながら、両者の一致度(相関)を計算する。両者の相関が高くなる場合、変化推定部18は、今回のスコアは将来において、上記所定のパターンに連動するように同様の値をとると推定する。 For example, in a lawsuit filed in the past, if the analysis related to the lawsuit is performed in order to submit the document and the moving average of the score is calculated, the change estimation unit 18 sets the moving average to the predetermined value. The correlation between the moving average of the score 4 for the document data 1 analyzed this time and the predetermined pattern is calculated. In other words, the change estimation unit 18 calculates the degree of coincidence (correlation) between the two while shifting the elapsed time and / or score. If the correlation between the two becomes high, the change estimation unit 18 estimates that the current score will take the same value in the future in conjunction with the predetermined pattern.
 表示部50は、算出部14から入力されたスコア4、抽出部15から入力された最多センテンス5、要約部16から入力された要約情報6、および変化推定部18から入力された変化情報8を表示可能な表示装置(例えば、液晶ディスプレイ)である。なお、図1は、文書分析システム100が表示部50を含む構成例を示すが、表示部50は、上記したそれぞれの情報をユーザに提示可能でありさえすればよく、例えば、文書分析システム100に通信可能に接続された外部の表示装置であってもよい。 The display unit 50 displays the score 4 input from the calculation unit 14, the most frequent sentence 5 input from the extraction unit 15, summary information 6 input from the summary unit 16, and change information 8 input from the change estimation unit 18. A display device capable of displaying (for example, a liquid crystal display). 1 shows a configuration example in which the document analysis system 100 includes the display unit 50, the display unit 50 only needs to be able to present each of the above information to the user. For example, the document analysis system 100 It may be an external display device connected to be communicable.
 (文書分析システム100が実行する処理)
 図2は、文書分析システム100が実行する処理の一例を示すフローチャートである。なお、以下の説明において、カッコ書きの「~ステップ」は、文書分析システム100の制御方法(文書分析システムの制御方法)に含まれる各ステップを表す。
(Processing executed by the document analysis system 100)
FIG. 2 is a flowchart illustrating an example of processing executed by the document analysis system 100. In the following description, parenthesized “˜steps” represent steps included in the control method of the document analysis system 100 (control method of the document analysis system).
 まず、取得部11は、文書データ1を取得する(ステップ1、以下「ステップ」を「S」と略記する)。次に、生成部12は、上記文書データ1に含まれるセンテンスに所定のキーワードが含まれるか否かを示すキーワードベクトル2を、当該センテンスごとに生成する(S2、生成ステップ)。 First, the acquisition unit 11 acquires the document data 1 (Step 1, hereinafter “Step” is abbreviated as “S”). Next, the generation unit 12 generates, for each sentence, a keyword vector 2 indicating whether or not a predetermined keyword is included in the sentence included in the document data 1 (S2, generation step).
 次に、乗算部13は、S2において生成したキーワードベクトル2を、上記所定のキーワードと、当該所定のキーワードとは異なる他のキーワードとの相関を示す相関マトリクスにそれぞれ乗じることによって、上記センテンスごとに相関ベクトル3を得る(S3、乗算ステップ)。 Next, the multiplying unit 13 multiplies the keyword vector 2 generated in S2 by a correlation matrix indicating the correlation between the predetermined keyword and another keyword different from the predetermined keyword, for each sentence. Correlation vector 3 is obtained (S3, multiplication step).
 最後に、算出部14は、S3において得た全ての相関ベクトル3について合算した値に基づいて、上記文書データ1と所定の事件との関連度を示す分別符号が、当該文書と結びつく強さを示すスコア4を算出する(S4、算出ステップ)。 Finally, the calculation unit 14 determines the strength with which the classification code indicating the degree of association between the document data 1 and the predetermined event is associated with the document based on the sum of all the correlation vectors 3 obtained in S3. The score 4 shown is calculated (S4, calculation step).
 なお、上記制御方法は、図2を参照して前述した上記処理だけでなく、取得部11、抽出部15、要約部16、フェーズ特定部17、および/または、変化推定部18において実行される処理を任意に含んでよい。 The above control method is executed not only by the above-described processing with reference to FIG. 2 but also by the acquisition unit 11, the extraction unit 15, the summarization unit 16, the phase identification unit 17, and / or the change estimation unit 18. Processing may optionally be included.
 〔実施形態2〕
 図3~図18に基づいて、本発明の第2の実施の形態(実施形態2)を説明する。本実施の形態では、実施形態1に追加される構成や、実施形態1の構成とは異なる構成のみについて説明する。すなわち、実施形態1において記載された構成は、実施形態2にもすべて含まれ得る。また、実施形態1において記載された用語の定義は、実施形態2においても同じである。
[Embodiment 2]
A second embodiment (Embodiment 2) of the present invention will be described with reference to FIGS. In the present embodiment, only the configuration added to the first embodiment and the configuration different from the configuration of the first embodiment will be described. That is, all the configurations described in the first embodiment can be included in the second embodiment. Moreover, the definition of the term described in Embodiment 1 is the same also in Embodiment 2.
 (文書分析システム101の構成)
 図3は、本発明の実施の形態2に係る文書分析システム101の要部構成を示すブロック図である。文書分析システム101は、所定のコンピュータまたはサーバに記録された情報を取得し、当該取得された情報に含まれる、複数の文書から構成される文書情報を分析するシステムである。
(Configuration of Document Analysis System 101)
FIG. 3 is a block diagram showing a main configuration of the document analysis system 101 according to Embodiment 2 of the present invention. The document analysis system 101 is a system that acquires information recorded in a predetermined computer or server and analyzes document information including a plurality of documents included in the acquired information.
 図3に示されるように、文書分析システム101は、実施の形態1において説明した制御部10(取得部11、生成部12、乗算部13、算出部14、抽出部15、要約部16、フェーズ特定部17、変化推定部18)に加えて、データ格納部108(デジタル情報格納領域102、調査基礎データベース103、キーワードデータベース104、関連用語データベース105、スコア算出データベース106、報告作成データベース107)、データベース管理部109、情報抽出部24、検索部30、文書解析部118、調査カテゴリ入力受付部20、調査種類判定部22、提示部130、カテゴリ選択部26、第1自動分別部201、第2自動分別部301、分別符号受付付与部131、および、第3自動分別部401をさらに備えている。また、文書分析システム101は、傾向情報生成部124、品質検査部501、学習部601、報告作成部701、弁護士レビュー受付部133、言語判定部120、翻訳部122をさらに備えてよい。 As shown in FIG. 3, the document analysis system 101 includes the control unit 10 (acquisition unit 11, generation unit 12, multiplication unit 13, calculation unit 14, extraction unit 15, summarization unit 16, phase described in the first embodiment. In addition to the identification unit 17 and the change estimation unit 18), the data storage unit 108 (digital information storage area 102, survey basic database 103, keyword database 104, related term database 105, score calculation database 106, report creation database 107), database Management unit 109, information extraction unit 24, search unit 30, document analysis unit 118, survey category input reception unit 20, survey type determination unit 22, presentation unit 130, category selection unit 26, first automatic sorting unit 201, second automatic A sorting unit 301, a sorting code reception and grant unit 131, and a third automatic sorting unit 401 are further provided. There. The document analysis system 101 may further include a trend information generation unit 124, a quality inspection unit 501, a learning unit 601, a report creation unit 701, a lawyer review reception unit 133, a language determination unit 120, and a translation unit 122.
 調査カテゴリ入力受付部20は、ユーザによるカテゴリの入力を受け付ける。カテゴリが入力された場合、調査カテゴリ入力受付部20は、当該カテゴリを調査種類判定部22およびカテゴリ選択部26に出力する。ここで、上記カテゴリは、複数の文書に含まれるそれぞれの文書を分類可能な指標である。 The survey category input receiving unit 20 receives a category input by the user. When a category is input, the survey category input reception unit 20 outputs the category to the survey type determination unit 22 and the category selection unit 26. Here, the category is an index that can classify each document included in a plurality of documents.
 例えば、上記カテゴリは、訴訟または不正調査の種類(当該訴訟または不正調査に係る事件の性質を表すものであり、例えば、反トラスト、特許、海外賄賂禁止(FCPA)、製造物責任(PL)、情報漏洩、架空請求などを含む)である。または、上記カテゴリは、文書情報の属性(文書情報に含まれる情報の性質を表すものであり、例えば、競合する相手方の情報、価格、見積もりシート、金額一覧、製品など)であってもよい。あるいは、上記カテゴリは、訴訟または不正調査の原因となる所定の行為の進展に応じて分類するフェーズであってもよい。 For example, the above categories represent the type of litigation or fraud investigation (representing the nature of the case relating to the litigation or fraud investigation, such as antitrust, patents, foreign bribery prohibition (FCPA), product liability (PL), Information leakage, fictitious billing, etc.). Alternatively, the category may be an attribute of document information (representing the nature of information included in the document information, such as competing opponent information, price, estimate sheet, price list, product, etc.). Alternatively, the category may be a phase classified according to the progress of a predetermined action that causes a lawsuit or fraud investigation.
 調査種類判定部22は、上記調査カテゴリ入力受付部20によって受け付けられたカテゴリに基づいて、調査の対象とするカテゴリを判定し、調査基礎データベース103から必要な情報の種類を抽出する。例えば、上記文書情報が、電子メール、プレゼンテーション資料、表計算資料、打ち合わせ資料、契約書、組織図、または事業計画書である場合、調査種類判定部22は、それぞれを上記必要な情報の種類として情報抽出部24に出力する。したがって、文書分析システム101は、上記必要な情報の種類を抽出できる。 The survey type determination unit 22 determines a category to be surveyed based on the category received by the survey category input reception unit 20 and extracts a necessary information type from the survey basic database 103. For example, when the document information is an e-mail, a presentation material, a spreadsheet, a meeting material, a contract, an organization chart, or a business plan, the investigation type determination unit 22 sets each of the types of necessary information as above. The information is output to the information extraction unit 24. Therefore, the document analysis system 101 can extract the necessary information types.
 情報抽出部24は、文書情報から複数の文書を抽出する。具体的には、情報抽出部24は、調査種類判定部22から入力された情報(例えば、電子メール、プレゼンテーション資料、表計算資料、打ち合わせ資料、契約書、組織図、事業計画書など)から、当該情報に含まれるキーワードおよび/または文章を、訴訟または不正調査に関連する情報として抽出し、当該抽出した結果を調査基礎データベース103に格納する。また、情報抽出部24は、上記抽出した結果を文書データ1として制御部10に出力する。したがって、文書分析システム101は、上記訴訟または不正調査に関連する情報を特定し、データベースに保持することができる。 The information extraction unit 24 extracts a plurality of documents from the document information. Specifically, the information extraction unit 24 uses information input from the survey type determination unit 22 (for example, e-mail, presentation materials, spreadsheet materials, meeting materials, contracts, organization charts, business plans, etc.) The keywords and / or sentences included in the information are extracted as information related to lawsuits or fraud investigations, and the extracted results are stored in the investigation basic database 103. In addition, the information extraction unit 24 outputs the extracted result as document data 1 to the control unit 10. Therefore, the document analysis system 101 can specify information related to the lawsuit or fraud investigation and hold it in the database.
 カテゴリ選択部26は、上記カテゴリを選択し、選択したカテゴリを制御部10に出力する。カテゴリが複数想定されている場合、カテゴリ選択部26は、当該複数のカテゴリから1つのカテゴリを順次選択できる。 The category selection unit 26 selects the category and outputs the selected category to the control unit 10. When a plurality of categories are assumed, the category selection unit 26 can sequentially select one category from the plurality of categories.
 また、調査カテゴリ入力受付部20からカテゴリが入力された場合、カテゴリ選択部26は、当該入力されたカテゴリを選択できる。これにより、文書分析システム101は、ユーザによって入力されたカテゴリを確実に選択できる。 In addition, when a category is input from the survey category input reception unit 20, the category selection unit 26 can select the input category. Thereby, the document analysis system 101 can reliably select the category input by the user.
 提示部130は、制御部10(算出部14)によって算出されたスコア4を、ユーザに把握可能に提示する。提示部130は、例えば、上記スコア4を表示部50(図3において図示せず)に表示することによって、当該スコア4をユーザに提示できる。これにより、文書分析システム101は、対象とされた文書がいずれのカテゴリに適合するかを、ユーザに把握させることができる。 The presenting unit 130 presents the score 4 calculated by the control unit 10 (calculating unit 14) to the user so as to be grasped. For example, the presentation unit 130 can present the score 4 to the user by displaying the score 4 on the display unit 50 (not shown in FIG. 3). Thereby, the document analysis system 101 can make a user grasp | ascertain which category the document made into object fits.
 検索部30は、文書情報(文書データ1)に含まれるキーワードおよび/または文章を、複数の文書の中から検索する。これにより、文書分析システム101は、上記文書情報に含まれるキーワードおよび/または文章を抽出することができる。 The search unit 30 searches a plurality of documents for keywords and / or sentences included in the document information (document data 1). Thereby, the document analysis system 101 can extract keywords and / or sentences included in the document information.
 第1自動分別部201は、検索部30によってキーワードデータベース104に格納されたキーワードが検索され、情報抽出部24によって当該キーワードを含む文書が文書情報から抽出された場合、当該抽出された文書に対して、キーワード対応情報に基づいて特定の分別符号を自動的に付与する。 When the keyword stored in the keyword database 104 is searched by the search unit 30 and a document including the keyword is extracted from the document information by the information extraction unit 24, the first automatic sorting unit 201 performs processing on the extracted document. Thus, a specific classification code is automatically assigned based on the keyword correspondence information.
 第2自動分別部301は、関連用語データベースに格納された関連用語を含む文書が文書情報から抽出され、当該抽出された文書に含まれる関連用語の評価値、および当該関連用語の数に基づいて、スコアが算出された場合、上記関連用語を含む文書のうち、当該スコアが一定値を超過した文書に対して、当該スコアおよび関連用語対応情報に基づいて、所定の分別符号を自動的に付与する。 The second automatic classification unit 301 extracts a document including related terms stored in the related term database from the document information, and based on the evaluation value of the related terms included in the extracted document and the number of the related terms. When a score is calculated, a predetermined classification code is automatically assigned based on the score and related term correspondence information to a document that includes the related term and whose score exceeds a certain value. To do.
 分別符号受付付与部131は、文書情報から抽出された、分別符号が付与されていない複数の文書に対して、ユーザが訴訟との関連性に基づいて付与した分別符号を受け付け、当該分別符号を付与する。 The classification code receiving / giving unit 131 accepts a classification code given by the user based on the relevance to the lawsuit for a plurality of documents that are extracted from the document information and to which the classification code is not given, and outputs the classification code. Give.
 文書解析部118は、分別符号受付付与部131によって分別符号を付与された文書を解析する。また、文書解析部118は、訴訟との関連性に基づいて、ユーザから分別符号を受け付けて付与した文書に加え、第1自動分別部201および第2自動分別部301において、キーワード、関連用語、スコアに基づいて自動的に分別符号が付与された文書を解析し、ユーザから分別符号を受け付けて付与した上記文書と、自動的に分別符号が付与された上記文書とを統合して、総合的な解析結果を得てもよい。この場合、第3自動分別部401は、当該総合的な解析結果に基づいて、分別符号を自動的に付与することができる。 The document analysis unit 118 analyzes the document given the classification code by the classification code reception / giving unit 131. Further, the document analysis unit 118, based on the relevance to the lawsuit, in addition to the document that has been given and received the classification code from the user, in the first automatic classification unit 201 and the second automatic classification unit 301, keywords, related terms, Based on the score, the document automatically assigned with the classification code is analyzed, and the above-mentioned document automatically received with the classification code is integrated with the above-mentioned document automatically received with the classification code. You may obtain a simple analysis result. In this case, the third automatic classification unit 401 can automatically assign a classification code based on the comprehensive analysis result.
 なお、分別および調査作業の進め方には、ワード検索による自動分別、ユーザによる分別および調査の受け付け、スコアを用いる自動分別および調査、学習過程を介在させる自動分別および調査、品質保証を介在させる自動分別および調査など、多様な進め方がある。上記多様な分別および調査作業が、どのような順序で、どのように組み合わされて進行したかを示す進行履歴とともに、分別符号が付与された複数の文書を文書解析部118が解析し、後述する報告作成部701が当該解析した結果を報告してもよい。 In addition, the classification and investigation work can be carried out through automatic classification by word search, acceptance of classification and investigation by users, automatic classification and investigation using scores, automatic classification and investigation through the learning process, and automatic classification through quality assurance. There are various ways to proceed, such as surveys. The document analysis unit 118 analyzes a plurality of documents assigned classification codes together with a progress history that indicates in what order and how the various classification and investigation operations have progressed in combination, and will be described later. The report creation unit 701 may report the analysis result.
 第3自動分別部401は、分別符号受付付与部131によって分別符号を付与された文書が、文書解析部118によって解析された結果に基づいて、文書情報から抽出された複数の文書に分別符号を自動的に付与する。 The third automatic classification unit 401 assigns a classification code to a plurality of documents extracted from the document information based on a result obtained by analyzing the document to which the classification code is given by the classification code receiving / giving unit 131 by the document analysis unit 118. Grant automatically.
 傾向情報生成部124は、文書解析部118が解析するために、各文書が含む単語の種類、出現数、単語の評価値に基づいて、各文書が持つ分別符号が付与された文書との類似の度合いを表す傾向情報を生成する。 The trend information generation unit 124 is similar to a document to which a classification code possessed by each document is assigned based on the type, number of occurrences, and evaluation value of the word included in each document for the document analysis unit 118 to analyze. The trend information indicating the degree of the is generated.
 品質検査部501は、分別符号受付付与部131によって受け付けられた分別符号と、文書解析部118によって傾向情報により付与された分別符号とを比較し、分別符号受付付与部131によって受け付けられた分別符号の妥当性を検証する。 The quality inspection unit 501 compares the classification code received by the classification code reception / giving unit 131 with the classification code given by the trend information by the document analysis unit 118, and the classification code received by the classification code reception / granting unit 131. Verify the validity of.
 学習部601は、文書を分別処理した結果をもとに、各キーワードまたは関連用語の重み付けを学習する。学習部601は、第1から第4の処理結果(後述)をもとに、各キーワードまたは関連用語の重みづけを式(3)により学習する。学習部601は、当該学習結果をキーワードデータベース104、関連用語データベース105、またはスコア算出データベース106に反映してもよい。 The learning unit 601 learns the weighting of each keyword or related term based on the result of sorting the document. The learning unit 601 learns the weighting of each keyword or related term using Expression (3) based on the first to fourth processing results (described later). The learning unit 601 may reflect the learning result on the keyword database 104, the related term database 105, or the score calculation database 106.
 報告作成部701は、文書を分別処理した結果をもとに、訴訟案件または不正調査の調査種類に応じて、最適な調査レポートを出力する。なお、前述したように、訴訟案件には、例えば、反トラスト、特許、海外賄賂禁止(FCPA)、製造物責任(PL)などが含まれる。また、不正調査には、例えば、情報漏洩、架空請求などが含まれる。 The report creation unit 701 outputs an optimal investigation report according to the type of litigation or the investigation type of the fraud investigation based on the result of separating the documents. As described above, the lawsuit includes, for example, antitrust, patent, foreign bribery prohibition (FCPA), product liability (PL), and the like. In addition, the fraud investigation includes, for example, information leakage and fictitious billing.
 弁護士レビュー受付部133は、分別調査と報告との質を向上させ、分別調査と報告との責任を明確にするために、主任弁護士または主任弁理士のレビューを受け付ける。 The lawyer review reception unit 133 receives reviews of the chief attorney or the lead patent attorney in order to improve the quality of the classification survey and the report and clarify the responsibility of the classification survey and the report.
 言語判定部120は、抽出された文書の言語の種類を判定する。 The language determination unit 120 determines the language type of the extracted document.
 翻訳部122は、ユーザから指定を受け付けて、または、自動的に、抽出した文書を翻訳する。この場合、1文多言語の複合言語にも対応できるように、言語判定部における言語の区切りを、1文より小さくすることが望ましい。また、言語の判定に、プレディクティブコーディング、キャラクターコーディングのいずれか、または両方を用いてもよい。さらに、HTML(Hyper Text Markup Language)のヘッダなどを、翻訳の対象から除外する処理を行うようにしてもよい。 The translation unit 122 receives the designation from the user or automatically translates the extracted document. In this case, it is desirable that the language delimiter in the language determination unit be smaller than one sentence so that it can be used for a single-sentence multilingual compound language. In addition, one or both of predictive coding and character coding may be used for language determination. Furthermore, a process of excluding an HTML (Hyper Text Markup Language) header or the like from translation targets may be performed.
 データ格納部108は、訴訟または不正調査の解析に利用するために、複数のコンピュータまたはサーバから取得したデジタル情報を、デジタル情報格納領域102に格納する。また、データ格納部108は、調査基礎データベース103、キーワードデータベース104、関連用語データベース105、スコア算出データベース106、および、報告作成データベース107を含む。なお、データ格納部108は、図3に示されるように、文書分析システム101の内部に含まれる記録媒体であってもよいし、当該文書分析システム101と通信可能に接続された外部の記録媒体であってもよい。 The data storage unit 108 stores digital information acquired from a plurality of computers or servers in the digital information storage area 102 for use in analysis of lawsuits or fraud investigations. The data storage unit 108 includes a survey basic database 103, a keyword database 104, a related term database 105, a score calculation database 106, and a report creation database 107. As shown in FIG. 3, the data storage unit 108 may be a recording medium included in the document analysis system 101 or an external recording medium connected to the document analysis system 101 so as to be communicable. It may be.
 調査基礎データベース103は、例えば、反トラスト、特許、海外賄賂禁止(Foreign Corrupt Practices Act;FCPA)、製造物責任(Products Liability;PL)などを含む訴訟案件、および/または、情報漏洩、架空請求などを含む不正調査のいずれに属するかを示す事件属性、会社名、担当者、カストディアン、および、調査または分別入力画面の構成を保持する。 The basic research database 103 includes, for example, litigation matters including antitrust, patents, foreign bribery prohibition (Foreign Corrupt Practices Act) (FCPA), product liability (Products Liability, PL), and / or information leakage, fictitious claims, etc. It holds the case attribute, company name, person in charge, custodian, and the structure of the investigation or classification input screen indicating which of the fraud investigations includes
 キーワードデータベース104は、取得されたデジタル情報に含まれる、文書の特定の分別符号、当該特定の分別符号と密接な関係を有するキーワード、および、当該特定の分別符号と当該キーワードとの対応関係を示すキーワード対応情報を保持する。 The keyword database 104 includes a specific classification code of a document, a keyword having a close relationship with the specific classification code, and a correspondence relationship between the specific classification code and the keyword included in the acquired digital information. Holds keyword correspondence information.
 関連用語データベース105は、所定の分別符号、当該所定の分別符号が付与された文書において、出現頻度が高い単語からなる関連用語、および、当該所定の分別符号と関連用語との対応関係を示す関連用語対応情報を保持する。 The related term database 105 includes a predetermined classification code, a related term composed of words having a high appearance frequency in a document to which the predetermined classification code is assigned, and a relationship indicating a correspondence relationship between the predetermined classification code and the related term. Holds term correspondence information.
 スコア算出データベース106は、文書と分別符号との結びつきの強さを示すスコアを算出するために、当該文書に含まれるワードの重み付けを保持する。 The score calculation database 106 holds weights of words included in the document in order to calculate a score indicating the strength of connection between the document and the classification code.
 報告作成データベース107は、カテゴリ、カストディアン、分別作業の内容に応じて定められる報告書の形式を保持する。 The report creation database 107 holds a report format determined according to the category, custodian, and contents of the classification work.
 データベース管理部109は、調査基礎データベース103、キーワードデータベース104、関連用語データベース105、スコア算出データベース106、および、報告作成データベース107のデータ内容の更新を管理する。データベース管理部109は、専用接続線またはインターネット回線901を介して情報格納装置902に接続されてよい。この場合、データベース管理部109は、情報格納装置902に格納されるデータの内容に基づいて、調査基礎データベース103、キーワードデータベース104、関連用語データベース105、スコア算出データベース106、および、報告作成データベース107のデータ内容を更新してもよい。 The database management unit 109 manages the update of data contents of the survey basic database 103, the keyword database 104, the related term database 105, the score calculation database 106, and the report creation database 107. The database management unit 109 may be connected to the information storage device 902 via a dedicated connection line or the Internet line 901. In this case, the database management unit 109 determines whether the survey basic database 103, the keyword database 104, the related term database 105, the score calculation database 106, and the report creation database 107 are based on the contents of data stored in the information storage device 902. Data content may be updated.
 (用語の説明)
 「分別符号」は、文書を分類するために用いられる識別子であって、文書を訴訟に利用することが容易となるように、当該訴訟との関連度を示す識別子である。例えば、訴訟において文書情報を証拠として利用する場合、証拠の種類に応じて付与されてよい。
(Explanation of terms)
The “classification code” is an identifier used for classifying documents, and is an identifier indicating the degree of relevance with the lawsuit so that the document can be easily used in the lawsuit. For example, when document information is used as evidence in a lawsuit, it may be given according to the type of evidence.
 「文書」は、1つ以上の単語を含むデータであり、例えば、電子メール、プレゼンテーション資料、表計算資料、打ち合わせ資料、契約書、組織図、事業計画書などであってよい。 “Document” is data including one or more words, and may be, for example, e-mail, presentation materials, spreadsheet materials, meeting materials, contracts, organization charts, business plans, and the like.
 「単語」は、意味を有する最少の文字列のまとまりである。例えば、「文書とは、1つ以上の単語を含むデータをいう。」という文章には、「文書」、「1つ」、「以上」、「単語」、「含む」、「データ」、「いう」という単語が含まれる。 “Word” is a group of the smallest character strings having meaning. For example, a sentence “document means data including one or more words” includes “document”, “one”, “more”, “word”, “include”, “data”, “ The word "" is included.
 「キーワード」は、ある言語において、一定の意味を有する文字列のまとまりである。例えば、「文書を分別する」という文章からキーワードを選定すると、「文書」、「分別」とすることができる。本実施形態においては、「侵害」や「訴訟」、あるいは「特許公報○○号」などのキーワードが、重点的に選定される。なお、上記「キーワード」は、形態素を含んでよい。 “Keyword” is a group of character strings having a certain meaning in a certain language. For example, if a keyword is selected from a sentence “classify a document”, it can be set to “document” or “classify”. In the present embodiment, keywords such as “infringement”, “lawsuit”, or “patent publication XX” are selected with priority. The “keyword” may include a morpheme.
 「キーワード対応情報」は、キーワードと特定の分別符号との対応関係を表す情報である。例えば、訴訟において重要な文書を表す「重要」という分別符号が「侵害者」というキーワードと密接な関係を持つ場合、上記「キーワード対応情報」は分別符号「重要」とキーワード「侵害者」とを紐づけて管理する情報であってもよい。 “Keyword correspondence information” is information representing the correspondence between a keyword and a specific classification code. For example, when the classification code “important” representing an important document in a lawsuit has a close relationship with the keyword “infringer”, the above “keyword correspondence information” uses the classification code “important” and the keyword “infringer”. It may be information managed in association with each other.
 「関連用語」は、所定の分別符号が付与された文書に共通して出現頻度が高い単語のうち、評価値が一定値以上の用語である。ここで、出現頻度は、例えば、ひとつの文書に登場する単語の総数のうち、関連用語が出現する割合であってよい。 The “related term” is a term having an evaluation value of a certain value or more among words having a high appearance frequency in common with a document to which a predetermined classification code is assigned. Here, the appearance frequency may be, for example, a ratio of related terms appearing in the total number of words appearing in one document.
 「評価値」は、各単語がある文書において発揮する情報量を示す値である。「評価値」は、伝達情報量を基準に算出されてもよい。例えば、所定の商品名を分別符号として付与する場合、上記「関連用語」は、当該商品が属する技術分野の名称、当該商品の販売国、当該商品の類似商品名などを指してもよい。具体的には、画像符号化処理を行う装置の商品名を分別符号として付与する場合の「関連用語」は、「符号化処理」、「日本」、「エンコーダ」などが挙げられる。 “Evaluation value” is a value indicating the amount of information that is exhibited in a document with each word. The “evaluation value” may be calculated based on the amount of transmitted information. For example, when a predetermined product name is assigned as a classification code, the “related term” may refer to the name of the technical field to which the product belongs, the country where the product is sold, the name of a similar product of the product, and the like. Specifically, “related terms” in the case of assigning the product name of the apparatus that performs the image encoding process as a classification code includes “encoding process”, “Japan”, “encoder”, and the like.
 「関連用語対応情報」は、関連用語と分別符号との対応関係を表す情報をいう。例えば、訴訟に係る商品名である「製品A」という分別符号が、製品Aの機能である「画像符号化」という関連用語を持つ場合、「関連用語対応情報」は、分別符号「製品A」と関連用語「画像符号化」とを紐づけて管理する情報であってもよい。 “Related term correspondence information” refers to information indicating the correspondence between related terms and classification codes. For example, when the classification code “product A”, which is the product name related to the lawsuit, has a related term “image encoding”, which is a function of the product A, the “related term correspondence information” is the classification code “product A”. And the related term “image coding” may be managed in association with each other.
 「スコア」は、前述したように、ある文書において、特定の分別符号との結びつきの強さを定量的に評価した値をいう。本発明の各実施形態においては、例えば、前述した〔数1〕~〔数3〕にしたがってスコアが算出される。 “Score” refers to a value obtained by quantitatively evaluating the strength of association with a specific classification code in a document as described above. In each embodiment of the present invention, for example, the score is calculated according to the above-described [Equation 1] to [Equation 3].
 文書分析システム101は、ユーザが付与した分別符号が共通する文書に頻出する単語を抽出してもよい。そして、文書ごとに含まれる、当該抽出した単語の種類、各単語がもつ評価値、および出現数の傾向情報を文書ごとに解析し、分別符号受付付与部131によって分別符号が受け付けられていない文書のうち、解析した傾向情報と同じ傾向をもつ文書に対して、共通の分別符号を付与してもよい。 The document analysis system 101 may extract words that frequently appear in documents having a common classification code assigned by the user. Then, for each document, the extracted word type, the evaluation value of each word, and the trend information of the number of appearances included in each document are analyzed for each document, and the classification code is not accepted by the classification code acceptance and grant unit 131. Among them, a common classification code may be assigned to documents having the same tendency as the analyzed trend information.
 ここで、「傾向情報」は、各文書が持つ、分別符号が付与された文書との類似の度合いを表す情報であって、各文書が含む単語の種類、出現数、単語の評価値に基づく、所定の分別符号との関連度で表される情報である。例えば、各文書が、所定の分別符号を付与された文書と、当該所定の分別符号との関連度において類似である場合に、当該2つの文書は同じ傾向情報を持つという。また、含まれる単語の種類は異なっていても、評価値が同じ単語を同じ出現数で含む文書について、同じ傾向を持つ文書としてもよい。 Here, the “trend information” is information representing the degree of similarity of each document with a classification code, and is based on the type of word, the number of occurrences, and the word evaluation value included in each document. , Information represented by the degree of association with a predetermined classification code. For example, when each document is similar in degree of relevance between a document given a predetermined classification code and the predetermined classification code, the two documents are said to have the same trend information. In addition, even if the types of words included are different, documents having the same evaluation value and the same number of occurrences may be documents having the same tendency.
 (文書分析システム101において実行される処理)
 図4は、文書分析システム101が実行する処理の一例を示すフローチャートである。なお、図2に示されたフローは、図4に示されるフローから独立した処理として実行されてもよいし、図4に示されるフローの任意の箇所に内包される処理として実行されてもよい。
(Processing executed in the document analysis system 101)
FIG. 4 is a flowchart illustrating an example of processing executed by the document analysis system 101. The flow shown in FIG. 2 may be executed as a process independent of the flow shown in FIG. 4 or may be executed as a process included in any part of the flow shown in FIG. .
 表示部50の表示画面の表示に応じてユーザから引数の指定を受け付けて、例えば、反トラスト、特許、FCPA、PLを含む訴訟案件、又は情報漏洩、架空請求を含む不正調査から対応するカテゴリを特定することができる(S11)。特定されたカテゴリに応じて、調査基礎データベース、文書分析データベース等の使用データベースを特定することができる(S12)。使用データベースが最新のものかどうかを確認するために、最新データベースを格納する情報格納装置902にアクセスすることができる。情報格納装置902は、分別を実施する組織の内部に設置される場合と、組織の外部に設置される場合がある。情報格納装置902が組織の外部に設置される場合として、例えば、提携する法律事務所又は特許事務所に設置される場合がある。 In response to display of the display screen of the display unit 50, designation of an argument from the user is accepted, and for example, a category corresponding to a lawsuit including antitrust, patent, FCPA, PL, or fraud investigation including information leakage and fictitious claims is displayed. It can be specified (S11). According to the identified category, a use database such as a survey basic database and a document analysis database can be identified (S12). In order to check whether the database used is the latest, the information storage device 902 that stores the latest database can be accessed. The information storage device 902 may be installed inside an organization that performs sorting or may be installed outside the organization. As a case where the information storage device 902 is installed outside the organization, for example, it may be installed in a partner law firm or patent office.
 情報格納装置902にアクセスする場合には、セキュリティーを保持するために、ID及びパスワードによる認証が行われることができる(S13)。認証が行われた後に、情報石納装置にアクセスすることが許可され、調査基礎データベース、文書分析データベース等の使用データベースが指針のデータベースに更新されることができる(S14)。更新された調査基礎データベースを検索し(S15)、表示装置の画面に会社名、担当者、カストディアンの名前が提示されることができる(S16)。表示装置の画面に表示される担当者とカストディアンの名前が実際の担当者とカストディアンの名前と異なる場合は、ユーザは表示装置の画面で担当者とカストディアンの名前を修正する。文書分析システムは、ユーザの修正入力を受け付けて、実際の担当者とカストディアンの名前を特定することができる(S17)。 When accessing the information storage device 902, authentication by ID and password can be performed to maintain security (S13). After the authentication is performed, access to the information storage device is permitted, and the usage database such as the survey basic database and the document analysis database can be updated to the guideline database (S14). The updated survey basic database is searched (S15), and the name of the company, the person in charge, and the custodian can be presented on the screen of the display device (S16). If the names of the person in charge and the custodian displayed on the screen of the display device are different from the names of the persons in charge and the custodian actually, the user corrects the names of the person in charge and the custodian on the screen of the display device. The document analysis system can accept the user's correction input and specify the names of the actual person in charge and the custodian (S17).
 次に、文書分析作業を実施するために、デジタル文書情報を抽出することができる(S18)。更新された文書分析データベースとして、更新されたキーワードデータベース、関連用語データベース、及びスコア算出データベースを検索して(S19)、抽出文書情報に分別符号を付与することができる(S20)。また、レビュアーによる分別符号を受け付けて、抽出文書情報に分別符号を付与することができる(S21)。分別結果を教師データとして、データベースを検索し、抽出文書情報に分別符号を付与することができる(S22)。主任弁護士又は弁理士によるレビューを受け付けることができる(S23)。これにより、調査の質を向上させることができる。ユーザの引数指定によりカテゴリを特定し(S24)、特定されたカテゴリに応じて報告作成データベースを特定することができる(S25)。特定された報告作成データベースにより、報告書の形式を定め、報告書を自動出力することができる(S26)。 Next, digital document information can be extracted in order to perform document analysis work (S18). As the updated document analysis database, the updated keyword database, related term database, and score calculation database are searched (S19), and a classification code can be assigned to the extracted document information (S20). Moreover, the classification code by the reviewer can be received and the classification code can be given to the extracted document information (S21). The database is searched using the classification result as teacher data, and a classification code can be assigned to the extracted document information (S22). A review by the chief attorney or patent attorney can be accepted (S23). This can improve the quality of the survey. The category is specified by the user's argument designation (S24), and the report creation database can be specified according to the specified category (S25). The format of the report can be determined by the identified report creation database, and the report can be automatically output (S26).
 図5は、図4に示される処理の一例における、調査種類に応じた調査および分別処理の一例を示すフローチャートである。 FIG. 5 is a flowchart showing an example of the investigation and classification process according to the investigation type in the example of the process shown in FIG.
 最初に、調査種類を入力することができる(S31)。すなわち、表示画面の表示に応じて、ユーザが、例えば、反トラスト、特許、海外賄賂禁止(FCPA)、製造物責任(PL)を含む訴訟案件又は情報漏洩、架空請求を含む不正調査から実施しようとする調査及び分別作業と対応するカテゴリを入力する。文書分析システムは、ユーザのカテゴリの入力を受け付けて、調査対象となるカテゴリを特定することができる。 First, the survey type can be input (S31). In other words, depending on the display screen, the user will try to carry out from a fraud investigation including antitrust, patents, litigation cases including overseas bribery prohibition (FCPA), product liability (PL) or information leakage, fictitious claims, etc. Enter the category corresponding to the survey and sorting work. The document analysis system can accept a user category input and specify a category to be investigated.
 特定されたカテゴリに応じて、調査及び文書分析処理の種類と使用するデータベースの種類を判定することができる(S32)。特定されたカテゴリに応じて、調査基礎データベース、文書分析データベース等の使用データベースに記憶された情報のストックにアクセスしてもよい(S33)。特定されたカテゴリに応じて調査基礎データベースにアクセスし、特定されたカテゴリに応じた各キーワード入力画面を表示することができる(S34)。特定されたカテゴリに応じて調査基礎データベースにアクセスし、特定されたカテゴリに応じた各文章入力画面を表示することができる(S35)特定されたカテゴリに応じて調査基礎データベースにアクセスし、特定されたカテゴリに応じてキーワードもしくは文書を抽出することができる(S36)。 Depending on the specified category, the type of survey and document analysis processing and the type of database to be used can be determined (S32). Depending on the identified category, a stock of information stored in a usage database such as a survey basic database or a document analysis database may be accessed (S33). The survey basic database is accessed according to the specified category, and each keyword input screen corresponding to the specified category can be displayed (S34). The survey basic database is accessed according to the identified category, and each text input screen can be displayed according to the identified category (S35). The survey basic database is accessed according to the identified category and identified. A keyword or document can be extracted according to the category (S36).
 上述の処理を実行することにより、自動分別符号付与(予測コーディング)の教師データに重み付けを追加して行うことができる(S37)。文書分析データベースをキーワード検索することにより、抽出文書及び情報の絞り込みを行うことができる(S38)。 By executing the above-described processing, weighting can be added to the teacher data for automatic classification code assignment (predictive coding) (S37). The extracted documents and information can be narrowed down by performing a keyword search in the document analysis database (S38).
 図6は、図4に示される処理の一例における、調査種類に応じたプレディクティブコーディングの一例を示すフローチャートである。 FIG. 6 is a flowchart showing an example of predictive coding according to the investigation type in the example of the process shown in FIG.
 本発明の実施形態に係る文書分析方法では、最初に、文書分析システムが調査の種類に応じてユーザに入力を求め、それに対するユーザの入力を受け付けることができる。例えば、反トラスト法と関連してカルテルについて、対象製品、関係者(氏名とメールアドレス)、関係組織(名称と部門)及び時期について、ユーザの入力を求め、それに対するユーザの入力を受け付けることができる。その他に、関係組織については、競争相手企業と顧客企業に関してユーザの入力を求め、それに対するユーザの入力を受け付けることができる(S51)。 In the document analysis method according to the embodiment of the present invention, first, the document analysis system can ask the user for input according to the type of survey, and can accept the user's input for that. For example, regarding cartels in relation to the antitrust law, user input is requested for target products, parties (name and email address), related organizations (name and department), and time, and user input is accepted. it can. In addition, regarding related organizations, it is possible to request user input regarding competitor companies and customer companies, and accept user input in response to the input (S51).
 次に、入力キーワードによって、分別符号付与に対する重み付けを行うことができる(S52)。そして、予測コーディングを行うことができる(S53)。本発明の実施形態では、一例として、図7に示すようなフローチャートに従い、第1段階~第5段階で、登録処理、分別処理、及び検査処理を行う。 Next, it is possible to weight the classification code with the input keyword (S52). Then, predictive coding can be performed (S53). In the embodiment of the present invention, as an example, according to the flowchart shown in FIG. 7, the registration process, the classification process, and the inspection process are performed in the first to fifth stages.
 第1段階では、過去の分別処理の結果を用いて、事前にキーワードと関連用語の更新登録を行う(S100)。このとき、キーワード及び関連用語は、分別符号とキーワード又は関連用語の対応情報であるキーワード対応情報及び関連用語対応情報とともに更新登録される。 In the first stage, the keyword and related terms are updated and registered in advance using the result of the past classification process (S100). At this time, the keyword and the related term are updated and registered together with the keyword correspondence information and the related term correspondence information which are correspondence information between the classification code and the keyword or the related term.
 第2段階では、第1段階で更新登録されたキーワードを含む文書を全文書情報から抽出し、該文書を発見すると第1段階で記録した更新キーワード対応情報を参照し、該キーワードに対応する分別符号を付与する第1分別処理を行う(S200)。 In the second stage, a document including the keyword updated and registered in the first stage is extracted from all document information. When the document is found, the updated keyword correspondence information recorded in the first stage is referred to, and the classification corresponding to the keyword is performed. A first separation process for assigning a code is performed (S200).
 第3段階では、第1段階で更新登録された関連用語を含む文書を、第2段階で分別符号を付与されなかった文書情報から抽出し、該関連用語を含む文書のスコアを算出する。該算出したスコアと第1段階で更新登録された関連用語対応情報を参照し、分別符号の付与を実行する第2分別処理を行う(S300)。 In the third stage, the document including the related term updated and registered in the first stage is extracted from the document information that has not been given the classification code in the second stage, and the score of the document including the related term is calculated. With reference to the calculated score and the related term correspondence information updated and registered in the first stage, a second classification process is performed in which a classification code is assigned (S300).
 第4段階では、第3段階までに分別符号を付与されなかった文書情報に対して、ユーザが付与した分別符号を受け付け、該文書情報に対してユーザから受け付けた分別符号を付与する。次に、ユーザから受け付けた分別符号を付与された文書情報を解析し、解析結果に基づいて、分別符号が付与されていない文書を抽出して、抽出した文書に分別符号を付与する第3分別処理を行う。例えば、該ユーザが付与した分別符号が共通である文書中に頻出する語を抽出し、文書ごとに含まれる、抽出した単語の種類、各単語が持つ評価値及び出現数の傾向情報を文書ごとに解析し、該傾向情報と同じ傾向を持つ文書に対して、共通の分別符号の付与を行う(S400)。 In the fourth stage, the classification code given by the user is accepted for the document information that has not been given the classification code by the third stage, and the classification code accepted from the user is given to the document information. Next, the document information provided with the classification code received from the user is analyzed, the document without the classification code is extracted based on the analysis result, and the third classification for adding the classification code to the extracted document Process. For example, words that frequently appear in documents with a common classification code assigned by the user are extracted, and the types of extracted words, evaluation values possessed by each word, and trend information on the number of appearances are included for each document. The common classification code is assigned to the document having the same tendency as the trend information (S400).
 第5段階では、第4段階でユーザが分別符号を付与した文書に対して、解析した傾向情報に基づいて付与すべき分別符号を決定し、該決定した分別符号とユーザの付与した分別符号を比較し、分別処理の妥当性の検証を行う(S500)。また、必要に応じて、文書分析処理の結果に基づいて学習処理を行っても良い。 In the fifth stage, the classification code to be given is determined based on the analyzed trend information for the document to which the user has given the classification code in the fourth stage, and the determined classification code and the classification code given by the user are determined. The validity of the classification process is verified by comparison (S500). Moreover, you may perform a learning process based on the result of a document analysis process as needed.
 第4段階及び第5段階の処理に用いられる傾向情報は、各文書が持つ、分別符号が付与された文書との類似の度合いを表すものをいい、各文書が含む単語の種類、出現数、単語の評価値に基づくものをいう。例えば、各文書が、所定の分別符号を付与された文書と、該所定の分別符号との関連度において類似である場合に、該2つの文書は同じ傾向情報を持つという。また、含まれる単語の種類は異なっていても、評価値が同じ単語を同じ出現数で含む文書について、同じ傾向を持つ文書としてもよい。 The trend information used in the fourth and fifth stage processing refers to the degree of similarity between each document and the document to which the classification code is assigned. The type of word included in each document, the number of occurrences, This is based on the evaluation value of a word. For example, when each document is similar in degree of relevance between a document assigned a predetermined classification code and the predetermined classification code, the two documents have the same tendency information. In addition, even if the types of words included are different, documents having the same evaluation value and the same number of occurrences may be documents having the same tendency.
 第1段階から第5段階の各段階における詳細な処理フローを以下で説明する。 The detailed processing flow in each stage from the first stage to the fifth stage will be described below.
 <第1段階(S100)>
 第1段階におけるキーワードデータベース104の詳細な処理フローを図8を用いて説明する。
<First stage (S100)>
A detailed processing flow of the keyword database 104 in the first stage will be described with reference to FIG.
 キーワードデータベース104は、過去の訴訟において文書を分別した結果を踏まえ、それぞれの分別符号ごとに管理用のテーブルを作成し、各分別符号に対応するキーワードを特定する(S111)。この特定は、本発明の実施形態においては、各分別符号が付与された文書を解析し、該文書中の各キーワードの出現数及び評価値を用いて行うが、キーワードが持つ伝達情報量を用いる方法や、ユーザが手動で選択する方法等を用いてもよい。 The keyword database 104 creates a management table for each classification code based on the result of classifying documents in past lawsuits, and specifies keywords corresponding to each classification code (S111). In the embodiment of the present invention, in the embodiment of the present invention, the document to which each classification code is assigned is analyzed, and the number of occurrences of each keyword in the document and the evaluation value are used. A method, a method of manual selection by the user, or the like may be used.
 本発明の実施形態においては、例えば、分別符号「重要」のキーワードとして「侵害」及び「弁理士」というキーワードが特定された場合、「侵害」及び「弁理士」が分別符号「重要」と密接な関係を持つキーワードであることを示すキーワード対応情報を作成する(S112)。そして、特定されたキーワードをキーワードデータベース104に登録する。この際、特定されたキーワードとキーワード対応情報を関係付けてキーワードデータベース104の分別符号「重要」の管理テーブルに記録する(S113)。 In the embodiment of the present invention, for example, when keywords “infringement” and “patent attorney” are specified as keywords of the classification code “important”, “infringement” and “patent attorney” are closely related to the classification code “important”. The keyword correspondence information indicating that the keyword has a special relationship is created (S112). Then, the identified keyword is registered in the keyword database 104. At this time, the identified keyword is associated with the keyword correspondence information and recorded in the management table of the classification code “important” in the keyword database 104 (S113).
 次に、関連用語データベース105の詳細な処理フローを図9を用いて説明する。関連用語データベース105は、過去の訴訟において文書を分別した結果を踏まえ、それぞれの分別符号ごとに管理用のテーブルを作成し、各分別符号に対応する関連用語を登録する(S121)。本発明の実施形態においては、例えば、「製品A」の関連用語として「符号化処理」及び「製品a」並びに「製品B」の関連用語として「復号化」及び「製品b」を登録する。 Next, a detailed processing flow of the related term database 105 will be described with reference to FIG. The related term database 105 creates a management table for each classification code based on the result of classifying documents in past lawsuits, and registers the related terms corresponding to each classification code (S121). In the embodiment of the present invention, for example, “encoding process” and “product a” are registered as related terms of “product A”, and “decoding” and “product b” are registered as related terms of “product B”.
 登録したそれぞれの関連用語がどの分別符号に対応するものかを示す関連用語対応情報を作成し(S122)、各管理テーブルに記録する(S123)。このとき、関連用語対応情報には、各関連用語の持つ評価値及び分別符号を決定するのに必要なスコアとなる閾値も併せて記録される。 The related term correspondence information indicating which classification code each registered related term corresponds to is created (S122) and recorded in each management table (S123). At this time, the related term correspondence information also records a threshold value serving as a score necessary for determining an evaluation value and a classification code of each related term.
 実際に分別作業を行う前に、キーワードとキーワード対応情報、及び関連用語と関連用語対応情報を最新のものに更新登録する(S113、S123)。 Before actually performing the sorting operation, the keyword and the keyword correspondence information, and the related term and the related term correspondence information are updated and registered (S113, S123).
 <第2段階(S200)>
 第2段階における第1自動分別部201の詳細な処理フローを、図10を用いて説明する。本発明の実施形態において、第2段階では、第1自動分別部201によって、分別符号「重要」を文書に付与する処理を行う。
<Second stage (S200)>
A detailed processing flow of the first automatic sorting unit 201 in the second stage will be described with reference to FIG. In the embodiment of the present invention, in the second stage, the first automatic classification unit 201 performs a process of assigning the classification code “important” to the document.
 第1自動分別部201では、第1段階(S100)でキーワードデータベース104に登録したキーワード「侵害」及び「弁理士」を含む文書を文書情報から抽出する(S211)。該抽出した文書に対して、キーワード対応情報から、該キーワードが記録されている管理テーブルを参照し(S212)、「重要」という分別符号を付与する(S213)。 The first automatic sorting unit 201 extracts documents including the keywords “infringement” and “patent attorney” registered in the keyword database 104 in the first step (S100) from the document information (S211). A management table in which the keyword is recorded is referred to from the keyword correspondence information to the extracted document (S212), and a classification code of “important” is given (S213).
 <第3段階(S300)>
 第3段階における第2自動分別部301の詳細な処理フローを、図11を用いて説明する。
<Third stage (S300)>
A detailed processing flow of the second automatic sorting unit 301 in the third stage will be described with reference to FIG.
 本発明の実施形態において、第2自動分別部301では、第2段階(S200)で分別符号を付与しなかった文書情報に対して、「製品A」及び「製品B」という分別符号を付与する処理を行う。 In the embodiment of the present invention, the second automatic classification unit 301 assigns the classification codes “product A” and “product B” to the document information that has not been assigned the classification code in the second stage (S200). Process.
 第2自動分別部301は、該文書情報から、第1段階で関連用語データベース105に記録した関連用語「符号化処理」、「製品a」、「復号化」及び「製品b」を含む文書を抽出する(S311)。該抽出した文書に対して、記録した4つの関連用語の出現頻度、評価値に基づいて、式(1)を用いて、スコア算出部116によりスコアを算出する(S312)。該スコアは各文書と分別符号「製品A」及び「製品B」との関連度を表している。 From the document information, the second automatic classification unit 301 records a document including related terms “encoding process”, “product a”, “decoding”, and “product b” recorded in the related term database 105 in the first stage. Extract (S311). For the extracted document, a score is calculated by the score calculation unit 116 using Expression (1) based on the appearance frequency and evaluation value of the four related terms recorded (S312). The score represents the degree of association between each document and the classification codes “product A” and “product B”.
 該スコアが閾値を超過した場合、関連用語対応情報を参照し(S313)、適切な分別符号を付与する(S314)。 When the score exceeds the threshold, the related term correspondence information is referred to (S313), and an appropriate classification code is assigned (S314).
 例えば、ある文書において関連用語「符号化処理」及び「製品a」の出現頻度並びに関連用語「符号化処理」が持つ評価値が高く、分別符号「製品A」との関連度を示すスコアが閾値を超過した際、該文書には分別符号「製品A」が付与される。 For example, in a document, the appearance frequency of the related terms “encoding process” and “product a” and the evaluation value of the related term “encoding process” are high, and the score indicating the degree of association with the classification code “product A” is a threshold value. Is exceeded, the document is given a classification code “Product A”.
 このとき、該文書に関連用語「製品b」の出現頻度も高く、分別符号「製品B」との関連度を示すスコアが閾値を超過した場合、該文書には分別符号「製品A」と併せて、「製品B」も付与される。一方、該文書に関連用語「製品b」の出現頻度が低く、分別符号「製品B」との関連度を示すスコアが閾値を超過しなかった場合には、該文書には分別符号「製品A」のみが付与される。 At this time, when the appearance frequency of the related term “product b” is high in the document and the score indicating the degree of association with the classification code “product B” exceeds the threshold, the document is also combined with the classification code “product A”. "Product B" is also given. On the other hand, when the appearance frequency of the related term “product b” is low in the document and the score indicating the degree of association with the classification code “product B” does not exceed the threshold, the classification code “product A” is included in the document. "Is granted.
 第2自動分別部301では、第4段階のS432において算出されるスコアを用いて以下に示す〔数4〕により、関連用語の評価値を再計算し、該評価値の重みづけを行う(S315)。 The second automatic sorting unit 301 recalculates the evaluation value of the related term by the following [Equation 4] using the score calculated in S432 in the fourth stage, and weights the evaluation value (S315). ).
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 ここで、wi,LはL回目学習後のi番目の選定キーワードの重みを表し、γはL回目学習における学習パラメータを表し、θは学習効果のしきい値を表す。例えば、「復号化」の出現頻度が非常に高いがスコアが一定値以上低い、という文書が一定数以上発生した場合、関連用語「復号化」の評価値を下げて再度、関連用語対応情報に記録する。 Here, w i, L represents the weight of the i-th selected keyword after the L-th learning, γ L represents a learning parameter in the L-th learning, and θ represents a learning effect threshold value. For example, if there are more than a certain number of documents where the appearance frequency of “decryption” is very high but the score is lower than a certain value, the evaluation value of the related term “decoding” is lowered and the related term correspondence information is again displayed. Record.
 <第4段階(S400)>
 第4段階では、図12に示すように、第3段階までの処理において、分別符号が付与されなかった文書情報から抽出した一定の割合の文書情報に対して、レビュワーからの分別符号の付与を受け付け、当該文書情報に受け付けた分別符号を付与する。次に、図13に示すように、レビュワーから受け付けた分別符号を付与された文書情報を解析し、その解析結果に基づいて、分別符号が付与されていない文書情報に分別符号を付与する。なお、本発明の実施形態においては、該文書情報に対して、第4段階では、例えば、「重要」、「製品A」及び「製品B」という分別符号を付与する処理を行う。第4段階について、更に以下に記載する。
<Fourth stage (S400)>
In the fourth stage, as shown in FIG. 12, in the process up to the third stage, the classification code from the reviewer is given to a certain percentage of the document information extracted from the document information to which the classification code is not given. Acceptance and the accepted classification code are assigned to the document information. Next, as shown in FIG. 13, the document information given the classification code received from the reviewer is analyzed, and based on the analysis result, the classification code is given to the document information to which the classification code is not given. In the embodiment of the present invention, in the fourth stage, for example, a process of assigning classification codes of “important”, “product A”, and “product B” is performed on the document information. The fourth stage is further described below.
 第4段階における分別符号受付付与部131の詳細な処理フローを、図12を用いて説明する。第4段階での処理対象となる文書情報からまず情報抽出部24が、ランダムに文書をサンプリングし、表示部50上で表示する。本発明の実施形態では、処理対象となる文書情報のうち2割の文書をランダムに抽出し、レビュワーによる分別対象とする。サンプリングは、文書の作成日時順や、名称順に文書を並べ、上から3割の文書を選ぶという抽出の仕方をしてもよい。 The detailed processing flow of the classification code reception assigning unit 131 in the fourth stage will be described with reference to FIG. First, the information extraction unit 24 samples a document at random from the document information to be processed in the fourth stage and displays it on the display unit 50. In the embodiment of the present invention, 20% of the document information to be processed is extracted at random and set as a classification target by the reviewer. Sampling may be an extraction method in which documents are arranged in order of document creation date and time or in order of name, and 30% of documents are selected from the top.
 ユーザは表示部50上に表示される図18に示す文書表示画面を閲覧し、各文書に対して付与する分別符号を選択する。分別符号受付付与部131は、該ユーザが選択した分別符号を受け付け(S411)、付与された分別符号に基づいて分別する(S412)。 The user browses the document display screen shown in FIG. 18 displayed on the display unit 50, and selects a classification code to be assigned to each document. The classification code reception / giving unit 131 receives the classification code selected by the user (S411) and classifies the classification code based on the given classification code (S412).
 次に、文書解析部118の詳細な処理フローを、図13を用いて説明する。文書解析部118では、分別符号受付付与部131で分別符号ごとに分別された文書に共通して頻出する単語を抽出する(S421)。抽出した共通の単語の評価値を式(2)により解析し(S422)、該共通の単語の文書中の出現頻度を解析する(S423)。 Next, a detailed processing flow of the document analysis unit 118 will be described with reference to FIG. The document analysis unit 118 extracts words that frequently appear in the documents classified by classification code by the classification code reception and grant unit 131 (S421). The evaluation value of the extracted common word is analyzed by equation (2) (S422), and the appearance frequency of the common word in the document is analyzed (S423).
 さらに、S422及びS423によって解析した結果を踏まえて、「重要」という分別符号が付与された文書の傾向情報を解析する(S424)。 Further, based on the results analyzed in S422 and S423, the trend information of the document assigned the classification code “important” is analyzed (S424).
 図14は、S424によって、「重要」という分別符号が付与された文書に共通して頻出する単語を解析した結果のグラフである。 FIG. 14 is a graph showing a result of analyzing words frequently appearing in the document to which the classification code “important” is assigned in S424.
 図14において、縦軸R_hotは、ユーザによって分別符号「重要」が付与された全文書のうち、分別符号「重要」に紐づく単語として選定された単語を含み、かつ分別符号「重要」が付与された文書の割合を示している。横軸は、ユーザが分別処理を実施した全文書のうち、分別符号受付付与部131によってS421で抽出された単語を含む文書の割合を示している。 In FIG. 14, the vertical axis R_hot includes a word selected as a word associated with the classification code “important” among all documents to which the classification code “important” is assigned by the user, and the classification code “important” is assigned. Shows the percentage of documents that were used. The horizontal axis indicates the ratio of documents including the word extracted in S421 by the classification code receiving and assigning unit 131 among all documents subjected to the classification process by the user.
 本発明の実施形態において、分別符号受付付与部131では、直線R_hot=R_allよりも上部にプロットされるような単語を、分別符号「重要」における共通の単語として抽出する。 In the embodiment of the present invention, the classification code receiving / giving unit 131 extracts words that are plotted above the straight line R_hot = R_all as common words in the classification code “important”.
 S421乃至S424の処理を、「製品A」及び「製品B」という分別符号が付与された文書に対しても実行し、該文書の傾向情報を解析する。 The processing from S421 to S424 is also executed for the documents to which the classification codes “product A” and “product B” are assigned, and the trend information of the documents is analyzed.
 次に、第3自動分別部401の詳細な処理フローを、図15を用いて説明する。第3自動分別部401では、第4段階での処理対象の文書情報のうち、S411で分別符号受付付与部131によって分別符号の付与が受け付けられなかった文書に対して処理を行う。第3自動分別部401では、このような文書から、S424で解析した、分別符号「重要」、「製品A」及び「製品B」が付与された文書の傾向情報と、同じ傾向情報を持つ文書を、抽出し(S431)、抽出した文書について、傾向法をもとに式(1)を用いてスコアを算出する(S432)。また、S431で抽出した文書に対して、傾向情報に基づいて適切な分別符号を付与する(S433)。 Next, a detailed processing flow of the third automatic sorting unit 401 will be described with reference to FIG. The third automatic classification unit 401 performs processing on the document that has not been given the classification code by the classification code reception / giving unit 131 in step S411 out of the document information to be processed in the fourth stage. In the third automatic classification unit 401, a document having the same trend information as the trend information of the document assigned with the classification codes “important”, “product A”, and “product B” analyzed in S 424 from such a document. Are extracted (S431), and a score is calculated for the extracted document using equation (1) based on the trend method (S432). Further, an appropriate classification code is assigned to the document extracted in S431 based on the trend information (S433).
 第3自動分別部401では、さらに、S432で算出したスコアを用いて、分別結果を各データベースに反映する(S434)。具体的には、スコアの低い文書に含まれているキーワード及び関連用語の評価値を下げ、スコアの高い文書に含まれているキーワード及び関連用語の評価値を上げる処理を行っても良い。 The third automatic sorting unit 401 further reflects the sorting result in each database using the score calculated in S432 (S434). Specifically, a process of lowering the evaluation values of keywords and related terms included in a document having a low score and increasing the evaluation values of keywords and related terms included in a document having a high score may be performed.
 更に、第3自動分別部401の詳細な処理フローの一例を、図16を用いて説明する。第3自動分別部401では、第4段階での処理対象の文書情報のうち、S411で分別符号受付付与部131によって分別符号の付与が受け付けられなかった文書に対して分別処理を行っても良い。第3自動分別部401では、引数が与えられなかった場合には(S441:なし)、該文書から、S424で解析した、分別符号「重要」が付与された文書の傾向情報と、同じ傾向情報を持つ文書を、抽出し(S442)、抽出した文書について、傾向情報をもとに式(1)を用いてスコアを算出する(S443)。また、S442で抽出した文書に対して、傾向情報に基づいて適切な分別符号を付与する(S444)。 Furthermore, an example of a detailed processing flow of the third automatic sorting unit 401 will be described with reference to FIG. The third automatic classification unit 401 may perform a classification process on the document information that has not been accepted by the classification code reception / giving unit 131 in step S411 out of the document information to be processed in the fourth stage. . In the case where no argument is given in the third automatic classification unit 401 (S441: None), the same trend information as the trend information of the document to which the classification code “important” is assigned is analyzed from the document in S424. Is extracted (S442), and the score of the extracted document is calculated using equation (1) based on the trend information (S443). Further, an appropriate classification code is assigned to the document extracted in S442 based on the trend information (S444).
 第3自動分別部401では、さらに、S443で算出したスコアを用いて、分別結果を各データベースに反映する(S445)。具体的には、スコアの低い文書に含まれているキーワード及び関連用語の評価値を下げ、一方、スコアの高い文書に含まれているキーワード及び関連用語の評価値を上げる処理を行う。 The third automatic sorting unit 401 further reflects the sorting result in each database using the score calculated in S443 (S445). Specifically, the evaluation value of the keyword and the related term included in the document with a low score is lowered, while the evaluation value of the keyword and the related term included in the document with a high score is increased.
 上述のように第2自動分別部301と第3自動分別部401の両方でスコア算出が行われ、スコア算出の回数が多くなる場合には、スコア算出のためのデータをスコア算出データベース106に一括して格納しても良い。 As described above, when the score calculation is performed in both the second automatic classification unit 301 and the third automatic classification unit 401 and the number of score calculations increases, the data for score calculation is collectively stored in the score calculation database 106. May be stored.
 <第5段階(S500)>
 第5段階における品質検査部501の詳細な処理フローを図17を用いて説明する。品質検査部501では、分別符号受付付与部131が、S411で受け付けた文書に対して、文書解析部118がS424で解析した傾向情報に基づいて、付与されるべき分別符号を決定する(S511)。
<Fifth stage (S500)>
A detailed processing flow of the quality inspection unit 501 in the fifth stage will be described with reference to FIG. In the quality inspection unit 501, the classification code reception / giving unit 131 determines the classification code to be given based on the trend information analyzed by the document analysis unit 118 in S424 for the document received in S411 (S511). .
 分別符号受付付与部131が受け付けた分別符号とS511で決定した分別符号とを比較し(S512)、S411で受け付けた分別符号の妥当性を検証する(S513)。 The classification code received by the classification code reception / giving unit 131 is compared with the classification code determined in S511 (S512), and the validity of the classification code received in S411 is verified (S513).
 本発明の実施形態に係る文書分析システム101は、学習部601を備えても良い。学習部601では、第1から第4の処理結果をもとに、各キーワード又は関連用語の重みづけを式(2)により学習する。該学習結果をキーワードデータベース104、関連用語データベース105、又はスコア算出データベース106に反映しても良い。 The document analysis system 101 according to the embodiment of the present invention may include a learning unit 601. The learning unit 601 learns the weighting of each keyword or related term based on the first to fourth processing results using Expression (2). The learning result may be reflected in the keyword database 104, the related term database 105, or the score calculation database 106.
 本発明の実施形態に係る文書分析システム101は、文書分析処理の結果をもとに、訴訟案件(例えば、訴訟であればカルテル・特許・FCPA・PLなど)又は不正調査(例えば、情報漏洩、架空請求など)の調査種類に合わせて最適な調査レポートの出力を行うための報告作成部701を備えることができる。 The document analysis system 101 according to the embodiment of the present invention is based on the result of the document analysis process. It is possible to provide a report creation unit 701 for outputting an optimum survey report according to the survey type (eg, fictitious billing).
 調査種類によって、調査する内容は異なる。
 例えば、カルテル案件であれば、
1.競合の担当者がカルテルに関連する意思疎通(価格の調整)を、いつ・どのように取ったか?
2.関係者はどの組織の誰か?
がポイントになる。
The contents of the survey vary depending on the survey type.
For example,
1. When and how did the competing personnel communicate with the cartel (price adjustment)?
2. Who is the organization involved?
Is the point.
 また、特許侵害であれば、
1.侵害の対象となっている技術と内容が同じか?
2.誰が、いつ、どのような意図をもって(もたずに)侵害したか、もしくはしていないか?
といったことがポイントになる。
In case of patent infringement,
1. Is the content the same as the technology being infringed?
2. Who, when, what intention (without) infringing or not infringing?
That is the point.
 本発明の実施形態の他の実施例に係る文書調査報告システム及び文書調査報告方法並びに文書調査報告プログラムについて以下に記載する。 A document survey report system, a document survey report method, and a document survey report program according to another example of the embodiment of the present invention will be described below.
 本発明の実施形態の他の実施例に係る文書調査報告システムでは、類似の検索情報に対応して、既に分別符号を付与した文書を解析し、解析結果に基づいて分別符号を付与する範囲を調整する。そして調整された分別符号を付与する範囲に基づいて、分別作業及び調査作業を行い、分別作業及び調査作業の結果に基づいて報告を作成する。 In the document investigation report system according to another example of the embodiment of the present invention, a document that has already been given a classification code is analyzed in correspondence with similar search information, and a range in which the classification code is assigned based on the analysis result is determined. adjust. Then, based on the range to which the adjusted classification code is assigned, the classification work and the survey work are performed, and a report is created based on the results of the classification work and the survey work.
 類似の検索情報に対応して分別符号を付与する範囲を調整する方法として、類似の検索情報に対応して類似の検索情報をクラスタリングして分別符号を付与する範囲を調整する方法と、分別結果を学習して予測分別を行う方法がある。類似の検索情報に対応して類似の検索情報をクラスタリングして分別符号を付与する範囲を調整する方法には、例えば、メタデータの共通性に着目して、原文書、原文書の返信文書、原文書の返信文書の返信文書に共通の分別符号を付与する場合がある。分別結果を学習して予測分別を行う方法では、分別結果について類似の検索情報を統合するように学習することによって、類似の検索情報について同一又は類似の分別符号を付与する。 As a method of adjusting the range to which the classification code is assigned corresponding to similar search information, the method of adjusting the range to which the classification code is assigned by clustering similar search information corresponding to the similar search information, and the classification result There is a method to perform prediction classification by learning. In order to adjust the range of clustering similar search information corresponding to similar search information and assigning a classification code, for example, focusing on the commonality of metadata, the original document, the reply document of the original document, A common classification code may be given to the reply document of the reply document of the original document. In the method of learning classification results and performing predictive classification, the same or similar classification codes are given to similar search information by learning to integrate similar search information for the classification results.
 本発明の実施形態の他の実施例では、解析の対象となる文書の件数により、解析結果の信頼性が変化する。分別の対象となる文書の全件数に対して、統計的手法を加えて、どの時点で、全文書のどの割合について、解析結果に基づいて分別符号を付与する範囲を調整するか定めても良い。 In another example of the embodiment of the present invention, the reliability of the analysis result varies depending on the number of documents to be analyzed. A statistical method may be added to the total number of documents to be classified to determine at what time point the percentage of all documents to be adjusted for the range to which the classification code is assigned based on the analysis results. .
 本発明の実施形態の他の実施例では、類似の検索情報に対応して分別符号を付与する範囲を調整する方法として、類似の検索情報に対応して検索情報をクラスタリングして分別符号を付与する範囲を調整する方法と、分別結果を学習して予測分別を行う方法の両方を実行して、分別符号を付与する文書の範囲を調整しても良い。 In another example of the embodiment of the present invention, as a method of adjusting the range to which the classification code is assigned corresponding to the similar search information, the classification is performed by clustering the search information corresponding to the similar search information. The range of the document to which the classification code is assigned may be adjusted by executing both the method of adjusting the range to be performed and the method of performing the prediction classification by learning the classification result.
 本発明の実施形態の他の実施例に係る文書調査報告システム及び文書調査報告方法並びに文書調査報告プログラムでは、これらの分別作業及び調査の結果に基づいて、報告を作成する。 In the document survey report system, the document survey report method, and the document survey report program according to another example of the embodiment of the present invention, a report is created based on the results of these sorting operations and surveys.
 これにより、本発明の実施形態の他の実施例に係る文書調査報告システム及び文書調査報告方法並びに文書調査報告プログラムでは、的確な調査報告を迅速に作成することが可能となると共に、分別作業及び報告作成作業に伴う負担を軽減することができる。 Thereby, in the document investigation report system, the document investigation report method, and the document investigation report program according to another example of the embodiment of the present invention, it is possible to quickly create an accurate investigation report, The burden associated with report creation can be reduced.
 本発明の実施形態の他の実施例では、ユーザに対し、調査種類判定部が抽出した情報の種類を提示する表示画面を制御する表示画面制御部を備えることができる。 In another example of the embodiment of the present invention, a display screen control unit that controls a display screen that presents the type of information extracted by the survey type determination unit to the user may be provided.
 本発明の実施形態の他の実施例では、表示画面制御部に提示された情報の種類に対応した、ユーザによるキーワードおよび/または文章の入力を受け付ける入力受付部を備えることができる。 In another example of the embodiment of the present invention, an input receiving unit that receives a keyword and / or sentence input by a user corresponding to the type of information presented on the display screen control unit may be provided.
 本発明の実施形態は、訴訟案件又は不正調査案件のカテゴリについてユーザの入力を受け付けることにより、カテゴリに応じて自動的にデータベースを更新する。これにより担当者、カストディアンの氏名等を入力する事務作業の負担が軽減される。また、カテゴリに応じて自動的に更新されたデータベースにより検索ワードを調整し、調整された検索ワードを用いて当該文書情報に対して分別符号を自動で付与する。これにより、訴訟又は不正調査案件に利用する文書情報の分別作業の負担が軽減される。すなわち、本発明により、訴訟に利用する文書情報の分析が容易になる。 The embodiment of the present invention automatically updates the database according to a category by accepting a user input for a category of litigation case or fraud investigation case. As a result, the burden of office work for inputting the names of persons in charge, custodians, etc. is reduced. Further, the search word is adjusted by the database automatically updated according to the category, and a classification code is automatically assigned to the document information using the adjusted search word. This reduces the burden of sorting the document information used for litigation or fraud investigation cases. That is, according to the present invention, it becomes easy to analyze document information used in a lawsuit.
 〔ソフトウェアによる実現例〕
 文書分析システム100および文書分析システム101の制御ブロック(特に、制御部10)は、集積回路(ICチップ)等に形成された論理回路(ハードウェア)によって実現してもよいし、CPU(Central Processing Unit)を用いてソフトウェアによって実現してもよい。後者の場合、文書分析システム100、101は、各機能を実現するソフトウェアであるプログラム(文書分析システム100、101の制御プログラム)の命令を実行するCPU、上記プログラムおよび各種データがコンピュータ(またはCPU)で読み取り可能に記録されたROM(Read Only Memory)または記憶装置(これらを「記録媒体」と称する)、上記プログラムを展開するRAM(Random Access Memory)などを備えている。そして、コンピュータ(またはCPU)が上記プログラムを上記記録媒体から読み取って実行することにより、本発明の目的が達成される。上記記録媒体としては、「一時的でない有形の媒体」、例えば、テープ、ディスク、カード、半導体メモリ、プログラマブルな論理回路などを用いることができる。また、上記プログラムは、当該プログラムを伝送可能な任意の伝送媒体(通信ネットワークや放送波等)を介して上記コンピュータに供給されてもよい。本発明は、上記プログラムが電子的な伝送によって具現化された、搬送波に埋め込まれたデータ信号の形態でも実現され得る。
[Example of software implementation]
The control blocks (particularly the control unit 10) of the document analysis system 100 and the document analysis system 101 may be realized by a logic circuit (hardware) formed in an integrated circuit (IC chip) or the like, or a CPU (Central Processing). Unit) and may be realized by software. In the latter case, the document analysis systems 100 and 101 are CPUs that execute instructions of a program (control program for the document analysis systems 100 and 101) that is software for realizing each function, and the programs and various data are computers (or CPUs). ROM (Read Only Memory) or storage device (referred to as “recording medium”) recorded in such a manner as to be readable, and a RAM (Random Access Memory) for expanding the program. And the objective of this invention is achieved when a computer (or CPU) reads the said program from the said recording medium and runs it. As the recording medium, a “non-temporary tangible medium” such as a tape, a disk, a card, a semiconductor memory, a programmable logic circuit, or the like can be used. The program may be supplied to the computer via an arbitrary transmission medium (such as a communication network or a broadcast wave) that can transmit the program. The present invention can also be realized in the form of a data signal embedded in a carrier wave in which the program is embodied by electronic transmission.
 具体的には、本発明の実施の形態に係る文書分析システムの制御プログラムは、文書を分析する文書分析システムの制御プログラムであって、コンピュータに、(文書分析システム100)に、生成機能、乗算機能、および算出機能を実現させる。 Specifically, a control program for a document analysis system according to an embodiment of the present invention is a control program for a document analysis system that analyzes a document, and includes a computer, (document analysis system 100), a generation function, and multiplication. Functions and calculation functions are realized.
 上記生成機能、乗算機能、および算出機能は、生成部12、乗算部13、および算出部14によってそれぞれ実現されることができる。詳細については、いずれも上述した通りである。 The generation function, multiplication function, and calculation function can be realized by the generation unit 12, the multiplication unit 13, and the calculation unit 14, respectively. Details are as described above.
 〔付記事項〕
 本発明は上述したそれぞれの実施の形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施の形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施の形態についても、本発明の技術的範囲に含まれる。さらに、各実施の形態にそれぞれ開示された技術的手段を組み合わせることにより、新しい技術的特徴を形成できる。
[Additional Notes]
The present invention is not limited to the above-described embodiments, and various modifications can be made within the scope of the claims, and the technical means disclosed in different embodiments can be appropriately combined. Embodiments to be made are also included in the technical scope of the present invention. Furthermore, a new technical feature can be formed by combining the technical means disclosed in each embodiment.
 本発明は、パーソナルコンピュータ、ワークステーション、メインフレームなど、任意のコンピュータに広く適用することができる。 The present invention can be widely applied to arbitrary computers such as personal computers, workstations, and mainframes.
 1:文書データ(文書)、2:キーワードベクトル、3:相関ベクトル、4:スコア、5:最多センテンス(所定のキーワードが最も多く含まれることを示すキーワードベクトルに対応するセンテンス)、6:要約情報(要約)、7:フェーズ情報(フェーズ)、8:変化情報(フェーズの変化)、12:生成部、13:乗算部、14:算出部、15:抽出部、16:要約部、17:フェーズ特定部、18:変化推定部、100:文書分析システム、101:文書分析システム 1: document data (document), 2: keyword vector, 3: correlation vector, 4: score, 5: most frequent sentence (sentence corresponding to a keyword vector indicating that a predetermined keyword is contained most), 6: summary information (Summary), 7: phase information (phase), 8: change information (change in phase), 12: generation unit, 13: multiplication unit, 14: calculation unit, 15: extraction unit, 16: summary unit, 17: phase Identification unit, 18: change estimation unit, 100: document analysis system, 101: document analysis system

Claims (9)

  1.  文書を分析する文書分析システムであって、
     前記文書に含まれるセンテンスに所定のキーワードが含まれるか否かを示すキーワードベクトルを、当該センテンスごとに生成する生成部と、
     前記生成部によって生成されたキーワードベクトルを、前記所定のキーワードと、当該所定のキーワードとは異なる他のキーワードとの相関を示す相関マトリクスにそれぞれ乗じることによって、前記センテンスごとに相関ベクトルを得る乗算部と、
     前記乗算部によって得られた全ての相関ベクトルについて合算した値に基づいて、前記文書と所定の事件との関連度を示す分別符号が、当該文書と結びつく強さを示すスコアを算出する算出部とを備えたことを特徴とする文書分析システム。
    A document analysis system for analyzing documents,
    A generating unit that generates, for each sentence, a keyword vector indicating whether or not a predetermined keyword is included in the sentence included in the document;
    A multiplication unit that obtains a correlation vector for each sentence by multiplying the correlation vector indicating the correlation between the predetermined keyword and another keyword different from the predetermined keyword by the keyword vector generated by the generation unit. When,
    A calculation unit that calculates a score indicating the strength with which the classification code indicating the degree of association between the document and the predetermined case is based on a value obtained by adding all the correlation vectors obtained by the multiplication unit; A document analysis system characterized by comprising:
  2.  前記算出部は、前記合算した値と、前記所定のキーワードに対する重みを示す重みベクトルとの内積を算出することによって、前記スコアを算出することを特徴とする請求項1に記載の文書分析システム。 2. The document analysis system according to claim 1, wherein the calculation unit calculates the score by calculating an inner product between the summed value and a weight vector indicating a weight for the predetermined keyword.
  3.  前記文書において、前記所定のキーワードが最も多く含まれることを示す前記キーワードベクトルに対応するセンテンスを抽出する抽出部をさらに備えたことを特徴とする請求項1または2に記載の文書分析システム。 3. The document analysis system according to claim 1, further comprising an extraction unit for extracting a sentence corresponding to the keyword vector indicating that the predetermined keyword is contained most in the document.
  4.  前記文書において、前記所定のキーワードが含まれることを示す前記キーワードベクトルに対応するセンテンスを列挙することにより、当該文書の要約を生成する要約部をさらに備えたことを特徴とする請求項1から3のいずれか1項に記載の文書分析システム。 The summarization part which produces | generates the summary of the said document by enumerating the sentence corresponding to the said keyword vector which shows that the said predetermined keyword is contained in the said document is further provided. The document analysis system according to any one of the above.
  5.  前記所定の事件の原因となる所定の行為を、当該所定の行為の進展に応じて分類するフェーズを、前記算出部によって算出されたスコアに基づいて特定する特定部をさらに備えたことを特徴とする請求項1から4のいずれか1項に記載の文書分析システム。 The system further comprises a specifying unit that specifies a phase of classifying a predetermined action causing the predetermined case according to progress of the predetermined action based on the score calculated by the calculation unit. The document analysis system according to any one of claims 1 to 4.
  6.  前記フェーズの時間的な遷移に基づいて、前記フェーズ特定部によって特定されたフェーズの変化を推定する変化推定部をさらに備えたことを特徴とする請求項5に記載の文書分析システム。 6. The document analysis system according to claim 5, further comprising a change estimation unit that estimates a change in the phase identified by the phase identification unit based on a temporal transition of the phase.
  7.  前記算出部によって算出されたスコアに基づいて、前記文書に分別符号を付与する符号付与部をさらに備えたことを特徴とする請求項1から6のいずれか1項に記載の文書分析システム。 The document analysis system according to any one of claims 1 to 6, further comprising a code assigning unit that assigns a classification code to the document based on the score calculated by the calculation unit.
  8.  文書を分析する文書分析システムの制御方法であって、
     前記文書に含まれるセンテンスに所定のキーワードが含まれるか否かを示すキーワードベクトルを、当該センテンスごとに生成する生成ステップと、
     前記生成ステップにおいて生成したキーワードベクトルを、前記所定のキーワードと、当該所定のキーワードとは異なる他のキーワードとの相関を示す相関マトリクスにそれぞれ乗じることによって、前記センテンスごとに相関ベクトルを得る乗算ステップと、
     前記乗算ステップにおいて得た全ての相関ベクトルについて合算した値に基づいて、前記文書と所定の事件との関連度を示す分別符号が、当該文書と結びつく強さを示すスコアを算出する算出ステップとを含むことを特徴とする文書分析システムの制御方法。
    A method for controlling a document analysis system for analyzing documents,
    Generating a keyword vector indicating whether or not a predetermined keyword is included in a sentence included in the document for each sentence;
    A multiplication step of obtaining a correlation vector for each sentence by multiplying the correlation vector indicating the correlation between the predetermined keyword and another keyword different from the predetermined keyword by the keyword vector generated in the generation step; ,
    A calculation step of calculating a score indicating the strength with which the classification code indicating the degree of association between the document and the predetermined case is based on the value obtained by adding all the correlation vectors obtained in the multiplication step. A method for controlling a document analysis system, comprising:
  9.  文書を分析する文書分析システムの制御プログラムであって、コンピュータに、
     前記文書に含まれるセンテンスに所定のキーワードが含まれるか否かを示すキーワードベクトルを、当該センテンスごとに生成する生成機能と、
     前記生成機能によって生成されたキーワードベクトルを、前記所定のキーワードと、当該所定のキーワードとは異なる他のキーワードとの相関を示す相関マトリクスにそれぞれ乗じることによって、前記センテンスごとに相関ベクトルを得る乗算機能と、
     前記乗算機能によって得られた全ての相関ベクトルについて合算した値に基づいて、前記文書と所定の事件との関連度を示す分別符号が、当該文書と結びつく強さを示すスコアを算出する算出機能とを実現させることを特徴とする文書分析システムの制御プログラム。
    A control program for a document analysis system for analyzing a document.
    A generation function for generating, for each sentence, a keyword vector indicating whether or not a predetermined keyword is included in a sentence included in the document;
    A multiplication function for obtaining a correlation vector for each sentence by multiplying the keyword vector generated by the generation function by a correlation matrix indicating the correlation between the predetermined keyword and another keyword different from the predetermined keyword. When,
    A calculation function for calculating a score indicating the strength with which the classification code indicating the degree of association between the document and the predetermined case is based on a value obtained by adding all the correlation vectors obtained by the multiplication function; A document analysis system control program characterized by realizing the above.
PCT/JP2014/062743 2014-05-13 2014-05-13 Document analysis system, control method for document analysis system, and control program for document analysis system WO2015173894A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2014/062743 WO2015173894A1 (en) 2014-05-13 2014-05-13 Document analysis system, control method for document analysis system, and control program for document analysis system
JP2015510547A JP5815911B1 (en) 2014-05-13 2014-05-13 Document analysis system, document analysis system control method, and document analysis system control program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2014/062743 WO2015173894A1 (en) 2014-05-13 2014-05-13 Document analysis system, control method for document analysis system, and control program for document analysis system

Publications (1)

Publication Number Publication Date
WO2015173894A1 true WO2015173894A1 (en) 2015-11-19

Family

ID=54479466

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2014/062743 WO2015173894A1 (en) 2014-05-13 2014-05-13 Document analysis system, control method for document analysis system, and control program for document analysis system

Country Status (2)

Country Link
JP (1) JP5815911B1 (en)
WO (1) WO2015173894A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102615420B1 (en) * 2022-11-16 2023-12-19 에이치엠컴퍼니 주식회사 Automatic analysis device for legal documents based on artificial intelligence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003016106A (en) * 2001-06-29 2003-01-17 Fuji Xerox Co Ltd Device for calculating degree of association value
JP2009098811A (en) * 2007-10-15 2009-05-07 Toshiba Corp Document sorting apparatus and program
WO2013129548A1 (en) * 2012-02-29 2013-09-06 株式会社Ubic Document classification system, document classification method, and document classification program
WO2014057965A1 (en) * 2012-10-09 2014-04-17 株式会社Ubic Forensic system, forensic method, and forensic program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003016106A (en) * 2001-06-29 2003-01-17 Fuji Xerox Co Ltd Device for calculating degree of association value
JP2009098811A (en) * 2007-10-15 2009-05-07 Toshiba Corp Document sorting apparatus and program
WO2013129548A1 (en) * 2012-02-29 2013-09-06 株式会社Ubic Document classification system, document classification method, and document classification program
WO2014057965A1 (en) * 2012-10-09 2014-04-17 株式会社Ubic Forensic system, forensic method, and forensic program

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102615420B1 (en) * 2022-11-16 2023-12-19 에이치엠컴퍼니 주식회사 Automatic analysis device for legal documents based on artificial intelligence

Also Published As

Publication number Publication date
JPWO2015173894A1 (en) 2017-04-20
JP5815911B1 (en) 2015-11-17

Similar Documents

Publication Publication Date Title
JP5627820B1 (en) Document analysis system, document analysis method, and document analysis program
JP5627750B1 (en) Document analysis system, document analysis method, and document analysis program
JP5596213B1 (en) Document analysis system, document analysis method, and document analysis program
JP5683749B1 (en) Document analysis system, document analysis method, and document analysis program
JP5986687B2 (en) Data separation system, data separation method, program for data separation, and recording medium for the program
JP5622969B1 (en) Document analysis system, document analysis method, and document analysis program
WO2015118619A1 (en) Document analysis system, document analysis method, and document analysis program
JP5815911B1 (en) Document analysis system, document analysis system control method, and document analysis system control program
JP5669904B1 (en) Document search system, document search method, and document search program for providing prior information
KR101658890B1 (en) Method for online evaluating patents
WO2015145524A1 (en) Document analysis system, document analysis method, and document analysis program
JP5851007B2 (en) Document analysis system, document analysis method, and document analysis program
JP2015056185A (en) Document analyzing system, document analysis method, and document analysis program
JP5829768B2 (en) E-mail analysis system, e-mail analysis method, and e-mail analysis program
KR20150015424A (en) Method for online evaluating patents
JP5990562B2 (en) Document search system, document search method, and document search program for providing prior information
JP5745676B1 (en) Document analysis system, document analysis method, and document analysis program

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2015510547

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14891798

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14891798

Country of ref document: EP

Kind code of ref document: A1