WO2015118619A1 - 文書分析システム及び文書分析方法並びに文書分析プログラム - Google Patents

文書分析システム及び文書分析方法並びに文書分析プログラム Download PDF

Info

Publication number
WO2015118619A1
WO2015118619A1 PCT/JP2014/052581 JP2014052581W WO2015118619A1 WO 2015118619 A1 WO2015118619 A1 WO 2015118619A1 JP 2014052581 W JP2014052581 W JP 2014052581W WO 2015118619 A1 WO2015118619 A1 WO 2015118619A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
information
lawsuit
investigation
classification code
Prior art date
Application number
PCT/JP2014/052581
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
守本 正宏
秀樹 武田
和巳 蓮子
彰晃 花谷
菜々子 吉田
Original Assignee
株式会社Ubic
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社Ubic filed Critical 株式会社Ubic
Priority to PCT/JP2014/052581 priority Critical patent/WO2015118619A1/ja
Priority to TW104103850A priority patent/TW201539217A/zh
Publication of WO2015118619A1 publication Critical patent/WO2015118619A1/ja

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services; Handling legal documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]

Definitions

  • the present invention relates to a document analysis system, a document analysis method, and a document analysis program.
  • Patent Document 1 discloses a digital document in which a specific person is designated from at least one or more users included in the user information and is accessed based on access history information regarding the specified specific person. Extracts only the information, sets the accompanying information indicating whether each extracted digital document information document file is related to a lawsuit, and outputs a document file related to the lawsuit based on the supplementary information
  • a forensic system is disclosed.
  • Patent Document 2 recorded digital information is displayed, and for each of a plurality of document files, a user identification indicating which of the users included in the user information relates to the user is specified. Information is set, the set user identification information is set to be recorded in the storage unit, at least one user is specified, and the user identification information corresponding to the specified user is set Searches the document file, sets incidental information indicating whether or not the retrieved document file is related to the lawsuit, and outputs the document file related to the lawsuit based on the supplementary information. A forensic system is disclosed.
  • Patent Document 3 accepts designation of at least one or more document files included in the digital document information, accepts designation of which language the designated document file is translated into, and designates the document file for which designation is accepted.
  • Translated into the language that accepted the specification extracted from the digital document information recorded in the recording unit a common document file showing the same content as the specified document file, the extracted common document file was translated
  • a forensic system that generates translation-related information indicating that a document file has been translated by using the translation content of the document file, and outputs a document file related to a lawsuit based on the translation-related information.
  • Patent Document 1 a huge amount of document information of users using a plurality of computers and servers is collected.
  • the present invention has an object to provide a document analysis system, a document analysis method, and a document analysis program for facilitating analysis of document information used in a lawsuit.
  • the document analysis system of the present invention is a document analysis system that acquires information recorded in a predetermined computer or server, and analyzes document information that is included in the acquired information and is composed of a plurality of documents.
  • a generation process model in which a predetermined action causing a lawsuit or fraud investigation occurs is stored for each phase classified according to the progress of the predetermined action, and information related to the lawsuit or fraud investigation is stored in the lawsuit or fraud investigation.
  • An investigation basic database that further stores for each category to which fraud investigation belongs and the generation process model, and further stores time-series information indicating the temporal order of the phases, information related to the lawsuit or fraud investigation, and the generation process Analyzing the document information based on the model and the time series information to indicate the possibility of the predetermined action
  • the document analysis system determines a survey category to be surveyed based on a survey category input receiving unit that receives input of the category of the lawsuit or fraud survey, and a category received by the survey category input receiving unit, A survey type determination unit that extracts a type of necessary information from the survey basic database may be further included.
  • the document analysis system may further include an information extraction unit that extracts keywords and / or sentences included in the document information from the document information as information related to the lawsuit or fraud investigation.
  • the document analysis system may further include a search unit that searches the keywords and / or sentences from the plurality of documents.
  • the document analysis system may further include an automatic classification code assigning unit that automatically assigns a classification code to each of the plurality of documents, and the keyword and / or the sentence may be used for assigning the classification code. it can.
  • the document analysis method of the present invention is a document analysis method for acquiring information recorded in a predetermined computer or server and analyzing document information composed of a plurality of documents included in the acquired information.
  • a generation process model in which a predetermined action causing a lawsuit or fraud investigation occurs is stored for each phase classified according to the progress of the predetermined action, and information related to the lawsuit or fraud investigation is stored in the lawsuit or fraud investigation.
  • Information related to the lawsuit or fraud investigation is further stored by referring to the investigation basic database further storing for each category to which the fraud investigation belongs and the generation process model, and further storing time series information indicating the temporal order of the phases.
  • the document information is analyzed based on the generation process model and the time series information, and the predetermined action occurs.
  • the indicator of the potential contains a calculation step of calculating the result of the analysis.
  • the document analysis program of the present invention is a document analysis program for acquiring information recorded in a predetermined computer or server and analyzing document information composed of a plurality of documents included in the acquired information.
  • the computer stores a generation process model in which a predetermined action causing a lawsuit or fraud investigation occurs, for each phase classified according to the progress of the predetermined action, and information related to the lawsuit or fraud investigation,
  • the lawsuit or fraud investigation is further stored for each category to which the lawsuit or fraud investigation belongs and the generation process model, and the lawsuit or fraud investigation is referred to by referring to a survey basic database further storing time series information indicating the temporal order of the phases.
  • the document information is classified based on related information, the generation process model, and the time series information. And, an index indicating the likelihood that the predetermined action is caused to realize the calculation function to calculate the result of the analysis.
  • the document analysis system the document analysis method, and the document analysis program of the present invention, it is possible to facilitate the analysis of document information used in a lawsuit.
  • FIG. 1 is a block diagram showing a main configuration of a document analysis system according to an embodiment of the present invention.
  • Table showing the list of possible phases in this embodiment (A) is a schematic diagram showing that the process in which the predetermined action occurs is modeled as the generation process model for each phase, and (b) is information related to the lawsuit or fraud investigation, Schematic diagram showing that the lawsuit or fraud investigation is stored for each category and the above generation process model 1 is a detailed configuration diagram of a document analysis system according to an embodiment of the present invention.
  • the chart which shows the flow of a process of the document analysis method concerning embodiment of this invention The chart which shows the flow of a detailed process in the document analysis method concerning embodiment of this invention
  • the chart which shows the flow of the investigation and the classification process according to the investigation type in the document analysis method according to the embodiment of the present invention The chart which shows the flow of predictive coding according to the investigation kind in the document analysis method concerning embodiment of this invention
  • the chart which showed the flow of processing for every step in an embodiment The chart which shows the processing flow of the keyword database in an embodiment
  • the chart which showed the processing flow of the related term database in this embodiment The chart which showed the processing flow of the 1st automatic classification part in this embodiment
  • the graph which showed the analysis result in the document analysis part in this embodiment The chart which showed the processing flow of the 3rd automatic classification part in one example of this embodiment
  • FIG. 1 is a block diagram showing a main configuration of a document analysis system 1 according to an embodiment of the present invention.
  • the document analysis system 1 is a system that acquires information recorded in a predetermined computer or server, and analyzes document information including a plurality of documents included in the acquired information.
  • the document analysis system 1 includes a survey category input reception unit 20, a survey type determination unit 22, an information extraction unit 24, a survey basic database 103, an analysis unit 26, a calculation unit 28, a search unit 30, and The automatic classification code assigning unit 32 is provided.
  • the investigation category input receiving unit 20 receives an input of a lawsuit or fraud investigation category by the user.
  • the category of the lawsuit or fraud investigation represents the nature of the case relating to the lawsuit or fraud investigation. For example, antitrust, patent, foreign bribery prohibition (FCPA), product liability (PL), information It may be a leak or a fictitious claim.
  • FCPA foreign bribery prohibition
  • PL product liability
  • the survey category input reception unit 20 outputs the category to the survey type determination unit 22.
  • the survey type determination unit 22 determines a category to be surveyed based on the category received by the survey category input reception unit 20 and extracts a necessary information type from the survey basic database 103. For example, when the document information is any one of an email, a presentation material, a spreadsheet, a meeting material, a contract, an organization chart, or a business plan, the survey type determination unit 22 needs the email as described above. It outputs to the information extraction part 24 as a kind of information.
  • the information extraction unit 24 extracts a plurality of documents from the document information. Specifically, the information extraction unit 24 uses information input from the survey type determination unit 22 (for example, e-mail, presentation materials, spreadsheet materials, meeting materials, contracts, organization charts, business plans, etc.). The keywords and / or sentences included in the information are extracted as information related to lawsuits or fraud investigations, and the extracted results are stored in the investigation basic database 103.
  • the survey type determination unit 22 for example, e-mail, presentation materials, spreadsheet materials, meeting materials, contracts, organization charts, business plans, etc.
  • the keywords and / or sentences included in the information are extracted as information related to lawsuits or fraud investigations, and the extracted results are stored in the investigation basic database 103.
  • the investigation basic database 103 stores a generation process model in which a predetermined action causing a lawsuit or fraud investigation occurs for each phase classified according to the progress of the predetermined action.
  • the lawsuit may be a lawsuit regarding, for example, antitrust, patent, foreign bribery prohibition (FCPA), product liability (PL), and the like.
  • the fraud investigation may be an investigation relating to information leakage, fictitious billing, and the like.
  • the prescribed actions are related to fraudulent actions such as antitrust, patents, overseas bribery prohibition, product liability, information leakage, and fictitious claims (for example, participating in price adjustment meetings with competitors). It may be an action.
  • FIG. 2 is a table showing a list of possible phases in the present embodiment.
  • the phase is an index indicating each stage in which the predetermined action progresses (classified according to the progress of the predetermined action).
  • the phase “Relationship Building” (relationship building) is a premise of the phase of competition (competition), and is a step of building a relationship with a customer / competition.
  • the “Preparation” phase refers to a stage in which information regarding competition is exchanged with competitors (which may be third parties).
  • the phase of “Competition” refers to the stage of presenting a price to a customer, obtaining feedback, and communicating with the competitor regarding the feedback.
  • the generation process model is based on information related to litigation or fraud investigations (for example, keywords extracted from document information), and a predetermined action subject (organization composed of individuals or multiple persons) It is a model about the process leading to.
  • Examples of the generation process model include a personality pattern model, an action pattern model, and a group pattern model.
  • (A) of FIG. 3 is a schematic diagram showing that the process in which the predetermined action occurs is modeled as the generation process model for each phase.
  • the survey basic database 103 stores the generation process model for each phase.
  • one generation process model is associated with the phase “Relationship Building” (relationship building)
  • another generation process model is associated with the phase “Preparation” (preparation). It is associated. That is, the process in which the predetermined action occurs is modeled as the generation process model for each phase.
  • the investigation basic database 103 further stores information related to the lawsuit or fraud investigation for each category to which the lawsuit or fraud investigation belongs and the generation process model.
  • the information related to the lawsuit or the fraud investigation may be a keyword, a combination of keywords, or meta information extracted from the document information by the information extraction unit 24.
  • the meta information is information indicating a predetermined attribute of the document information. For example, when the document information is an e-mail, the meta information may be a date and time when the e-mail is transmitted / received.
  • FIG. 3B is a schematic diagram showing that information related to the lawsuit or fraud investigation is stored for each category to which the lawsuit or fraud investigation belongs and the generation process model.
  • the investigation basic database 103 stores information related to the lawsuit or fraud investigation for each category to which the lawsuit or fraud investigation belongs and the generation process model. For example, for the category “antitrust” and one generation process model, information related to the lawsuit or the fraud investigation is stored in the investigation basic database 103.
  • the survey basic database 103 further stores time series information.
  • the time series information is information indicating a temporal order of the phases.
  • the time-series information has a phase of “Relationship Building” (relationship building) that has evolved into a phase of “Competition” through a phase of “Preparation” (preparation). It may be information indicating a series of transitions.
  • the analysis unit 26 analyzes the document information based on the information related to the lawsuit or fraud investigation, the generation process model, and the time series information. Specifically, the analysis unit 26 reads information related to the lawsuit or fraud investigation, the generation process model, and the time series information from the investigation basic database 103, and performs morphological analysis and keyword analysis of the investigation target data. To extract an action corresponding to the predetermined action. The analysis unit 26 outputs the analysis result (the extracted predetermined action) to the calculation unit 28.
  • the calculating unit 28 calculates an index (case index) indicating the possibility of the predetermined action from the result of the analysis. Specifically, an increment of an index is arbitrarily set for each predetermined action that causes a lawsuit or fraud investigation, and the calculation unit 28 sets the index corresponding to the extracted predetermined action to the above-described index. Increase by increments. For example, when a predetermined action belonging to the phase “Relationship Building” (relationship building) is extracted, the calculation unit 28 may increase the index corresponding to the predetermined action by one. In the example shown in FIG. 2, the increment of the index for an arbitrary action is set to “1”, but the increment can be arbitrarily set. The upper limit value of the index may be set to 10, for example.
  • the search unit 30 searches the document information for keywords or related terms recorded in the database. That is, the search unit 30 searches the plurality of documents for keywords (for example, words such as “infringement” and “lawsuit”) and / or sentences.
  • keywords for example, words such as “infringement” and “lawsuit”
  • the automatic classification code assigning unit 32 automatically assigns a classification code to each of the plurality of documents. At this time, the keyword and / or the sentence are used for assigning the classification code.
  • the document analysis system 1 it is possible to objectively grasp the risk level of a predetermined action by indexing the possibility of a predetermined action (for example, an illegal action) causing a lawsuit or a fraud investigation. can do.
  • a predetermined action for example, an illegal action
  • the predetermined action can be monitored by reporting according to the movement of the index. Therefore, the document analysis system 1 can facilitate analysis of document information used for a lawsuit.
  • FIG. 4 shows a detailed configuration example of the document analysis system 1 according to the embodiment of the present invention.
  • the document analysis system 1 can include a data storage unit 100 that stores information and data.
  • the data storage unit 100 stores digital information acquired from a plurality of computers or servers in the digital information storage area 101 for use in analysis of lawsuits or fraud investigations.
  • the data storage unit 100 includes, for example, a category attribute, company name, person in charge, which indicates which category of anti-trust, patent, FCPA, PL lawsuit or information leak, and fraud investigation including fictitious claims.
  • Survey basic database 103 for storing the configuration of the custodian and the survey or classification input screen, a specific classification code of the document included in the acquired digital information, a keyword closely related to the specific classification code, and
  • a keyword database 104 for registering keyword correspondence information indicating a correspondence relationship between the specific classification code and the keyword, a predetermined classification code, and a word having a high appearance frequency in the document to which the predetermined classification code is assigned.
  • a database 105 which stores the score calculation database 106 for registering the weighting of words contained in the document in order to calculate a score indicating the strength of the connection between document and sorting code.
  • the survey basic database 103 stores a generation process model in which a predetermined action causing a lawsuit or a fraud investigation occurs for each phase classified according to the progress of the predetermined action.
  • the survey basic database 103 also stores time-series information indicating the temporal order of the phases.
  • the data storage unit 100 stores a report creation database 107 for registering a report format determined according to the category, custodian, and contents of sorting work. As shown in FIG. 4, the data storage unit 100 may be installed in the document analysis system 1 or may be installed outside the document analysis system 1 as a separate storage device.
  • the document analysis system 1 includes a database management unit 109 that manages updating of data contents of a survey basic database 103, a keyword database 104, a related term database 105, a score calculation database 106, and a report creation database 107. Prepare.
  • the database management unit 109 can be connected to the information storage device 902 via a dedicated connection line or the Internet line 901. Then, based on the data contents stored in the information storage device 902, the database management unit 109 stores the data contents of the survey basic database 103, the keyword database 104, the related term database 105, the score calculation database 106, and the report creation database 107. Can be updated.
  • the document analysis system 1 includes the survey category input reception unit 20, the survey type determination unit 22, the information extraction unit 24, the analysis unit 26, the calculation unit 28, and the search unit 30. ing.
  • the automatic classification code assigning unit 32 is realized as a first automatic classification unit 201, a second automatic classification unit, and a third automatic classification unit 401.
  • the document analysis system 1 searches a keyword recorded in the keyword database 104 by the search unit 30 and a score calculation unit 116 that calculates a score indicating the strength of association between a document and a classification code,
  • a document including a keyword is extracted from the document information, a first automatic classification unit 201 that automatically assigns a specific classification code to the extracted document based on the keyword correspondence information, and from the document information to the related term database
  • a document including the recorded related terms is extracted, a score is calculated based on the evaluation value of the related terms included in the extracted document and the number of the related terms, and the score is constant among the documents including the related terms.
  • a second automatic classification unit 301 that automatically assigns a predetermined classification code to a document that exceeds the value based on the score and related term correspondence information It is possible.
  • the document analysis system 1 includes a document display unit 130 that displays a plurality of documents extracted from document information on a screen, and a plurality of documents that are not assigned a classification code extracted from document information.
  • the classification code assigned by the user based on the relevance to the lawsuit is received, and the classification code reception / giving unit 131 for assigning the classification code and the document to which the classification code is given by the classification code reception / giving unit 131 are analyzed.
  • the classification code is obtained.
  • a third automatic sorting unit 401 that automatically applies can be provided.
  • the document analysis system 1 translates the extracted document automatically by accepting the language determination unit 120 that determines the language type of the extracted document and the user's specification.
  • a translation unit 122 may be provided.
  • the language delimiter in the language determination unit 120 is set to be smaller than one sentence so as to be able to cope with a single sentence multilingual compound language. Furthermore, a process of removing an HTML header or the like from a translation target may be performed.
  • the document analysis system 1 in order to perform the analysis by the document analysis unit 118, the classification that each document has based on the type of word, the number of occurrences, and the evaluation value of the word included in each document You may provide the tendency information generation part 124 which produces
  • the document analysis system 1 compares the classification code received by the classification code reception / giving unit 131 with the classification code given by the trend information in the document analysis unit 118, and the classification code reception / granting unit 131. May include a quality inspection unit 501 that verifies the validity of the classification code received.
  • the document analysis system 1 may include a learning unit 601 that learns the weighting of each keyword or related term based on the result of the document analysis processing.
  • the document analysis system 1 includes a report creation unit 701 for outputting an optimal investigation report according to a lawsuit case or an investigation type of fraud investigation based on the result of document analysis processing.
  • Litigation cases include, for example, antitrust (cartel), patents, foreign bribery prohibition (FCPA), or product liability (PL).
  • the fraud investigation includes, for example, information leakage and fictitious claims.
  • the document analysis system 1 can include, for example, a lawyer review reception unit 133 that receives a review of a chief attorney or a chief patent attorney in order to improve the quality of the classification survey and the report.
  • Classification code refers to an identifier used when classifying documents, and indicates the degree of relevance with a lawsuit so that it can be easily used in a lawsuit. For example, when document information is used as evidence in a lawsuit, it may be given according to the type of evidence.
  • Document means data containing one or more words. Examples of “documents” include e-mail, presentation materials, spreadsheet materials, meeting materials, contracts, organization charts, business plans, and the like.
  • “Word” refers to a group of the smallest character strings that have meaning. For example, in a sentence “document means data including one or more words”, the words “document” “one” “more” “word” “include” “data” “say” Is included.
  • Keyword refers to a group of character strings having a certain meaning in a certain language. For example, if a keyword is selected from a sentence “classify a document”, it can be “document”, “classify”, or the like. In the embodiment, keywords such as “infringement”, “lawsuit”, and “patent publication XX” are selected with priority.
  • the keyword includes a morpheme.
  • key correspondence information refers to information indicating the correspondence between a keyword and a specific classification code. For example, if the classification code “important” representing an important document in a lawsuit has a close relationship with the keyword “infringer”, the above “keyword correspondence information” links the classification code “important” with the keyword “infringer”. It may also be information that is managed.
  • a related term refers to a word having an evaluation value equal to or higher than a certain value among words having a high appearance frequency in common with a document to which a predetermined classification code is assigned.
  • the appearance frequency refers to the rate at which related terms appear in the total number of words that appear in one document.
  • evaluation value refers to the amount of information that is exhibited in a document with each word.
  • the “evaluation value” may be calculated based on the amount of transmitted information.
  • the “related term” may refer to the name of the technical field to which the product belongs, the country where the product is sold, the name of a similar product of the product, and the like.
  • “related terms” in the case of assigning the product name of the apparatus that performs the image encoding process as a classification code includes “encoding process”, “Japan”, “encoder”, and the like.
  • “Related term correspondence information” refers to information indicating correspondence between related terms and classification codes. For example, when the classification code “product A” which is the product name related to the lawsuit has a related term “image encoding” which is a function of the product A, the “related term correspondence information” is classified into the classification code “product A”. And the related term “image coding” may be associated with each other and managed.
  • “Score” refers to a document that quantitatively evaluates the strength of connection with a specific classification code. In each embodiment of the present invention, for example, the score is calculated from the words appearing in the document and the evaluation value possessed by each word using the following equation (1).
  • the document analysis system 1 may extract words that frequently appear in documents having a common classification code assigned by the user.
  • the extracted word type, the evaluation value of each word, and the trend information of the number of appearances included in each document are analyzed for each document, and the classification code reception / giving unit 131 does not accept the classification code.
  • a common classification code may be assigned to a document having the same tendency as the analyzed trend information.
  • trend information refers to the degree of similarity between each document and a document to which a classification code is assigned, and is based on the type of word, the number of occurrences, and the word evaluation value included in each document.
  • the degree of relevance with a predetermined classification code For example, when each document is similar in degree of relevance between a document assigned a predetermined classification code and the predetermined classification code, the two documents have the same tendency information.
  • documents having the same evaluation value and the same number of occurrences may be documents having the same tendency.
  • FIG. 5 is a chart showing the flow of processing of the document analysis method (document analysis system control method) according to the embodiment of the present invention.
  • the analysis unit 26 reads the information related to the lawsuit or fraud investigation, the generation process model, and the time series information from the investigation basic database 103 (step 41, hereinafter “step” is abbreviated as “S”). ).
  • the analysis part 26 extracts the action applicable to the said predetermined
  • the calculation part 28 calculates the parameter
  • FIG. 6 is a detailed flowchart of the document analysis method according to the embodiment of the present invention. Note that the flow shown in FIG. 5 may be executed as a process independent of the flow shown in FIG. 6, or may be executed as a process included in any part of the flow shown in FIG. .
  • the use database such as the survey basic database and the document analysis database can be specified (S12).
  • the information storage device may be installed inside an organization that performs sorting or may be installed outside the organization. As a case where the information storage device is installed outside the organization, for example, there is a case where the information storage device is installed in an affiliated law firm or patent office.
  • the usage database such as the survey basic database and the document analysis database can be updated to the guideline database (S14).
  • the updated survey basic database is searched (S15), and the name of the company, the person in charge, and the custodian can be presented on the screen of the display device (S16).
  • the document analysis system can accept the user's correction input and specify the names of the actual person in charge and the custodian (S17).
  • digital document information can be extracted in order to perform document analysis work (S18).
  • the updated document analysis database the updated keyword database, related term database, and score calculation database can be searched (S19), and a classification code can be assigned to the extracted document information (S20).
  • the classification code by the reviewer can be received and the classification code can be given to the extracted document information (S21).
  • the database can be searched using the classification result as teacher data, and a classification code can be assigned to the extracted document information (S22).
  • the category is specified by the user's argument designation (S24), and the report creation database can be specified according to the specified category (S25).
  • the format of the report can be determined by the identified report creation database, and the report can be automatically output (S26).
  • FIG. 7 is a chart showing a flow of investigation and classification processing according to the investigation type in the document analysis method according to the embodiment of the present invention.
  • the survey type can be input (S31).
  • the user will try to carry out from a fraud investigation including antitrust, patents, litigation cases including overseas bribery prohibition (FCPA), product liability (PL) or information leakage, fictitious claims, etc. Enter the category corresponding to the survey and sorting work.
  • the document analysis system can accept a user category input and specify a category to be investigated.
  • the type of survey and document analysis processing and the type of database to be used can be determined (S32).
  • information stock stored in a usage database such as a survey basic database or a document analysis database may be accessed (S33).
  • the survey basic database is accessed according to the specified category, and each keyword input screen corresponding to the specified category can be displayed (S34).
  • the survey basic database is accessed according to the specified category, and keywords or documents can be extracted according to the specified category (S36).
  • the extracted documents and information can be narrowed down by performing a keyword search in the document analysis database (S38).
  • FIG. 8 is a chart showing the flow of predictive coding according to the investigation type in the document analysis method according to the embodiment of the present invention.
  • the document analysis system can ask the user for input according to the type of survey, and can accept the user's input for that. For example, regarding cartels in relation to the antitrust law, user input is requested for target products, parties (name and email address), related organizations (name and department), and time, and user input is accepted. it can. In addition, regarding related organizations, it is possible to request user input regarding competitor companies and customer companies, and accept user input in response to the input (S51).
  • the registration process, the classification process, and the inspection process are performed in the first to fifth stages.
  • keywords and related terms are updated and registered in advance using the results of past classification processing (STEP 100).
  • the keyword and the related term are updated and registered together with the keyword correspondence information and the related term correspondence information which are correspondence information between the classification code and the keyword or the related term.
  • a document including the keyword updated and registered in the first stage is extracted from all document information.
  • the updated keyword correspondence information recorded in the first stage is referred to, and the classification corresponding to the keyword is performed.
  • a first separation process for assigning a code is performed (STEP 200).
  • the document including the related term updated and registered in the first stage is extracted from the document information that has not been given the classification code in the second stage, and the score of the document including the related term is calculated.
  • a second classification process is performed in which a classification code is assigned (STEP 300).
  • the classification code given by the user is accepted for the document information that has not been given the classification code by the third stage, and the classification code accepted from the user is given to the document information.
  • the document information provided with the classification code received from the user is analyzed, the document without the classification code is extracted based on the analysis result, and the third classification for adding the classification code to the extracted document Process. For example, words that frequently appear in documents with a common classification code assigned by the user are extracted, and the types of extracted words, evaluation values possessed by each word, and trend information on the number of appearances are included for each document. And a common classification code is assigned to a document having the same tendency as the trend information (STEP 400).
  • the classification code to be given is determined based on the analyzed trend information for the document to which the user has given the classification code in the fourth stage, and the determined classification code and the classification code given by the user are determined.
  • the validity of the sorting process is verified by comparison (STEP 500). Moreover, you may perform a learning process based on the result of a document analysis process as needed.
  • the trend information used in the fourth and fifth stage processing refers to the degree of similarity between each document and the document to which the classification code is assigned.
  • the type of word included in each document the number of occurrences, This is based on the evaluation value of a word. For example, when each document is similar in degree of relevance between a document assigned a predetermined classification code and the predetermined classification code, the two documents have the same tendency information. In addition, even if the types of words included are different, documents having the same evaluation value and the same number of occurrences may be documents having the same tendency.
  • the keyword database 104 creates a management table for each classification code based on the result of classifying documents in past lawsuits, and specifies keywords corresponding to each classification code (STEP 111).
  • the document to which each classification code is assigned is analyzed, and the number of occurrences of each keyword in the document and the evaluation value are used.
  • a method, a method of manual selection by the user, or the like may be used.
  • the keyword correspondence information indicating that the keyword has a special relationship is created (STEP 112). Then, the identified keyword is registered in the keyword database 104. At this time, the identified keyword is associated with the keyword correspondence information and recorded in the management table of the classification code “important” in the keyword database 104 (STEP 113).
  • the related term database 105 creates a management table for each classification code based on the results of document classification in past lawsuits, and registers related terms corresponding to each classification code (STEP 121).
  • STEP 121 registers related terms corresponding to each classification code.
  • encoding process” and “product a” are registered as related terms of “product A”
  • decoding” and “product b” are registered as related terms of “product B”.
  • the related term correspondence information indicating which classification code each registered related term corresponds to is created (STEP 122) and recorded in each management table (STEP 123). At this time, the related term correspondence information also records a threshold value serving as a score necessary for determining an evaluation value and a classification code of each related term.
  • the keyword and the keyword correspondence information, and the related term and the related term correspondence information are updated and registered (STEP 113, STEP 123).
  • ⁇ Second stage (STEP 200)> A detailed processing flow of the first automatic sorting unit 201 in the second stage will be described with reference to FIG.
  • the first automatic classification unit 201 performs a process of assigning the classification code “important” to the document.
  • the first automatic sorting unit 201 extracts documents including the keywords “infringement” and “patent attorney” registered in the keyword database 104 in the first stage (STEP 100) from the document information (STEP 211).
  • the extracted document is referred to from the keyword correspondence information with reference to the management table in which the keyword is recorded (STEP 212), and a classification code of “important” is given (STEP 213).
  • the second automatic classification unit 301 assigns the classification codes “product A” and “product B” to the document information that has not been assigned the classification code in the second stage (STEP 200). Process.
  • the second automatic classification unit 301 records a document including related terms “encoding process”, “product a”, “decoding”, and “product b” recorded in the related term database 105 in the first stage. Extract (STEP 311). Based on the recorded appearance frequency and evaluation value of the four related terms, the score is calculated by the score calculation unit 116 using the expression (1) (STEP 312). The score represents the degree of association between each document and the classification codes “product A” and “product B”.
  • the appearance frequency of the related terms “encoding process” and “product a” and the evaluation value of the related term “encoding process” are high, and the score indicating the degree of association with the classification code “product A” is a threshold value. Is exceeded, the document is given a classification code “Product A”.
  • the second automatic classification unit 301 recalculates the evaluation value of the related term using the score calculated in STEP 432 in the fourth stage according to the following equation (2), and weights the evaluation value (STEP 315). ).
  • the classification code from the reviewer is given to the document information of a certain ratio extracted from the document information to which the classification code is not given. Acceptance and the accepted classification code are assigned to the document information.
  • the document information given the classification code received from the reviewer is analyzed, and based on the analysis result, the classification code is given to the document information not given the classification code.
  • a process of assigning classification codes of “important”, “product A”, and “product B” is performed on the document information. The fourth stage is further described below.
  • the information extraction unit 24 first samples a document at random and displays it on the document display unit 130.
  • 20% of the document information to be processed is extracted at random and set as a classification target by the reviewer.
  • Sampling may be an extraction method in which documents are arranged in order of document creation date and time or in order of name, and 30% of documents are selected from the top.
  • the user browses the display screen 11 shown in FIG. 20 displayed on the document display unit 130, and selects a classification code to be assigned to each document.
  • the classification code reception / giving unit 131 receives the classification code selected by the user (STEP 411), and sorts based on the given classification code (STEP 412).
  • the document analysis unit 118 extracts words that frequently appear in the documents classified by classification code by the classification code reception / giving unit 131 (STEP 421).
  • the evaluation value of the extracted common word is analyzed by Expression (2) (STEP 422), and the appearance frequency of the common word in the document is analyzed (STEP 423).
  • FIG. 16 is a graph showing a result of analyzing words frequently appearing in the document to which the classification code “important” is assigned in STEP424.
  • the vertical axis R_hot includes words selected as words associated with the classification code “important” among all documents to which the classification code “important” is assigned by the user, and the classification code “important” is assigned. Shows the percentage of documents that were used.
  • the horizontal axis indicates the ratio of documents including the words extracted in STEP 421 by the classification code receiving and assigning unit 131 among all the documents subjected to the classification process by the user.
  • STEP 421 to STEP 424 The processing of STEP 421 to STEP 424 is also executed for the documents to which the classification codes “product A” and “product B” are assigned, and the trend information of the documents is analyzed.
  • the third automatic classification unit 401 performs processing on a document whose classification code is not accepted by the classification code acceptance and grant unit 131 in STEP 411 out of the document information to be processed in the fourth stage.
  • a document having the same trend information as the trend information of the document to which the classification codes “important”, “product A”, and “product B” are assigned analyzed in STEP 424 from such a document.
  • Are extracted (STEP 431), and the score of the extracted document is calculated using the formula (1) based on the trend method (STEP 432).
  • an appropriate classification code is assigned to the document extracted in STEP 431 based on the trend information (STEP 433).
  • the third automatic sorting unit 401 further reflects the sorting result in each database using the score calculated in STEP 432 (STEP 434). Specifically, a process of lowering the evaluation values of keywords and related terms included in a document having a low score and increasing the evaluation values of keywords and related terms included in a document having a high score may be performed.
  • the third automatic classification unit 401 may perform a classification process on a document whose classification code is not given by the classification code reception and grant unit 131 in STEP 411 among the document information to be processed in the fourth stage. .
  • the third automatic sorting unit 401 when no argument is given (STEP 441: None), the same trend information as the trend information of the document to which the classification code “important” is assigned, analyzed from the document in STEP 424. Is extracted (STEP 442), and the score of the extracted document is calculated using equation (1) based on the trend information (STEP 443). Further, an appropriate classification code is assigned to the document extracted in STEP 442 based on the trend information (STEP 444).
  • the third automatic sorting unit 401 further reflects the sorting result in each database using the score calculated in STEP 443 (STEP 445). Specifically, the evaluation value of the keyword and the related term included in the document with a low score is lowered, while the evaluation value of the keyword and the related term included in the document with a high score is increased.
  • the data for score calculation is collectively stored in the score calculation database 106. May be stored.
  • ⁇ Fifth stage (STEP 500)> A detailed processing flow of the quality inspection unit 501 in the fifth stage will be described with reference to FIG.
  • the classification code reception / giving unit 131 determines the classification code to be given to the document received in STEP 411 based on the trend information analyzed by the document analysis unit 118 in STEP 424 (STEP 511). .
  • the classification code received by the classification code reception / giving unit 131 is compared with the classification code determined in STEP 511 (STEP 512), and the validity of the classification code received in STEP 411 is verified (STEP 513).
  • the document analysis system 1 may include a learning unit 601.
  • the learning unit 601 learns the weighting of each keyword or related term based on the first to fourth processing results using Expression (2).
  • the learning result may be reflected in the keyword database 104, the related term database 105, or the score calculation database 106.
  • the document analysis system 1 is based on the result of the document analysis processing, and a lawsuit case (for example, a cartel / patent / FCPA / PL if a lawsuit) or a fraud investigation (for example, information leakage, It is possible to provide a report creation unit 701 for outputting an optimum survey report according to the survey type (eg, fictitious billing).
  • a lawsuit case for example, a cartel / patent / FCPA / PL if a lawsuit
  • a fraud investigation for example, information leakage
  • the contents of the survey vary depending on the survey type. For example,
  • a method of analyzing a document that has already been given a classification code corresponding to similar search information and adjusting a range to which the classification code is assigned based on the analysis result is used.
  • the method of adjusting the range to which the classification code is assigned by clustering similar search information corresponding to the similar search information There is a method to perform prediction classification by learning.
  • a common classification code may be given to the reply document of the reply document of the original document.
  • the same or similar classification codes are given to similar search information by learning to integrate similar search information for the classification results.
  • the reliability of the analysis result varies depending on the number of documents to be analyzed.
  • a statistical method may be added to the total number of documents to be classified to determine at what time point the percentage of all documents to be adjusted for the range to which the classification code is assigned based on the analysis results. .
  • the classification is performed by clustering the search information corresponding to the similar search information.
  • the range of the document to which the classification code is assigned may be adjusted by executing both the method of adjusting the range to be performed and the method of performing the prediction classification by learning the classification result. Accordingly, in another example of the embodiment of the present invention, it is possible to quickly and accurately assign a classification code, and to reduce the burden associated with the classification work.
  • a display screen control unit that controls a display screen that presents the type of information extracted by the survey type determination unit to the user may be provided.
  • an input receiving unit that receives a keyword and / or sentence input by a user corresponding to the type of information presented on the display screen control unit may be provided.
  • the document analysis program of the present invention is a document analysis program for acquiring information recorded in a predetermined computer or server and analyzing document information composed of a plurality of documents included in the acquired information.
  • the computer stores a generation process model in which a predetermined action causing a lawsuit or fraud investigation occurs, for each phase classified according to the progress of the predetermined action, and information related to the lawsuit or fraud investigation,
  • the lawsuit or fraud investigation is further stored for each category to which the lawsuit or fraud investigation belongs and the generation process model, and the lawsuit or fraud investigation is referred to by referring to a survey basic database further storing time series information indicating the temporal order of the phases.
  • the document information is classified based on related information, the generation process model, and the time series information. And, an index indicating the likelihood that the predetermined action is caused to realize the calculation function to calculate the result of the analysis.
  • the calculation function can be realized by the calculation unit. Details are as described above.
  • the embodiment of the present invention automatically updates the database according to a category by accepting a user input for a category of litigation case or fraud investigation case.
  • a category of litigation case or fraud investigation case As a result, the burden of office work for inputting the names of persons in charge, custodians, etc. is reduced.
  • the search word is adjusted by the database automatically updated according to the category, and a classification code is automatically assigned to the document information using the adjusted search word. This reduces the burden of sorting the document information used for litigation or fraud investigation cases.
  • the control block of the document analysis system 1 may be realized by a logic circuit (hardware) formed in an integrated circuit (IC chip) or the like, or may be realized by software using a CPU (Central Processing Unit). .
  • the document analysis system 1 includes a CPU that executes instructions of a program (control program) that is software that implements each function, and a ROM (in which the program and various data are recorded so as to be readable by the computer (or CPU) A Read Only Memory) or a storage device (these are referred to as “recording media”), a RAM (Random Access Memory) for expanding the program, and the like.
  • the objective of this invention is achieved when a computer (or CPU) reads the said program from the said recording medium and runs it.
  • a “non-temporary tangible medium” such as a tape, a disk, a card, a semiconductor memory, a programmable logic circuit, or the like can be used.
  • the program may be supplied to the computer via an arbitrary transmission medium (such as a communication network or a broadcast wave) that can transmit the program.
  • the present invention can also be realized in the form of a data signal embedded in a carrier wave in which the program is embodied by electronic transmission.
  • a document analysis system comprising: a survey category determination unit that determines a survey category to be surveyed based on a category and extracts a necessary type of information from the survey basic database.
  • the document analysis system further includes a display screen control unit that controls a display screen for presenting a type of information extracted by the survey type determination unit to the user.
  • the document analysis system further includes an input reception unit that receives an input of a keyword and / or a sentence by a user corresponding to the type of information presented on the display screen control unit.
  • the document analysis system further includes an information extraction unit that extracts keywords and / or sentences corresponding to the type of information extracted by the survey type determination unit from the survey basic database. .
  • the document analysis system further includes a search unit that searches the document for the keyword and / or the sentence.
  • the document analysis system further includes an automatic classification code assigning unit that automatically assigns a classification code to the document, and the keyword and / or the sentence are used for assigning the classification code.
  • Document analysis system includes an automatic classification code assigning unit that automatically assigns a classification code to the document, and the keyword and / or the sentence are used for assigning the classification code.
  • An analysis method comprising: a survey category input receiving step for receiving an input of a category of the lawsuit or fraud investigation; and a survey category to be investigated based on the category received by the survey category input receiving step;
  • a document analysis method comprising: a survey type determination step for extracting a type of necessary information from a survey basic database that stores information related to litigation or fraud investigation.

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Technology Law (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
PCT/JP2014/052581 2014-02-04 2014-02-04 文書分析システム及び文書分析方法並びに文書分析プログラム WO2015118619A1 (ja)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2014/052581 WO2015118619A1 (ja) 2014-02-04 2014-02-04 文書分析システム及び文書分析方法並びに文書分析プログラム
TW104103850A TW201539217A (zh) 2014-02-04 2015-02-04 文件分析系統、文件分析方法、以及文件分析程式

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2014/052581 WO2015118619A1 (ja) 2014-02-04 2014-02-04 文書分析システム及び文書分析方法並びに文書分析プログラム

Publications (1)

Publication Number Publication Date
WO2015118619A1 true WO2015118619A1 (ja) 2015-08-13

Family

ID=53777454

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2014/052581 WO2015118619A1 (ja) 2014-02-04 2014-02-04 文書分析システム及び文書分析方法並びに文書分析プログラム

Country Status (2)

Country Link
TW (1) TW201539217A (zh)
WO (1) WO2015118619A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598988A (zh) * 2019-08-14 2019-12-20 中国平安财产保险股份有限公司 统计数据处理方法、装置及存储介质
WO2022184034A1 (zh) * 2021-03-01 2022-09-09 北京字跳网络技术有限公司 一种文档处理方法、装置、设备和介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI742549B (zh) * 2020-03-02 2021-10-11 如如研創股份有限公司 多維度模板之報告書產出方法與系統

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011081491A (ja) * 2009-10-05 2011-04-21 Nec Biglobe Ltd 時系列分析装置、時系列分析方法、及びプログラム
JP2012038135A (ja) * 2010-08-09 2012-02-23 Hitachi Solutions Ltd トレンド推移判定装置またはその方法
JP2013109642A (ja) * 2011-11-22 2013-06-06 Nomura Research Institute Ltd 文書管理装置
JP2013182338A (ja) * 2012-02-29 2013-09-12 Ubic:Kk 文書分別システム及び文書分別方法並びに文書分別プログラム
JP2013214152A (ja) * 2012-03-30 2013-10-17 Ubic:Kk 文書分別システム及び文書分別方法並びに文書分別プログラム

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011081491A (ja) * 2009-10-05 2011-04-21 Nec Biglobe Ltd 時系列分析装置、時系列分析方法、及びプログラム
JP2012038135A (ja) * 2010-08-09 2012-02-23 Hitachi Solutions Ltd トレンド推移判定装置またはその方法
JP2013109642A (ja) * 2011-11-22 2013-06-06 Nomura Research Institute Ltd 文書管理装置
JP2013182338A (ja) * 2012-02-29 2013-09-12 Ubic:Kk 文書分別システム及び文書分別方法並びに文書分別プログラム
JP2013214152A (ja) * 2012-03-30 2013-10-17 Ubic:Kk 文書分別システム及び文書分別方法並びに文書分別プログラム

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GIRGENTI, RICHARD H., MANAGING THE RISK OF FRAUD AND MISCONDUCT, 13 June 2012 (2012-06-13), pages 260 - 262 , 305 to 308 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598988A (zh) * 2019-08-14 2019-12-20 中国平安财产保险股份有限公司 统计数据处理方法、装置及存储介质
WO2022184034A1 (zh) * 2021-03-01 2022-09-09 北京字跳网络技术有限公司 一种文档处理方法、装置、设备和介质

Also Published As

Publication number Publication date
TW201539217A (zh) 2015-10-16

Similar Documents

Publication Publication Date Title
JP5627820B1 (ja) 文書分析システム及び文書分析方法並びに文書分析プログラム
JP5530476B2 (ja) 文書分別システム及び文書分別方法並びに文書分別プログラム
JP5596213B1 (ja) 文書分析システム及び文書分析方法並びに文書分析プログラム
JP5627750B1 (ja) 文書分析システム及び文書分析方法並びに文書分析プログラム
JP5723067B1 (ja) データ分析システム、データ分析方法、および、データ分析プログラム
JP5683749B1 (ja) 文書分析システム、文書分析方法、および、文書分析プログラム
JP5622969B1 (ja) 文書分析システム、文書分析方法、および、文書分析プログラム
WO2015118619A1 (ja) 文書分析システム及び文書分析方法並びに文書分析プログラム
JP6124936B2 (ja) データ分析システム、データ分析方法、および、データ分析プログラム
JP5669904B1 (ja) 事前情報を提供する文書調査システム、文書調査方法、及び文書調査プログラム
WO2015025978A1 (ja) 文書分別システム及び文書分別方法並びに文書分別プログラム
JP5815911B1 (ja) 文書分析システム、文書分析システムの制御方法、および、文書分析システムの制御プログラム
JP5851007B2 (ja) 文書分析システム及び文書分析方法並びに文書分析プログラム
JP5829768B2 (ja) 電子メール分析システム、電子メール分析方法、および、電子メール分析プログラム
JP2015056185A (ja) 文書分析システム及び文書分析方法並びに文書分析プログラム
JP5990562B2 (ja) 事前情報を提供する文書調査システム、文書調査方法、及び文書調査プログラム
WO2015145524A1 (ja) 文書分析システム、文書分析方法、および、文書分析プログラム
JP5850973B2 (ja) 文書分別システム及び文書分別方法並びに文書分別プログラム
JP5745676B1 (ja) 文書分析システム、文書分析方法、および、文書分析プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14881448

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14881448

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP