WO2016189605A1 - Data analysis system, control method, control program, and recording medium - Google Patents

Data analysis system, control method, control program, and recording medium Download PDF

Info

Publication number
WO2016189605A1
WO2016189605A1 PCT/JP2015/064832 JP2015064832W WO2016189605A1 WO 2016189605 A1 WO2016189605 A1 WO 2016189605A1 JP 2015064832 W JP2015064832 W JP 2015064832W WO 2016189605 A1 WO2016189605 A1 WO 2016189605A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
component
target data
evaluation
analysis system
Prior art date
Application number
PCT/JP2015/064832
Other languages
French (fr)
Japanese (ja)
Inventor
秀樹 武田
和巳 蓮子
Original Assignee
株式会社Ubic
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社Ubic filed Critical 株式会社Ubic
Priority to PCT/JP2015/064832 priority Critical patent/WO2016189605A1/en
Priority to JP2017520082A priority patent/JPWO2016189605A1/en
Publication of WO2016189605A1 publication Critical patent/WO2016189605A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to a data analysis system that analyzes data, and can be applied to, for example, an artificial intelligence system that analyzes big data.
  • the present invention has been made in view of such problems, and a data analysis technology system capable of accurately analyzing a data group regardless of the amount of the partial data for the data group, and
  • the purpose is to provide the related technology.
  • a data analysis system for analyzing data, comprising a memory, an input control device, and a controller.
  • An index for ranking the target data is generated, and the index corresponds to the relationship between each target data and a predetermined case, and changes based on an input given by the user via the input control device.
  • the memory stores at least temporarily the plurality of target data
  • the input control device presents sample data for the target data to a user, and receives input of classification information from the user;
  • the classification information is associated with the sample data based on the input to classify the sample data, and the sample data
  • a combination of data and classification information received from the user is provided as reference data to the controller, and the controller obtains a plurality of the reference data and extracts a first component from the plurality of reference data
  • the first component constitutes at least a part of the reference data, evaluates the degree to which the first component contributes to the combination, and the evaluated first component
  • a second component having relevance to the target data is extracted from at least one of the plurality of target data, and the second component constitutes at least a part of the target data. Evaluating the relevance between the plurality of target data and the predetermined case by evaluating the component and generating the index based on the evaluation result of the second component And it is characterized in that.
  • a control method of a data analysis system for analyzing data generates an index for ranking a plurality of target data, and the index includes each target data and a predetermined value.
  • a first step corresponding to a relevance to the case and changing based on an input from a user; a second step storing at least temporarily the plurality of target data;
  • a third step of presenting sample data for the target data to the user and an input of classification information is received from the user, and the classification information is associated with the sample data based on the input to classify the sample data
  • a combination of the sample data and the classification information received from the user is provided as reference data.
  • the third invention is a data analysis system control program for causing a computer to execute each step included in the control method of the data analysis system.
  • the invention is characterized in that it is a computer-readable recording medium on which a control program of the data analysis system is recorded.
  • the data analysis system, the control method, the control program, and the recording medium according to one aspect of the present invention have the effect that the data group can be accurately analyzed regardless of the amount of the partial data with respect to the data group. Play.
  • FIG. 1 is a block diagram illustrating an example of a hardware configuration of a data analysis system (hereinafter, simply referred to as “system”) according to the present embodiment.
  • the system includes, for example, an arbitrary recording medium (eg, memory, hard disk, etc.) capable of storing data (including digital data and analog data), and a controller capable of executing a control program stored in the recording medium.
  • an arbitrary recording medium eg, memory, hard disk, etc.
  • data including digital data and analog data
  • controller capable of executing a control program stored in the recording medium.
  • “data” may be any data expressed in a format that can be processed by the computer.
  • the data may be, for example, unstructured data whose structure definition is incomplete at least in part, and document data (for example, e-mail (attached file header) Information), technical documents (including a wide range of documents explaining technical matters such as academic papers, patent publications, product specifications, design drawings, etc.), presentation materials, spreadsheets, financial statements, meeting materials, Record reports, sales documents, contracts, organization charts, business plans, company analysis information, electronic medical records, web pages, blogs, comments posted on social network services, etc., audio data (eg conversation / music) Data), image data (eg, data composed of a plurality of pixels or vector information), video data (eg, Broadly includes such configured data) of a plurality of frame images.
  • document data for example, e-mail (attached file header) Information
  • technical documents including a wide range of documents explaining technical matters such as academic papers, patent publications, product specifications, design drawings, etc.
  • presentation materials including a wide
  • “reference data” may be, for example, data associated with classification information by a user (data that has been classified, which is a combination of data and classification information).
  • the “target data” may be data not associated with the classification information (unclassified data that is not presented to the user as reference data and is not classified for the user).
  • the “classification information” may be an identification label used for classifying reference data, for example, a “Related” label indicating that the reference data and a predetermined case are related, Information that classifies the reference data into three, such as a “High” label indicating that they are related and a “Non-Related” label indicating that they are not related, or “good”, “ It may be information that classifies the reference data into five categories such as “slightly good”, “normal”, “slightly bad”, and “bad”.
  • the “predetermined case” includes a wide range of targets for which the system is evaluated for relevance to data, and the scope thereof is not limited.
  • the predetermined case may be a case where the discovery procedure is required when the system is realized as a discovery support system, or a crime that is the subject of an investigation when the system is realized as a criminal investigation support system.
  • an email monitoring system When implemented as an email monitoring system, it may be fraudulent activity (eg, information leakage, collusion, etc.), or medical application system (eg, pharmacovigilance support system, clinical trial efficiency system, medical risk)
  • medical application system eg, pharmacovigilance support system, clinical trial efficiency system, medical risk
  • pharmacovigilance support system e.g., pharmacovigilance support system, clinical trial efficiency system, medical risk
  • a hedging system When implemented as a hedging system, fall prediction (fall prevention) system, prognosis prediction system, diagnosis support system, etc., it may be a case or case related to medicine, or an Internet application system (for example, smart mail system, information aggregation ( Curation) system, user monitoring System, social media management system, etc.), it may be case examples / cases related to the Internet, and when implemented as a project evaluation system, it may be a project that has been performed in the past or as a marketing support system.
  • an Internet application system for example,
  • it may be a product / service targeted for marketing, or it may be realized as an intellectual property evaluation system, it may be an intellectual property subject to evaluation, or it may be realized as an unauthorized transaction monitoring system, It may be a fraudulent financial transaction, if it is realized as a call center escalation system, it may be a past response case, if it is realized as a credit check system, it may be a subject of credit check, and driving support When implemented as a system, It may be that on the rolling, if it is implemented as a sales support system, may be in the operating results.
  • the data analysis system 1 includes, for example, a server device 2 that can execute main processing of data analysis and one or more that can execute related processing of data analysis.
  • a storage system 5 including a plurality of client devices 3, a database 4 for recording data and evaluation results for the data, and a management computer 6 that provides a management function for data analysis to the client device 3 and the server device 2. And may be provided.
  • the client device 3 can present a part of a plurality of target data to the user as sample data before classification. As a result, the user can input for evaluation / classification of the sample data via the client device 3.
  • the server device 2 can randomly sample a plurality of target data, extract a predetermined number of sample data, and provide this to a predetermined client device.
  • the sample data may be data belonging to a data group that is not included in the target data to be analyzed but has a predetermined case that is the same as or similar to the target data.
  • the client device 3 includes, as hardware resources, for example, a memory, a controller, a bus, an input / output interface (for example, a keyboard and a display), and a communication interface (communication means using a predetermined network). And the server apparatus 2 and the management computer 6 are communicably connected).
  • the server device 2 Based on the sample data to which the classification information is attached, that is, the combination of the sample data and the classification information (this is referred to as “reference data”), the server device 2 includes a pattern (for example, included in the data). Broadly refer to abstract rules, meanings, concepts, styles, distributions, samples, etc., not limited to so-called “specific patterns”), and based on these patterns, the relationship between the target data and a given case evaluate. That is, the server device 2 can evaluate the relevance between the target data and the lawsuit based on the learned pattern, can also evaluate the relevance between the target data and the criminal investigation, And the user's preference can be evaluated, and the relationship between the target data and any other event can be evaluated. Similarly to the client device 3, the server device 2 may include, for example, a memory, a controller, a bus, an input / output interface, and a communication interface as hardware resources.
  • the management computer 6 executes predetermined management processing for the client device 3, the server device 2, and the storage system 5.
  • the management computer 6 may include, for example, a memory, a controller, a bus, an input / output interface, and a communication interface as hardware resources.
  • application programs that can control each device are stored in the memory provided in each of the client device 3, the server device 2, and the management computer 6, and each controller executes the application program to thereby execute the application program.
  • Programs (software resources) and hardware resources cooperate to operate each device.
  • the storage system 5 may be composed of, for example, a disk array system, and may include a database 4 that records data and results of evaluation / classification of the data.
  • the server apparatus 2 and the storage system 5 are connected by a DAS (Direct Attached Storage) method or a SAN (Storage Area Network).
  • DAS Direct Attached Storage
  • SAN Storage Area Network
  • FIG. 1 the hardware configuration shown in FIG. 1 is merely an example, and the above system can be realized by other hardware configurations.
  • a part or all of the processing executed in the server device 2 may be executed in the client device 3, or a part or all of the processing may be executed in the server device 2.
  • the storage system 5 may be built in the server device 2. It is understood by those skilled in the art that there can be various hardware configurations capable of realizing the system, and the present invention is not limited to one specific configuration (for example, the configuration illustrated in FIG. 1).
  • FIG. 2 is a functional block diagram showing an example of the predictive coding function realized by the data analysis system according to the present embodiment.
  • the system can include a predictive coding unit 10.
  • the predictive coding (Predictive Coding) unit 10 is a large number of data (target data not associated with classification information) based on a small number of data manually classified (referred to as the reference data described above). For example, it is big data.) The target data is evaluated so that significant information can be extracted.
  • the predictive coding unit 10 includes, for example, a data acquisition unit 11, a classification information acquisition unit 12, a data classification unit 13, a component extraction unit 14, a component evaluation unit 15, a component storage 16 and a data evaluation unit 17. Can do.
  • the data acquisition unit 11 acquires data from an arbitrary storage resource (for example, the database 4, a web server on the Internet, a mail server on the intranet, etc.).
  • the data acquisition unit 11 provides all data to be subjected to data analysis as target data to the component extraction unit 14, randomly samples the target data, acquires a predetermined number of sample data, and classifies the data Provided to part 13.
  • the classification information acquisition unit 12 acquires the classification information input by the user for each sample data from an arbitrary input device (for example, the client device 3), and outputs the classification information to the data classification unit 13.
  • the data classification unit 13 combines the plurality of sample data sent from the data acquisition unit 11 and the classification information input to each sample data from the classification information acquisition unit 12, and uses the combination as a plurality of reference data To the component extraction unit 14.
  • the component extraction unit 14 extracts the components constituting the reference data from the plurality of reference data received from the data classification unit 13.
  • the “component” may be partial data constituting at least a part of the data, for example, a morpheme, a keyword, a sentence, a paragraph, and / or metadata (for example, an email header) constituting the document.
  • Information partial audio that constitutes audio, volume (gain) information, and / or timbre information, partial image that constitutes an image, partial pixels, and / or luminance information, and video Frame image, motion information, and / or 3D information.
  • the component extraction unit 14 outputs the extracted component and classification information corresponding to the component to the component evaluation unit 15. Further, the constituent element extraction unit 14 extracts constituent elements constituting the target data from the target data input from the data acquisition unit 11 and outputs the constituent elements to the data evaluation unit 17.
  • the component evaluation unit 15 evaluates the component input from the component extraction unit 14. For example, the component evaluation unit 15 determines the degree of contribution of the plurality of components constituting at least part of the reference data to the combination (in other words, the distribution in which the components appear according to the classification information). Evaluate each. More specifically, the constituent element evaluation unit 15 uses, for example, a transmission information amount (for example, an information amount calculated from a predetermined definition formula using the appearance probability of the constituent element and the appearance probability of the classification information). Then, the evaluation value of the component is calculated by evaluating the component. Thereby, the component evaluation part 15 can learn the pattern contained in the said reference data. The component evaluation unit 15 outputs the component and the evaluation value of the component to the component storage unit 16.
  • a transmission information amount for example, an information amount calculated from a predetermined definition formula using the appearance probability of the constituent element and the appearance probability of the classification information.
  • the component storage unit 16 associates the component and the evaluation value input from the component evaluation unit 15, and stores both in an arbitrary memory (for example, the storage system 5).
  • the data evaluation unit 17 reads an evaluation value associated with the component input from the component extraction unit 14 from an arbitrary memory (for example, the database 4 of the storage system 5), and obtains target data based on the evaluation value. evaluate. More specifically, the data evaluation unit 17 ranks the index of the target data (for example, ranks the target data, for example, by adding the evaluation values associated with the constituent elements constituting at least a part of the target data. Numerical values, letters, and / or symbols) can be derived. A form suitable as the index is a score obtained by adding the evaluation values. The data evaluation unit 17 associates the target data with the index, and stores both in an arbitrary memory (for example, the storage system 5).
  • an arbitrary memory for example, the database 4 of the storage system 5
  • the component evaluation unit 15 selects the component until the evaluation of the data with the “Related” or “High” label set becomes larger than the evaluation of the data with no label set, and the component Can be repeatedly evaluated to correct the evaluation value of the component. As a result, the component evaluation unit 15 can find a component that appears in a plurality of reference data to which the classification information “Related” or “High” is attached and has an influence on the combination of the reference data and the label. .
  • the component evaluation unit 15 calculates the evaluation value wgt of the component using, for example, the following formula.
  • wgt indicates the initial value of the evaluation value of the i-th component before evaluation.
  • Wgt indicates the evaluation value of the i-th component after the Lth evaluation.
  • means an evaluation parameter in the L-th evaluation, and ⁇ means a threshold value in the evaluation.
  • the component evaluation part 15 can evaluate, for example, that a component represents the characteristic of predetermined classification information, so that the value of the calculated transmission information amount is large.
  • the component evaluation unit 15 sets, as target data, an intermediate value between the lowest value of the index of the reference data set with “Related” and the highest value of the index of the reference data set with “Non-Related”.
  • a threshold value predetermined reference value for automatically determining whether or not “Related” is set can be used.
  • the data evaluation part 17 calculates each score of each of several target data and each of several reference data from the following formula
  • the score is an index that quantitatively evaluates the strength of the connection of these data to the classification code.
  • m j frequency of occurrence of the i-th component
  • wgt i Evaluation value of the i-th component
  • *** part is a functional configuration that is realized by executing a program (data analysis program) by a controller included in the data analysis system, It may be paraphrased as ** processing or *** function.
  • *** part can be replaced by hardware resources, those skilled in the art will understand that these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof. Yes, it is not limited to either.
  • FIG. 3 is a flowchart showing an example of processing executed by the predictive coding unit 10 included in the data analysis system according to the present embodiment.
  • the data acquisition unit 11 acquires sample data from an arbitrary memory (step 10, hereinafter “step” is abbreviated as “S”).
  • the classification information acquisition unit 12 acquires the classification information input by the user from an arbitrary input device (S11).
  • the data classification unit 13 classifies the data by combining the data and the classification information to configure reference data (S12), and the component extraction unit 14 configures the reference data. Are extracted from the reference data (S13).
  • the component evaluation unit 15 evaluates the component (S14), and the component storage unit 16 associates the component with the evaluation value and stores both in an arbitrary memory (S15).
  • the processing of S10 to S15 is referred to as a “learning phase” (a phase in which the system learns a pattern).
  • the data acquisition unit 11 acquires target data from an arbitrary memory (S16).
  • the component extraction unit 14 extracts the components constituting the target data from the target data (S17).
  • the data evaluation unit 17 reads an evaluation value associated with the constituent element from an arbitrary memory, and evaluates the target data based on the evaluation value (S18).
  • evaluation phase the system evaluates target data based on the pattern).
  • each process included in the learning phase is not an essential process in the system.
  • a memory that associates and stores a component and an evaluation value of the component is given in advance, and the predictive coding unit 10 performs target data based on the component and the evaluation value stored in the memory. Can also be evaluated.
  • the data analysis system accurately corrects a large number of target data regardless of the ratio of the number of reference data to the number of target data, or even when the number of reference data is insufficient.
  • the data analysis system sets new components based on the components extracted from the reference data, so that the target data can be accurately evaluated by supplementing the new components while maintaining the classification policy for the reference data. I was able to do it.
  • the former in order to distinguish between the component extracted from the reference data and the new component, the former is referred to as a reference component and the latter as a related component for convenience. To do.
  • a related component is a component that is not included in the reference data but is included in the target data and has attributes related to the standard component.
  • the target data is evaluated in relation to a predetermined case (target data Is effective in determining the score).
  • the “related attribute” is a characteristic of the related component with respect to the reference component, for example, that the related component (morpheme) is co-occurring with the reference component (morpheme),
  • the former and the latter are in a synonym relationship, and the former meta information and the latter meta information are common.
  • “price” is a reference component, “adjust price” or “adjust” that exists simultaneously in the same context as “price”, such as “determine price”, “consult price”, etc.
  • “Decision” and “consultation” are co-occurrence words for “price”, that is, related components.
  • “adjustment”, “determination”, and “consultation” may be determined as synonyms for “price” based on a database or the like.
  • FIG. 4 shows an example of a flowchart of the addition process.
  • the component evaluation unit 15 determines whether additional processing is necessary (S40).
  • the data analysis program of the server apparatus 12 can select an operation mode (operation policy) of additional processing when the data analysis operation manager sets the operation environment for data search.
  • this operation mode for example, there is a pattern in which (1) additional processing is performed, (2) additional processing is not performed, or (3) additional processing is performed depending on the situation.
  • the “situation” is, for example, a state in which the number of reference data tends to be insufficient compared to the number of target data, and the component evaluation unit 15 indicates that the score distribution of the plurality of target data is biased, When the number of reference components used for the evaluation is relatively small, it can be determined that the situation has occurred.
  • the additional process control program may always perform the additional process at the time of data analysis without the necessity determination step of the additional process.
  • the data analysis program may be configured to select whether to notify the operator that the additional processing is performed.
  • the component evaluation unit 15 denies the necessity determination of the additional process, the flow ends without performing the additional process, and when the necessity determination is affirmed, the related component element is set based on the evaluation result of the reference data. Migrate to Therefore, the component evaluation unit 15 sets a specific component that is a basis for determining a related component from among components (standard components) extracted from the reference data in order to set the related component. Is determined (S41).
  • the specific standard component may be one, a plurality, or all of a plurality of standard components extracted from the reference data. How the specific reference component is determined may be based on configuration information set for the operational environment for data analysis. In a preferred aspect, the specific reference component is selected from a predetermined number of reference components in descending order of evaluation value (for example, the reference component having the highest evaluation value). This is because the higher the evaluation component is, the higher the degree of relevance of the related component to the above-mentioned predetermined case according to the reference component. If the number of specific criteria components is greater than the optimal value, there is a concern that the evaluation of the target data tends to be inconsistent with the classification in the reference data, while if the number of specific criteria components is less than the optimal value, the target data is evaluated. However, the “predetermined number” may be appropriately determined by the system.
  • the component evaluation unit 15 determines a related component based on the specific reference component (S42).
  • the constituent element evaluation unit 15 extracts constituent elements that do not exist in the reference data and have a co-occurrence relationship with the specific standard constituent element from all target data or a part of target data, and the extracted constituent elements Is set as the related morpheme. For example, if “price” is a specific reference component, for example, “adjustment” co-occurs with “price”, such as “adjust price”, “determine price”, “consult price”, “Decision” and “consultation” are extracted, and these are set as related components.
  • the related component is information that is automatically added to the system without user input for evaluation (scoring) of the target data without increasing the reference data. If the related component does not exist in the target data, the component evaluation unit 15 may add a new specific reference element until the related component can be extracted from the target data.
  • the constituent element evaluation unit 15 performs the evaluation according to the attribute (for example, information transmission amount) of the related constituent element (S43).
  • the component evaluation unit 15 detects target data in which related components such as “adjustment”, “decision”, and “consultation” coexist with a specific reference component (“price”).
  • the number (n) of the detected target data is specified.
  • the additional processing control program regards the detected target data as relevant to the predetermined case (that is, the classification information of “Relative” or “High” corresponds to the predetermined case)
  • the information transmission amount is calculated from a predetermined definition formula based on the appearance probability of the related component and the appearance probability of the classification information in all the target data, and the evaluation value corresponding to each related component is estimated.
  • the component evaluation unit 15 can evaluate the evaluation value (weight) of the related component according to the following formula.
  • a component evaluation part may evaluate the evaluation value of a related component according to the following formula
  • CF is the j 0 th reference elements m j0, frequency with which the j 1 th connected component m j1 cooccur in the same sentence (occurrence frequency: collocation frequency) represents
  • DF is Both represent the frequency of co-occurrence in the same data
  • w represents the weight (evaluation result) of the reference component m j0 .
  • F represents an arbitrary function, for example, May be, It may be.
  • the component evaluation unit 15 evaluates the related component, whether the target data including the related component is “Relative” or “Non-Relative” based on the evaluation of the reference component, and The evaluation may be performed based on the evaluation result (score value) (S18) of the target data.
  • the data evaluation unit 17 re-evaluates all target data (recalculates the score) based on the evaluation value of the reference component and the evaluation value of the related component (S44). Furthermore, the data evaluation unit 17 ranks all the target data according to the evaluation result of all the target data and creates ranking information of all the targets. The data evaluation unit 17 evaluates each target data according to a predetermined value. Compared with the threshold information, classification information is set for each target data. The data evaluation unit 17 can output the above-described ranking information including the classification information to the client device 3.
  • a component having relevance to the component included in the reference data can be added to the evaluation of the target data as a new component. Regardless of the ratio of the number of reference data to the number of data, or even when the number of reference data is insufficient, a large number of target data can be accurately evaluated.
  • the constituent element evaluation unit 15 may determine the related constituent element from a constituent element that is a synonym of the specific reference constituent element and is not included in the reference data but included in the target data. At this time, the component evaluation unit 15 may use the search table of the database 4 to select synonyms for the specific reference component.
  • a synonym means that two different morphemes are in a relationship of being matched by, for example, a higher-level concept morpheme.
  • the constituent element evaluation unit 15 may set the related constituent elements from the synonyms and the morphemes described above having a co-occurrence relationship with the specific reference constituent element. Furthermore, the constituent element evaluation unit 15 may set related constituent elements from morphemes having a co-occurrence relationship with respect to the synonyms. Furthermore, when the synonym of the specific reference constituent element does not exist in the target data, the constituent element evaluation unit 15 may use another synonym having a similar meaning to the synonym as a candidate for the related constituent element.
  • the target data is evaluated based on the related component and the reference component, so that the difference between the two is given to the user. It can be presented, and the former evaluation results can be applied to the determination and evaluation of related components, but without the evaluation of target data based on the reference components, You may perform evaluation of object data based on it.
  • the server device 12 when the operating environment is set so as to notify the user of the additional processing, the server device 12 sends the specific reference component and the related configuration to the client device 3.
  • the candidate elements can be displayed in the order of evaluation values, and the user can select whether or not to adopt each element for data analysis.
  • the predictive coding unit 10 optimizes evaluation values of constituent elements based on given reference data and / or newly obtained reference data, for example, as described in (1) to (3) below. Can do.
  • the component evaluation unit 15 calculates the recall rate or the conformance rate based on the result of evaluating the target data, and the component is the data and the data so that the recall rate or the conformance rate increases. By repeatedly evaluating the degree of contribution to the combination with the classification information, the learned pattern can be updated.
  • the above-mentioned “recall rate” (RecallateRate) is an index indicating the ratio (coverability) of the data to be discovered to the predetermined number of data. For example, when “reproducibility is 80% with respect to 30% of all data”, it indicates that 80% of the data to be found is included in the data of the top 30% of the index (data If the data is brute force (linear review) without using an analysis system, the amount of data to be discovered is proportional to the amount reviewed, so the greater the deviation from the proportion, the better the system performance.) .
  • the “Precision Rate” is an index indicating the ratio (accuracy) of data to be truly discovered to the data discovered by the system. For example, when the expression “the relevance rate is 80% when 30% of all data is processed” is shown, the proportion of data to be discovered is 80% of the data of the top 30% of the index. .
  • the component extraction unit 14 calculates the recall rate or the conformance rate based on the result evaluated by the data evaluation unit 17, and when the recall rate or the conformance rate is lower than the target value, the recall rate or the conformance rate is the target. Re-extract the component from the data until the value is exceeded. At this time, the component extraction unit 14 may extract the component excluding the component extracted last time, or may replace a part of the component extracted last time with a new component.
  • the data evaluation unit 17 derives the index of the target data using the re-extracted component, the index (second index) of each data is derived using the re-extracted component and its evaluation value.
  • the recall rate or the matching rate may be derived again from the first index and the second index obtained before re-extracting the constituent elements. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
  • the component evaluation unit 15 evaluates the component included in the reference data, and then convolves the evaluation value of the component other than the component with the component.
  • the component can be re-evaluated so that the evaluation value of the other component is reflected in the evaluation value.
  • the relevance between the constituent element and the other constituent elements is evaluated as an evaluation value of the constituent element, so that the data analysis system can further improve the accuracy of data analysis. Play.
  • the component evaluation unit 15 can update a pattern (for example, a combination of a component and an evaluation value of the component) at an arbitrary timing. That is, for example, the component evaluation unit 15 (a) at a timing when an update request is received from an administrative user who manages the system, (b) at a timing when a preset date and time arrives, and / or (c) The pattern can be updated at a timing when an input regarding the additional review is received from the user.
  • a pattern for example, a combination of a component and an evaluation value of the component
  • the user can confirm (confirmation review) the content of the target data from which the index is derived by the data evaluation unit 17, and can newly input classification information for the target data.
  • the classification information acquisition unit 12 may acquire newly input classification information, and the data classification unit 13 may combine the target data and the classification information and use the combination as new reference data.
  • the new reference data is stored in an arbitrary memory, and is fed back to the system, for example, at the timings (a) to (c).
  • the component extraction unit 14 extracts the component from the new reference data, and the component evaluation unit 15 evaluates the component.
  • the constituent element storage unit 16 replaces the evaluation value with a new evaluation result (evaluation value) and stores it. If not, the component and the evaluation value are associated with each other and newly stored in the memory.
  • the predictive coding unit 10 includes a plurality of constituent elements constituting at least a part of data corresponding to the classification information at an arbitrary timing (for example, timings (a) to (b) described above).
  • the learned pattern can be updated by re-evaluating the degree of contribution to the combination with the classification information.
  • the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
  • the predictive coding unit 10 can further include a management unit 18 (for example, the management unit 18 has the following functions (1) to (5)).
  • the data evaluation unit 17 derives an index for each of a plurality of target data, and the user (for example, in the order in which the index indicates that the target data is highly related to the predetermined case) As an example, consider the case where each target data is confirmed and classification information is given (confirmed review). At this time, the management unit 18 uses the gradation corresponding to the ratio that the target data associated with the classification information occupies for all the target data, and the distribution of the ratio with respect to the result of evaluating each of the plurality of target data. Can be displayed in a visible manner.
  • the management unit 18 when the data evaluation unit 17 derives a numerical value in the range of 0 to 10000 as the index, the management unit 18, for example, has a range obtained by dividing the index every 1000 (that is, 0 to 1000 in the first interval). , 1001 to 2000 as the second section, 2001 to 3000 as the third section, etc.) (for example, the target data with the index of 2500 is classified into the third section), and a certain range
  • the range can be displayed (for example, the higher the ratio, the closer to the warm color system and the lower, the closer to the cold color system).
  • the management unit 18 displays the other ranges in the same manner for the other ranges.
  • the management unit 18 can display the distribution of the ratio in each range using gradation, for example, the index indicates that the relevance between the target data and the predetermined case is high. If the above-mentioned ratio in the range is indicated by a cold color tone in spite of the range (for example, the ninth section where the index is 8001 to 9000), the confirmation review by the user may be wrong Can suggest that. That is, the data analysis system further provides an additional effect that allows the user to grasp the distribution at a glance.
  • the management unit 18 can visualize interrelationships (eg, hierarchical relationships, series relationships, data transmission / reception, etc.) between a plurality of subjects (eg, people, organizations, computers, etc.). For example, when an e-mail is transmitted from the first computer to the second computer, the management unit 18 converts the first circle representing the first computer and the second circle representing the second computer into the first circle.
  • a predetermined display device for example, a display provided in the client device 10) is a diagram that is connected by an arrow (for example, a thickness corresponding to the size of the e-mail) from the circle to the second circle. Can be displayed.
  • the management unit 18 can visualize the interrelationship according to the result evaluated by the data evaluation unit 17. For example, when the data evaluation unit 17 derives a numerical value in the range of 0 to 10000 as the index, the management unit 18 may, for example, target data (for example, first data) associated with an index belonging to a specified section.
  • target data for example, first data
  • the diagram can be displayed on the predetermined display device only on the basis of the electronic mail transmitted from the computer to the second computer. Thereby, the data analysis system further exhibits an additional effect that allows the user to grasp the mutual relationship between a plurality of subjects at a glance.
  • the management unit 18 determines whether or not the first component representing the predetermined operation is included in the target data. When determining that the first component is included, the management unit 18 identifies the second component representing the target of the predetermined operation can do.
  • the management unit 18 associates the meta information (attribute information) indicating the attribute (property / feature) of the target data including the above constituent element and other constituent elements with the constituent element and the second constituent element.
  • the meta information is information indicating a predetermined attribute of data.
  • the target data is an e-mail
  • the name of the person who sent the e-mail the name of the person who received the e-mail
  • the e-mail It may be an address, the date and time of transmission / reception, and the like.
  • the management unit 18 associates the two components with the meta information and displays them on a predetermined display device (for example, a display provided in the client device 3).
  • a predetermined display device for example, a display provided in the client device 3
  • the management unit 18 connects the circle representing the first component and the circle representing the second component with an arrow from the first circle to the second circle. It can be displayed on a display device.
  • the data analysis system further exhibits an additional effect that the user can grasp the predetermined operation and the target at a glance.
  • the management unit 18 can extract data including constituent elements corresponding to subordinate concepts of a preselected concept from a plurality of target data, and can summarize the plurality of target data.
  • Content eg, sentences, graphs, tables, etc.
  • the user selects some concepts according to the topic to be detected from the target data, and registers the selected concepts in the management unit 18 in advance. For example, if the topic to be detected is “illegal” or “dissatisfied”, the concept category is divided into five categories of “behavior”, “emotion”, “nature / state”, “risk”, and “money” For example, “behavior” for “behavior”, “despise”, etc. “feeling” for “feelings”, “being angry”, etc. “dullness” for “nature / state”, “ The concept of “risk” and “danger” for “risk”, such as “bad attitude”, and “money paid for human labor” for “money” are given to the management unit 18 by the user. sign up.
  • the management unit 18 For each registered concept, the management unit 18 searches the reference data for a component corresponding to the subordinate concept of the concept, associates the searched component with the concept, and stores an arbitrary memory (for example, storage Store in system 18). Then, the management unit 18 extracts the stored constituent element from the target data, specifies a concept associated with the constituent element, and outputs a summary using the concept.
  • an arbitrary memory for example, storage Store in system 18
  • the management unit 18 extracts the concepts “system”, “sales” and “do” from the text “monitoring system order” included in a certain e-mail, and “accounting system introduction” included in another e-mail.
  • the concepts “system”, “sale”, and “do” are extracted from the text “”, and “sell system” is output as a summary of these emails.
  • the management unit 18 can show, for example, a graph (for example, a pie chart) indicating the ratio of target data including the concept of “sell system” to all target data.
  • the data analysis system further exhibits an additional effect of allowing the user to grasp the entire image of the target data.
  • Topic clustering The management unit 18 can cluster the plurality of target data according to topics (subjects) included in the plurality of target data.
  • the management unit 18 can cluster a plurality of target data using an arbitrary classification model (for example, K-means, support vector machine, spherical clustering, etc.).
  • an arbitrary classification model for example, K-means, support vector machine, spherical clustering, etc.
  • the predictive coding unit 10 may further include a phase analysis unit 19 (not shown in FIG. 2).
  • the phase analysis unit 19 has the following functions (1) to (3), for example.
  • phase analysis part 19 can analyze the phase which shows each step in which a predetermined case progresses.
  • a flow in which the phase analysis unit 19 analyzes a phase based on an example in which the above system is realized as a criminal investigation support system and a predetermined case is “collusion” will be described.
  • the collusion involves the relationship building phase (the stage of building relationships with competitors), the preparation phase (the stage of exchanging information about competitors with competitors), and the competition phase (providing prices to customers, obtaining feedback, It is known to progress in the order of communication). Therefore, the system administrator sets the above three phases in the phase analysis unit 19.
  • the system learns a plurality of patterns corresponding to the plurality of phases from a plurality of types of reference data respectively prepared for a plurality of preset phases, and the target data based on the plurality of phases, respectively. For example, it is possible to specify “in which phase the organization to be analyzed is currently in”.
  • the component evaluation unit 15 refers to a plurality of types of reference data respectively prepared for a plurality of preset phases, evaluates components included in the plurality of types of reference data, and The element and the result (evaluation value) obtained by evaluating the component are associated with each other and stored in the memory for each phase (that is, a plurality of patterns corresponding to the plurality of phases are respectively learned).
  • the data evaluation unit 17 derives an index for each of a plurality of phases by analyzing the target data based on the pattern learned for each phase.
  • the phase analysis unit 19 determines whether or not the index satisfies a predetermined determination criterion (for example, a threshold value) set in advance for each phase (for example, whether or not the index exceeds the threshold value). ) And the count value corresponding to the phase is increased. Finally, the phase analysis unit 19 specifies the current phase based on the count value (for example, the phase having the maximum count value is set as the current phase). Or when it determines with the parameter
  • a predetermined determination criterion for example, a threshold value
  • phase progress prediction based on a prediction model The phase analysis unit 19 is based on an index derived by evaluating a plurality of target data based on a model that can predict the progress of a predetermined action related to a predetermined case. Predict and present the following actions:
  • the phase analysis unit 19 uses the index derived for the first phase (for example, the relationship building phase) and the index derived for the second phase (for example, the preparation phase) as variables. Assuming a regression model (a model in which the progress can be predicted), the possibility (for example, the probability) of proceeding to the third phase (for example, the competitive phase) can be predicted based on the regression coefficient optimized in advance. Thereby, the data analysis system further exhibits an additional effect that the result of predicting the progress of the predetermined action related to the predetermined case can be suggested to the user.
  • a regression model a model in which the progress can be predicted
  • the phase analysis unit 19 uses the above-mentioned determination criteria (predetermined determination criteria set in advance for each phase, for specifying phases based on the index derived by the data evaluation unit 17, For example, the threshold) can be optimized according to given data.
  • the management unit 18 performs regression analysis on the relationship between the index derived for each of the plurality of target data and the ranking of the index (that is, the rank when the indices are arranged in ascending order), and the regression Based on the result of the analysis, the determination criterion can be reset (for example, the threshold value is changed).
  • the administrator of the system previously sets a ranking threshold for the ranking.
  • a function (y e ⁇ x + ⁇ (e is the base of the natural logarithm) where the phase analysis unit 19 determines the relationship between the index derived by the data evaluation unit 17 and the ranking of the index.
  • ⁇ and ⁇ are parameters that take real values)) (for example, the parameters of the function are determined by the method of least squares), and the index corresponding to the ranking threshold is newly set in the function.
  • the data analysis system can optimize the determination criterion according to given data, and thus has the additional effect of improving the accuracy of data analysis.
  • Each unit included in the predictive coding unit 10 can have, for example, the following auxiliary functions (1) to (6).
  • the data evaluation unit 17 can evaluate target data with high resolution. That is, the data evaluation unit 17 not only derives an index for the target data but also divides the target data into a plurality of parts (for example, sentences or paragraphs (partial target data) included in the target data). Based on the learned pattern, each of the plurality of partial target data can be evaluated (an index is derived for the partial target data).
  • the data evaluation unit 17 can also integrate a plurality of indices derived for each of the plurality of partial target data, and use the integrated index as an evaluation result of the target data (for example, each index is derived as a numerical value).
  • the maximum value of the index is extracted and used as an integrated index for the target data, or the average of the index is set as an integrated index for the target data, or a predetermined number of the indexes are added in descending order, Or an integrated indicator).
  • the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
  • the component evaluation unit 15 delimits at predetermined intervals.
  • Each pattern is learned from the obtained reference data (for example, the reference data of the first section, the reference data of the second section, etc.) (that is, the component and the result of evaluating the component at each predetermined time)
  • the data evaluation unit 17 can evaluate the target data based on each of the patterns. That is, the data evaluation unit 17 can derive an index for the target data along the time series. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
  • the data evaluation unit 17 can predict a future index based on the temporal change of the index. For example, the data evaluation unit 17 sets a model for time series analysis (for example, autoregressive model, moving average model, etc.) and within a predetermined period (for example, the past month) before new target data is obtained. The next index obtained when the new target data is evaluated can be predicted based on the index derived in step. Thereby, the data analysis system can further exhibit an additional effect that an event that can occur in the future (for example, a risk that an undesirable situation occurs) can be presented to the user.
  • a model for time series analysis for example, autoregressive model, moving average model, etc.
  • a predetermined period for example, the past month
  • Case-by-case evaluation Data that changes in nature depending on the type of case (for example, litigation-related documents whose contents change according to the type of lawsuit (for example, violation of antitrust law, information leakage, patent infringement, etc.) Etc.)
  • the component evaluation unit 15 learns each pattern from the reference data prepared for each case (for example, reference data related to violation of the Antimonopoly Act, reference data related to information leakage, etc.) (that is, The data evaluation unit 17 can evaluate the target data based on the pattern, respectively, by acquiring the component and the result of evaluating the component for each case.
  • the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
  • the data evaluation unit 17 can analyze the structure of the target data and reflect the analysis result in the evaluation of the target data. For example, when the target data includes a sentence (text) at least partially, the data evaluation unit 17 expresses each sentence included in the sentence (for example, whether the sentence is a positive form or a negative form). Or the like, and the result of the analysis can be reflected in an index derived for the target data.
  • the positive form is an expression that affirms the subject (for example, “the dish is delicious”)
  • the negative form is an expression that denies the subject (for example, “the dish is not delicious” or “the dish is not delicious”).
  • the negative form may be an expression that affirms or denies the subject matter (eg, “the food was not delicious” or “the food was not delicious”).
  • the data evaluation unit 17 can adjust the index according to the expression form. For example, when the data evaluation unit 17 derives a numerical value in a predetermined range as the index, the data evaluation unit 17 adds, for example, “+ ⁇ ” to the positive form and “ ⁇ ” to the negative form, The above index can be adjusted by adding “+ ⁇ ” to the depolarized form ( ⁇ , ⁇ , and ⁇ may be arbitrary numerical values, respectively). Further, when the data evaluation unit 17 detects that the sentence included in the target data is negative, for example, by canceling the sentence, the component included in the sentence is not used as a basis for deriving the index ( The component is not considered).
  • the constituent element evaluation unit 15 can increase or decrease the evaluation value of the constituent element depending on, for example, whether a certain morpheme (constituent element) is a subject, an object, or a predicate of the sentence. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
  • the data evaluation unit 17 correlates the first component included in the target data with the second component included in the target data (co-occurrence, For example, the index for the target data can be derived in consideration of the frequency of occurrence of both at the same time.
  • the data evaluation unit 17 determines that the first keyword is Based on the number of occurrences of the second keyword (second component) at a second position (for example, a position included in a predetermined range including the first position) in the vicinity of the appearing first position, the index Can be derived.
  • the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
  • the data evaluation unit 17 is the user's emotion that generated the target data, and the predetermined data generated based on the evaluation information Emotions for the case can be extracted from the target data (emotions included in the target data are evaluated).
  • the data evaluation unit 17 when data included in a website introducing a product / service (for example, an online product site, a restaurant guide) is to be analyzed, the data evaluation unit 17 is included in a comment (review) on the product / service.
  • Components for example, keywords such as “good”, “fun”, “bad”, “clogged”
  • evaluation of the product / service eg, “very good”, “good”, “
  • the target data for example, data included in other websites
  • the data evaluation unit 17 can increase or decrease the evaluation result according to, for example, exaggerated expressions (for example, “very”, “very”, etc.).
  • the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
  • Example of data analysis system processing data other than document data the case where the data analysis system analyzes document data is mainly assumed, and an example based on the assumption has been described.
  • the system is not limited to document data (for example, audio data, image data). , Video data, etc.).
  • the system may analyze the speech data itself, convert the speech data into document data by speech recognition, and convert the converted document data as an analysis target.
  • the system divides the voice data into partial voices of a predetermined length to form components, and uses the voice analysis method (for example, hidden Markov model, Kalman filter, etc.) to convert the partial voices.
  • the voice analysis method for example, hidden Markov model, Kalman filter, etc.
  • the voice data can be analyzed.
  • a speech is recognized using an arbitrary speech recognition algorithm (for example, a recognition method using a hidden Markov model), and the procedure similar to the procedure described in the embodiment is performed on the recognized data. Can be analyzed.
  • the system When analyzing image data, the system, for example, divides the image data into partial images of a predetermined size to form components, and any image recognition method (for example, pattern matching, support vector machine, neural network) Etc.) can be used to identify the partial image.
  • image recognition method for example, pattern matching, support vector machine, neural network
  • the system when analyzing video data, divides a plurality of frame images included in the video data into partial images each having a predetermined size to form a component, and an arbitrary image recognition technique (for example, a pattern
  • the video data can be analyzed by identifying the partial image using matching, a support vector machine, a neural network, or the like.
  • the control block of the data analysis system may be realized by a logic circuit (hardware) formed on an integrated circuit (IC chip) or the like, or may be realized by software using a CPU.
  • the system includes a CPU that executes a program (control program for the data analysis system) that is software that implements each function, and a ROM (in which the program and various data are recorded so as to be readable by the computer (or CPU)).
  • a Read Only Memory or a storage device (these are referred to as “recording media”), a RAM (Random Access Memory) for developing the program, and the like are provided.
  • a computer reads the said program from the said recording medium and runs it.
  • a “non-temporary tangible medium” such as a tape, a disk, a card, a semiconductor memory, a programmable logic circuit, or the like can be used.
  • the program may be supplied to the computer via an arbitrary transmission medium (such as a communication network or a broadcast wave) that can transmit the program.
  • the present invention can also be realized in the form of a data signal embedded in a carrier wave in which the program is embodied by electronic transmission.
  • the above program can be implemented in any programming language, for example, a script language such as Python, ActionScript, JavaScript (registered trademark), an object-oriented programming language such as Objective-C, Java (registered trademark), HTML5, or the like Can be implemented using other markup languages. Also, any recording medium that records the above program falls within the scope of the present invention.
  • the system uses the related component having the relationship with the component included in the reference data for the evaluation of the target data, so that the target data can be obtained even if the number of reference data is small.
  • the system includes, for example, a discovery support system, a forensic system, an e-mail monitoring system, a medical application system (for example, a pharmacovigilance support system, a clinical trial efficiency system) , Medical risk hedging system, fall prediction (fall prevention) system, prognosis prediction system, diagnosis support system, etc.) Internet application system (eg smart mail system, information aggregation (curation) system, user monitoring system, social media) Management systems, etc.), information leakage detection system, project evaluation system, marketing support system, intellectual property assessment system, unauthorized transaction monitoring system, call center escalation system, such as credit investigation system, it can also be implemented as any of the system.
  • a discovery support system for example, a forensic system, an e-mail monitoring system
  • a medical application system for example, a pharmacovig
  • the data analysis system of the present invention uses target data (for example, documents, emails, spreadsheet data, etc.) as a predetermined evaluation standard (for example, in this case lawsuit). (E.g., whether or not the data should be submitted in the discovery procedure), by using related components that are related to the components included in the reference data for the evaluation of the target data Even if the number of data is small, the target data can be accurately evaluated and only the documents related to this case can be efficiently and reliably submitted to the court.
  • target data for example, documents, emails, spreadsheet data, etc.
  • a predetermined evaluation standard for example, in this case lawsuit
  • the data analysis system of the present invention uses target data (for example, documents, emails, spreadsheet data, etc.) as a predetermined evaluation standard (for example, the data is a crime). (E.g., whether or not the act is provable evidence), by using the related component that is related to the component included in the reference data for the evaluation of the target data Even if the number of data is small, it is possible to accurately evaluate the target data and efficiently and reliably extract evidence that proves the criminal activity.
  • target data for example, documents, emails, spreadsheet data, etc.
  • a predetermined evaluation standard for example, the data is a crime.
  • the data analysis system of the present invention when the data analysis system of the present invention is realized as an e-mail monitoring system, the data analysis system transmits / receives target data (for example, e-mail, attached file, etc.) to a predetermined evaluation standard (for example, e-mail By using a related component that is related to the component included in the reference data in the evaluation of the target data when evaluating based on whether or not the user has attempted fraud) Even if the number of reference data is small, it is possible to accurately evaluate target data and efficiently and reliably detect signs of fraud such as information leakage and collusion.
  • target data for example, e-mail, attached file, etc.
  • a predetermined evaluation standard for example, e-mail
  • the data analysis system of the present invention is realized as a medical application system (for example, pharmacovigilance support system, clinical trial efficiency system, medical risk hedging system, fall prediction (fall prevention) system prognosis prediction system, diagnosis support system, etc.).
  • the data analysis system uses target data (for example, electronic medical records, nursing records, patient diaries, etc.) based on predetermined evaluation criteria (for example, whether or not to take a specific risk action of the patient, (E.g., whether or not the reference data is effective), by using the related components that are related to the components included in the reference data for the evaluation of the target data. Even if the number is small, the target data is accurately evaluated, for example, the patient falls into a dangerous state (for example, falls) The efficacy of the prediction and drugs, to efficiently and reliably, it is possible to objectively evaluate.
  • target data for example, electronic medical records, nursing records, patient diaries, etc.
  • predetermined evaluation criteria for example, whether or not to take a specific risk action of the patient, (E
  • the data analysis system of the present invention is realized as an Internet application system (for example, a smart mail system, an information aggregation (curation) system, a user monitoring system, a social media management system, etc.), the data analysis system is a target.
  • Data for example, a message posted by the user to the SNS, recommended information posted on the website, profile of the user or group, etc.
  • a predetermined evaluation standard for example, the preference of the user and the preference of other users
  • the number of reference data Accurately evaluate the target data at least, display a list of other users who are likely to feel at ease with the user, present restaurant information that suits the user's preferences, or cause harm to the user It is possible to efficiently and reliably execute a warning for a group that may be.
  • the data analysis system of the present invention uses target data (for example, e-mail, database access log information) as a predetermined evaluation criterion (for example, the When evaluating based on whether or not the user who sent and received e-mails is trying to commit fraud, use related components that are related to the components included in the reference data to evaluate the target data
  • target data for example, e-mail, database access log information
  • evaluation criterion for example, the When evaluating based on whether or not the user who sent and received e-mails is trying to commit fraud, use related components that are related to the components included in the reference data to evaluate the target data
  • the number of reference data is small, it is possible to accurately evaluate the target data and efficiently and reliably find a sign of information leakage.
  • the data analysis system of the present invention when the data analysis system of the present invention is realized as an information asset utilization system (project evaluation system), the data analysis system includes information assets (target data) possessed by companies / experts for effective information for the project. Therefore, when extracting dynamically according to the situation of the project, the number of reference data can be reduced by using the related components that are related to the components included in the reference data for the evaluation of the target data. Even if the target data is accurately evaluated, for example, (1) In order to improve the efficiency of development sites where shortening of the development period is desired, information on products developed in the past can be reused according to the requirements of the development. (2) It is possible to efficiently and reliably execute the specification of useful information assets based on the expertise possessed by skilled engineers.
  • the data analysis system of the present invention uses target data (for example, company / individual profile, product information, etc.) as a predetermined evaluation standard (for example, When evaluating based on whether the product is male or female, or whether the consumer has a favorable impression on the product, etc., the related components that are related to the components included in the reference data are used to evaluate the target data.
  • target data for example, company / individual profile, product information, etc.
  • the related components that are related to the components included in the reference data are used to evaluate the target data.
  • the data analysis system of the present invention uses target data (for example, patent publications, documents summarizing the invention, academic papers, etc.) as a predetermined evaluation standard (for example, , Whether the patent publication can be used as evidence to reject or invalidate a given patent).
  • target data for example, patent publications, documents summarizing the invention, academic papers, etc.
  • a predetermined evaluation standard for example, , Whether the patent publication can be used as evidence to reject or invalidate a given patent.
  • the data analysis system for example, combines each claim of a patent to be invalidated with a “Related” label (classification information), and each claim of an unrelated patent different from the patent and “Non- A combination with a “Related” label (classification information) is acquired as reference data, a pattern is learned from the reference data, and an index is calculated for a large number of documents (target data) (for example, an index for each paragraph of a patent publication) The target data can be evaluated by calculating and adding a predetermined number from the top of the index to obtain the index of the patent publication.
  • target data for example, an index for each paragraph of a patent publication
  • the data analysis system of the present invention uses target data (for example, e-mail, financial transaction information, bid information, etc.) as a predetermined evaluation criterion (for example, the When evaluating based on whether the user who sent and received e-mail is going to conduct fraudulent transactions, etc., use related components that are related to the components included in the reference data to evaluate the target data
  • target data for example, e-mail, financial transaction information, bid information, etc.
  • a predetermined evaluation criterion for example, the When evaluating based on whether the user who sent and received e-mail is going to conduct fraudulent transactions, etc., use related components that are related to the components included in the reference data to evaluate the target data
  • a sign of fraud such as cartels and collusion.
  • the data analysis system of the present invention uses target data (for example, telephone call history, recorded voice, etc.) as a predetermined evaluation criterion (for example, past history).
  • target data for example, telephone call history, recorded voice, etc.
  • evaluation criterion for example, past history
  • the data analysis system of the present invention When the data analysis system of the present invention is implemented as a credit check system, the data analysis system receives target data (for example, company profile, information about company performance, information about stock prices, press releases, etc.) in a predetermined manner.
  • target data for example, company profile, information about company performance, information about stock prices, press releases, etc.
  • evaluation criteria for example, whether the company goes bankrupt, whether the company grows, etc.
  • the related components that are related to the components included in the reference data are subject data For example, even if the number of reference data is small, the target data can be accurately evaluated, and for example, the prediction of corporate growth / bankruptcy can be achieved efficiently and reliably.
  • the data analysis system of the present invention uses target data (for example, data acquired from an in-vehicle sensor, a camera, a microphone, etc.) as a predetermined evaluation standard (for example, When the evaluation is performed based on whether or not the driver is focused on information during driving by the expert driver, the related component having the relationship with the component included in the reference data is used for the evaluation of the target data.
  • target data for example, data acquired from an in-vehicle sensor, a camera, a microphone, etc.
  • a predetermined evaluation standard for example, When the evaluation is performed based on whether or not the driver is focused on information during driving by the expert driver, the related component having the relationship with the component included in the reference data is used for the evaluation of the target data.
  • the target data can be accurately evaluated, and for example, automatic extraction of useful information that can make driving safe and comfortable can be achieved efficiently and reliably.
  • the data analysis system of the present invention uses target data (for example, company / individual profile, product information, etc.) based on a predetermined evaluation standard (for example, When evaluating based on whether the product is male or female, or whether the consumer has a favorable impression on the product, etc., the related components that are related to the components included in the reference data are used to evaluate the target data.
  • a predetermined evaluation standard for example, When evaluating based on whether the product is male or female, or whether the consumer has a favorable impression on the product, etc., the related components that are related to the components included in the reference data are used to evaluate the target data.
  • the data analysis system of the present invention uses target data (for example, the market price of the stock price) as a predetermined evaluation standard (for example, a stock price).
  • target data for example, the market price of the stock price
  • a predetermined evaluation standard for example, a stock price
  • preprocessing for example, extracting an important part from the data and extracting only the important part from the data
  • the analysis target may be applied), or the mode of displaying the data analysis result may be changed. It will be understood by those skilled in the art that a variety of such variations can exist, and all variations fall within the scope of the present invention.
  • a data analysis system includes a memory, an input control device, and a controller, and the controller generates an index that ranks a plurality of target data, and the index includes each target data Corresponding to a predetermined case and changes based on an input given by a user via the input control device, and the memory stores at least the plurality of target data at least temporarily.
  • the input control device presents sample data for the target data to the user, accepts input of classification information from the user, and the classification information is based on the input to classify the sample data.
  • a combination of the sample data and the classification information received from the user is associated with the sample data, and the reference data
  • the controller obtains a plurality of the reference data, extracts a first component from the plurality of reference data, and the first component is at least one of the reference data.
  • the second component having the relevance to the evaluated first component is the plurality of target data, and the degree of contribution of the first component to the combination is evaluated.
  • the second component element constitutes at least a part of the target data, the second component element is evaluated, and the evaluation result of the second component element is obtained.
  • the relevance between the plurality of target data and the predetermined case is evaluated by generating the index based on the index. Therefore, according to the data analysis system, since the second component can be supplemented for the evaluation of the target data, the analysis of the data group can be accurately performed regardless of the amount of the partial data with respect to the data group. It can be carried out.
  • the controller includes the first configuration from among the plurality of components based on superiority or inferiority of the evaluation of each of the plurality of components included in the reference data.
  • the controller includes the first configuration from among the plurality of components based on superiority or inferiority of the evaluation of each of the plurality of components included in the reference data.
  • the controller determines, as the first component, the component having the highest evaluation among the plurality of components included in the target data. An additional effect that a component having a high evaluation value can be selected as the second component is achieved.
  • the controller sets the association between the second component and the first component based on a predetermined criterion, and sets the relationship to the criterion.
  • the controller is configured such that the second component is in a co-occurrence relationship with the first component, is in a similar word relationship, and Analyzing the second constituent element related to the first constituent element by analyzing the second constituent element by extracting the second constituent element based on at least one of the relationships having the common meta information
  • the controller is configured such that the second component is in a co-occurrence relationship with the first component, is in a similar word relationship, and Analyzing the second constituent element related to the first constituent element by analyzing the second constituent element by extracting the second constituent element based on at least one of the relationships having the common meta information
  • the controller includes the second component according to the frequency with which the second component exists with the predetermined relationship with respect to the plurality of target data.
  • the evaluation that the second constituent element can be evaluated accurately and reliably by evaluating the constituent element and evaluating the plurality of target data based on the evaluation result of the second constituent element. Effects are achieved.
  • a control method for a data analysis system is a control method for a data analysis system for analyzing data, which generates an index for ranking a plurality of target data, and the index includes each target data And a second step of storing at least temporarily the plurality of target data corresponding to the relationship between the user and the predetermined case and changing based on an input from the user And a third step of presenting sample data for the target data to the user, and an input of classification information is received from the user, the classification information being based on the input to classify the sample data Reference is made to the combination of the fourth step that is associated with the sample data and the classification information received from the user.
  • a second component having the characteristics is extracted from at least one of the plurality of target data, and the second component constitutes at least a part of the target data;
  • a data analysis system control program is a data analysis system control program that causes a computer to execute each step included in the data analysis system control method invention.
  • a recording medium is a recording medium that records a control program of the data analysis system. Therefore, according to the control program and the recording medium of the data analysis system, the second component can be supplemented for the evaluation of the target data. Therefore, regardless of the amount of the partial data for the data group, the data Group analysis can be performed accurately.
  • a data analysis system is, for example, a data analysis system that evaluates target data
  • the system includes a memory, an input control device, and a controller, and the controller includes a plurality of targets.
  • the data is evaluated, and the evaluation corresponds to, for example, the relationship between each target data and a predetermined case, and an index that enables ranking of the plurality of target data is generated by the evaluation,
  • the index can be changed based on an input given by a user via the input control device, and the memory stores, for example, at least temporarily the plurality of target data evaluated by the controller, and the input
  • the control device for example, allows the user to input an order for the controller to rank the plurality of target data, and the plurality of target data
  • the order changes, for example, according to the index that changes based on the input, and the input includes, for example, reference data different from the plurality of target data, the reference data, and the predetermined case.
  • the classification is, for example, divided into a plurality of classification information according to the content of the reference data, and at least one of the plurality of classification information is
  • the reference data is given to the reference data by the input, the reference data is presented to the user, and the at least one classification information given to the presented reference data by the user input;
  • a combination with the reference data is provided to the controller, and the controller includes, for example, a plurality of components included in the reference data.
  • a pattern characterized by the reference data is extracted from the reference data according to the classification information given by the input by evaluating the degree of contribution to each combination provided by the control device, and the extracted pattern is converted into the extracted pattern.
  • determining the index by evaluating relevance between the target data and the predetermined case, setting the determined index in the target data, and ranking the plurality of target data according to the index
  • the user is notified of the plurality of target data arranged in order.
  • the present invention can be widely applied to arbitrary computers such as a personal computer, a server device, a workstation, and a mainframe, and is particularly applicable to an artificial intelligence system.

Abstract

A controller for extracting a first constituent element from a plurality of reference data, evaluating the first constituent element, and on the basis of this evaluation, extracting a second constituent element from one or more of a plurality of target data, evaluating the second constituent element, generating an indicator on the basis of the evaluation results for the first constituent element and the evaluation results for the second constituent element, and on the basis of the generated indicator, evaluating the relationship between the plurality of target data and a prescribed matter.

Description

データ分析に係るシステム、制御方法、制御プログラム、および、その記録媒体Data analysis system, control method, control program, and recording medium therefor
 本発明は、データを分析するデータ分析システム等に関し、例えば、ビックデータを分析する人工知能システムに応用可能なものである。 The present invention relates to a data analysis system that analyzes data, and can be applied to, for example, an artificial intelligence system that analyzes big data.
 コンピュータの急速な発展により社会の情報化が進んだ結果、企業・個人の活動に、膨大な量の情報(ビッグデータ)が、広範に、かつ、密接に関係するようになってきている。そのため、最近では、特に、ビッグデータの中から、所望の情報を的確に分別する必要性が重要視されている。 As a result of computerization of society due to the rapid development of computers, an enormous amount of information (big data) has become widely and closely related to the activities of companies and individuals. Therefore, recently, the necessity of accurately separating desired information from big data has been emphasized.
 ビッグデータから所望の情報を抽出するためのアプローチとして、データ群からサンプリングされた一部のデータに対して、レビューワに依るデータ分析を適用し、この分析結果を利用して、残りのデータを自動分析可能なシステムが知られている(例えば、特開2013-182338号公報)。 As an approach for extracting desired information from big data, we apply data analysis by reviewers to some data sampled from the data group, and use the analysis results to extract the remaining data. A system capable of automatic analysis is known (for example, JP 2013-182338 A).
特開2013―182338号公報JP 2013-182338 A
 上記データ分析システムでは、データ群からサンプリングされる一部データのデータ数が少ないと、残りの多数データの分析に必要な情報が不足して、データ群を正確に分析できない問題があり、反面、一部データのデータ数を多くしようとすると、レビューワに対する負荷が過大となって、データ分析のために要する期間やコストが増加してしまう等の弊害を無視できないことになる。 In the above data analysis system, if there is a small amount of data sampled from the data group, there is a problem that the data group cannot be analyzed accurately due to insufficient information necessary for analysis of the remaining many data. If an attempt is made to increase the number of data of some data, the burden on the reviewer becomes excessive, and adverse effects such as an increase in time and cost required for data analysis cannot be ignored.
 そこで、本発明は、係る問題点に鑑みてなされたものであり、データ群に対する、前記一部データの多少に拘わらず、データ群の分析を正確に行うことができるデータ分析技術システム、及び、その関連技術を提供することを目的とする。 Therefore, the present invention has been made in view of such problems, and a data analysis technology system capable of accurately analyzing a data group regardless of the amount of the partial data for the data group, and The purpose is to provide the related technology.
 前記目的を達成するために、第1の発明に係るデータ分析システムは、データを分析するデータ分析システムであって、メモリと、入力制御装置と、コントローラと、を備え、前記コントローラは、複数の対象データを序列化する指標を生成し、当該指標は、各対象データと所定の事案との関連性に対応するものであって、ユーザが前記入力制御装置を介して与えた入力に基づいて変化するものであり、前記メモリは、前記複数の対象データを少なくとも一時的に記憶し、前記入力制御装置は、前記対象データに対するサンプルデータをユーザに提示し、分類情報の入力を前記ユーザから受け付け、当該分類情報は、前記サンプルデータを分類するために前記入力に基づいて当該サンプルデータに対応付けられるものであり、前記サンプルデータと前記ユーザから受け付けた分類情報との組み合わせを、参照データとして前記コントローラに提供し、前記コントローラは、複数の前記参照データを取得し、当該複数の参照データから第1の構成要素を抽出し、当該第1の構成要素は、当該参照データの少なくとも一部を構成するものであり、前記第1の構成要素が前記組み合わせに寄与する度合いを評価し、当該評価された第1の構成要素と関連性を有する第2の構成要素を前記複数の対象データの少なくとも一つから抽出し、当該第2の構成要素は、当該対象データの少なくとも一部を構成するものであり、当該第2の構成要素を評価し、前記第2の構成要素の評価結果に基づいて前記指標を生成することによって、前記複数の対象データと前記所定の事案との関連性を評価する、ことを特徴とするものである。 In order to achieve the above object, a data analysis system according to a first invention is a data analysis system for analyzing data, comprising a memory, an input control device, and a controller. An index for ranking the target data is generated, and the index corresponds to the relationship between each target data and a predetermined case, and changes based on an input given by the user via the input control device. The memory stores at least temporarily the plurality of target data, the input control device presents sample data for the target data to a user, and receives input of classification information from the user; The classification information is associated with the sample data based on the input to classify the sample data, and the sample data A combination of data and classification information received from the user is provided as reference data to the controller, and the controller obtains a plurality of the reference data and extracts a first component from the plurality of reference data The first component constitutes at least a part of the reference data, evaluates the degree to which the first component contributes to the combination, and the evaluated first component A second component having relevance to the target data is extracted from at least one of the plurality of target data, and the second component constitutes at least a part of the target data. Evaluating the relevance between the plurality of target data and the predetermined case by evaluating the component and generating the index based on the evaluation result of the second component And it is characterized in that.
 前記目的を達成するために、第2の発明に係る、データを分析するデータ分析システムの制御方法は、複数の対象データを序列化する指標を生成し、当該指標は、各対象データと所定の事案との関連性に対応するものであって、ユーザからの入力に基づいて変化するものである第1のステップと、前記複数の対象データを少なくとも一時的に記憶する第2のステップと、前記対象データに対するサンプルデータをユーザに提示する第3のステップと、分類情報の入力を前記ユーザから受け付け、当該分類情報は、前記サンプルデータを分類するために前記入力に基づいて当該サンプルデータに対応付けられるものである第4のステップと、前記サンプルデータと前記ユーザから受け付けた分類情報との組み合わせを、参照データとして提供する第5のステップと、複数の前記参照データを取得する第6のステップと、当該複数の参照データから第1の構成要素を抽出し、当該第1の構成要素は、当該参照データの少なくとも一部を構成するものである第7のステップと、前記第1の構成要素が前記組み合わせに寄与する度合いを評価する第8のステップと、当該評価された第1の構成要素と関連性を有する第2の構成要素を前記複数の対象データの少なくとも一つから抽出し、当該第2の構成要素は、当該対象データの少なくとも一部を構成するものである第9のステップと、当該第2の構成要素を評価する第10のステップと、前記第2の構成要素の評価結果に基づいて前記指標を生成することによって、前記複数の対象データと前記所定の事案との関連性を評価する第11のステップと、含むことを特徴とするものである。 In order to achieve the above object, a control method of a data analysis system for analyzing data according to the second invention generates an index for ranking a plurality of target data, and the index includes each target data and a predetermined value. A first step corresponding to a relevance to the case and changing based on an input from a user; a second step storing at least temporarily the plurality of target data; A third step of presenting sample data for the target data to the user and an input of classification information is received from the user, and the classification information is associated with the sample data based on the input to classify the sample data A combination of the sample data and the classification information received from the user is provided as reference data. A fifth step of acquiring a plurality of the reference data, a first component extracted from the plurality of reference data, wherein the first component is at least one of the reference data A seventh step that constitutes a part, an eighth step that evaluates the degree to which the first component contributes to the combination, and a first step that is related to the evaluated first component A second step of extracting two constituent elements from at least one of the plurality of target data, wherein the second constituent element constitutes at least a part of the target data; and the second configuration A tenth step of evaluating an element, and an eleventh of evaluating an association between the plurality of target data and the predetermined case by generating the index based on an evaluation result of the second component And steps and is characterized in that it comprises.
 さらに、前記目的を達成するために、第3の発明は、前記データ分析システムの制御方法に含まれる各ステップを、コンピュータに実行させるデータ分析システムの制御プログラムであることを特徴とし、第4の発明は前記データ分析システムの制御プログラムを記録したコンピュータ読み取り可能な記録媒体であることを特徴とする。 Furthermore, in order to achieve the object, the third invention is a data analysis system control program for causing a computer to execute each step included in the control method of the data analysis system. The invention is characterized in that it is a computer-readable recording medium on which a control program of the data analysis system is recorded.
 本発明の一態様に係るデータ分析システム、制御方法、制御プログラム、および記録媒体は、データ群に対する、前記一部データの多少に拘わらず、データ群の分析を正確に行うことができるという効果を奏する。 The data analysis system, the control method, the control program, and the recording medium according to one aspect of the present invention have the effect that the data group can be accurately analyzed regardless of the amount of the partial data with respect to the data group. Play.
本発明の一態様に係るデータ分析システムのハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the data analysis system which concerns on 1 aspect of this invention. 上記データ分析システムが備えた予測コーディング機能の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of the prediction coding function with which the said data analysis system was equipped. 上記データ分析システムが備えた予測コーディング部が実行する処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process which the predictive coding part with which the said data analysis system was provided. 上記データ分析システムが備えた予測コーディング部が実行する追加処理の一例を示すフローチャートである。It is a flowchart which shows an example of the additional process which the prediction coding part with which the said data analysis system was provided performs.
 次に本発明の実施形態を添付図面に基づいて説明する。
 〔データ分析システムの構成〕
 図1は、本実施の形態に係るデータ分析システム(以下、単に「システム」と略記することがある。)のハードウェア構成の一例を示すブロック図である。当該システムは、例えば、データ(デジタルデータおよびアナログデータを含む。)を格納可能な任意の記録媒体(例えば、メモリ、ハードディスクなど。)と、当該記録媒体に格納された制御プログラムを実行可能なコントローラ(例えば、CPU:Central Processing Unit)とを備え、当該記録媒体に少なくとも一時的に格納されたデータを分析するコンピュータ(例えば、パーソナルコンピュータ、サーバ装置、クライアント装置、ワークステーション、メインフレームなど)またはコンピュータシステム(例えば、データ分析のための主要処理を実行するサーバ装置、ユーザが使用するクライアント装置、分析対象となるデータを格納するファイルサーバなど、複数のコンピュータが統合的に動作することによってデータ分析を実現するシステム)として実現され得る。本実施の形態は、上記システムが後者によって実現される例(図1)を主として説明している。
Next, embodiments of the present invention will be described with reference to the accompanying drawings.
[Data analysis system configuration]
FIG. 1 is a block diagram illustrating an example of a hardware configuration of a data analysis system (hereinafter, simply referred to as “system”) according to the present embodiment. The system includes, for example, an arbitrary recording medium (eg, memory, hard disk, etc.) capable of storing data (including digital data and analog data), and a controller capable of executing a control program stored in the recording medium. (E.g., CPU: Central Processing Unit) or a computer (e.g., personal computer, server device, client device, workstation, mainframe, etc.) or computer that analyzes data stored at least temporarily in the recording medium System (for example, server device that executes main processing for data analysis, client device used by user, file server that stores data to be analyzed, etc.) Realize It may be implemented as Temu). In the present embodiment, an example (FIG. 1) in which the system is realized by the latter will be mainly described.
 なお、本実施の形態において、「データ」は、上記コンピュータによって処理可能となる形式で表現される、任意のものでよい。上記データは、例えば、少なくとも一部において構造定義が不完全な非構造化データであってよく、自然言語によって記述された文章を少なくとも一部に含む文書データ(例えば、電子メール(添付ファイル・ヘッダ情報を含む)、技術文書(例えば、学術論文、特許公報、製品仕様書、設計図など、技術的事項を説明する文書を広く含む)、プレゼンテーション資料、表計算資料、決算報告書、打ち合わせ資料、報告書、営業資料、契約書、組織図、事業計画書、企業分析情報、電子カルテ、ウェブページ、ブログ、ソーシャルネットワークサービスに投稿されたコメントなど)、音声データ(例えば、会話・音楽などを録音したデータ)、画像データ(例えば、複数の画素またはベクター情報から構成されるデータ)、映像データ(例えば、複数のフレーム画像から構成されるデータ)などを広く含む。 In the present embodiment, “data” may be any data expressed in a format that can be processed by the computer. The data may be, for example, unstructured data whose structure definition is incomplete at least in part, and document data (for example, e-mail (attached file header) Information), technical documents (including a wide range of documents explaining technical matters such as academic papers, patent publications, product specifications, design drawings, etc.), presentation materials, spreadsheets, financial statements, meeting materials, Record reports, sales documents, contracts, organization charts, business plans, company analysis information, electronic medical records, web pages, blogs, comments posted on social network services, etc., audio data (eg conversation / music) Data), image data (eg, data composed of a plurality of pixels or vector information), video data (eg, Broadly includes such configured data) of a plurality of frame images.
 また、本実施の形態において、「参照データ」(reference data)は、例えば、ユーザによって分類情報が対応付けられたデータ(データと分類情報との組み合わされた、分類済みのデータ)であってよい。一方、「対象データ」(target data)は、当該分類情報が対応付けられていないデータ(参照データとしてユーザに提示されておらず、ユーザにとっては分類されていない未分類のデータ)であってよい。ここで、上記「分類情報」は、参照データを分類するために用いる識別ラベルであってよく、例えば、参照データと所定の事案とが関係していることを示す「Related」ラベル、両者が特に関係していることを示す「High」ラベル、および、両者が関係しないことを示す「Non-Related」ラベルのように、当該参照データを3つに分類する情報であったり、「良い」、「やや良い」、「普通」、「やや悪い」、および「悪い」のように、当該参照データを5つに分類する情報であったりしてよい。 Further, in the present embodiment, “reference data” (reference data) may be, for example, data associated with classification information by a user (data that has been classified, which is a combination of data and classification information). . On the other hand, the “target data” (target data) may be data not associated with the classification information (unclassified data that is not presented to the user as reference data and is not classified for the user). . Here, the “classification information” may be an identification label used for classifying reference data, for example, a “Related” label indicating that the reference data and a predetermined case are related, Information that classifies the reference data into three, such as a “High” label indicating that they are related and a “Non-Related” label indicating that they are not related, or “good”, “ It may be information that classifies the reference data into five categories such as “slightly good”, “normal”, “slightly bad”, and “bad”.
 また、上記「所定の事案」は、上記システムがデータとの関連性を評価される対象を広く含み、その範囲は制限されない。例えば、所定の事案は、当該システムがディスカバリ支援システムとして実現される場合、ディスカバリ手続きが要求される本件訴訟であってよいし、犯罪捜査支援システムとして実現される場合、捜査対象となる犯罪であってよいし、電子メール監視システムとして実現される場合、不正行為(例えば、情報漏えい、談合など)であってよいし、医療応用システム(例えば、ファーマコビジランス支援システム、治験効率化システム、医療リスクヘッジシステム、転倒予測(転倒防止)システム、予後予測システム、診断支援システムなど)として実現される場合、医薬に関する事例・事案であってよいし、インターネット応用システム(例えば、スマートメールシステム、情報アグリゲーション(キュレーション)システム、ユーザ監視システム、ソーシャルメディア運営システムなど)として実現される場合、インターネットに関する事例・事案であってよいし、プロジェクト評価システムとして実現される場合、過去に遂行したプロジェクトであってよいし、マーケティング支援システムとして実現される場合、マーケティング対象となる商品・サービスであってよいし、知財評価システムとして実現される場合、評価対象となる知的財産であってよいし、不正取引監視システムとして実現される場合、不正な金融取引であってよいし、コールセンターエスカレーションシステムとして実現される場合、過去の対応事例であってよいし、信用調査システムとして実現される場合、信用調査する対象であってよいし、ドライビング支援システムとして実現される場合、車両の運転に関することであってよいし、営業支援システムとして実現される場合、営業成績であってよい。 In addition, the “predetermined case” includes a wide range of targets for which the system is evaluated for relevance to data, and the scope thereof is not limited. For example, the predetermined case may be a case where the discovery procedure is required when the system is realized as a discovery support system, or a crime that is the subject of an investigation when the system is realized as a criminal investigation support system. When implemented as an email monitoring system, it may be fraudulent activity (eg, information leakage, collusion, etc.), or medical application system (eg, pharmacovigilance support system, clinical trial efficiency system, medical risk) When implemented as a hedging system, fall prediction (fall prevention) system, prognosis prediction system, diagnosis support system, etc., it may be a case or case related to medicine, or an Internet application system (for example, smart mail system, information aggregation ( Curation) system, user monitoring System, social media management system, etc.), it may be case examples / cases related to the Internet, and when implemented as a project evaluation system, it may be a project that has been performed in the past or as a marketing support system. If it is, it may be a product / service targeted for marketing, or it may be realized as an intellectual property evaluation system, it may be an intellectual property subject to evaluation, or it may be realized as an unauthorized transaction monitoring system, It may be a fraudulent financial transaction, if it is realized as a call center escalation system, it may be a past response case, if it is realized as a credit check system, it may be a subject of credit check, and driving support When implemented as a system, It may be that on the rolling, if it is implemented as a sales support system, may be in the operating results.
 図1に例示されるように、本実施の形態に係るデータ分析システム1は、例えば、データ分析の主要処理を実行可能なサーバ装置2と、当該データ分析の関連処理を実行可能な一つ又は複数のクライアント装置3と、データおよび当該データに対する評価結果を記録するデータベース4を備えるストレージシステム5と、クライアント装置3およびサーバ装置2に対して、データ分析のための管理機能を提供する管理計算機6とを備えてよい。 As illustrated in FIG. 1, the data analysis system 1 according to the present embodiment includes, for example, a server device 2 that can execute main processing of data analysis and one or more that can execute related processing of data analysis. A storage system 5 including a plurality of client devices 3, a database 4 for recording data and evaluation results for the data, and a management computer 6 that provides a management function for data analysis to the client device 3 and the server device 2. And may be provided.
 クライアント装置3は、複数の対象データの一部を、分類前のサンプルデータとしてユーザに提示可能である。これにより、当該ユーザは、クライアント装置3を介してサンプルデータに対する評価・分類のための入力を行うことができる。サーバ装置2は、複数の対象データをランダムサンプリングして、所定数のサンプルデータを抽出し、これを所定のクライアント装置に提供することができる。サンプルデータとしては、分析対象である対象データには含まれないが、所定事案を対象データと同一又は類似とするデータ群に属するデータでもよい。クライアント装置3は、ハードウェア資源として、例えば、メモリと、コントローラと、バスと、入出力インターフェース(例えば、キーボード、ディスプレイなど)と、通信インターフェース(所定のネットワークを用いた通信手段によって、クライアント装置3とサーバ装置2および管理計算機6とを通信可能に接続する)とを備えてよい。 The client device 3 can present a part of a plurality of target data to the user as sample data before classification. As a result, the user can input for evaluation / classification of the sample data via the client device 3. The server device 2 can randomly sample a plurality of target data, extract a predetermined number of sample data, and provide this to a predetermined client device. The sample data may be data belonging to a data group that is not included in the target data to be analyzed but has a predetermined case that is the same as or similar to the target data. The client device 3 includes, as hardware resources, for example, a memory, a controller, a bus, an input / output interface (for example, a keyboard and a display), and a communication interface (communication means using a predetermined network). And the server apparatus 2 and the management computer 6 are communicably connected).
 サーバ装置2は、分類情報が付されたサンプルデータ、即ち、サンプルデータと分類情報との組み合わせ(これを「参照データ」という。)に基づいて、当該参照データからパターン(例えば、データに含まれる抽象的な規則、意味、概念、様式、分布、サンプルなどを広く指し、いわゆる「特定のパターン」に限定されない)を学習し、当該パターンに基づいて、対象データと所定の事案との関連性を評価する。すなわち、サーバ装置2は、上記学習したパターンに基づいて、対象データと訴訟との関連性を評価することもできるし、対象データと犯罪捜査との関連性を評価することもできるし、対象データとユーザの嗜好との関連性を評価することもできるし、対象データとその他の任意の事象との関連性を評価することもできる。サーバ装置2は、クライアント装置3と同様に、ハードウェア資源として、例えば、メモリと、コントローラと、バスと、入出力インターフェースと、通信インターフェースとを備えてよい。 Based on the sample data to which the classification information is attached, that is, the combination of the sample data and the classification information (this is referred to as “reference data”), the server device 2 includes a pattern (for example, included in the data). Broadly refer to abstract rules, meanings, concepts, styles, distributions, samples, etc., not limited to so-called “specific patterns”), and based on these patterns, the relationship between the target data and a given case evaluate. That is, the server device 2 can evaluate the relevance between the target data and the lawsuit based on the learned pattern, can also evaluate the relevance between the target data and the criminal investigation, And the user's preference can be evaluated, and the relationship between the target data and any other event can be evaluated. Similarly to the client device 3, the server device 2 may include, for example, a memory, a controller, a bus, an input / output interface, and a communication interface as hardware resources.
 管理計算機6は、クライアント装置3、サーバ装置2、およびストレージシステム5に対して、所定の管理処理を実行する。管理計算機6は、クライアント装置3と同様に、ハードウェア資源として、例えば、メモリと、コントローラと、バスと、入出力インターフェースと、通信インターフェースとを備えてよい。なお、クライアント装置3、サーバ装置2、管理計算機6がそれぞれ備えたメモリには、各装置を制御可能なアプリケーションプログラムが記憶されており、各コントローラが当該アプリケーションプログラムをそれぞれ実行することにより、当該アプリケーションプログラム(ソフトウェア資源)とハードウェア資源とが協働し、各装置が動作する。 The management computer 6 executes predetermined management processing for the client device 3, the server device 2, and the storage system 5. Similarly to the client device 3, the management computer 6 may include, for example, a memory, a controller, a bus, an input / output interface, and a communication interface as hardware resources. Note that application programs that can control each device are stored in the memory provided in each of the client device 3, the server device 2, and the management computer 6, and each controller executes the application program to thereby execute the application program. Programs (software resources) and hardware resources cooperate to operate each device.
 ストレージシステム5は、例えば、ディスクアレイシステムから構成され、データと当該データに対する評価・分類の結果とを記録するデータベース4を備えてよい。サーバ装置2とストレージシステム5とは、DAS(Direct Attached Storage)方式、またはSAN(Storage Area Network)によって接続されている。 The storage system 5 may be composed of, for example, a disk array system, and may include a database 4 that records data and results of evaluation / classification of the data. The server apparatus 2 and the storage system 5 are connected by a DAS (Direct Attached Storage) method or a SAN (Storage Area Network).
 なお、図1に示されるハードウェア構成はあくまで例示に過ぎず、上記システムは、他のハードウェア構成によっても実現され得る。例えば、サーバ装置2において実行される処理の一部または全部がクライアント装置3において実行される構成であってもよいし、当該処理の一部または全部がサーバ装置2において実行される構成であってもよいし、ストレージシステム5がサーバ装置2に内蔵される構成であってもよい。当該システムを実現可能なハードウェア構成が多様に存在し得ることは、当業者に理解されるところであり、特定の1つの構成(例えば、図1に例示されるような構成)に限定されない。 Note that the hardware configuration shown in FIG. 1 is merely an example, and the above system can be realized by other hardware configurations. For example, a part or all of the processing executed in the server device 2 may be executed in the client device 3, or a part or all of the processing may be executed in the server device 2. Alternatively, the storage system 5 may be built in the server device 2. It is understood by those skilled in the art that there can be various hardware configurations capable of realizing the system, and the present invention is not limited to one specific configuration (for example, the configuration illustrated in FIG. 1).
 〔データ分析システム1が備える予測コーディング機能〕
 図2は、本実施の形態に係るデータ分析システムによって実現される、予測コーディング機能の一例を示す機能ブロック図である。
[Predictive coding function of data analysis system 1]
FIG. 2 is a functional block diagram showing an example of the predictive coding function realized by the data analysis system according to the present embodiment.
 (予測コーディング機能の基本構成)
 図2に例示されるように、上記システムは、予測コーディング部10を備えることができる。予測コーディング(Predictive Coding)部10は、人手で分類された少数のデータ(既述の参照データのことである。)に基づいて、多数のデータ(分類情報が対応付けられていない対象データであり、例えば、ビッグデータである。)から有意な情報を抽出できるように、当該対象データを評価する。
(Basic configuration of predictive coding function)
As illustrated in FIG. 2, the system can include a predictive coding unit 10. The predictive coding (Predictive Coding) unit 10 is a large number of data (target data not associated with classification information) based on a small number of data manually classified (referred to as the reference data described above). For example, it is big data.) The target data is evaluated so that significant information can be extracted.
 予測コーディング部10は、例えば、データ取得部11、分類情報取得部12、データ分類部13、構成要素抽出部14、構成要素評価部15、構成要素格納部16、およびデータ評価部17を備えることができる。 The predictive coding unit 10 includes, for example, a data acquisition unit 11, a classification information acquisition unit 12, a data classification unit 13, a component extraction unit 14, a component evaluation unit 15, a component storage 16 and a data evaluation unit 17. Can do.
 データ取得部11は、任意の記憶資源(例えば、データベース4、インターネット上のウェブサーバ、イントラネット上のメールサーバなど)からデータを取得する。データ取得部11は、データ分析の対象とする全データを対象データとして構成要素抽出部14に提供すると共に、対象データをランダムサンプリングして、所定数のサンプルデータを取得して、これをデータ分類部13に提供する。 The data acquisition unit 11 acquires data from an arbitrary storage resource (for example, the database 4, a web server on the Internet, a mail server on the intranet, etc.). The data acquisition unit 11 provides all data to be subjected to data analysis as target data to the component extraction unit 14, randomly samples the target data, acquires a predetermined number of sample data, and classifies the data Provided to part 13.
 分類情報取得部12は、各サンプルデータに対して、ユーザによって入力された分類情報を、任意の入力装置(例えば、クライアント装置3)から取得し、当該分類情報をデータ分類部13に出力する。 The classification information acquisition unit 12 acquires the classification information input by the user for each sample data from an arbitrary input device (for example, the client device 3), and outputs the classification information to the data classification unit 13.
 データ分類部13は、データ取得部11から送られた複数のサンプルデータと、分類情報取得部12から、各サンプルデータに対して入力された分類情報とを組み合わせ、当該組み合わせを、複数の参照データとして構成要素抽出部14に出力する。 The data classification unit 13 combines the plurality of sample data sent from the data acquisition unit 11 and the classification information input to each sample data from the classification information acquisition unit 12, and uses the combination as a plurality of reference data To the component extraction unit 14.
 構成要素抽出部14は、データ分類部13から受領した複数の参照データから、当該参照データを構成する構成要素を抽出する。ここで、「構成要素」は、データの少なくとも一部を構成する部分データであってよく、例えば、文書を構成する形態素、キーワード、センテンス、段落、および/またはメタデータ(例えば、電子メールのヘッダ情報)であったり、音声を構成する部分音声、ボリューム(ゲイン)情報、および/または音色情報であったり、画像を構成する部分画像、部分画素、および/または輝度情報であったり、映像を構成するフレーム画像、モーション情報、および/または3次元情報であったりしてよい。構成要素抽出部14は、抽出した構成要素と当該構成要素に対応する分類情報とを構成要素評価部15に出力する。さらに、構成要素抽出部14は、データ取得部11から入力された対象データから、当該対象データを構成する構成要素を抽出し、当該構成要素をデータ評価部17に出力する。 The component extraction unit 14 extracts the components constituting the reference data from the plurality of reference data received from the data classification unit 13. Here, the “component” may be partial data constituting at least a part of the data, for example, a morpheme, a keyword, a sentence, a paragraph, and / or metadata (for example, an email header) constituting the document. Information), partial audio that constitutes audio, volume (gain) information, and / or timbre information, partial image that constitutes an image, partial pixels, and / or luminance information, and video Frame image, motion information, and / or 3D information. The component extraction unit 14 outputs the extracted component and classification information corresponding to the component to the component evaluation unit 15. Further, the constituent element extraction unit 14 extracts constituent elements constituting the target data from the target data input from the data acquisition unit 11 and outputs the constituent elements to the data evaluation unit 17.
 構成要素評価部15は、構成要素抽出部14から入力された構成要素を評価する。構成要素評価部15は、例えば、夫々、参照データの少なくとも一部を構成する複数の構成要素が、上記組み合わせに寄与する度合い(言い換えれば、当該構成要素が分類情報に応じて出現する分布)をそれぞれ評価する。より具体的には、構成要素評価部15は、例えば、伝達情報量(例えば、構成要素の出現確率と分類情報の出現確率とを用いて、所定の定義式から算出される情報量)を用いて構成要素を評価することによって、当該構成要素の評価値を算出する。これにより、構成要素評価部15は、当該参照データに含まれるパターンを学習することができる。構成要素評価部15は、構成要素と当該構成要素の評価値とを構成要素格納部16に出力する。 The component evaluation unit 15 evaluates the component input from the component extraction unit 14. For example, the component evaluation unit 15 determines the degree of contribution of the plurality of components constituting at least part of the reference data to the combination (in other words, the distribution in which the components appear according to the classification information). Evaluate each. More specifically, the constituent element evaluation unit 15 uses, for example, a transmission information amount (for example, an information amount calculated from a predetermined definition formula using the appearance probability of the constituent element and the appearance probability of the classification information). Then, the evaluation value of the component is calculated by evaluating the component. Thereby, the component evaluation part 15 can learn the pattern contained in the said reference data. The component evaluation unit 15 outputs the component and the evaluation value of the component to the component storage unit 16.
 構成要素格納部16は、構成要素評価部15から入力された構成要素および評価値を対応付け、両者を任意のメモリ(例えば、ストレージシステム5)に格納する。 The component storage unit 16 associates the component and the evaluation value input from the component evaluation unit 15, and stores both in an arbitrary memory (for example, the storage system 5).
 データ評価部17は、構成要素抽出部14から入力された構成要素に対応付けられた評価値を任意のメモリ(例えば、ストレージシステム5のデータベース4)から読み出し、当該評価値に基づいて対象データを評価する。より具体的には、データ評価部17は、例えば、対象データの少なくとも一部を構成する構成要素に対応付けられた評価値を合算することによって、当該対象データの指標(例えば、対象データを序列化可能にする数値、文字、および/または記号であってよい)を導出することができる。当該指標として好適な形態は、前記評価値を合算したスコアである。データ評価部17は、当該対象データと当該指標とを対応付け、両者を任意のメモリ(例えば、ストレージシステム5)に格納する。 The data evaluation unit 17 reads an evaluation value associated with the component input from the component extraction unit 14 from an arbitrary memory (for example, the database 4 of the storage system 5), and obtains target data based on the evaluation value. evaluate. More specifically, the data evaluation unit 17 ranks the index of the target data (for example, ranks the target data, for example, by adding the evaluation values associated with the constituent elements constituting at least a part of the target data. Numerical values, letters, and / or symbols) can be derived. A form suitable as the index is a score obtained by adding the evaluation values. The data evaluation unit 17 associates the target data with the index, and stores both in an arbitrary memory (for example, the storage system 5).
 構成要素評価部15は、「Related」または「High」のラベルが設定されたデータの評価が、これらのラベルが設定されないデータの評価よりも大きくなるまで、構成要素を選定するとともに、当該構成要素を繰り返し評価し、当該構成要素の評価値を修正することができる。これによって、構成要素評価部15は、「Related」または「High」の分類情報が付された複数の参照データに出現し、参照データとラベルとの組み合わせに影響がある構成要素を見つけ出すことができる。構成要素評価部15は、例えば、以下の式を用いて構成要素の評価値wgtを算出する。 The component evaluation unit 15 selects the component until the evaluation of the data with the “Related” or “High” label set becomes larger than the evaluation of the data with no label set, and the component Can be repeatedly evaluated to correct the evaluation value of the component. As a result, the component evaluation unit 15 can find a component that appears in a plurality of reference data to which the classification information “Related” or “High” is attached and has an influence on the combination of the reference data and the label. . The component evaluation unit 15 calculates the evaluation value wgt of the component using, for example, the following formula.
Figure JPOXMLDOC01-appb-M000001
 ここで、wgtは、評価前のi番目の構成要素の評価値の初期値を示す。また、wgtは、L回目の評価後のi番目の構成要素の評価値を示す。γはL回目の評価における評価パラメータを意味し、θは評価の際の閾値を意味する。これにより、構成要素評価部15は、例えば、算出した伝達情報量の値が大きいほど、構成要素が所定の分類情報の特徴を表すものとして評価することができる。なお、構成要素評価部15は、「Related」が設定された参照データの指標の最低値と、「Non-Related」が設定された参照データの指標の最高値との中間値を、対象データに対して「Related」の設定の有無を自動判定する際の閾値(所定の基準値)とすることができる。
Figure JPOXMLDOC01-appb-M000001
Here, wgt indicates the initial value of the evaluation value of the i-th component before evaluation. Wgt indicates the evaluation value of the i-th component after the Lth evaluation. γ means an evaluation parameter in the L-th evaluation, and θ means a threshold value in the evaluation. Thereby, the component evaluation part 15 can evaluate, for example, that a component represents the characteristic of predetermined classification information, so that the value of the calculated transmission information amount is large. Note that the component evaluation unit 15 sets, as target data, an intermediate value between the lowest value of the index of the reference data set with “Related” and the highest value of the index of the reference data set with “Non-Related”. On the other hand, a threshold value (predetermined reference value) for automatically determining whether or not “Related” is set can be used.
 そして、データ評価部17は、構成要素の評価値によって、例えば、以下の式から、複数の対象データの夫々と複数の参照データの夫々のスコアを算出する。スコアとは、これらデータの分類別符号に対する結びつきの強さを定量的に評価する指標である。
Figure JPOXMLDOC01-appb-M000002
:i番目の構成要素の出現頻度
wgt:i番目の構成要素の評価値
And the data evaluation part 17 calculates each score of each of several target data and each of several reference data from the following formula | equation, for example from the evaluation value of a component. The score is an index that quantitatively evaluates the strength of the connection of these data to the classification code.
Figure JPOXMLDOC01-appb-M000002
m j : frequency of occurrence of the i-th component
wgt i : Evaluation value of the i-th component
 なお、上記において、***部と表記した構成は、データ分析システムが備えたコントローラが、プログラム(データ分析プログラム)を実行することによって実現する機能構成であるため、***部を、***処理または***機能と言い換えてもよい。また、***部をハードウェア資源によって代替することもできるため、これらの機能ブロックがハードウェアのみ、ソフトウェアのみ、またはそれらの組み合わせによって多様な形で実現できることは当業者には理解されるところであり、いずれかに限定されるものではない。 In addition, in the above, since the configuration described as “*** part” is a functional configuration that is realized by executing a program (data analysis program) by a controller included in the data analysis system, It may be paraphrased as ** processing or *** function. In addition, since the *** part can be replaced by hardware resources, those skilled in the art will understand that these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof. Yes, it is not limited to either.
 〔予測コーディング部10が実行する処理〕
 図3は、本実施の形態に係るデータ分析システムが備えた予測コーディング部10が実行する処理の一例を示すフローチャートである。
[Processing performed by the predictive coding unit 10]
FIG. 3 is a flowchart showing an example of processing executed by the predictive coding unit 10 included in the data analysis system according to the present embodiment.
 まず、データ取得部11が、任意のメモリからサンプルデータを取得する(ステップ10、以下「ステップ」を「S」と略記する)。次に、分類情報取得部12が、ユーザによって入力された分類情報を、任意の入力装置から取得する(S11)。次に、データ分類部13が、当該データと分類情報とを組み合わせることによって当該データを分類して、参照データを構成し(S12)、構成要素抽出部14が、当該参照データを構成する構成要素を当該参照データから抽出する(S13)。そして、構成要素評価部15が、当該構成要素を評価し(S14)、構成要素格納部16が、当該構成要素と評価値とを対応付け、両者を任意のメモリに格納する(S15)。なお、上記S10~S15の処理を、「学習フェーズ」(上記システムがパターンを学習するフェーズ)と称する。 First, the data acquisition unit 11 acquires sample data from an arbitrary memory (step 10, hereinafter “step” is abbreviated as “S”). Next, the classification information acquisition unit 12 acquires the classification information input by the user from an arbitrary input device (S11). Next, the data classification unit 13 classifies the data by combining the data and the classification information to configure reference data (S12), and the component extraction unit 14 configures the reference data. Are extracted from the reference data (S13). Then, the component evaluation unit 15 evaluates the component (S14), and the component storage unit 16 associates the component with the evaluation value and stores both in an arbitrary memory (S15). The processing of S10 to S15 is referred to as a “learning phase” (a phase in which the system learns a pattern).
 データ取得部11が、任意のメモリから対象データを取得する(S16)。構成要素抽出部14が、当該対象データを構成する構成要素を当該対象データから抽出する(S17)。データ評価部17は、当該構成要素に対応付けられた評価値を任意のメモリから読み出し、当該評価値に基づいて対象データを評価する(S18)。なお、上記S16~S18の処理を、「評価フェーズ」(上記システムが上記パターンに基づいて対象データを評価する)と称する。 The data acquisition unit 11 acquires target data from an arbitrary memory (S16). The component extraction unit 14 extracts the components constituting the target data from the target data (S17). The data evaluation unit 17 reads an evaluation value associated with the constituent element from an arbitrary memory, and evaluates the target data based on the evaluation value (S18). The processing of S16 to S18 is referred to as “evaluation phase” (the system evaluates target data based on the pattern).
 なお、上記学習フェーズに含まれる各処理は、いずれも上記システムにおいて必須の処理ではないことに注意する。例えば、構成要素と当該構成要素の評価値とを対応付けて記憶するメモリが予め与えられており、予測コーディング部10が、当該メモリに格納された当該構成要素および評価値に基づいて、対象データを評価することもできる。 Note that each process included in the learning phase is not an essential process in the system. For example, a memory that associates and stores a component and an evaluation value of the component is given in advance, and the predictive coding unit 10 performs target data based on the component and the evaluation value stored in the memory. Can also be evaluated.
 〔予測コーディング部10が実行する特別処理(構成要素の追加処理)〕
 既述の予測コーディング機能は、参照データから抽出された構成要素の評価結果に基づいて対象データを評価するため、当該構成要素を対象データが含まない限り、当該対象データを評価できないことになる。これは、往々にして、対象データ数(対象データ量)に対して参照データ数(参照データ量)が少ないと、その分、参照データから抽出される構成要素も不足することによって、生じ得るものである。そして、参照データから抽出された構成要素の数が不足していると、対象データにスコアを付けることが出来たとしても、スコア付された対象データ数の分布が偏ったものになる等、多数の対象データを正確に評価することが難しくなる。
[Special processing executed by predictive coding unit 10 (component addition processing)]
Since the above-described predictive coding function evaluates the target data based on the evaluation result of the constituent element extracted from the reference data, the target data cannot be evaluated unless the target data includes the constituent element. This often occurs when the number of reference data (the amount of reference data) is smaller than the number of target data (the amount of target data), and the components extracted from the reference data are insufficient. It is. And if the number of components extracted from the reference data is insufficient, even if the target data can be scored, the distribution of the number of target data with scores will be biased. It becomes difficult to accurately evaluate the target data.
 そこで、参照データ数を増やすことによって、参照データから抽出される構成要素の種類を増やせばよいが、対象データ数に対して参照データ数が十分に足りているか否かは、そもそも、判断が付き難いものであり、一方、むやみに、参照データ数を増やすとレビューワの負荷が過大になって時間や労力が嵩みコストの増加を招く等、予測コーディングのメリットが却って損なわれることにもなり兼ねない。 Therefore, it is only necessary to increase the types of components extracted from the reference data by increasing the number of reference data. However, whether or not the number of reference data is sufficient relative to the number of target data is determined in the first place. On the other hand, if the number of reference data is increased, the load on the reviewer becomes excessive, increasing the time and labor, leading to an increase in cost. I can not.
 そのために、本実施形態に係るデータ分析システムは、対象データ数に対する参照データ数の割合の多少に拘わらず、或いは、参照データ数が不足している場合であっても、多数の対象データを正確に評価できるようにするために、参照データに対する評価結果に応じて、参照データから抽出された当初の構成要素とは別な新たな構成要素を対象データの評価のために補うことができるようにしたものである。 For this reason, the data analysis system according to the present embodiment accurately corrects a large number of target data regardless of the ratio of the number of reference data to the number of target data, or even when the number of reference data is insufficient. In order to be able to evaluate the target data, it is possible to supplement the new constituent elements different from the original constituent elements extracted from the reference data for the evaluation of the target data according to the evaluation result for the reference data. It is a thing.
 データ分析システムは、新たな構成要素を参照データから抽出された構成要素に基づいて設定することにより、参照データに対する分類の方針を維持しながら、新たな構成要素を補って対象データを正確に評価できるようにした。以下の説明において、参照データから抽出された構成要素と、前記新たな構成要素と、を区別するために、前者を基準構成要素と、後者を関連構成要素と、夫々便宜上区別して表記することとする。 The data analysis system sets new components based on the components extracted from the reference data, so that the target data can be accurately evaluated by supplementing the new components while maintaining the classification policy for the reference data. I was able to do it. In the following description, in order to distinguish between the component extracted from the reference data and the new component, the former is referred to as a reference component and the latter as a related component for convenience. To do.
 関連構成要素は、参照データに含まれてなく、対象データに含まれ、基準構成要素に関連する属性を持った構成要素であり、その結果、対象データを所定事案との関係で評価(対象データのスコアを決定)する上で効果的なものである。「関連する属性」とは、基準構成要素に対して関連構成要素が有する特徴のことであり、例えば、関連構成要素(形態素)が基準構成要素(形態素)とが共起の関係にあること、前者と後者とが類義語の関係にあること、前者のメタ情報と後者のメタ情報とが共通すること等である。例えば、「価格」を基準構成素とすると、「価格を調整する」、或いは、「価格を決定する」、「価格を相談する」等、「価格」と同一文脈で同時に存在する「調整」、「決定」、及び、「相談」が夫々「価格」に対する共起語、即ち、関連構成要素ということになる。また、「調整」、「決定」、及び、「相談」を、データベース等に基づいて、「価格」の類義語として決定されてもよい。 A related component is a component that is not included in the reference data but is included in the target data and has attributes related to the standard component. As a result, the target data is evaluated in relation to a predetermined case (target data Is effective in determining the score). The “related attribute” is a characteristic of the related component with respect to the reference component, for example, that the related component (morpheme) is co-occurring with the reference component (morpheme), For example, the former and the latter are in a synonym relationship, and the former meta information and the latter meta information are common. For example, if “price” is a reference component, “adjust price” or “adjust” that exists simultaneously in the same context as “price”, such as “determine price”, “consult price”, etc. “Decision” and “consultation” are co-occurrence words for “price”, that is, related components. In addition, “adjustment”, “determination”, and “consultation” may be determined as synonyms for “price” based on a database or the like.
 次に、関連構成要素を追加して対象データの評価を行うための処理(以後、追加処理、という。)の一例について説明する。サーバ装置12は、図3のデータ分析プログラムのステップSP18に続いて追加処理の制御プログラムを実行する。図4に追加処理のフローチャートの一例を示す。 Next, an example of a process (hereinafter referred to as an additional process) for adding a related component and evaluating target data will be described. The server apparatus 12 executes a control program for additional processing following step SP18 of the data analysis program of FIG. FIG. 4 shows an example of a flowchart of the addition process.
 先ず、構成要素評価部15は、追加処理の要否を判定する(S40)。サーバ装置12のデータ分析プログラムは、データ分析の運用責任者がデータ検索の運用環境を設定する際に、追加処理の運用態様(運用ポリシー)を選択させることができる。この運用態様には、例えば、(1)追加処理を行う、(2)追加処理を行わない、又は、(3)状況に応じて追加処理を行う、というパターンがある。「状況」とは、例えば、対象データ数に比較して参照データ数が不足傾向にある状態であり、構成要素評価部15は、複数の対象データのスコアの分布に偏りがあること、対象データの評価に使用される基準構成要素数が相対的に少ない場合に、前記状況が生じていると判定することができる。なお、追加処理制御プログラムは、追加処理の要否判定ステップを無くして、データ分析の際に常に追加処理を行うようにしてもよい。また、データ分析プログラムは追加処理が行われるのをオペレータに通知するかしないかを、選択できるように構成されてもよい。 First, the component evaluation unit 15 determines whether additional processing is necessary (S40). The data analysis program of the server apparatus 12 can select an operation mode (operation policy) of additional processing when the data analysis operation manager sets the operation environment for data search. In this operation mode, for example, there is a pattern in which (1) additional processing is performed, (2) additional processing is not performed, or (3) additional processing is performed depending on the situation. The “situation” is, for example, a state in which the number of reference data tends to be insufficient compared to the number of target data, and the component evaluation unit 15 indicates that the score distribution of the plurality of target data is biased, When the number of reference components used for the evaluation is relatively small, it can be determined that the situation has occurred. Note that the additional process control program may always perform the additional process at the time of data analysis without the necessity determination step of the additional process. Further, the data analysis program may be configured to select whether to notify the operator that the additional processing is performed.
 構成要素評価部15は、追加処理の要否判定を否定すると、追加処理を行うことなくフローを終了し、要否判定を肯定すると、関連構成要素を参照データの評価結果に基づいて設定する処理に移行する。そこで、構成要素評価部15は、関連構成要素の設定のために、参照データから抽出された構成要素(基準構成要素)の中から、関連構成要素を決定するための基となる特定基準構成要素を決定する(S41)。 When the component evaluation unit 15 denies the necessity determination of the additional process, the flow ends without performing the additional process, and when the necessity determination is affirmed, the related component element is set based on the evaluation result of the reference data. Migrate to Therefore, the component evaluation unit 15 sets a specific component that is a basis for determining a related component from among components (standard components) extracted from the reference data in order to set the related component. Is determined (S41).
 特定基準構成要素は、参照データから抽出された複数の基準構成要素の一つ、又は、複数、あるいは、全てであってよい。特定基準構成要素がどのように決定されるかは、データ分析の運用環境に対する設定された構成情報に基づくようにしてもよい。好適な態様では、特定基準構成要素は、評価の優劣(特に、評価値が高い順に所定数の基準構成要素から選定される(例えば、最も評価値が高い基準構成要素である。))。これは、評価値が高い基準構成要素であるほど、当該基準構成要素に応じて関連構成要素の既述の所定事案に対する関連度も高くなるからである。特定基準構成要素数が最適値より多いと、対象データの評価が参照データにおける分類に即しない傾向となる懸念があり、一方、特定基準構成要素の数が最適値より少ないと、対象データの評価が十分でない懸念があるため、「所定数」はシステムによって適宜決定されてもよい。 The specific standard component may be one, a plurality, or all of a plurality of standard components extracted from the reference data. How the specific reference component is determined may be based on configuration information set for the operational environment for data analysis. In a preferred aspect, the specific reference component is selected from a predetermined number of reference components in descending order of evaluation value (for example, the reference component having the highest evaluation value). This is because the higher the evaluation component is, the higher the degree of relevance of the related component to the above-mentioned predetermined case according to the reference component. If the number of specific criteria components is greater than the optimal value, there is a concern that the evaluation of the target data tends to be inconsistent with the classification in the reference data, while if the number of specific criteria components is less than the optimal value, the target data is evaluated. However, the “predetermined number” may be appropriately determined by the system.
 構成要素評価部15は、特定基準構成要素を決定すると、特定基準構成要素に基づいて関連構成要素を決定する(S42)。構成要素評価部15は、参照データには存在せず、特定基準構成要素と共起関係にある構成要素を全ての対象データ、又は、一部の対象データから抽出し、当該抽出された構成要素を関連形態素として設定する。例えば、「価格」が特定基準構成要素とすると、例えば、「価格を調整する」、「価格を決定する」、「価格を相談する」のように、「価格」と共起する「調整」、「決定」、「相談」が抽出され、そして、これらが関連構成要素として設定される。したがって、関連構成要素が、参照データを増やすことなく、対象データの評価(スコア付け)のためにシステムにユーザ入力を介さず自動的に追加された情報である。なお、関連構成要素が対象データに存在しない場合には、構成要素評価部15は、関連構成要素が対象データから抽出できるようになるまで、特定基準要素を新たに追加すればよい。 When the specific reference component is determined, the component evaluation unit 15 determines a related component based on the specific reference component (S42). The constituent element evaluation unit 15 extracts constituent elements that do not exist in the reference data and have a co-occurrence relationship with the specific standard constituent element from all target data or a part of target data, and the extracted constituent elements Is set as the related morpheme. For example, if “price” is a specific reference component, for example, “adjustment” co-occurs with “price”, such as “adjust price”, “determine price”, “consult price”, “Decision” and “consultation” are extracted, and these are set as related components. Therefore, the related component is information that is automatically added to the system without user input for evaluation (scoring) of the target data without increasing the reference data. If the related component does not exist in the target data, the component evaluation unit 15 may add a new specific reference element until the related component can be extracted from the target data.
 続いて、構成要素評価部15は、関連構成要素の属性(例えば、情報伝達量)に応じてその評価を行う(S43)。例えば、構成要素評価部15は、「調整」、「決定」、「相談」等の関連構成要素が特定基準構成要素(「価格」)と共起状態を持ちながら存在する対象データを検出し、当該検出された対象データの数(n)を特定する。追加処理制御プログラムは、当該検出された対象データを、所定事案に対して関連性ありと見做して(即ち、所定事案に対して「Relative」、或いは、「High」の分類情報が該当するとして)、全対象データに於ける、関連構成要素の出現確率と分類情報の出現確率とに基づく所定の定義式から、情報伝達量を算出して各関連構成要素に対応する評価値を見積る。例えば、関連構成要素のうち、第1の要素(調整)、第2の要素(決定)、第3の要素(相談)・・・・の夫々共起状態を持ちながら存在する対象データ数(n)がこの順小さくなっていくとすると、夫々の評価値は、第1の要素(調整)>第2の要素(決定)>第3の要素(相談)>・・・というようになる。具体的には、構成要素評価部15は、以下の式にしたがって関連構成要素の評価値(重み)を評価することができる。
Figure JPOXMLDOC01-appb-M000003
 または、構成要素評価部は、以下の式にしたがって関連構成要素の評価値を評価してもよい。
Figure JPOXMLDOC01-appb-M000004
 ここで、CFは、j番目の基準構成要素mj0と、j番目の関連構成要素mj1とが同一のセンテンスで共起する頻度(共起頻度:collocation frequency)を表し、DFは、両者が同一のデータ内で共起する頻度を表し、wは、基準構成要素mj0の重み(評価結果)を表す。また、fは任意の関数を表し、例えば、
Figure JPOXMLDOC01-appb-M000005
であってもよいし、
Figure JPOXMLDOC01-appb-M000006
であってもよい。なお、構成要素評価部15は、関連構成要素の評価を、関連構成要素を含む対象データが、基準構成要素の評価に基づいて、「Relative」、又は、「Non-Relative」であるか、そして、対象データの評価結果(スコア値)(S18)に基づいて行ってもよい。
Subsequently, the constituent element evaluation unit 15 performs the evaluation according to the attribute (for example, information transmission amount) of the related constituent element (S43). For example, the component evaluation unit 15 detects target data in which related components such as “adjustment”, “decision”, and “consultation” coexist with a specific reference component (“price”). The number (n) of the detected target data is specified. The additional processing control program regards the detected target data as relevant to the predetermined case (that is, the classification information of “Relative” or “High” corresponds to the predetermined case) In other words, the information transmission amount is calculated from a predetermined definition formula based on the appearance probability of the related component and the appearance probability of the classification information in all the target data, and the evaluation value corresponding to each related component is estimated. For example, among related components, the number of target data (n) having a co-occurrence state of each of the first element (adjustment), the second element (decision), the third element (consultation),. ) Become smaller in this order, the respective evaluation values are as follows: first element (adjustment)> second element (decision)> third element (consultation)>. Specifically, the component evaluation unit 15 can evaluate the evaluation value (weight) of the related component according to the following formula.
Figure JPOXMLDOC01-appb-M000003
Or a component evaluation part may evaluate the evaluation value of a related component according to the following formula | equation.
Figure JPOXMLDOC01-appb-M000004
Here, CF is the j 0 th reference elements m j0, frequency with which the j 1 th connected component m j1 cooccur in the same sentence (occurrence frequency: collocation frequency) represents, DF is Both represent the frequency of co-occurrence in the same data, and w represents the weight (evaluation result) of the reference component m j0 . F represents an arbitrary function, for example,
Figure JPOXMLDOC01-appb-M000005
May be,
Figure JPOXMLDOC01-appb-M000006
It may be. The component evaluation unit 15 evaluates the related component, whether the target data including the related component is “Relative” or “Non-Relative” based on the evaluation of the reference component, and The evaluation may be performed based on the evaluation result (score value) (S18) of the target data.
 次いで、データ評価部17は、基準構成要素の評価値と関連構成要素の評価値に基づいて全対象データの再評価(スコアのを再計算)を行う(S44)。さらに、データ評価部17は、全対象データの評価結果に応じて、全対象データを序列化して全対象のランキングの情報を作成するとともに、データ評価部17は、各対象データの評価を所定の閾値情報と比較して、当該各対象データに分類情報を設定する。データ評価部17は、分類情報を含めた既述のランキング情報をクライアント装置3に出力することができる。 Next, the data evaluation unit 17 re-evaluates all target data (recalculates the score) based on the evaluation value of the reference component and the evaluation value of the related component (S44). Furthermore, the data evaluation unit 17 ranks all the target data according to the evaluation result of all the target data and creates ranking information of all the targets. The data evaluation unit 17 evaluates each target data according to a predetermined value. Compared with the threshold information, classification information is set for each target data. The data evaluation unit 17 can output the above-described ranking information including the classification information to the client device 3.
 以上説明したように、本実施形態に係るデータ分析システムによれば、参照データに含まれる構成要素と関連性を有する構成要素を、新たな構成要として対象データの評価に追加できるために、対象データ数に対する参照データ数の割合の多少に拘わらず、或いは、参照データ数が不足している場合であっても、多数の対象データを正確に評価することができる。なお、構成要素評価部15は、関連構成要素を、特定基準構成要素の類義語であって、参照データに含まれてなく、対象データに含まれている構成要素から決定してもよい。この際、構成要素評価部15は、特定基準構成要素の類義語の選定にデータベース4の検索テーブルを利用してもよい。類義語とは、二つの異なる形態素が、例えば、上位概念の形態素によって一致する関係にあることをいう。また、構成要素評価部15は、関連構成要素を、当該類義語、及び、特定基準構成要素と共起関係を持った既述の形態素とから設定してもよい。さらにまた、構成要素評価部15は、関連構成要素を、当該類義語に対して共起関係を有する形態素から設定してもよい。またさらに、構成要素評価部15は、特定基準構成要素の類義語が対象データに存在しない場合には、当該類義語に対して類似の意義を有する他の類義語を関連構成要素の候補としてもよい。 As described above, according to the data analysis system according to the present embodiment, a component having relevance to the component included in the reference data can be added to the evaluation of the target data as a new component. Regardless of the ratio of the number of reference data to the number of data, or even when the number of reference data is insufficient, a large number of target data can be accurately evaluated. The constituent element evaluation unit 15 may determine the related constituent element from a constituent element that is a synonym of the specific reference constituent element and is not included in the reference data but included in the target data. At this time, the component evaluation unit 15 may use the search table of the database 4 to select synonyms for the specific reference component. A synonym means that two different morphemes are in a relationship of being matched by, for example, a higher-level concept morpheme. In addition, the constituent element evaluation unit 15 may set the related constituent elements from the synonyms and the morphemes described above having a co-occurrence relationship with the specific reference constituent element. Furthermore, the constituent element evaluation unit 15 may set related constituent elements from morphemes having a co-occurrence relationship with respect to the synonyms. Furthermore, when the synonym of the specific reference constituent element does not exist in the target data, the constituent element evaluation unit 15 may use another synonym having a similar meaning to the synonym as a candidate for the related constituent element.
 既述の実施形態では、基準構成要素に基づく対象データの評価(点数化)の後に、関連構成要素と基準構成要素とに基づいて対象データの評価を実行することにより、両者の差分をユーザに提示することができたり、前者の評価結果を関連構成要素の決定や評価に応用することができるが、基準構成要素に基づく対象データの評価を行うことなく、関連構成要素と基準構成要素とに基づいて対象データの評価を実行してもよい。 In the above-described embodiment, after the evaluation (scoring) of the target data based on the reference component, the target data is evaluated based on the related component and the reference component, so that the difference between the two is given to the user. It can be presented, and the former evaluation results can be applied to the determination and evaluation of related components, but without the evaluation of target data based on the reference components, You may perform evaluation of object data based on it.
 さらにまた、既述のデータ分析システムにおいて、追加処理をユーザに通知するように運用環境が設定されている場合には、サーバ装置12は、クライント装置3に、特定基準構成要素、及び、関連構成要素の候補を夫々の評価値順に表示して、ユーザに、各要素をデータ分析に採用するかしないかを選択させることができる。 Furthermore, in the above-described data analysis system, when the operating environment is set so as to notify the user of the additional processing, the server device 12 sends the specific reference component and the related configuration to the client device 3. The candidate elements can be displayed in the order of evaluation values, and the user can select whether or not to adopt each element for data analysis.
 (パターン更新機能)
 予測コーディング部10は、例えば、以下(1)~(3)のように、所与の参照データ、および/または新たに得られた参照データに基づいて、構成要素の評価値を最適化することができる。
(Pattern update function)
The predictive coding unit 10 optimizes evaluation values of constituent elements based on given reference data and / or newly obtained reference data, for example, as described in (1) to (3) below. Can do.
 (1)評価値の最適化
 構成要素評価部15は、対象データを評価した結果に基づいて再現率または適合率を算出し、当該再現率または適合率が上昇するように、構成要素がデータと分類情報との組み合わせに寄与する度合いを繰り返し評価することによって、上記学習したパターンを更新することができる。
(1) Optimization of evaluation value The component evaluation unit 15 calculates the recall rate or the conformance rate based on the result of evaluating the target data, and the component is the data and the data so that the recall rate or the conformance rate increases. By repeatedly evaluating the degree of contribution to the combination with the classification information, the learned pattern can be updated.
 ここで、上記「再現率」(Recall Rate)は、所定数のデータに対して発見すべきデータが占める割合(網羅性)を示す指標である。例えば、「全データの30%に対して再現率が80%」と表現した場合、発見すべきデータの80%が、指標の上位30%のデータの中に含まれていることを示す(データ分析システムを用いず、データに総当たり(リニアレビュー)した場合、発見すべきデータの量はレビューした量に比例するため、当該比例からの乖離が大きいほどシステムの性能が良いことを示す。)。また、上記「適合率」(Precision Rate)は、上記システムによって発見されたデータに対して、真に発見すべきデータが占める割合(正確性)を示す指標である。例えば、「全データを30%処理した時点で、適合率が80%」と表現した場合、指標の上位30%のデータに対して、発見すべきデータの占める割合が80%であることを示す。 Here, the above-mentioned “recall rate” (RecallateRate) is an index indicating the ratio (coverability) of the data to be discovered to the predetermined number of data. For example, when “reproducibility is 80% with respect to 30% of all data”, it indicates that 80% of the data to be found is included in the data of the top 30% of the index (data If the data is brute force (linear review) without using an analysis system, the amount of data to be discovered is proportional to the amount reviewed, so the greater the deviation from the proportion, the better the system performance.) . The “Precision Rate” is an index indicating the ratio (accuracy) of data to be truly discovered to the data discovered by the system. For example, when the expression “the relevance rate is 80% when 30% of all data is processed” is shown, the proportion of data to be discovered is 80% of the data of the top 30% of the index. .
 構成要素抽出部14は、データ評価部17によって評価された結果に基づいて再現率または適合率を算出し、当該再現率または適合率が目標値を下回っていた場合、再現率または適合率が目標値を上回るまで、構成要素をデータから再抽出する。このとき、構成要素抽出部14は、前回抽出した構成要素を除いた構成要素を抽出するようにしてもよいし、前回抽出した構成要素の一部を新たな構成要素に置き換えてもよい。また、データ評価部17が、再抽出された構成要素で対象データの指標を導出する場合、再抽出された構成要素とその評価値とを用いて各データの指標(第2指標)を導出し、構成要素を再抽出する前に得られた第1指標と第2指標とから、再現率または適合率を導出し直してもよい。これにより、データ分析システムは、データ分析の精度を向上させることができるという付加的な効果をさらに奏する。 The component extraction unit 14 calculates the recall rate or the conformance rate based on the result evaluated by the data evaluation unit 17, and when the recall rate or the conformance rate is lower than the target value, the recall rate or the conformance rate is the target. Re-extract the component from the data until the value is exceeded. At this time, the component extraction unit 14 may extract the component excluding the component extracted last time, or may replace a part of the component extracted last time with a new component. When the data evaluation unit 17 derives the index of the target data using the re-extracted component, the index (second index) of each data is derived using the re-extracted component and its evaluation value. The recall rate or the matching rate may be derived again from the first index and the second index obtained before re-extracting the constituent elements. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
 (2)畳み込み手法に基づく構成要素の評価
 構成要素評価部15は、参照データに含まれる構成要素を評価した後、当該構成要素以外の他の構成要素の評価値を畳み込むことによって、当該構成要素の評価値に当該他の構成要素の評価値を反映させるように、当該構成要素を再評価することができる。これにより、構成要素と他の構成要素との関連性が、当該構成要素の評価値として評価されるため、データ分析システムは、データ分析の精度を向上させることができるという付加的な効果をさらに奏する。
(2) Evaluation of component based on convolution method The component evaluation unit 15 evaluates the component included in the reference data, and then convolves the evaluation value of the component other than the component with the component The component can be re-evaluated so that the evaluation value of the other component is reflected in the evaluation value. As a result, the relevance between the constituent element and the other constituent elements is evaluated as an evaluation value of the constituent element, so that the data analysis system can further improve the accuracy of data analysis. Play.
 (3)最適化のタイミング
 構成要素評価部15は、任意のタイミングでパターン(例えば、構成要素と当該構成要素の評価値との組み合わせ)を更新することができる。すなわち、構成要素評価部15は、例えば、(a)上記システムを管理する管理ユーザから更新リクエストを受け付けたタイミングで、(b)予め設定された日時が到来したタイミングで、および/または(c)ユーザから追加レビューに関する入力を受け付けたタイミングで、上記パターンを更新することができる。
(3) Optimization Timing The component evaluation unit 15 can update a pattern (for example, a combination of a component and an evaluation value of the component) at an arbitrary timing. That is, for example, the component evaluation unit 15 (a) at a timing when an update request is received from an administrative user who manages the system, (b) at a timing when a preset date and time arrives, and / or (c) The pattern can be updated at a timing when an input regarding the additional review is received from the user.
 ユーザは、データ評価部17によって指標が導出された対象データの内容を確認(確認レビュー)し、当該対象データに対する分類情報を新たに入力することができる。このとき、分類情報取得部12は、新たに入力された分類情報を取得し、データ分類部13は、上記対象データと当該分類情報とを組み合わせ、当該組み合わせを新たな参照データとしてもよい。当該新たな参照データは、任意のメモリに蓄積され、例えば、上記(a)~(c)のタイミングで上記システムにフィードバックされる。 The user can confirm (confirmation review) the content of the target data from which the index is derived by the data evaluation unit 17, and can newly input classification information for the target data. At this time, the classification information acquisition unit 12 may acquire newly input classification information, and the data classification unit 13 may combine the target data and the classification information and use the combination as new reference data. The new reference data is stored in an arbitrary memory, and is fed back to the system, for example, at the timings (a) to (c).
 これにより、構成要素抽出部14は、上記新たな参照データから構成要素を抽出し、構成要素評価部15は、当該構成要素を評価する。当該構成要素が以前に評価され、当該構成要素とその評価値とがメモリに格納されている場合、構成要素格納部16は、当該評価値を新たな評価結果(評価値)と置き換え、格納されていない場合、当該構成要素とその評価値とを対応付けて、当該メモリに新たに格納する。 Thereby, the component extraction unit 14 extracts the component from the new reference data, and the component evaluation unit 15 evaluates the component. When the constituent element has been evaluated before and the constituent element and its evaluation value are stored in the memory, the constituent element storage unit 16 replaces the evaluation value with a new evaluation result (evaluation value) and stores it. If not, the component and the evaluation value are associated with each other and newly stored in the memory.
 すなわち、予測コーディング部10は、任意のタイミング(例えば、上記(a)~(b)のタイミング)で、当該分類情報に対応するデータの少なくとも一部を構成する複数の構成要素が、当該データと当該分類情報との組み合わせに寄与する度合いを再評価することによって、上記学習したパターンを更新することができる。これにより、データ分析システムは、データ分析の精度を向上させることができるという付加的な効果をさらに奏する。 That is, the predictive coding unit 10 includes a plurality of constituent elements constituting at least a part of data corresponding to the classification information at an arbitrary timing (for example, timings (a) to (b) described above). The learned pattern can be updated by re-evaluating the degree of contribution to the combination with the classification information. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
 (管理機能)
 予測コーディング部10は、管理部18をさらに備えることができる(。管理部18は、例えば、以下(1)~(5)の機能を有する。
(Management function)
The predictive coding unit 10 can further include a management unit 18 (for example, the management unit 18 has the following functions (1) to (5)).
 (1)レビュー・ヒートマップ(Review Heat Map)
 データ評価部17が、複数の対象データに対してそれぞれ指標を導出し、(例えば、当該指標によって当該対象データと所定の事案との関連性が高いことが示された順に)ユーザが、当該複数の対象データをそれぞれ確認して分類情報を付与した(確認レビューした)場合を一例として考える。このとき、管理部18は、分類情報が対応付けられた対象データが、すべての対象データに対して占める割合に応じたグラデーションを用いて、複数の対象データをそれぞれ評価した結果に対する当該割合の分布を視認可能に表示することができる。
(1) Review Heat Map
The data evaluation unit 17 derives an index for each of a plurality of target data, and the user (for example, in the order in which the index indicates that the target data is highly related to the predetermined case) As an example, consider the case where each target data is confirmed and classification information is given (confirmed review). At this time, the management unit 18 uses the gradation corresponding to the ratio that the target data associated with the classification information occupies for all the target data, and the distribution of the ratio with respect to the result of evaluating each of the plurality of target data. Can be displayed in a visible manner.
 例えば、データ評価部17が、0~10000の値域をとる数値を上記指標として導出する場合、管理部18は、例えば、当該指標を1000ごとに区切った範囲(すなわち、0~1000を第1区間、1001~2000を第2区間、2001~3000を第3区間・・・とする)に対象データをそれぞれ分類し(例えば、指標が2500である対象データを第3区間に分類する)、ある範囲に分類された対象データの総数に対して、所定の分類情報(例えば、「Related」)が付与された対象データが占める割合が視認可能となるように、例えば、当該範囲の色調を変化させて(例えば、当該割合が高いほど暖色系に近づき、低いほど寒色系に近づく)、当該範囲を表示させることができる。管理部18は、他の範囲についても、同様に当該他の範囲を表示させる。 For example, when the data evaluation unit 17 derives a numerical value in the range of 0 to 10000 as the index, the management unit 18, for example, has a range obtained by dividing the index every 1000 (that is, 0 to 1000 in the first interval). , 1001 to 2000 as the second section, 2001 to 3000 as the third section, etc.) (for example, the target data with the index of 2500 is classified into the third section), and a certain range For example, by changing the color tone of the range so that the ratio of the target data to which the predetermined classification information (for example, “Related”) occupies the total number of target data classified into The range can be displayed (for example, the higher the ratio, the closer to the warm color system and the lower, the closer to the cold color system). The management unit 18 displays the other ranges in the same manner for the other ranges.
 これにより、管理部18は、各範囲における上記割合の分布を、グラデーションを用いて表示することができるため、例えば、上記指標によって対象データと所定の事案との関連性が高いことが示されている範囲(例えば、当該指標が8001~9000である第9区間)にもかかわらず、当該範囲における上記割合が寒色系の色調で示されている場合、ユーザによる確認レビューが間違っているおそれがあることを示唆することができる。すなわち、データ分析システムは、ユーザに当該分布を一目で把握させることができるという付加的な効果をさらに奏する。 Thereby, since the management unit 18 can display the distribution of the ratio in each range using gradation, for example, the index indicates that the relevance between the target data and the predetermined case is high. If the above-mentioned ratio in the range is indicated by a cold color tone in spite of the range (for example, the ninth section where the index is 8001 to 9000), the confirmation review by the user may be wrong Can suggest that. That is, the data analysis system further provides an additional effect that allows the user to grasp the distribution at a glance.
 (2)セントラル・リンケージ(Central Linkage)
 管理部18は、複数の主体(例えば、人、組織、コンピュータなど)間の相互関係(例えば、上下関係、系列関係、データ送受信の多寡など)を可視化することができる。例えば、第1コンピュータから第2コンピュータに電子メールが送信された場合、管理部18は、当該第1コンピュータを表す第1の円と当該第2コンピュータを表す第2の円とを、当該第1の円から当該第2の円に向かう矢印(例えば、電子メールの多寡に応じた太さを有してよい)で結んだダイアグラムを、所定の表示装置(例えば、クライアント装置10が備えたディスプレイ)に表示させることができる。
(2) Central Linkage
The management unit 18 can visualize interrelationships (eg, hierarchical relationships, series relationships, data transmission / reception, etc.) between a plurality of subjects (eg, people, organizations, computers, etc.). For example, when an e-mail is transmitted from the first computer to the second computer, the management unit 18 converts the first circle representing the first computer and the second circle representing the second computer into the first circle. A predetermined display device (for example, a display provided in the client device 10) is a diagram that is connected by an arrow (for example, a thickness corresponding to the size of the e-mail) from the circle to the second circle. Can be displayed.
 また、管理部18は、データ評価部17によって評価された結果に応じて、上記相互関係を可視化することができる。例えば、データ評価部17が、0~10000の値域をとる数値を上記指標として導出する場合、管理部18は、例えば、指定された区間に属する指標が対応付けられた対象データ(例えば、第1コンピュータから第2コンピュータに送信された電子メール)のみに基づいて、上記ダイアグラムを上記所定の表示装置に表示させることができる。これにより、データ分析システムは、複数の主体間の相互関係をユーザに一目で把握させることができるという付加的な効果をさらに奏する。 Further, the management unit 18 can visualize the interrelationship according to the result evaluated by the data evaluation unit 17. For example, when the data evaluation unit 17 derives a numerical value in the range of 0 to 10000 as the index, the management unit 18 may, for example, target data (for example, first data) associated with an index belonging to a specified section. The diagram can be displayed on the predetermined display device only on the basis of the electronic mail transmitted from the computer to the second computer. Thereby, the data analysis system further exhibits an additional effect that allows the user to grasp the mutual relationship between a plurality of subjects at a glance.
 (3)行動抽出(Behavior Extractor)
 管理部18は、所定の動作を表す第1の構成要素が対象データに含まれるか否かを判定し、含まれると判定する場合、当該所定の動作の対象を表す第2の構成要素を特定することができる。
(3) Behavior Extractor
The management unit 18 determines whether or not the first component representing the predetermined operation is included in the target data. When determining that the first component is included, the management unit 18 identifies the second component representing the target of the predetermined operation can do.
 例えば、「仕様を確定する」という文章が上記対象データに含まれる場合、当該文章から「仕様」および「確定する」という構成要素を抽出し、「確定する」という所定の動作を表す構成要素(動詞)の対象である「仕様」という他の構成要素(目的語)を特定する。次に、管理部18は、上記構成要素および他の構成要素を含む対象データの属性(性質・特徴)を示すメタ情報(属性情報)と、当該構成要素および第他の構成要素とを関連付ける。ここで、上記メタ情報とは、データが有する所定の属性を示す情報であり、例えば、上記対象データが電子メールである場合、当該電子メールを送信した人物の名前、受信した人物の名前、メールアドレス、送受信された日時などであってよい。 For example, when the sentence “determine the specification” is included in the target data, the component “specification” and “determine” are extracted from the sentence, and the component (determining the predetermined operation) The other component (object) called "specification" that is the target of the verb) is specified. Next, the management unit 18 associates the meta information (attribute information) indicating the attribute (property / feature) of the target data including the above constituent element and other constituent elements with the constituent element and the second constituent element. Here, the meta information is information indicating a predetermined attribute of data. For example, when the target data is an e-mail, the name of the person who sent the e-mail, the name of the person who received the e-mail, and the e-mail It may be an address, the date and time of transmission / reception, and the like.
 そして、管理部18は、2つの構成要素とメタ情報とを対応付けて、所定の表示装置(例えば、クライアント装置3が備えたディスプレイ)に表示させる。例えば、管理部18は、第1の構成要素を表す円と第2の構成要素を表す円とを、当該第1の円から当該第2の円に向かう矢印で結んだダイアグラムを、上記所定の表示装置に表示させることができる。これにより、データ分析システムは、上記所定の動作とその対象とをユーザに一目で把握させることができるという付加的な効果をさらに奏する。 Then, the management unit 18 associates the two components with the meta information and displays them on a predetermined display device (for example, a display provided in the client device 3). For example, the management unit 18 connects the circle representing the first component and the circle representing the second component with an arrow from the first circle to the second circle. It can be displayed on a display device. Thereby, the data analysis system further exhibits an additional effect that the user can grasp the predetermined operation and the target at a glance.
 (4)生成的概念抽出に基づく自動要約
 管理部18は、予め選定された概念の下位概念に対応する構成要素を含むデータを複数の対象データからそれぞれ抽出し、当該複数の対象データを要約可能なコンテンツ(例えば、文章、グラフ、表など)を生成することができる。
(4) Automatic summarization based on generative concept extraction The management unit 18 can extract data including constituent elements corresponding to subordinate concepts of a preselected concept from a plurality of target data, and can summarize the plurality of target data. Content (eg, sentences, graphs, tables, etc.) can be generated.
 まず、ユーザが、対象データから検出したいトピックに応じたいくつかの概念を選定し、当該選定した概念を予め管理部18に登録する。例えば、検出すべきトピックが「不正」または「不満」である場合、概念のカテゴリを「行動」、「感情」、「性質・状態」、「リスク」、および「金銭」の5つに分け、例えば「行動」については「復讐する」、「軽蔑する」など、「感情」については「苦しむこと」、「腹を立てること」など、「性質・状態」については「鈍重であること」、「態度が悪いこと」など、「リスク」については「脅す」、「だます」など、「金銭」については「人の労働に対して支払われるお金」などの概念を、ユーザが管理部18にそれぞれ登録する。 First, the user selects some concepts according to the topic to be detected from the target data, and registers the selected concepts in the management unit 18 in advance. For example, if the topic to be detected is “illegal” or “dissatisfied”, the concept category is divided into five categories of “behavior”, “emotion”, “nature / state”, “risk”, and “money” For example, “behavior” for “behavior”, “despise”, etc. “feeling” for “feelings”, “being angry”, etc. “dullness” for “nature / state”, “ The concept of “risk” and “danger” for “risk”, such as “bad attitude”, and “money paid for human labor” for “money” are given to the management unit 18 by the user. sign up.
 管理部18は、登録された概念ごとに、当該概念の下位概念に対応する構成要素を参照データから検索し、当該検索された構成要素を当該概念に対応付けて、任意のメモリ(例えば、ストレージシステム18)に格納する。そして、管理部18は、当該格納された構成要素を対象データから抽出し、当該構成要素に対応付けられた概念を特定し、当該概念を用いた要約を出力する。 For each registered concept, the management unit 18 searches the reference data for a component corresponding to the subordinate concept of the concept, associates the searched component with the concept, and stores an arbitrary memory (for example, storage Store in system 18). Then, the management unit 18 extracts the stored constituent element from the target data, specifies a concept associated with the constituent element, and outputs a summary using the concept.
 例えば、管理部18は、ある電子メールに含まれる「監視システム受注」というテキストから「システム」、「販売」、および「する」という概念を抽出し、他の電子メールに含まれる「会計システム導入」というテキストから「システム」、「販売」、および「する」という概念を抽出し、これら電子メールの要約として「システムを販売する」を出力する。このとき、管理部18は、例えば、「システムを販売する」の概念を含む対象データが、すべての対象データに対して占める割合を示すグラフ(例えば、円グラフ)を示すことができる。これにより、データ分析システムは、対象データの全体像をユーザに把握させることができるという付加的な効果をさらに奏する。 For example, the management unit 18 extracts the concepts “system”, “sales” and “do” from the text “monitoring system order” included in a certain e-mail, and “accounting system introduction” included in another e-mail. The concepts “system”, “sale”, and “do” are extracted from the text “”, and “sell system” is output as a summary of these emails. At this time, the management unit 18 can show, for example, a graph (for example, a pie chart) indicating the ratio of target data including the concept of “sell system” to all target data. Thereby, the data analysis system further exhibits an additional effect of allowing the user to grasp the entire image of the target data.
 (5)トピッククラスタリング(Topic Clustering)
 管理部18は、複数の対象データに含まれるトピック(主題)に応じて、当該複数の対象データをクラスタリングすることができる。例えば、管理部18は、任意の分類モデル(例えば、K平均法、サポートベクターマシン、球面クラスタリングなど)を用いて、複数の対象データをクラスタリングすることができる。これにより、データ分析システムは、対象データの全体像をユーザに把握させることができるという付加的な効果をさらに奏する。
(5) Topic clustering
The management unit 18 can cluster the plurality of target data according to topics (subjects) included in the plurality of target data. For example, the management unit 18 can cluster a plurality of target data using an arbitrary classification model (for example, K-means, support vector machine, spherical clustering, etc.). Thereby, the data analysis system further exhibits an additional effect of allowing the user to grasp the entire image of the target data.
 (フェーズ分析機能)
 予測コーディング部10は、フェーズ分析部19をさらに備えることができる(図2において図示されていない)。フェーズ分析部19は、例えば、以下(1)~(3)の機能を有する。
(Phase analysis function)
The predictive coding unit 10 may further include a phase analysis unit 19 (not shown in FIG. 2). The phase analysis unit 19 has the following functions (1) to (3), for example.
 (1)フェーズ分析
 フェーズ分析部19は、所定の事案が進展する各段階を示すフェーズを分析することができる。ここで、上記システムが犯罪捜査支援システムとして実現され、所定の事案が「談合行為」である例に基づいて、フェーズ分析部19がフェーズを分析する流れを説明する。
(1) Phase analysis The phase analysis part 19 can analyze the phase which shows each step in which a predetermined case progresses. Here, a flow in which the phase analysis unit 19 analyzes a phase based on an example in which the above system is realized as a criminal investigation support system and a predetermined case is “collusion” will be described.
 談合行為は、関係構築フェーズ(競合他社と関係を構築する段階)、準備フェーズ(競合他社と競合に関する情報を交換する段階)、競合フェーズ(顧客へ価格を提示し、フィードバックを得て、競合他社とコミュニケーションを取る段階)の順に進展することが知られている。そこで、上記システムの管理者は、フェーズ分析部19に上記3つのフェーズを設定する。上記システムは、予め設定された複数のフェーズに対してそれぞれ準備される複数種類の参照データから、当該複数のフェーズに対応する複数のパターンをそれぞれ学習し、当該複数のフェーズにそれぞれ基づいて対象データを分析することによって、例えば「分析対象である組織が、現在どのフェーズにあるか」を特定することができる。 The collusion involves the relationship building phase (the stage of building relationships with competitors), the preparation phase (the stage of exchanging information about competitors with competitors), and the competition phase (providing prices to customers, obtaining feedback, It is known to progress in the order of communication). Therefore, the system administrator sets the above three phases in the phase analysis unit 19. The system learns a plurality of patterns corresponding to the plurality of phases from a plurality of types of reference data respectively prepared for a plurality of preset phases, and the target data based on the plurality of phases, respectively. For example, it is possible to specify “in which phase the organization to be analyzed is currently in”.
 すなわち、構成要素評価部15は、予め設定された複数のフェーズに対してそれぞれ準備される複数種類の参照データを参照し、当該複数種類の参照データにそれぞれ含まれる構成要素を評価し、当該構成要素と当該構成要素を評価した結果(評価値)とを対応付けて、フェーズごとにメモリに格納する(すなわち、当該複数のフェーズに対応する複数のパターンをそれぞれ学習する)。次に、データ評価部17は、上記フェーズごとに学習されたパターンに基づいて対象データを分析することにより、複数のフェーズに対してそれぞれ指標を導出する。 That is, the component evaluation unit 15 refers to a plurality of types of reference data respectively prepared for a plurality of preset phases, evaluates components included in the plurality of types of reference data, and The element and the result (evaluation value) obtained by evaluating the component are associated with each other and stored in the memory for each phase (that is, a plurality of patterns corresponding to the plurality of phases are respectively learned). Next, the data evaluation unit 17 derives an index for each of a plurality of phases by analyzing the target data based on the pattern learned for each phase.
 そして、フェーズ分析部19は、当該指標が各フェーズに対して予め設定された所定の判定基準(例えば、閾値)を満たしているか否か(例えば、当該指標が当該閾値を超過しているか否か)を判定し、満たしていると判定する場合、当該フェーズに対応するカウント値を増加させる。最後に、フェーズ分析部19は、当該カウント値に基づいて現在のフェーズを特定する(例えば、最大のカウント値を有するフェーズを、現在のフェーズとする)。または、フェーズごとに導出された指標が、当該フェーズに設定された所定の判定基準を満たしていると判定した場合、フェーズ分析部19は、当該フェーズを現在のフェーズとして特定することもできる。これにより、データ分析システムは、所定の事案が進展する各段階を示すフェーズを、ユーザに示唆することができるという付加的な効果をさらに奏する。 Then, the phase analysis unit 19 determines whether or not the index satisfies a predetermined determination criterion (for example, a threshold value) set in advance for each phase (for example, whether or not the index exceeds the threshold value). ) And the count value corresponding to the phase is increased. Finally, the phase analysis unit 19 specifies the current phase based on the count value (for example, the phase having the maximum count value is set as the current phase). Or when it determines with the parameter | index derived for every phase satisfy | filling the predetermined criterion set to the said phase, the phase analysis part 19 can also specify the said phase as a present phase. Thereby, the data analysis system further exhibits an additional effect that a phase indicating each stage where a predetermined case progresses can be suggested to the user.
 (2)予測モデルに基づくフェーズ進展予測
 フェーズ分析部19は、所定の事案に関係する所定の行為の進展を予測可能なモデルに基づいて、複数の対象データを評価することによって導出した指標から、次の行為を予測・提示することができる。
(2) Phase progress prediction based on a prediction model The phase analysis unit 19 is based on an index derived by evaluating a plurality of target data based on a model that can predict the progress of a predetermined action related to a predetermined case. Predict and present the following actions:
 すなわち、フェーズ分析部19は、例えば、第1フェーズ(例えば、関係構築フェーズ)に対して導出された指標と、第2フェーズ(例えば、準備フェーズ)に対して導出された指標とを変数とする回帰モデル(上記進展を予測可能なモデル)を仮定し、予め最適化した回帰係数に基づいて、第3フェーズ(例えば、競合フェーズ)に進む可能性(例えば、確率)を予測することができる。これにより、データ分析システムは、所定の事案に関係する所定の行為の進展を予測した結果を、ユーザに示唆することができるという付加的な効果をさらに奏する。 That is, for example, the phase analysis unit 19 uses the index derived for the first phase (for example, the relationship building phase) and the index derived for the second phase (for example, the preparation phase) as variables. Assuming a regression model (a model in which the progress can be predicted), the possibility (for example, the probability) of proceeding to the third phase (for example, the competitive phase) can be predicted based on the regression coefficient optimized in advance. Thereby, the data analysis system further exhibits an additional effect that the result of predicting the progress of the predetermined action related to the predetermined case can be suggested to the user.
 (3)判定基準の最適化
 フェーズ分析部19は、データ評価部17によって導出された指標に基づいてフェーズを特定するための上記判定基準(各フェーズに対して予め設定された所定の判定基準、例えば、閾値)を、所与のデータに応じて最適化することができる。管理部18は、例えば、複数の対象データに対してそれぞれ導出された指標と当該指標のランキング(すなわち、指標を昇順で並べた場合における順位)との関係に対して回帰分析を行い、当該回帰分析の結果に基づいて上記判定基準を再設定(例えば、上記閾値を変更)することができる。
(3) Optimization of determination criteria The phase analysis unit 19 uses the above-mentioned determination criteria (predetermined determination criteria set in advance for each phase, for specifying phases based on the index derived by the data evaluation unit 17, For example, the threshold) can be optimized according to given data. For example, the management unit 18 performs regression analysis on the relationship between the index derived for each of the plurality of target data and the ranking of the index (that is, the rank when the indices are arranged in ascending order), and the regression Based on the result of the analysis, the determination criterion can be reset (for example, the threshold value is changed).
 まず、上記システムの管理者は、上記ランキングに対して予めランキング閾値を設定しておく。フェーズ分析部19は、データ評価部17によって導出された指標と当該指標のランキングとの関係に対して、例えば、指数型分布族に属する関数(y=eαx+β(eは自然対数の底、αおよびβは実数値をとるパラメータである))を用いた回帰分析を行い(例えば、最小自乗法により上記関数の上記パラメータを決定する)、当該関数において上記ランキング閾値に対応する指標を、新たな判定基準(変更後の閾値)として設定する。これにより、データ分析システムは、所与のデータに応じて判定基準を最適化することができるため、データ分析の精度を向上させることができるという付加的な効果をさらに奏する。 First, the administrator of the system previously sets a ranking threshold for the ranking. For example, a function (y = e αx + β (e is the base of the natural logarithm) where the phase analysis unit 19 determines the relationship between the index derived by the data evaluation unit 17 and the ranking of the index. α and β are parameters that take real values)) (for example, the parameters of the function are determined by the method of least squares), and the index corresponding to the ranking threshold is newly set in the function. Is set as a simple criterion (the threshold after change). As a result, the data analysis system can optimize the determination criterion according to given data, and thus has the additional effect of improving the accuracy of data analysis.
 (補助機能)
 予測コーディング部10が備えた各部は、例えば、以下(1)~(6)の補助機能を有することができる。
(Auxiliary function)
Each unit included in the predictive coding unit 10 can have, for example, the following auxiliary functions (1) to (6).
 (1)高解像度評価
 データ評価部17は、高い解像度で対象データを評価することができる。すなわち、データ評価部17は、対象データに対して指標を導出するだけでなく、例えば、対象データを複数のパーツ(例えば、当該対象データに含まれるセンテンスまたは段落(部分対象データ))に分割し、学習したパターンに基づいて当該複数の部分対象データをそれぞれ評価(部分対象データに対して指標を導出)することができる。
(1) High Resolution Evaluation The data evaluation unit 17 can evaluate target data with high resolution. That is, the data evaluation unit 17 not only derives an index for the target data but also divides the target data into a plurality of parts (for example, sentences or paragraphs (partial target data) included in the target data). Based on the learned pattern, each of the plurality of partial target data can be evaluated (an index is derived for the partial target data).
 そして、データ評価部17は、複数の部分対象データに対してそれぞれ導出した複数の指標を統合し、当該統合指標を対象データの評価結果とすることもできる(例えば、各指標が数値として導出される場合、当該指標の最大値を抽出して当該対象データに対する統合指標としたり、当該指標の平均を当該対象データに対する統合指標としたり、当該指標を大きい順から所定数合算して当該対象データの統合指標としたりすることができる)。これにより、データ分析システムは、データ分析の精度を向上させることができるという付加的な効果をさらに奏する。 The data evaluation unit 17 can also integrate a plurality of indices derived for each of the plurality of partial target data, and use the integrated index as an evaluation result of the target data (for example, each index is derived as a numerical value). The maximum value of the index is extracted and used as an integrated index for the target data, or the average of the index is set as an integrated index for the target data, or a predetermined number of the indexes are added in descending order, Or an integrated indicator). Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
 (2)時系列評価
 時間の経過とともにその性質が変化するデータ(例えば、時間の経過とともに進行する病状を記録した電子カルテなど)を分析する場合、構成要素評価部15は、所定時間ごとに区切られた参照データ(例えば、第1区間の参照データ、第2区間の参照データ・・・)からそれぞれパターンを学習し(すなわち、当該所定時間ごとに構成要素と当該構成要素を評価した結果とを取得し)、データ評価部17は、当該パターンにそれぞれ基づいて対象データを評価することができる。すなわち、データ評価部17は、時系列に沿って対象データに対する指標を導出することができる。これにより、データ分析システムは、データ分析の精度を向上させることができるという付加的な効果をさらに奏する。
(2) Time-series evaluation When analyzing data whose properties change with the passage of time (for example, an electronic medical record that records a medical condition that progresses with the passage of time), the component evaluation unit 15 delimits at predetermined intervals. Each pattern is learned from the obtained reference data (for example, the reference data of the first section, the reference data of the second section, etc.) (that is, the component and the result of evaluating the component at each predetermined time) The data evaluation unit 17 can evaluate the target data based on each of the patterns. That is, the data evaluation unit 17 can derive an index for the target data along the time series. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
 このとき、データ評価部17は、上記指標の時間的変化に基づいて、将来の指標を予測することができる。例えば、データ評価部17は、新たに対象データが得られる前に、時系列分析のためのモデル(例えば、自己回帰モデル、移動平均モデルなど)と、所定の期間内(例えば、過去1ヶ月)において導出された指標とに基づいて、当該新たな対象データを評価した場合に得られる次の指標を予測することができる。これにより、データ分析システムは、将来起こり得る事象(例えば、好ましくない事態が起こるリスク)をユーザに提示できるという付加的な効果をさらに奏する。 At this time, the data evaluation unit 17 can predict a future index based on the temporal change of the index. For example, the data evaluation unit 17 sets a model for time series analysis (for example, autoregressive model, moving average model, etc.) and within a predetermined period (for example, the past month) before new target data is obtained. The next index obtained when the new target data is evaluated can be predicted based on the index derived in step. Thereby, the data analysis system can further exhibit an additional effect that an event that can occur in the future (for example, a risk that an undesirable situation occurs) can be presented to the user.
 (3)案件別評価
 案件の種類に応じてその性質が変化するデータ(例えば、訴訟の種類(例えば、独占禁止法違反、情報漏洩、特許権侵害など)に応じて内容が変化する訴訟関連文書など)を分析する場合、構成要素評価部15は、案件ごとに準備された参照データ(例えば、独占禁止法違反に関する参照データ、情報漏洩に関する参照データ・・・)からそれぞれパターンを学習し(すなわち、当該案件ごとに構成要素と当該構成要素を評価した結果とを取得し)、データ評価部17は、当該パターンにそれぞれ基づいて対象データを評価することができる。これにより、データ分析システムは、データ分析の精度を向上させることができるという付加的な効果をさらに奏する。
(3) Case-by-case evaluation Data that changes in nature depending on the type of case (for example, litigation-related documents whose contents change according to the type of lawsuit (for example, violation of antitrust law, information leakage, patent infringement, etc.) Etc.), the component evaluation unit 15 learns each pattern from the reference data prepared for each case (for example, reference data related to violation of the Antimonopoly Act, reference data related to information leakage, etc.) (that is, The data evaluation unit 17 can evaluate the target data based on the pattern, respectively, by acquiring the component and the result of evaluating the component for each case. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
 (4)構文解析
 データ評価部17は、対象データが有する構造を解析し、当該解析した結果を当該対象データの評価に反映させることができる。例えば、対象データが少なくとも一部に文章(テキスト)を含む場合、データ評価部17は、当該文章に含まれる各センテンスの表現形態(例えば、当該センテンスが肯定形であるか、否定形であるか、消極形であるかなど)を解析し、当該解析した結果を対象データに対して導出する指標に反映させることができる。ここで、肯定形は、主題を肯定する表現(例えば、「料理が美味しい」)であり、否定形は、主題を否定する表現(例えば、「料理が不味い」または「料理が美味しくない」)であり、消極形は、主題を婉曲に肯定または否定する表現(例えば、「料理が美味しいとはいえなかった」または「料理が不味いとはいえかった」)であってよい。
(4) Syntax analysis The data evaluation unit 17 can analyze the structure of the target data and reflect the analysis result in the evaluation of the target data. For example, when the target data includes a sentence (text) at least partially, the data evaluation unit 17 expresses each sentence included in the sentence (for example, whether the sentence is a positive form or a negative form). Or the like, and the result of the analysis can be reflected in an index derived for the target data. Here, the positive form is an expression that affirms the subject (for example, “the dish is delicious”), and the negative form is an expression that denies the subject (for example, “the dish is not delicious” or “the dish is not delicious”). Yes, the negative form may be an expression that affirms or denies the subject matter (eg, “the food was not delicious” or “the food was not delicious”).
 データ評価部17は、上記表現形態に応じて指標を調整することができる。例えば、データ評価部17が所定の値域をとる数値を上記指標として導出する場合、データ評価部17は、例えば、肯定形に「+α」を加算し、否定形に「-β」を加算し、消極形に「+θ」を加算することによって(α、β、およびθは、それぞれ任意の数値であってよい)、上記指標を調整することができる。また、データ評価部17は、対象データに含まれるセンテンスが否定型であることを検知した場合、例えば、当該センテンスをキャンセルすることにより、当該センテンスに含まれる構成要素を指標導出の基礎にしない(当該構成要素を考慮しない)ことができる。 The data evaluation unit 17 can adjust the index according to the expression form. For example, when the data evaluation unit 17 derives a numerical value in a predetermined range as the index, the data evaluation unit 17 adds, for example, “+ α” to the positive form and “−β” to the negative form, The above index can be adjusted by adding “+ θ” to the depolarized form (α, β, and θ may be arbitrary numerical values, respectively). Further, when the data evaluation unit 17 detects that the sentence included in the target data is negative, for example, by canceling the sentence, the component included in the sentence is not used as a basis for deriving the index ( The component is not considered).
 さらに、構成要素評価部15は、例えば、ある形態素(構成要素)がセンテンスの主語、目的語、および述語のいずれかに応じて、当該構成要素の評価値を増減させることができる。これにより、データ分析システムは、データ分析の精度を向上させることができるという付加的な効果をさらに奏する。 Furthermore, the constituent element evaluation unit 15 can increase or decrease the evaluation value of the constituent element depending on, for example, whether a certain morpheme (constituent element) is a subject, an object, or a predicate of the sentence. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
 (5)構成要素間の相関(共起)を考慮した評価
 データ評価部17は、対象データに含まれる第1構成要素と、当該対象データに含まれる第2構成要素との相関(共起、例えば、両者が同時に出現する頻度)を考慮して、当該対象データに対する指標を導出することができる。
(5) Evaluation Considering Correlation (Co-occurrence) Between Components The data evaluation unit 17 correlates the first component included in the target data with the second component included in the target data (co-occurrence, For example, the index for the target data can be derived in consideration of the frequency of occurrence of both at the same time.
 例えば、対象データが少なくとも一部に文章(テキスト)を含む場合において、当該文章に「価格」という第1キーワード(第1構成要素)が出現するとき、データ評価部17は、当該第1キーワードが出現した第1位置の近傍にある第2位置(例えば、当該第1位置を含む所定の範囲に含まれる位置)に、第2キーワード(第2構成要素)が出現する数に基づいて、上記指標を導出することができる。これにより、データ分析システムは、データ分析の精度を向上させることができるという付加的な効果をさらに奏する。 For example, when the target data includes a sentence (text) at least in part, and the first keyword (first component) “price” appears in the sentence, the data evaluation unit 17 determines that the first keyword is Based on the number of occurrences of the second keyword (second component) at a second position (for example, a position included in a predetermined range including the first position) in the vicinity of the appearing first position, the index Can be derived. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
 (6)感情分析
 対象データが所定の事案に対するユーザの評価情報を含む場合、データ評価部17は、当該対象データを生成したユーザの感情であって、当該評価情報に基づいて生じた当該所定の事案に対する感情を、当該対象データから抽出する(当該対象データに含まれる感情を評価する)ことができる。
(6) Emotion Analysis When the target data includes user evaluation information for a predetermined case, the data evaluation unit 17 is the user's emotion that generated the target data, and the predetermined data generated based on the evaluation information Emotions for the case can be extracted from the target data (emotions included in the target data are evaluated).
 例えば、商品・サービスを紹介するウェブサイト(例えば、オンライン商品サイト、レストランガイドなど)に含まれるデータを分析対象とする場合、データ評価部17は、当該商品・サービスに対するコメント(レビュー)に含まれる構成要素(例えば、「良かった」、「楽しかった」、「悪かった」、「つまらなった」などのキーワード)と、当該商品・サービスに対する評価(例えば、「とても良い」、「良い」、「普通」、「悪い」、「とても悪い」の5段階評価)との組み合わせ(参照データ)に基づいて、対象データ(例えば、他のウェブサイトに含まれるデータ)を評価することができる。このとき、データ評価部17は、例えば、誇張表現(例えば、「とても」、「非常に」など)に応じて当該評価結果を増減させることができる。これにより、データ分析システムは、データ分析の精度を向上させることができるという付加的な効果をさらに奏する。 For example, when data included in a website introducing a product / service (for example, an online product site, a restaurant guide) is to be analyzed, the data evaluation unit 17 is included in a comment (review) on the product / service. Components (for example, keywords such as “good”, “fun”, “bad”, “clogged”) and evaluation of the product / service (eg, “very good”, “good”, “ The target data (for example, data included in other websites) can be evaluated based on a combination (reference data) with a combination of “normal”, “bad”, and “very bad”. At this time, the data evaluation unit 17 can increase or decrease the evaluation result according to, for example, exaggerated expressions (for example, “very”, “very”, etc.). Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.
 〔データ分析システムが文書データ以外のデータを処理する例〕
 本実施の形態においては、データ分析システムが文書データを分析する場合を主に想定し、当該想定に基づく一例を説明したが、当該システムは、文書データ以外のデータ(例えば、音声データ、画像データ、映像データなど)を分析することもできる。
[Example of data analysis system processing data other than document data]
In the present embodiment, the case where the data analysis system analyzes document data is mainly assumed, and an example based on the assumption has been described. However, the system is not limited to document data (for example, audio data, image data). , Video data, etc.).
 例えば、音声データを分析する場合、上記システムは、当該音声データ自体を分析の対象としてもよいし、音声認識により当該音声データを文書データに変換し、変換後の文書データを分析の対象としてもよい。前者の場合、上記システムは、例えば、音声データを所定の長さの部分音声に分割して構成要素とし、任意の音声分析手法(例えば、隠れマルコフモデル、カルマンフィルタなど)を用いて当該部分音声を識別することによって、当該音声データを分析できる。後者の場合、任意の音声認識アルゴリズム(例えば、隠れマルコフモデルを用いた認識方法など)を用いて音声を認識し、認識後のデータに対して、実施の形態において説明した手順と同様の手順で分析できる。 For example, when analyzing speech data, the system may analyze the speech data itself, convert the speech data into document data by speech recognition, and convert the converted document data as an analysis target. Good. In the former case, for example, the system divides the voice data into partial voices of a predetermined length to form components, and uses the voice analysis method (for example, hidden Markov model, Kalman filter, etc.) to convert the partial voices. By identifying, the voice data can be analyzed. In the latter case, a speech is recognized using an arbitrary speech recognition algorithm (for example, a recognition method using a hidden Markov model), and the procedure similar to the procedure described in the embodiment is performed on the recognized data. Can be analyzed.
 また、画像データを分析する場合、上記システムは、例えば、画像データを所定の大きさの部分画像に分割して構成要素とし、任意の画像認識手法(例えば、パターンマッチング、サポートベクターマシン、ニューラルネットワークなど)を用いて当該部分画像を識別することによって、当該画像データを分析できる。 When analyzing image data, the system, for example, divides the image data into partial images of a predetermined size to form components, and any image recognition method (for example, pattern matching, support vector machine, neural network) Etc.) can be used to identify the partial image.
 さらに、映像データを分析する場合、上記システムは、例えば、映像データに含まれる複数のフレーム画像を所定の大きさの部分画像にそれぞれ分割して構成要素とし、任意の画像認識手法(例えば、パターンマッチング、サポートベクターマシン、ニューラルネットワークなど)を用いて当該部分画像を識別することによって、当該映像データを分析できる。 Further, when analyzing video data, the system, for example, divides a plurality of frame images included in the video data into partial images each having a predetermined size to form a component, and an arbitrary image recognition technique (for example, a pattern The video data can be analyzed by identifying the partial image using matching, a support vector machine, a neural network, or the like.
 〔ソフトウェア・ハードウェアによる実現例〕
 データ分析システムの制御ブロックは、集積回路(ICチップ)等に形成された論理回路(ハードウェア)によって実現してもよいし、CPUを用いてソフトウェアによって実現してもよい。後者の場合、上記システムは、各機能を実現するソフトウェアであるプログラム(データ分析システムの制御プログラム)を実行するCPU、当該プログラムおよび各種データがコンピュータ(またはCPU)で読み取り可能に記録されたROM(Read Only Memory)または記憶装置(これらを「記録媒体」と称する)、当該プログラムを展開するRAM(Random Access Memory)などを備えている。そして、コンピュータ(またはCPU)が上記プログラムを上記記録媒体から読み取って実行することにより、本発明の目的が達成される。上記記録媒体としては、「一時的でない有形の媒体」、例えば、テープ、ディスク、カード、半導体メモリ、プログラマブルな論理回路などを用いることができる。また、上記プログラムは、当該プログラムを伝送可能な任意の伝送媒体(通信ネットワークや放送波等)を介して上記コンピュータに供給されてもよい。本発明は、上記プログラムが電子的な伝送によって具現化された、搬送波に埋め込まれたデータ信号の形態でも実現され得る。なお、上記プログラムは、任意のプログラミング言語によって実装可能であり、例えば、Python、ActionScript、JavaScript(登録商標)などのスクリプト言語、Objective-C、Java(登録商標)などのオブジェクト指向プログラミング言語、HTML5などのマークアップ言語などを用いて実装され得る。また、上記プログラムを記録した任意の記録媒体も、本発明の範疇に入る。
[Example of implementation using software and hardware]
The control block of the data analysis system may be realized by a logic circuit (hardware) formed on an integrated circuit (IC chip) or the like, or may be realized by software using a CPU. In the latter case, the system includes a CPU that executes a program (control program for the data analysis system) that is software that implements each function, and a ROM (in which the program and various data are recorded so as to be readable by the computer (or CPU)). A Read Only Memory) or a storage device (these are referred to as “recording media”), a RAM (Random Access Memory) for developing the program, and the like are provided. And the objective of this invention is achieved when a computer (or CPU) reads the said program from the said recording medium and runs it. As the recording medium, a “non-temporary tangible medium” such as a tape, a disk, a card, a semiconductor memory, a programmable logic circuit, or the like can be used. The program may be supplied to the computer via an arbitrary transmission medium (such as a communication network or a broadcast wave) that can transmit the program. The present invention can also be realized in the form of a data signal embedded in a carrier wave in which the program is embodied by electronic transmission. The above program can be implemented in any programming language, for example, a script language such as Python, ActionScript, JavaScript (registered trademark), an object-oriented programming language such as Objective-C, Java (registered trademark), HTML5, or the like Can be implemented using other markup languages. Also, any recording medium that records the above program falls within the scope of the present invention.
 〔他のアプリケーション例〕
 上記した実施の形態においては、上記システムが参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価可能なシステムとして実現される例を説明したが、当該システムは、例えば、ディスカバリ支援システム、フォレンジックシステム、電子メール監視システム、医療応用システム(例えば、ファーマコビジランス支援システム、治験効率化システム、医療リスクヘッジシステム、転倒予測(転倒防止)システム、予後予測システム、診断支援システムなど)、インターネット応用システム(例えば、スマートメールシステム、情報アグリゲーション(キュレーション)システム、ユーザ監視システム、ソーシャルメディア運営システムなど)、情報漏洩検知システム、プロジェクト評価システム、マーケティング支援システム、知財評価システム、不正取引監視システム、コールセンターエスカレーションシステム、信用調査システムなど、任意のシステムとしても実現され得る。
[Other application examples]
In the above-described embodiment, the system uses the related component having the relationship with the component included in the reference data for the evaluation of the target data, so that the target data can be obtained even if the number of reference data is small. Although an example realized as a system that can be accurately evaluated has been described, the system includes, for example, a discovery support system, a forensic system, an e-mail monitoring system, a medical application system (for example, a pharmacovigilance support system, a clinical trial efficiency system) , Medical risk hedging system, fall prediction (fall prevention) system, prognosis prediction system, diagnosis support system, etc.) Internet application system (eg smart mail system, information aggregation (curation) system, user monitoring system, social media) Management systems, etc.), information leakage detection system, project evaluation system, marketing support system, intellectual property assessment system, unauthorized transaction monitoring system, call center escalation system, such as credit investigation system, it can also be implemented as any of the system.
 例えば、本発明のデータ分析システムがディスカバリ支援システムとして実現される場合、当該データ分析システムは、対象データ(例えば、ドキュメント、電子メール、表計算データなど)を所定の評価基準(例えば、本件訴訟におけるディスカバリ手続きにおいて当該データを提出すべきか否かなど)に基づいて評価する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、本件訴訟に関連する文書のみを法廷に効率良く確実に提出することができる。 For example, when the data analysis system of the present invention is realized as a discovery support system, the data analysis system uses target data (for example, documents, emails, spreadsheet data, etc.) as a predetermined evaluation standard (for example, in this case lawsuit). (E.g., whether or not the data should be submitted in the discovery procedure), by using related components that are related to the components included in the reference data for the evaluation of the target data Even if the number of data is small, the target data can be accurately evaluated and only the documents related to this case can be efficiently and reliably submitted to the court.
 また、本発明のデータ分析システムがフォレンジックシステムとして実現される場合、当該データ分析システムは、対象データ(例えば、ドキュメント、電子メール、表計算データなど)を所定の評価基準(例えば、当該データが犯罪行為を立証可能な証拠であるか否かなど)に基づいて評価する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、当該犯罪行為を立証する証拠を効率良く確実に抽出することができる。 When the data analysis system of the present invention is realized as a forensic system, the data analysis system uses target data (for example, documents, emails, spreadsheet data, etc.) as a predetermined evaluation standard (for example, the data is a crime). (E.g., whether or not the act is provable evidence), by using the related component that is related to the component included in the reference data for the evaluation of the target data Even if the number of data is small, it is possible to accurately evaluate the target data and efficiently and reliably extract evidence that proves the criminal activity.
 また、本発明のデータ分析システムが電子メール監視システムとして実現される場合、当該データ分析システムは、対象データ(例えば、電子メール、添付ファイルなど)を所定の評価基準(例えば、当該電子メールを送受信したユーザが不正行為を行おうとしているか否かなど)に基づいて評価する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、情報漏洩・談合などの不正行為の予兆を効率良く確実に発見することができる。 Further, when the data analysis system of the present invention is realized as an e-mail monitoring system, the data analysis system transmits / receives target data (for example, e-mail, attached file, etc.) to a predetermined evaluation standard (for example, e-mail By using a related component that is related to the component included in the reference data in the evaluation of the target data when evaluating based on whether or not the user has attempted fraud) Even if the number of reference data is small, it is possible to accurately evaluate target data and efficiently and reliably detect signs of fraud such as information leakage and collusion.
 また、本発明のデータ分析システムが医療応用システム(例えば、ファーマコビジランス支援システム、治験効率化システム、医療リスクヘッジシステム、転倒予測(転倒防止)システム予後予測システム、診断支援システムなど)として実現される場合、当該データ分析システムは、対象データ(例えば、電子カルテ、看護記録、患者の日記など)を所定の評価基準(例えば、患者の特定の危険行動を取るか否か、ある薬剤が病気に対して効能を発揮したか否かなど)に基づいて評価する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、例えば、患者が危険な状態(例えば、転倒するなど)に陥ることの予測や薬剤の効能を、効率良く確実に、客観的評価することができる。 In addition, the data analysis system of the present invention is realized as a medical application system (for example, pharmacovigilance support system, clinical trial efficiency system, medical risk hedging system, fall prediction (fall prevention) system prognosis prediction system, diagnosis support system, etc.). In this case, the data analysis system uses target data (for example, electronic medical records, nursing records, patient diaries, etc.) based on predetermined evaluation criteria (for example, whether or not to take a specific risk action of the patient, (E.g., whether or not the reference data is effective), by using the related components that are related to the components included in the reference data for the evaluation of the target data. Even if the number is small, the target data is accurately evaluated, for example, the patient falls into a dangerous state (for example, falls) The efficacy of the prediction and drugs, to efficiently and reliably, it is possible to objectively evaluate.
 また、本発明のデータ分析システムがインターネット応用システム(例えば、スマートメールシステム、情報アグリゲーション(キュレーション)システム、ユーザ監視システム、ソーシャルメディア運営システムなど)として実現される場合、当該データ分析システムは、対象データ(例えば、ユーザがSNSに投稿したメッセージ、ウェブサイトに掲載されたお勧め情報、ユーザまたは団体のプロフィールなど)を所定の評価基準(例えば、当該ユーザの嗜好と他のユーザの嗜好とが類似しているか否か、当該ユーザの嗜好とレストランの属性とが一致しているか否かなど)に基づいて評価する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、当該ユーザと気の合いそうな他のユーザを一覧表示させたり、当該ユーザの嗜好に合ったレストランの情報を提示したり、当該ユーザに危害を与えかねない団体を警告したりすること等を効率良く確実に実行することができる。 When the data analysis system of the present invention is realized as an Internet application system (for example, a smart mail system, an information aggregation (curation) system, a user monitoring system, a social media management system, etc.), the data analysis system is a target. Data (for example, a message posted by the user to the SNS, recommended information posted on the website, profile of the user or group, etc.) is determined based on a predetermined evaluation standard (for example, the preference of the user and the preference of other users) The related constituent elements that are related to the constituent elements included in the reference data when the evaluation is made based on whether or not the user's preference and the restaurant attribute match. By using it in the evaluation of, for example, the number of reference data Accurately evaluate the target data at least, display a list of other users who are likely to feel at ease with the user, present restaurant information that suits the user's preferences, or cause harm to the user It is possible to efficiently and reliably execute a warning for a group that may be.
 また、本発明のデータ分析システムが情報漏洩検知システムとして実現される場合、当該データ分析システムは、対象データ(例えば、電子メール、データベースへのアクセスログ情報など)を所定の評価基準(例えば、当該電子メールを送受信したユーザが不正行為を行おうとしているか否かなど)に基づいて評価する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、情報漏洩の予兆を効率良く確実に発見することができる。 In addition, when the data analysis system of the present invention is realized as an information leakage detection system, the data analysis system uses target data (for example, e-mail, database access log information) as a predetermined evaluation criterion (for example, the When evaluating based on whether or not the user who sent and received e-mails is trying to commit fraud, use related components that are related to the components included in the reference data to evaluate the target data Thus, even if the number of reference data is small, it is possible to accurately evaluate the target data and efficiently and reliably find a sign of information leakage.
 また、本発明のデータ分析システムが情報資産活用システム(プロジェクト評価システム)として実現され場合、当該データ分析システムは、プロジェクトのための有効な情報を、企業・熟練者が有する情報資産(対象データ)から、プロジェクトの状況に応じて動的に抽出する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、例えば、(1)開発期間の短縮化が望まれる開発現場を効率化するために、過去に開発した製品に関する情報を当該開発の要件に応じて再利用する、(2)熟練技術者が有する専門知識に基づいて、有用な情報資産を特定したりすることを効率良く確実に実行することができる。 In addition, when the data analysis system of the present invention is realized as an information asset utilization system (project evaluation system), the data analysis system includes information assets (target data) possessed by companies / experts for effective information for the project. Therefore, when extracting dynamically according to the situation of the project, the number of reference data can be reduced by using the related components that are related to the components included in the reference data for the evaluation of the target data. Even if the target data is accurately evaluated, for example, (1) In order to improve the efficiency of development sites where shortening of the development period is desired, information on products developed in the past can be reused according to the requirements of the development. (2) It is possible to efficiently and reliably execute the specification of useful information assets based on the expertise possessed by skilled engineers.
 また、本発明のデータ分析システムがマーケティング支援システムとして実現される場合、当該データ分析システムは、対象データ(例えば、企業・個人のプロフィール、製品情報など)を所定の評価基準(例えば、当該個人は男性か女性か、消費者は製品に対して好感を抱いているか否かなど)に基づいて評価する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、例えば、ある製品に対する市場の評価を抽出することが効率良く確実に達成される。 When the data analysis system of the present invention is realized as a marketing support system, the data analysis system uses target data (for example, company / individual profile, product information, etc.) as a predetermined evaluation standard (for example, When evaluating based on whether the product is male or female, or whether the consumer has a favorable impression on the product, etc., the related components that are related to the components included in the reference data are used to evaluate the target data. By using it, even if the number of reference data is small, it is possible to efficiently and reliably achieve evaluation of target data accurately, for example, to extract a market evaluation for a certain product.
 また、本発明のデータ分析システムが知財評価システムとして実現される場合、当該データ分析システムは、対象データ(例えば、特許公報、発明を要約した文書、学術論文など)を所定の評価基準(例えば、当該特許公報は所与の特許を拒絶・無効にする証拠となり得るか否かなど)に基づいて評価する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、例えば、多数の文献(例えば、特許公報、学術論文、インターネットに掲載された文章)の中から無効資料の抽出を効率良く確実に達成する。このとき、データ分析システムは、例えば、無効対象となる特許の各請求項と「Related」ラベル(分類情報)との組み合わせ、および、当該特許とは異なる無関係な特許の各請求項と「Non-Related」ラベル(分類情報)との組み合わせを参照データとして取得し、当該参照データからパターンを学習し、多数の文献(対象データ)に対して指標を算出する(例えば、特許公報の段落ごとに指標を算出し、当該指標の上位から所定数分を合算することによって、当該特許公報の指標とする。)ことによって、当該対象データを評価することができる。 Further, when the data analysis system of the present invention is realized as an intellectual property evaluation system, the data analysis system uses target data (for example, patent publications, documents summarizing the invention, academic papers, etc.) as a predetermined evaluation standard (for example, , Whether the patent publication can be used as evidence to reject or invalidate a given patent). By using it for evaluation, even if the number of reference data is small, the target data is accurately evaluated, for example, from a large number of documents (for example, patent gazettes, academic papers, sentences published on the Internet). Achieve efficient and reliable extraction of invalid materials. At this time, the data analysis system, for example, combines each claim of a patent to be invalidated with a “Related” label (classification information), and each claim of an unrelated patent different from the patent and “Non- A combination with a “Related” label (classification information) is acquired as reference data, a pattern is learned from the reference data, and an index is calculated for a large number of documents (target data) (for example, an index for each paragraph of a patent publication) The target data can be evaluated by calculating and adding a predetermined number from the top of the index to obtain the index of the patent publication.
 また、本発明のデータ分析システムが不正取引監視システムとして実現される場合、当該データ分析システムは、対象データ(例えば、電子メール、金融取引情報、入札情報など)を所定の評価基準(例えば、当該電子メールを送受信したユーザが不正取引を行おうとしているか否かなど)に基づいて評価する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、カルテル・談合などの不正行為の予兆を効率良く確実に発見することができる。 Further, when the data analysis system of the present invention is realized as an unauthorized transaction monitoring system, the data analysis system uses target data (for example, e-mail, financial transaction information, bid information, etc.) as a predetermined evaluation criterion (for example, the When evaluating based on whether the user who sent and received e-mail is going to conduct fraudulent transactions, etc., use related components that are related to the components included in the reference data to evaluate the target data Thus, even if the number of reference data is small, it is possible to accurately evaluate the target data and efficiently and reliably detect a sign of fraud such as cartels and collusion.
 また、本発明のデータ分析システムがコールセンターエスカレーションシステムとして実現される場合、当該データ分析システムは、対象データ(例えば、電話の通話履歴、録音された音声など)を所定の評価基準(例えば、過去の対応事例と類似するか否かなど)に基づいて評価する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、例えば、過去の対応事例の中から現在の状況に最適な対応方法を抽出することを効率良く確実に達成することができる。 In addition, when the data analysis system of the present invention is realized as a call center escalation system, the data analysis system uses target data (for example, telephone call history, recorded voice, etc.) as a predetermined evaluation criterion (for example, past history). When evaluating based on whether or not it is similar to the corresponding case, etc., by using the related components that are related to the components included in the reference data for the evaluation of the target data, even if the number of reference data Even if there is a small amount of data, it is possible to efficiently and reliably achieve evaluation of the target data accurately and, for example, to extract a response method optimal for the current situation from past response cases.
 また、本発明のデータ分析システムが信用調査システムとして実現される場合、当該データ分析システムは、対象データ(例えば、企業のプロフィール、企業の業績に関する情報、株価に関する情報、プレスリリースなど)を所定の評価基準(例えば、当該企業が倒産するか否か、当該企業が成長するか否かなど)に基づいて評価する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、例えば、企業の成長・倒産の予測を効率良く確実に達成することができる。 When the data analysis system of the present invention is implemented as a credit check system, the data analysis system receives target data (for example, company profile, information about company performance, information about stock prices, press releases, etc.) in a predetermined manner. When evaluating based on evaluation criteria (for example, whether the company goes bankrupt, whether the company grows, etc.), the related components that are related to the components included in the reference data are subject data For example, even if the number of reference data is small, the target data can be accurately evaluated, and for example, the prediction of corporate growth / bankruptcy can be achieved efficiently and reliably.
 また、本発明のデータ分析システムがドライビング支援システムとして実現される場合、当該データ分析システムは、対象データ(例えば、車載センサ・カメラ・マイクなどから取得されるデータ)を所定の評価基準(例えば、熟練ドライバによる運転中に、当該熟練ドライバが着目した情報か否かなど)に基づいて評価する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、例えば、運転を安全・快適にし得る有用な情報の自動的抽出を効率良く確実に達成することができる。 Further, when the data analysis system of the present invention is realized as a driving support system, the data analysis system uses target data (for example, data acquired from an in-vehicle sensor, a camera, a microphone, etc.) as a predetermined evaluation standard (for example, When the evaluation is performed based on whether or not the driver is focused on information during driving by the expert driver, the related component having the relationship with the component included in the reference data is used for the evaluation of the target data. Thus, even if the number of reference data is small, the target data can be accurately evaluated, and for example, automatic extraction of useful information that can make driving safe and comfortable can be achieved efficiently and reliably.
 さらに、本発明のデータ分析システムが営業支援システムとして実現される場合、当該データ分析システムは、対象データ(例えば、企業・個人のプロフィール、製品情報など)を所定の評価基準(例えば、当該個人は男性か女性か、消費者は製品に対して好感を抱いているか否かなど)に基づいて評価する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、例えば、ある製品に対する市場の評価の抽出を効率良く確実に達成することができる。 Furthermore, when the data analysis system of the present invention is realized as a sales support system, the data analysis system uses target data (for example, company / individual profile, product information, etc.) based on a predetermined evaluation standard (for example, When evaluating based on whether the product is male or female, or whether the consumer has a favorable impression on the product, etc., the related components that are related to the components included in the reference data are used to evaluate the target data. By using it, even if the number of reference data is small, the target data can be accurately evaluated, and for example, extraction of market evaluation for a certain product can be efficiently and reliably achieved.
 さらに、本発明のデータ分析システムが、金融システム(例えば、株価予測システムなど)として実現され場合、当該データ分析システムは、対象データ(例えば、株価の時価など)を所定の評価基準(例えば、株価が上昇するか否かなど)に基づいて評価する際に、参照データに含まれる構成要素と関連性を有する関連構成要素を対象データの評価に利用することによって、たとえ、参照データの数が少なくても対象データを正確に評価して、例えば、将来の株価の予測を効率良く確実に達成することができる。 Furthermore, when the data analysis system of the present invention is realized as a financial system (for example, a stock price prediction system), the data analysis system uses target data (for example, the market price of the stock price) as a predetermined evaluation standard (for example, a stock price). The number of reference data can be reduced by using the related components that are related to the components included in the reference data for the evaluation of the target data. However, it is possible to accurately evaluate the target data and efficiently and reliably achieve the prediction of the future stock price, for example.
 なお、本発明のデータ分析システムが応用される分野によっては、当該分野に特有の事情を考慮して、例えば、データに前処理(例えば、当該データから重要箇所を抜き出し、当該重要箇所のみをデータ分析の対象とするなど)を施したり、データ分析の結果を表示する態様を変化させたりしてよい。こうした変形例が多様に存在し得ることは、当業者に理解されるところであり、すべての変形例が本発明の範疇に入る。 Depending on the field to which the data analysis system of the present invention is applied, in consideration of circumstances peculiar to the field, for example, preprocessing (for example, extracting an important part from the data and extracting only the important part from the data) The analysis target may be applied), or the mode of displaying the data analysis result may be changed. It will be understood by those skilled in the art that a variety of such variations can exist, and all variations fall within the scope of the present invention.
 〔まとめ〕
 本発明の一態様に係るデータ分析システムは、メモリと、入力制御装置と、コントローラと、を備え、前記コントローラは、複数の対象データを序列化する指標を生成し、当該指標は、各対象データと所定の事案との関連性に対応するものであって、ユーザが前記入力制御装置を介して与えた入力に基づいて変化するものであり、前記メモリは、前記複数の対象データを少なくとも一時的に記憶し、前記入力制御装置は、前記対象データに対するサンプルデータをユーザに提示し、分類情報の入力を前記ユーザから受け付け、当該分類情報は、前記サンプルデータを分類するために前記入力に基づいて当該サンプルデータに対応付けられるものであり、前記サンプルデータと前記ユーザから受け付けた分類情報との組み合わせを、参照データとして前記コントローラに提供し、前記コントローラは、複数の前記参照データを取得し、当該複数の参照データから第1の構成要素を抽出し、当該第1の構成要素は、当該参照データの少なくとも一部を構成するものであり、前記第1の構成要素が前記組み合わせに寄与する度合いを評価し、当該評価された第1の構成要素と関連性を有する第2の構成要素を前記複数の対象データの少なくとも一つから抽出し、当該第2の構成要素は、当該対象データの少なくとも一部を構成するものであり、当該第2の構成要素を評価し、前記第2の構成要素の評価結果に基づいて前記指標を生成することによって、前記複数の対象データと前記所定の事案との関連性を評価する、ことを特徴とする。したがって、当該データ分析システムによれば、第2の構成要素を対象データの評価のために補うことができるので、データ群に対する、前記一部データの多少に拘わらず、データ群の分析を正確に行うことができる。
[Summary]
A data analysis system according to an aspect of the present invention includes a memory, an input control device, and a controller, and the controller generates an index that ranks a plurality of target data, and the index includes each target data Corresponding to a predetermined case and changes based on an input given by a user via the input control device, and the memory stores at least the plurality of target data at least temporarily. And the input control device presents sample data for the target data to the user, accepts input of classification information from the user, and the classification information is based on the input to classify the sample data. A combination of the sample data and the classification information received from the user is associated with the sample data, and the reference data The controller obtains a plurality of the reference data, extracts a first component from the plurality of reference data, and the first component is at least one of the reference data. And the second component having the relevance to the evaluated first component is the plurality of target data, and the degree of contribution of the first component to the combination is evaluated. And the second component element constitutes at least a part of the target data, the second component element is evaluated, and the evaluation result of the second component element is obtained. The relevance between the plurality of target data and the predetermined case is evaluated by generating the index based on the index. Therefore, according to the data analysis system, since the second component can be supplemented for the evaluation of the target data, the analysis of the data group can be accurately performed regardless of the amount of the partial data with respect to the data group. It can be carried out.
 本発明の他の態様に係るデータ分析システムでは、前記コントローラは、前記参照データに含まれる複数の構成要素夫々の前記評価の優劣に基づいて、当該複数の構成要素の中から前記第1の構成要素を決定することにより、第2の構成要素が第1の構成要素と関連しながら、容易かつ迅速に設定されるという付加的な効果が達成される。 In the data analysis system according to another aspect of the present invention, the controller includes the first configuration from among the plurality of components based on superiority or inferiority of the evaluation of each of the plurality of components included in the reference data. By determining the elements, an additional effect is achieved that the second component is easily and quickly set up in association with the first component.
 本発明の他の態様に係るデータ分析システムでは、前記コントローラは、前記参照データに含まれる複数の構成要素夫々の前記評価の優劣に基づいて、当該複数の構成要素の中から前記第1の構成要素を決定することにより、第2の構成要素が第1の構成要素の評価結果に関連しながら、適切かつ容易迅速に設定されるという付加的な効果が達成される。 In the data analysis system according to another aspect of the present invention, the controller includes the first configuration from among the plurality of components based on superiority or inferiority of the evaluation of each of the plurality of components included in the reference data. By determining the elements, an additional effect is achieved that the second component is set appropriately and easily and quickly while relating to the evaluation result of the first component.
 本発明の他の態様に係るデータ分析システムでは、前記コントローラは、前記対象データに含まれる複数の構成要素の中から前記評価が最も高い構成要素を前記第1の構成要素として決定することにより、第2の構成要素として評価値が高い構成要素を選択し得るという付加的な効果が達成される。 In the data analysis system according to another aspect of the present invention, the controller determines, as the first component, the component having the highest evaluation among the plurality of components included in the target data. An additional effect that a component having a high evaluation value can be selected as the second component is achieved.
 本発明の他の態様に係るデータ分析システムでは、前記コントローラは、前記第2の構成要素と前記第1の構成要素との間の前記関連性を所定の基準に基づいて設定し、当該基準に基づいて前記第2の構成要素を前記複数の対象データから抽出することにより、第1の構成要素に関連した第2の構成要素を対象データの分析に有効なものとして自動的に補うことができ、この結果、対象データの評価を正確に行うことができるという付加的な効果が達成される。 In the data analysis system according to another aspect of the present invention, the controller sets the association between the second component and the first component based on a predetermined criterion, and sets the relationship to the criterion. By extracting the second component from the plurality of target data based on the second component, the second component related to the first component can be automatically supplemented as effective for the analysis of the target data. As a result, an additional effect that the target data can be accurately evaluated is achieved.
 本発明の他の態様に係るデータ分析システムでは、前記コントローラは、前記第2の構成要素が前記第1の構成要素に対して共起関係にあること、類似語の関係にあること、及び、メタ情報を共通にする関係にあることの少なくとも一つに基づいて、当該第2の構成要素を抽出することにより、当該第1の構成要素に関連した前記第2の構成要素を対象データの分析に有効なものとして自動的に補うことができ、この結果、対象データの評価を正確に行うことができるという付加的な効果が達成される。 In the data analysis system according to another aspect of the present invention, the controller is configured such that the second component is in a co-occurrence relationship with the first component, is in a similar word relationship, and Analyzing the second constituent element related to the first constituent element by analyzing the second constituent element by extracting the second constituent element based on at least one of the relationships having the common meta information As a result, an additional effect that the target data can be accurately evaluated can be achieved.
 本発明の他の態様に係るデータ分析システムでは、前記コントローラは、前記複数の対象データに対して前記第2の構成要素が前記所定の関連性を有しながら存在する頻度に応じて当該第2の構成要素を評価し、前記複数の対象データを当該第2の構成要素の評価結果に基づいて評価することにより、前記第2の構成要素の評価を、正確かつ確実におこなうことができるという付加的な効果が達成される。 In the data analysis system according to another aspect of the present invention, the controller includes the second component according to the frequency with which the second component exists with the predetermined relationship with respect to the plurality of target data. The evaluation that the second constituent element can be evaluated accurately and reliably by evaluating the constituent element and evaluating the plurality of target data based on the evaluation result of the second constituent element. Effects are achieved.
 本発明の他の態様に係るデータ分析システムの制御方法は、データを分析するデータ分析システムの制御方法であって、複数の対象データを序列化する指標を生成し、当該指標は、各対象データと所定の事案との関連性に対応するものであって、ユーザからの入力に基づいて変化するものである第1のステップと、前記複数の対象データを少なくとも一時的に記憶する第2のステップと、前記対象データに対するサンプルデータをユーザに提示する第3のステップと、分類情報の入力を前記ユーザから受け付け、当該分類情報は、前記サンプルデータを分類するために前記入力に基づいて当該サンプルデータに対応付けられるものである第4のステップと、前記サンプルデータと前記ユーザから受け付けた分類情報との組み合わせを、参照データとして提供する第5のステップと、複数の前記参照データを取得する第6のステップと、当該複数の参照データから第1の構成要素を抽出し、当該第1の構成要素は、当該参照データの少なくとも一部を構成するものである第7のステップと、前記第1の構成要素が前記組み合わせに寄与する度合いを評価する第8のステップと、当該評価された第1の構成要素と関連性を有する第2の構成要素を前記複数の対象データの少なくとも一つから抽出し、当該第2の構成要素は、当該対象データの少なくとも一部を構成するものである第9のステップと、当該第2の構成要素を評価する第10のステップと、前記第2の構成要素の評価結果に基づいて前記指標を生成することによって、前記複数の対象データと前記所定の事案との関連性を評価する第11のステップと、を含むことを特徴とする。したがって、当該データ分析システムの制御方法によれば、第2の構成要素を対象データの評価のために補うことができるので、データ群に対する、前記一部データの多少に拘わらず、データ群の分析を正確に行うことができる。 A control method for a data analysis system according to another aspect of the present invention is a control method for a data analysis system for analyzing data, which generates an index for ranking a plurality of target data, and the index includes each target data And a second step of storing at least temporarily the plurality of target data corresponding to the relationship between the user and the predetermined case and changing based on an input from the user And a third step of presenting sample data for the target data to the user, and an input of classification information is received from the user, the classification information being based on the input to classify the sample data Reference is made to the combination of the fourth step that is associated with the sample data and the classification information received from the user. A fifth step of providing as a data, a sixth step of obtaining a plurality of the reference data, and extracting a first component from the plurality of reference data, the first component being the reference A seventh step that constitutes at least part of the data; an eighth step that evaluates the degree to which the first component contributes to the combination; and an association with the evaluated first component A second component having the characteristics is extracted from at least one of the plurality of target data, and the second component constitutes at least a part of the target data; The tenth step of evaluating the second component, and generating the index based on the evaluation result of the second component, thereby determining the relevance between the plurality of target data and the predetermined case An eleventh step worthy, characterized in that it comprises a. Therefore, according to the control method of the data analysis system, since the second component can be supplemented for the evaluation of the target data, the analysis of the data group is performed regardless of the amount of the partial data with respect to the data group. Can be done accurately.
 さらに、本発明の他の態様に係るデータ分析システムの制御プログラムは、前記データ分析システムの制御方法の発明に含まれる各ステップを、コンピュータに実行させるデータ分析システムの制御プログラムであることを特徴とし、本発明のさらに他の態様に係る記録媒体は、当該データ分析システムの制御プログラムを記録した記録媒体であることを特徴とする。したがって、当該データ分析システムの制御プログラムおよび記録媒体によれば、第2の構成要素を対象データの評価のために補うことができるので、データ群に対する、前記一部データの多少に拘わらず、データ群の分析を正確に行うことができる。 Furthermore, a data analysis system control program according to another aspect of the present invention is a data analysis system control program that causes a computer to execute each step included in the data analysis system control method invention. A recording medium according to still another aspect of the present invention is a recording medium that records a control program of the data analysis system. Therefore, according to the control program and the recording medium of the data analysis system, the second component can be supplemented for the evaluation of the target data. Therefore, regardless of the amount of the partial data for the data group, the data Group analysis can be performed accurately.
 本発明の別態様に係るデータ分析システムは、例えば、対象データを評価するデータ分析システムであって、前記システムは、メモリと、入力制御装置と、コントローラとを備え、前記コントローラは、複数の対象データを評価し、当該評価は、例えば、各対象データと所定の事案との関連性に対応するものであり、前記複数の対象データの序列化を可能とする指標を、前記評価により生成し、ユーザが前記入力制御装置を介して与えた入力に基づいて前記指標を変化させることができ、前記メモリは、例えば、前記コントローラが評価する前記複数の対象データを少なくとも一時的に記憶し、前記入力制御装置は、例えば、前記コントローラが前記複数の対象データを序列化するための入力を前記ユーザに許容し、当該複数の対象データの序列は、例えば、前記入力に基づいて変化する前記指標に応じて変化するものであり、前記入力は、例えば、前記複数の対象データとは異なる参照データを、当該参照データと前記所定の事案との関連性に基づいて分類するものであり、当該分類は、例えば、前記参照データの内容に応じて複数の分類情報に分けられたものであり、前記複数の分類情報のうちの少なくとも1つは、前記入力によって前記参照データに付与されるものであり、前記参照データを前記ユーザに提示し、前記ユーザの入力により、前記提示された参照データに対して与えられた前記少なくとも1つの分類情報と当該参照データとの組み合わせを、前記コントローラに提供し、前記コントローラは、例えば、前記参照データに含まれる複数の構成要素が、前記入力制御装置から提供された組み合わせにそれぞれ寄与する度合いを評価することによって、前記入力により付与された分類情報に応じて当該参照データが特徴付けられるパターンを当該参照データから抽出し、前記抽出したパターンに基づいて、前記対象データと前記所定の事案との関連性を評価して前記指標を決定し、前記決定した指標を前記対象データに設定し、前記指標に応じて前記複数の対象データを序列化し、前記序列化した複数の対象データをユーザに報知する。 A data analysis system according to another aspect of the present invention is, for example, a data analysis system that evaluates target data, and the system includes a memory, an input control device, and a controller, and the controller includes a plurality of targets. The data is evaluated, and the evaluation corresponds to, for example, the relationship between each target data and a predetermined case, and an index that enables ranking of the plurality of target data is generated by the evaluation, The index can be changed based on an input given by a user via the input control device, and the memory stores, for example, at least temporarily the plurality of target data evaluated by the controller, and the input The control device, for example, allows the user to input an order for the controller to rank the plurality of target data, and the plurality of target data The order changes, for example, according to the index that changes based on the input, and the input includes, for example, reference data different from the plurality of target data, the reference data, and the predetermined case. The classification is, for example, divided into a plurality of classification information according to the content of the reference data, and at least one of the plurality of classification information is The reference data is given to the reference data by the input, the reference data is presented to the user, and the at least one classification information given to the presented reference data by the user input; A combination with the reference data is provided to the controller, and the controller includes, for example, a plurality of components included in the reference data. A pattern characterized by the reference data is extracted from the reference data according to the classification information given by the input by evaluating the degree of contribution to each combination provided by the control device, and the extracted pattern is converted into the extracted pattern. And determining the index by evaluating relevance between the target data and the predetermined case, setting the determined index in the target data, and ranking the plurality of target data according to the index The user is notified of the plurality of target data arranged in order.
 〔付記事項〕
 本発明は上述したそれぞれの実施の形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施の形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施の形態についても、本発明の技術的範囲に含まれる。さらに、各実施の形態にそれぞれ開示された技術的手段を組み合わせることにより、新しい技術的特徴を形成できる。
[Additional Notes]
The present invention is not limited to the above-described embodiments, and various modifications can be made within the scope of the claims, and the technical means disclosed in different embodiments can be appropriately combined. Embodiments to be made are also included in the technical scope of the present invention. Furthermore, a new technical feature can be formed by combining the technical means disclosed in each embodiment.
 本発明は、例えば、パーソナルコンピュータ、サーバ装置、ワークステーション、メインフレームなど、任意のコンピュータに広く適用することができ、特に、人工知能システムに適用可能である。 The present invention can be widely applied to arbitrary computers such as a personal computer, a server device, a workstation, and a mainframe, and is particularly applicable to an artificial intelligence system.
 1……データ分析システム、2……サーバ装置、3……クライアント装置、4……データベース、5……ストレージシステム、6……管理計算機、10……予測コーディング部、11……データ取得部、12……分類情報取得部、13……データ分類部、14……構成要素抽出部、15……構成要素評価部、16……構成要素格納部、17……データ評価部、18……管理部、19……フェーズ分析部。 DESCRIPTION OF SYMBOLS 1 ... Data analysis system, 2 ... Server apparatus, 3 ... Client apparatus, 4 ... Database, 5 ... Storage system, 6 ... Management computer, 10 ... Predictive coding part, 11 ... Data acquisition part, 12 …… Classification information acquisition unit, 13 …… Data classification unit, 14 …… Constituent element extraction unit, 15 …… Constituent element evaluation unit, 16 …… Constituent element storage unit, 17 …… Data evaluation unit, 18 …… Management , 19 …… Phase Analysis Department.

Claims (12)

  1.  データを分析するデータ分析システムであって、
     メモリと、入力制御装置と、コントローラと、を備え、
     前記コントローラは、複数の対象データを序列化する指標を生成し、当該指標は、各対象データと所定の事案との関連性に対応するものであって、ユーザが前記入力制御装置を介して与えた入力に基づいて変化するものであり、
     前記メモリは、前記複数の対象データを少なくとも一時的に記憶し、
     前記入力制御装置は、
     前記対象データに対するサンプルデータをユーザに提示し、
     分類情報の入力を前記ユーザから受け付け、当該分類情報は、前記サンプルデータを分類するために前記入力に基づいて当該サンプルデータに対応付けられるものであり、
     前記サンプルデータと前記ユーザから受け付けた分類情報との組み合わせを、参照データとして前記コントローラに提供し、
     前記コントローラは、
     複数の前記参照データを取得し、
     当該複数の参照データから第1の構成要素を抽出し、当該第1の構成要素は、当該参照データの少なくとも一部を構成するものであり、
     前記第1の構成要素が前記組み合わせに寄与する度合いを評価し、
     当該評価された第1の構成要素と関連性を有する第2の構成要素を前記複数の対象データの少なくとも一つから抽出し、当該第2の構成要素は、当該対象データの少なくとも一部を構成するものであり、
     当該第2の構成要素を評価し、
     前記第2の構成要素の評価結果に基づいて前記指標を生成することによって、前記複数の対象データと前記所定の事案との関連性を評価する、
     データ分析システム。
    A data analysis system for analyzing data,
    A memory, an input control device, and a controller;
    The controller generates an index for ranking a plurality of target data, and the index corresponds to the relationship between each target data and a predetermined case, and is given by the user via the input control device. Change based on input,
    The memory stores the plurality of target data at least temporarily,
    The input control device
    Present sample data for the target data to the user,
    The classification information is received from the user, and the classification information is associated with the sample data based on the input to classify the sample data.
    Providing a combination of the sample data and the classification information received from the user as reference data to the controller;
    The controller is
    Obtaining a plurality of said reference data;
    A first component is extracted from the plurality of reference data, and the first component constitutes at least a part of the reference data,
    Assessing the degree to which the first component contributes to the combination;
    A second component having a relationship with the evaluated first component is extracted from at least one of the plurality of target data, and the second component configures at least a part of the target data Is what
    Evaluate the second component,
    Evaluating the relevance between the plurality of target data and the predetermined case by generating the index based on the evaluation result of the second component;
    Data analysis system.
  2.  前記コントローラは、前記第2の構成要素を前記参照データに含まれていない構成要素から設定する、請求項1記載のデータ分析システム。 The data analysis system according to claim 1, wherein the controller sets the second component from components not included in the reference data.
  3.  前記コントローラは、前記参照データに含まれる複数の構成要素夫々の前記評価の優劣に基づいて、当該複数の構成要素の中から前記第1の構成要素を決定する、請求項2記載のデータ分析システム。 The data analysis system according to claim 2, wherein the controller determines the first component from the plurality of components based on superiority or inferiority of the evaluation of each of the plurality of components included in the reference data. .
  4.  前記前記コントローラは、前記対象データに含まれる複数の構成要素の中から前記評価が最も高い構成要素を前記第1の構成要素として決定する、請求項3記載のデータ分析システム。 The data analysis system according to claim 3, wherein the controller determines, as the first component, a component having the highest evaluation among a plurality of components included in the target data.
  5.  前記コントローラは、前記第2の構成要素と前記第1の構成要素との間の前記関連性を所定の基準に基づいて設定し、当該基準に基づいて前記第2の構成要素を前記複数の対象データから抽出する、請求項1記載のデータ分析システム。 The controller sets the association between the second component and the first component based on a predetermined criterion, and sets the second component based on the criterion to the plurality of objects. The data analysis system according to claim 1, wherein the data analysis system is extracted from data.
  6.  前記コントローラは、前記第2の構成要素が前記第1の構成要素に対して共起関係にあること、類似語の関係にあること、及び、メタ情報を共通にする関係にあることの少なくとも一つに基づいて、当該第2の構成要素を抽出する、請求項5記載のデータ分析システム。 The controller has at least one of the second component having a co-occurrence relationship with the first component, a similar word relationship, and a relationship having common meta information. The data analysis system according to claim 5, wherein the second component is extracted based on the two.
  7.  前記コントローラは、前記複数の対象データに対して前記第2の構成要素が前記所定の関連性を有しながら存在する頻度に応じて当該第2の構成要素を評価し、前記複数の対象データを当該第2の構成要素の評価結果に基づいて評価する、請求項5記載のデータ分析システム。 The controller evaluates the second component according to the frequency with which the second component exists with the predetermined relationship with respect to the plurality of target data, and the plurality of target data The data analysis system according to claim 5, wherein evaluation is performed based on an evaluation result of the second component.
  8.  前記コントローラは、前記第1の構成要素の評価結果にさらに基づいて前記指標を生成する、請求項1記載のデータ分析システム。 The data analysis system according to claim 1, wherein the controller further generates the index based on an evaluation result of the first component.
  9.  前記コントローラは、前記第1の構成要素の評価結果に基づいて前記指標を生成して前記複数の対象データを評価した後、前記第1の構成要素の評価結果と前記第2の構成要素の評価結果とに基づいて前記指標を生成し、前記複数の対象データを評価する、請求項8記載のデータ分析システム。 The controller generates the index based on the evaluation result of the first component and evaluates the plurality of target data, and then evaluates the evaluation result of the first component and the evaluation of the second component The data analysis system according to claim 8, wherein the index is generated based on a result and the plurality of target data are evaluated.
  10.  データを分析するデータ分析システムの制御方法であって、
     複数の対象データを序列化する指標を生成し、当該指標は、各対象データと所定の事案との関連性に対応するものであって、ユーザからの入力に基づいて変化するものである第1のステップと、
     前記複数の対象データを少なくとも一時的に記憶する第2のステップと、
     前記対象データに対するサンプルデータをユーザに提示する第3のステップと、
     分類情報の入力を前記ユーザから受け付け、当該分類情報は、前記サンプルデータを分類するために前記入力に基づいて当該サンプルデータに対応付けられるものである第4のステップと、
     前記サンプルデータと前記ユーザから受け付けた分類情報との組み合わせを、参照データとして提供する第5のステップと、
     複数の前記参照データを取得する第6のステップと、
     当該複数の参照データから第1の構成要素を抽出し、当該第1の構成要素は、当該参照データの少なくとも一部を構成するものである第7のステップと、
     前記第1の構成要素が前記組み合わせに寄与する度合いを評価する第8のステップと、
     当該評価された第1の構成要素と関連性を有する第2の構成要素を前記複数の対象データの少なくとも一つから抽出し、当該第2の構成要素は、当該対象データの少なくとも一部を構成するものである第9のステップと、
     当該第2の構成要素を評価する第10のステップと、
     前記第2の構成要素の評価結果に基づいて前記指標を生成することによって、前記複数の対象データと前記所定の事案との関連性を評価する第11のステップと、
     を含むデータ分析システムの制御方法。
    A method for controlling a data analysis system for analyzing data,
    An index for ranking a plurality of target data is generated, and the index corresponds to the relationship between each target data and a predetermined case, and changes based on an input from the user. And the steps
    A second step of storing the plurality of target data at least temporarily;
    A third step of presenting sample data for the target data to the user;
    Receiving a classification information input from the user, the classification information is associated with the sample data based on the input to classify the sample data; and
    A fifth step of providing, as reference data, a combination of the sample data and classification information received from the user;
    A sixth step of obtaining a plurality of the reference data;
    A first component extracted from the plurality of reference data, wherein the first component constitutes at least part of the reference data;
    An eighth step of assessing the degree to which the first component contributes to the combination;
    A second component having a relationship with the evaluated first component is extracted from at least one of the plurality of target data, and the second component configures at least a part of the target data A ninth step to do;
    A tenth step of evaluating the second component;
    An eleventh step of evaluating relevance between the plurality of target data and the predetermined case by generating the index based on an evaluation result of the second component;
    A method for controlling a data analysis system including:
  11.  請求項10記載のデータ分析システムの制御方法に含まれる各ステップを、コンピュータに実行させるデータ分析システムの制御プログラム。 A data analysis system control program for causing a computer to execute each step included in the data analysis system control method according to claim 10.
  12.  請求項11記載のデータ分析システムの制御プログラムを記録したコンピュータ読み取り可能な記録媒体。 A computer-readable recording medium in which the control program of the data analysis system according to claim 11 is recorded.
PCT/JP2015/064832 2015-05-22 2015-05-22 Data analysis system, control method, control program, and recording medium WO2016189605A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2015/064832 WO2016189605A1 (en) 2015-05-22 2015-05-22 Data analysis system, control method, control program, and recording medium
JP2017520082A JPWO2016189605A1 (en) 2015-05-22 2015-05-22 Data analysis system, control method, control program, and recording medium therefor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2015/064832 WO2016189605A1 (en) 2015-05-22 2015-05-22 Data analysis system, control method, control program, and recording medium

Publications (1)

Publication Number Publication Date
WO2016189605A1 true WO2016189605A1 (en) 2016-12-01

Family

ID=57394061

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/064832 WO2016189605A1 (en) 2015-05-22 2015-05-22 Data analysis system, control method, control program, and recording medium

Country Status (2)

Country Link
JP (1) JPWO2016189605A1 (en)
WO (1) WO2016189605A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019133478A (en) * 2018-01-31 2019-08-08 株式会社Fronteo Computing system
JP2020502712A (en) * 2016-12-11 2020-01-23 ディープ バイオ インク Disease diagnosis system and method using neural network
CN113065065A (en) * 2021-03-30 2021-07-02 广联达科技股份有限公司 Method, device and equipment for evaluating search performance and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003242170A (en) * 2002-02-15 2003-08-29 Ricoh Co Ltd Document search device, document search method, and recording medium
JP2013182338A (en) * 2012-02-29 2013-09-12 Ubic:Kk Document classification system and document classification method and document classification program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003242170A (en) * 2002-02-15 2003-08-29 Ricoh Co Ltd Document search device, document search method, and recording medium
JP2013182338A (en) * 2012-02-29 2013-09-12 Ubic:Kk Document classification system and document classification method and document classification program

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020502712A (en) * 2016-12-11 2020-01-23 ディープ バイオ インク Disease diagnosis system and method using neural network
US11074686B2 (en) 2016-12-11 2021-07-27 Deep Bio, Inc. System for diagnosing disease using neural network and method therefor
JP2019133478A (en) * 2018-01-31 2019-08-08 株式会社Fronteo Computing system
CN113065065A (en) * 2021-03-30 2021-07-02 广联达科技股份有限公司 Method, device and equipment for evaluating search performance and readable storage medium

Also Published As

Publication number Publication date
JPWO2016189605A1 (en) 2018-02-15

Similar Documents

Publication Publication Date Title
JP6182279B2 (en) Data analysis system, data analysis method, data analysis program, and recording medium
Li et al. Tourism companies' risk exposures on text disclosure
JP5885875B1 (en) Data analysis system, data analysis method, program, and recording medium
Liu et al. A two-phase sentiment analysis approach for judgement prediction
WO2017067153A1 (en) Credit risk assessment method and device based on text analysis, and storage medium
JP6748710B2 (en) Data analysis system, control method thereof, program, and recording medium
WO2016125310A1 (en) Data analysis system, data analysis method, and data analysis program
Abrahams et al. Audience targeting by B-to-B advertisement classification: A neural network approach
Afsana et al. Automatically assessing quality of online health articles
WO2016203652A1 (en) System related to data analysis, control method, control program, and recording medium therefor
Li et al. Evaluating Online Review Helpfulness Based on Elaboration Likelihood Model: the Moderating Role of Readability.
WO2016189605A1 (en) Data analysis system, control method, control program, and recording medium
JP5933863B1 (en) Data analysis system, control method, control program, and recording medium
JP2017201543A (en) Data analysis system, data analysis method, data analysis program, and recording media
WO2016121127A1 (en) Data evaluation system, data evaluation method, and data evaluation program
JP6178480B1 (en) DATA ANALYSIS SYSTEM, ITS CONTROL METHOD, PROGRAM, AND RECORDING MEDIUM
JP6026036B1 (en) DATA ANALYSIS SYSTEM, ITS CONTROL METHOD, PROGRAM, AND RECORDING MEDIUM
Tanaltay et al. Can Social Media Predict Soccer Clubs’ Stock Prices? The Case of Turkish Teams and Twitter
WO2016111007A1 (en) Data analysis system, data analysis system control method, and data analysis system control program
Dey et al. Applying Text Mining to Understand Customer Perception of Mobile Banking App
Kim et al. Analyzing Dissatisfaction Factors of Weather Service Users Using Twitter and News Headlines
Afsana et al. Automatically Assessing Quality of Online Health
Li et al. Empirical study of factors that influence the perceived usefulness of online mental health community members
Mouri Dey et al. Applying Text Mining to Understand Customer Perception of Mobile Banking App
JP2023047661A (en) Web advertisement determination device, system, program, and method that can present reason for determination

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15893241

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2017520082

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15893241

Country of ref document: EP

Kind code of ref document: A1