WO2016189605A1

WO2016189605A1 - Data analysis system, control method, control program, and recording medium

Info

Publication number: WO2016189605A1
Application number: PCT/JP2015/064832
Authority: WO
Inventors: 秀樹武田; 和巳蓮子
Original assignee: 株式会社Ｕｂｉｃ
Priority date: 2015-05-22
Filing date: 2015-05-22
Publication date: 2016-12-01
Also published as: JPWO2016189605A1

Abstract

A controller for extracting a first constituent element from a plurality of reference data, evaluating the first constituent element, and on the basis of this evaluation, extracting a second constituent element from one or more of a plurality of target data, evaluating the second constituent element, generating an indicator on the basis of the evaluation results for the first constituent element and the evaluation results for the second constituent element, and on the basis of the generated indicator, evaluating the relationship between the plurality of target data and a prescribed matter.

Description

Data analysis system, control method, control program, and recording medium therefor

The present invention relates to a data analysis system that analyzes data, and can be applied to, for example, an artificial intelligence system that analyzes big data.

As a result of computerization of society due to the rapid development of computers, an enormous amount of information (big data) has become widely and closely related to the activities of companies and individuals. Therefore, recently, the necessity of accurately separating desired information from big data has been emphasized.

As an approach for extracting desired information from big data, we apply data analysis by reviewers to some data sampled from the data group, and use the analysis results to extract the remaining data. A system capable of automatic analysis is known (for example, JP 2013-182338 A).

JP 2013-182338 A

In the above data analysis system, if there is a small amount of data sampled from the data group, there is a problem that the data group cannot be analyzed accurately due to insufficient information necessary for analysis of the remaining many data. If an attempt is made to increase the number of data of some data, the burden on the reviewer becomes excessive, and adverse effects such as an increase in time and cost required for data analysis cannot be ignored.

Therefore, the present invention has been made in view of such problems, and a data analysis technology system capable of accurately analyzing a data group regardless of the amount of the partial data for the data group, and The purpose is to provide the related technology.

In order to achieve the above object, a data analysis system according to a first invention is a data analysis system for analyzing data, comprising a memory, an input control device, and a controller. An index for ranking the target data is generated, and the index corresponds to the relationship between each target data and a predetermined case, and changes based on an input given by the user via the input control device. The memory stores at least temporarily the plurality of target data, the input control device presents sample data for the target data to a user, and receives input of classification information from the user; The classification information is associated with the sample data based on the input to classify the sample data, and the sample data A combination of data and classification information received from the user is provided as reference data to the controller, and the controller obtains a plurality of the reference data and extracts a first component from the plurality of reference data The first component constitutes at least a part of the reference data, evaluates the degree to which the first component contributes to the combination, and the evaluated first component A second component having relevance to the target data is extracted from at least one of the plurality of target data, and the second component constitutes at least a part of the target data. Evaluating the relevance between the plurality of target data and the predetermined case by evaluating the component and generating the index based on the evaluation result of the second component And it is characterized in that.

In order to achieve the above object, a control method of a data analysis system for analyzing data according to the second invention generates an index for ranking a plurality of target data, and the index includes each target data and a predetermined value. A first step corresponding to a relevance to the case and changing based on an input from a user; a second step storing at least temporarily the plurality of target data; A third step of presenting sample data for the target data to the user and an input of classification information is received from the user, and the classification information is associated with the sample data based on the input to classify the sample data A combination of the sample data and the classification information received from the user is provided as reference data. A fifth step of acquiring a plurality of the reference data, a first component extracted from the plurality of reference data, wherein the first component is at least one of the reference data A seventh step that constitutes a part, an eighth step that evaluates the degree to which the first component contributes to the combination, and a first step that is related to the evaluated first component A second step of extracting two constituent elements from at least one of the plurality of target data, wherein the second constituent element constitutes at least a part of the target data; and the second configuration A tenth step of evaluating an element, and an eleventh of evaluating an association between the plurality of target data and the predetermined case by generating the index based on an evaluation result of the second component And steps and is characterized in that it comprises.

Furthermore, in order to achieve the object, the third invention is a data analysis system control program for causing a computer to execute each step included in the control method of the data analysis system. The invention is characterized in that it is a computer-readable recording medium on which a control program of the data analysis system is recorded.

The data analysis system, the control method, the control program, and the recording medium according to one aspect of the present invention have the effect that the data group can be accurately analyzed regardless of the amount of the partial data with respect to the data group. Play.

It is a block diagram which shows an example of the hardware constitutions of the data analysis system which concerns on 1 aspect of this invention. It is a functional block diagram which shows an example of the prediction coding function with which the said data analysis system was equipped. It is a flowchart which shows an example of the process which the predictive coding part with which the said data analysis system was provided. It is a flowchart which shows an example of the additional process which the prediction coding part with which the said data analysis system was provided performs.

Next, embodiments of the present invention will be described with reference to the accompanying drawings.
[Data analysis system configuration]
FIG. 1 is a block diagram illustrating an example of a hardware configuration of a data analysis system (hereinafter, simply referred to as “system”) according to the present embodiment. The system includes, for example, an arbitrary recording medium (eg, memory, hard disk, etc.) capable of storing data (including digital data and analog data), and a controller capable of executing a control program stored in the recording medium. (E.g., CPU: Central Processing Unit) or a computer (e.g., personal computer, server device, client device, workstation, mainframe, etc.) or computer that analyzes data stored at least temporarily in the recording medium System (for example, server device that executes main processing for data analysis, client device used by user, file server that stores data to be analyzed, etc.) Realize It may be implemented as Temu). In the present embodiment, an example (FIG. 1) in which the system is realized by the latter will be mainly described.

In the present embodiment, “data” may be any data expressed in a format that can be processed by the computer. The data may be, for example, unstructured data whose structure definition is incomplete at least in part, and document data (for example, e-mail (attached file header) Information), technical documents (including a wide range of documents explaining technical matters such as academic papers, patent publications, product specifications, design drawings, etc.), presentation materials, spreadsheets, financial statements, meeting materials, Record reports, sales documents, contracts, organization charts, business plans, company analysis information, electronic medical records, web pages, blogs, comments posted on social network services, etc., audio data (eg conversation / music) Data), image data (eg, data composed of a plurality of pixels or vector information), video data (eg, Broadly includes such configured data) of a plurality of frame images.

Further, in the present embodiment, “reference data” (reference data) may be, for example, data associated with classification information by a user (data that has been classified, which is a combination of data and classification information). . On the other hand, the “target data” (target data) may be data not associated with the classification information (unclassified data that is not presented to the user as reference data and is not classified for the user). . Here, the “classification information” may be an identification label used for classifying reference data, for example, a “Related” label indicating that the reference data and a predetermined case are related, Information that classifies the reference data into three, such as a “High” label indicating that they are related and a “Non-Related” label indicating that they are not related, or “good”, “ It may be information that classifies the reference data into five categories such as “slightly good”, “normal”, “slightly bad”, and “bad”.

In addition, the “predetermined case” includes a wide range of targets for which the system is evaluated for relevance to data, and the scope thereof is not limited. For example, the predetermined case may be a case where the discovery procedure is required when the system is realized as a discovery support system, or a crime that is the subject of an investigation when the system is realized as a criminal investigation support system. When implemented as an email monitoring system, it may be fraudulent activity (eg, information leakage, collusion, etc.), or medical application system (eg, pharmacovigilance support system, clinical trial efficiency system, medical risk) When implemented as a hedging system, fall prediction (fall prevention) system, prognosis prediction system, diagnosis support system, etc., it may be a case or case related to medicine, or an Internet application system (for example, smart mail system, information aggregation ( Curation) system, user monitoring System, social media management system, etc.), it may be case examples / cases related to the Internet, and when implemented as a project evaluation system, it may be a project that has been performed in the past or as a marketing support system. If it is, it may be a product / service targeted for marketing, or it may be realized as an intellectual property evaluation system, it may be an intellectual property subject to evaluation, or it may be realized as an unauthorized transaction monitoring system, It may be a fraudulent financial transaction, if it is realized as a call center escalation system, it may be a past response case, if it is realized as a credit check system, it may be a subject of credit check, and driving support When implemented as a system, It may be that on the rolling, if it is implemented as a sales support system, may be in the operating results.

As illustrated in FIG. 1, the data analysis system 1 according to the present embodiment includes, for example, a server device 2 that can execute main processing of data analysis and one or more that can execute related processing of data analysis. A storage system 5 including a plurality of client devices 3, a database 4 for recording data and evaluation results for the data, and a management computer 6 that provides a management function for data analysis to the client device 3 and the server device 2. And may be provided.

The client device 3 can present a part of a plurality of target data to the user as sample data before classification. As a result, the user can input for evaluation / classification of the sample data via the client device 3. The server device 2 can randomly sample a plurality of target data, extract a predetermined number of sample data, and provide this to a predetermined client device. The sample data may be data belonging to a data group that is not included in the target data to be analyzed but has a predetermined case that is the same as or similar to the target data. The client device 3 includes, as hardware resources, for example, a memory, a controller, a bus, an input / output interface (for example, a keyboard and a display), and a communication interface (communication means using a predetermined network). And the server apparatus 2 and the management computer 6 are communicably connected).

Based on the sample data to which the classification information is attached, that is, the combination of the sample data and the classification information (this is referred to as “reference data”), the server device 2 includes a pattern (for example, included in the data). Broadly refer to abstract rules, meanings, concepts, styles, distributions, samples, etc., not limited to so-called “specific patterns”), and based on these patterns, the relationship between the target data and a given case evaluate. That is, the server device 2 can evaluate the relevance between the target data and the lawsuit based on the learned pattern, can also evaluate the relevance between the target data and the criminal investigation, And the user's preference can be evaluated, and the relationship between the target data and any other event can be evaluated. Similarly to the client device 3, the server device 2 may include, for example, a memory, a controller, a bus, an input / output interface, and a communication interface as hardware resources.

The management computer 6 executes predetermined management processing for the client device 3, the server device 2, and the storage system 5. Similarly to the client device 3, the management computer 6 may include, for example, a memory, a controller, a bus, an input / output interface, and a communication interface as hardware resources. Note that application programs that can control each device are stored in the memory provided in each of the client device 3, the server device 2, and the management computer 6, and each controller executes the application program to thereby execute the application program. Programs (software resources) and hardware resources cooperate to operate each device.

The storage system 5 may be composed of, for example, a disk array system, and may include a database 4 that records data and results of evaluation / classification of the data. The server apparatus 2 and the storage system 5 are connected by a DAS (Direct Attached Storage) method or a SAN (Storage Area Network).

Note that the hardware configuration shown in FIG. 1 is merely an example, and the above system can be realized by other hardware configurations. For example, a part or all of the processing executed in the server device 2 may be executed in the client device 3, or a part or all of the processing may be executed in the server device 2. Alternatively, the storage system 5 may be built in the server device 2. It is understood by those skilled in the art that there can be various hardware configurations capable of realizing the system, and the present invention is not limited to one specific configuration (for example, the configuration illustrated in FIG. 1).

[Predictive coding function of data analysis system 1]
FIG. 2 is a functional block diagram showing an example of the predictive coding function realized by the data analysis system according to the present embodiment.

(Basic configuration of predictive coding function)
As illustrated in FIG. 2, the system can include a predictive coding unit 10. The predictive coding (Predictive Coding) unit 10 is a large number of data (target data not associated with classification information) based on a small number of data manually classified (referred to as the reference data described above). For example, it is big data.) The target data is evaluated so that significant information can be extracted.

The predictive coding unit 10 includes, for example, a data acquisition unit 11, a classification information acquisition unit 12, a data classification unit 13, a component extraction unit 14, a component evaluation unit 15, a component storage 16 and a data evaluation unit 17. Can do.

The data acquisition unit 11 acquires data from an arbitrary storage resource (for example, the database 4, a web server on the Internet, a mail server on the intranet, etc.). The data acquisition unit 11 provides all data to be subjected to data analysis as target data to the component extraction unit 14, randomly samples the target data, acquires a predetermined number of sample data, and classifies the data Provided to part 13.

The classification information acquisition unit 12 acquires the classification information input by the user for each sample data from an arbitrary input device (for example, the client device 3), and outputs the classification information to the data classification unit 13.

The data classification unit 13 combines the plurality of sample data sent from the data acquisition unit 11 and the classification information input to each sample data from the classification information acquisition unit 12, and uses the combination as a plurality of reference data To the component extraction unit 14.

The component extraction unit 14 extracts the components constituting the reference data from the plurality of reference data received from the data classification unit 13. Here, the “component” may be partial data constituting at least a part of the data, for example, a morpheme, a keyword, a sentence, a paragraph, and / or metadata (for example, an email header) constituting the document. Information), partial audio that constitutes audio, volume (gain) information, and / or timbre information, partial image that constitutes an image, partial pixels, and / or luminance information, and video Frame image, motion information, and / or 3D information. The component extraction unit 14 outputs the extracted component and classification information corresponding to the component to the component evaluation unit 15. Further, the constituent element extraction unit 14 extracts constituent elements constituting the target data from the target data input from the data acquisition unit 11 and outputs the constituent elements to the data evaluation unit 17.

The component evaluation unit 15 evaluates the component input from the component extraction unit 14. For example, the component evaluation unit 15 determines the degree of contribution of the plurality of components constituting at least part of the reference data to the combination (in other words, the distribution in which the components appear according to the classification information). Evaluate each. More specifically, the constituent element evaluation unit 15 uses, for example, a transmission information amount (for example, an information amount calculated from a predetermined definition formula using the appearance probability of the constituent element and the appearance probability of the classification information). Then, the evaluation value of the component is calculated by evaluating the component. Thereby, the component evaluation part 15 can learn the pattern contained in the said reference data. The component evaluation unit 15 outputs the component and the evaluation value of the component to the component storage unit 16.

The component storage unit 16 associates the component and the evaluation value input from the component evaluation unit 15, and stores both in an arbitrary memory (for example, the storage system 5).

The data evaluation unit 17 reads an evaluation value associated with the component input from the component extraction unit 14 from an arbitrary memory (for example, the database 4 of the storage system 5), and obtains target data based on the evaluation value. evaluate. More specifically, the data evaluation unit 17 ranks the index of the target data (for example, ranks the target data, for example, by adding the evaluation values associated with the constituent elements constituting at least a part of the target data. Numerical values, letters, and / or symbols) can be derived. A form suitable as the index is a score obtained by adding the evaluation values. The data evaluation unit 17 associates the target data with the index, and stores both in an arbitrary memory (for example, the storage system 5).

The component evaluation unit 15 selects the component until the evaluation of the data with the “Related” or “High” label set becomes larger than the evaluation of the data with no label set, and the component Can be repeatedly evaluated to correct the evaluation value of the component. As a result, the component evaluation unit 15 can find a component that appears in a plurality of reference data to which the classification information “Related” or “High” is attached and has an influence on the combination of the reference data and the label. . The component evaluation unit 15 calculates the evaluation value wgt of the component using, for example, the following formula.

Here, wgt indicates the initial value of the evaluation value of the i-th component before evaluation. Wgt indicates the evaluation value of the i-th component after the Lth evaluation. γ means an evaluation parameter in the L-th evaluation, and θ means a threshold value in the evaluation. Thereby, the component evaluation part 15 can evaluate, for example, that a component represents the characteristic of predetermined classification information, so that the value of the calculated transmission information amount is large. Note that the component evaluation unit 15 sets, as target data, an intermediate value between the lowest value of the index of the reference data set with “Related” and the highest value of the index of the reference data set with “Non-Related”. On the other hand, a threshold value (predetermined reference value) for automatically determining whether or not “Related” is set can be used.

And the data evaluation part 17 calculates each score of each of several target data and each of several reference data from the following formula | equation, for example from the evaluation value of a component. The score is an index that quantitatively evaluates the strength of the connection of these data to the classification code.

m _j : frequency of occurrence of the i-th component
wgt _i : Evaluation value of the i-th component

In addition, in the above, since the configuration described as “*** part” is a functional configuration that is realized by executing a program (data analysis program) by a controller included in the data analysis system, It may be paraphrased as ** processing or *** function. In addition, since the *** part can be replaced by hardware resources, those skilled in the art will understand that these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof. Yes, it is not limited to either.

[Processing performed by the predictive coding unit 10]
FIG. 3 is a flowchart showing an example of processing executed by the predictive coding unit 10 included in the data analysis system according to the present embodiment.

First, the data acquisition unit 11 acquires sample data from an arbitrary memory (step 10, hereinafter “step” is abbreviated as “S”). Next, the classification information acquisition unit 12 acquires the classification information input by the user from an arbitrary input device (S11). Next, the data classification unit 13 classifies the data by combining the data and the classification information to configure reference data (S12), and the component extraction unit 14 configures the reference data. Are extracted from the reference data (S13). Then, the component evaluation unit 15 evaluates the component (S14), and the component storage unit 16 associates the component with the evaluation value and stores both in an arbitrary memory (S15). The processing of S10 to S15 is referred to as a “learning phase” (a phase in which the system learns a pattern).

The data acquisition unit 11 acquires target data from an arbitrary memory (S16). The component extraction unit 14 extracts the components constituting the target data from the target data (S17). The data evaluation unit 17 reads an evaluation value associated with the constituent element from an arbitrary memory, and evaluates the target data based on the evaluation value (S18). The processing of S16 to S18 is referred to as “evaluation phase” (the system evaluates target data based on the pattern).

Note that each process included in the learning phase is not an essential process in the system. For example, a memory that associates and stores a component and an evaluation value of the component is given in advance, and the predictive coding unit 10 performs target data based on the component and the evaluation value stored in the memory. Can also be evaluated.

[Special processing executed by predictive coding unit 10 (component addition processing)]
Since the above-described predictive coding function evaluates the target data based on the evaluation result of the constituent element extracted from the reference data, the target data cannot be evaluated unless the target data includes the constituent element. This often occurs when the number of reference data (the amount of reference data) is smaller than the number of target data (the amount of target data), and the components extracted from the reference data are insufficient. It is. And if the number of components extracted from the reference data is insufficient, even if the target data can be scored, the distribution of the number of target data with scores will be biased. It becomes difficult to accurately evaluate the target data.

Therefore, it is only necessary to increase the types of components extracted from the reference data by increasing the number of reference data. However, whether or not the number of reference data is sufficient relative to the number of target data is determined in the first place. On the other hand, if the number of reference data is increased, the load on the reviewer becomes excessive, increasing the time and labor, leading to an increase in cost. I can not.

For this reason, the data analysis system according to the present embodiment accurately corrects a large number of target data regardless of the ratio of the number of reference data to the number of target data, or even when the number of reference data is insufficient. In order to be able to evaluate the target data, it is possible to supplement the new constituent elements different from the original constituent elements extracted from the reference data for the evaluation of the target data according to the evaluation result for the reference data. It is a thing.

The data analysis system sets new components based on the components extracted from the reference data, so that the target data can be accurately evaluated by supplementing the new components while maintaining the classification policy for the reference data. I was able to do it. In the following description, in order to distinguish between the component extracted from the reference data and the new component, the former is referred to as a reference component and the latter as a related component for convenience. To do.

A related component is a component that is not included in the reference data but is included in the target data and has attributes related to the standard component. As a result, the target data is evaluated in relation to a predetermined case (target data Is effective in determining the score). The “related attribute” is a characteristic of the related component with respect to the reference component, for example, that the related component (morpheme) is co-occurring with the reference component (morpheme), For example, the former and the latter are in a synonym relationship, and the former meta information and the latter meta information are common. For example, if “price” is a reference component, “adjust price” or “adjust” that exists simultaneously in the same context as “price”, such as “determine price”, “consult price”, etc. “Decision” and “consultation” are co-occurrence words for “price”, that is, related components. In addition, “adjustment”, “determination”, and “consultation” may be determined as synonyms for “price” based on a database or the like.

Next, an example of a process (hereinafter referred to as an additional process) for adding a related component and evaluating target data will be described. The server apparatus 12 executes a control program for additional processing following step SP18 of the data analysis program of FIG. FIG. 4 shows an example of a flowchart of the addition process.

First, the component evaluation unit 15 determines whether additional processing is necessary (S40). The data analysis program of the server apparatus 12 can select an operation mode (operation policy) of additional processing when the data analysis operation manager sets the operation environment for data search. In this operation mode, for example, there is a pattern in which (1) additional processing is performed, (2) additional processing is not performed, or (3) additional processing is performed depending on the situation. The “situation” is, for example, a state in which the number of reference data tends to be insufficient compared to the number of target data, and the component evaluation unit 15 indicates that the score distribution of the plurality of target data is biased, When the number of reference components used for the evaluation is relatively small, it can be determined that the situation has occurred. Note that the additional process control program may always perform the additional process at the time of data analysis without the necessity determination step of the additional process. Further, the data analysis program may be configured to select whether to notify the operator that the additional processing is performed.

When the component evaluation unit 15 denies the necessity determination of the additional process, the flow ends without performing the additional process, and when the necessity determination is affirmed, the related component element is set based on the evaluation result of the reference data. Migrate to Therefore, the component evaluation unit 15 sets a specific component that is a basis for determining a related component from among components (standard components) extracted from the reference data in order to set the related component. Is determined (S41).

The specific standard component may be one, a plurality, or all of a plurality of standard components extracted from the reference data. How the specific reference component is determined may be based on configuration information set for the operational environment for data analysis. In a preferred aspect, the specific reference component is selected from a predetermined number of reference components in descending order of evaluation value (for example, the reference component having the highest evaluation value). This is because the higher the evaluation component is, the higher the degree of relevance of the related component to the above-mentioned predetermined case according to the reference component. If the number of specific criteria components is greater than the optimal value, there is a concern that the evaluation of the target data tends to be inconsistent with the classification in the reference data, while if the number of specific criteria components is less than the optimal value, the target data is evaluated. However, the “predetermined number” may be appropriately determined by the system.

When the specific reference component is determined, the component evaluation unit 15 determines a related component based on the specific reference component (S42). The constituent element evaluation unit 15 extracts constituent elements that do not exist in the reference data and have a co-occurrence relationship with the specific standard constituent element from all target data or a part of target data, and the extracted constituent elements Is set as the related morpheme. For example, if “price” is a specific reference component, for example, “adjustment” co-occurs with “price”, such as “adjust price”, “determine price”, “consult price”, “Decision” and “consultation” are extracted, and these are set as related components. Therefore, the related component is information that is automatically added to the system without user input for evaluation (scoring) of the target data without increasing the reference data. If the related component does not exist in the target data, the component evaluation unit 15 may add a new specific reference element until the related component can be extracted from the target data.

Subsequently, the constituent element evaluation unit 15 performs the evaluation according to the attribute (for example, information transmission amount) of the related constituent element (S43). For example, the component evaluation unit 15 detects target data in which related components such as “adjustment”, “decision”, and “consultation” coexist with a specific reference component (“price”). The number (n) of the detected target data is specified. The additional processing control program regards the detected target data as relevant to the predetermined case (that is, the classification information of “Relative” or “High” corresponds to the predetermined case) In other words, the information transmission amount is calculated from a predetermined definition formula based on the appearance probability of the related component and the appearance probability of the classification information in all the target data, and the evaluation value corresponding to each related component is estimated. For example, among related components, the number of target data (n) having a co-occurrence state of each of the first element (adjustment), the second element (decision), the third element (consultation),. ) Become smaller in this order, the respective evaluation values are as follows: first element (adjustment)> second element (decision)> third element (consultation)>. Specifically, the component evaluation unit 15 can evaluate the evaluation value (weight) of the related component according to the following formula.

Or a component evaluation part may evaluate the evaluation value of a related component according to the following formula | equation.

Here, CF is the _{j 0} th reference elements _{m j0,} frequency with which the _{j 1} th connected component _{m j1} cooccur in the same sentence (occurrence frequency: collocation frequency) represents, DF is Both represent the frequency of co-occurrence in the same data, and w represents the weight (evaluation result) of the reference component m _j0 . F represents an arbitrary function, for example,

May be,

It may be. The component evaluation unit 15 evaluates the related component, whether the target data including the related component is “Relative” or “Non-Relative” based on the evaluation of the reference component, and The evaluation may be performed based on the evaluation result (score value) (S18) of the target data.

Next, the data evaluation unit 17 re-evaluates all target data (recalculates the score) based on the evaluation value of the reference component and the evaluation value of the related component (S44). Furthermore, the data evaluation unit 17 ranks all the target data according to the evaluation result of all the target data and creates ranking information of all the targets. The data evaluation unit 17 evaluates each target data according to a predetermined value. Compared with the threshold information, classification information is set for each target data. The data evaluation unit 17 can output the above-described ranking information including the classification information to the client device 3.

As described above, according to the data analysis system according to the present embodiment, a component having relevance to the component included in the reference data can be added to the evaluation of the target data as a new component. Regardless of the ratio of the number of reference data to the number of data, or even when the number of reference data is insufficient, a large number of target data can be accurately evaluated. The constituent element evaluation unit 15 may determine the related constituent element from a constituent element that is a synonym of the specific reference constituent element and is not included in the reference data but included in the target data. At this time, the component evaluation unit 15 may use the search table of the database 4 to select synonyms for the specific reference component. A synonym means that two different morphemes are in a relationship of being matched by, for example, a higher-level concept morpheme. In addition, the constituent element evaluation unit 15 may set the related constituent elements from the synonyms and the morphemes described above having a co-occurrence relationship with the specific reference constituent element. Furthermore, the constituent element evaluation unit 15 may set related constituent elements from morphemes having a co-occurrence relationship with respect to the synonyms. Furthermore, when the synonym of the specific reference constituent element does not exist in the target data, the constituent element evaluation unit 15 may use another synonym having a similar meaning to the synonym as a candidate for the related constituent element.

In the above-described embodiment, after the evaluation (scoring) of the target data based on the reference component, the target data is evaluated based on the related component and the reference component, so that the difference between the two is given to the user. It can be presented, and the former evaluation results can be applied to the determination and evaluation of related components, but without the evaluation of target data based on the reference components, You may perform evaluation of object data based on it.

Furthermore, in the above-described data analysis system, when the operating environment is set so as to notify the user of the additional processing, the server device 12 sends the specific reference component and the related configuration to the client device 3. The candidate elements can be displayed in the order of evaluation values, and the user can select whether or not to adopt each element for data analysis.

(Pattern update function)
The predictive coding unit 10 optimizes evaluation values of constituent elements based on given reference data and / or newly obtained reference data, for example, as described in (1) to (3) below. Can do.

(1) Optimization of evaluation value The component evaluation unit 15 calculates the recall rate or the conformance rate based on the result of evaluating the target data, and the component is the data and the data so that the recall rate or the conformance rate increases. By repeatedly evaluating the degree of contribution to the combination with the classification information, the learned pattern can be updated.

Here, the above-mentioned “recall rate” (RecallateRate) is an index indicating the ratio (coverability) of the data to be discovered to the predetermined number of data. For example, when “reproducibility is 80% with respect to 30% of all data”, it indicates that 80% of the data to be found is included in the data of the top 30% of the index (data If the data is brute force (linear review) without using an analysis system, the amount of data to be discovered is proportional to the amount reviewed, so the greater the deviation from the proportion, the better the system performance.) . The “Precision Rate” is an index indicating the ratio (accuracy) of data to be truly discovered to the data discovered by the system. For example, when the expression “the relevance rate is 80% when 30% of all data is processed” is shown, the proportion of data to be discovered is 80% of the data of the top 30% of the index. .

The component extraction unit 14 calculates the recall rate or the conformance rate based on the result evaluated by the data evaluation unit 17, and when the recall rate or the conformance rate is lower than the target value, the recall rate or the conformance rate is the target. Re-extract the component from the data until the value is exceeded. At this time, the component extraction unit 14 may extract the component excluding the component extracted last time, or may replace a part of the component extracted last time with a new component. When the data evaluation unit 17 derives the index of the target data using the re-extracted component, the index (second index) of each data is derived using the re-extracted component and its evaluation value. The recall rate or the matching rate may be derived again from the first index and the second index obtained before re-extracting the constituent elements. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.

(2) Evaluation of component based on convolution method The component evaluation unit 15 evaluates the component included in the reference data, and then convolves the evaluation value of the component other than the component with the component The component can be re-evaluated so that the evaluation value of the other component is reflected in the evaluation value. As a result, the relevance between the constituent element and the other constituent elements is evaluated as an evaluation value of the constituent element, so that the data analysis system can further improve the accuracy of data analysis. Play.

(3) Optimization Timing The component evaluation unit 15 can update a pattern (for example, a combination of a component and an evaluation value of the component) at an arbitrary timing. That is, for example, the component evaluation unit 15 (a) at a timing when an update request is received from an administrative user who manages the system, (b) at a timing when a preset date and time arrives, and / or (c) The pattern can be updated at a timing when an input regarding the additional review is received from the user.

The user can confirm (confirmation review) the content of the target data from which the index is derived by the data evaluation unit 17, and can newly input classification information for the target data. At this time, the classification information acquisition unit 12 may acquire newly input classification information, and the data classification unit 13 may combine the target data and the classification information and use the combination as new reference data. The new reference data is stored in an arbitrary memory, and is fed back to the system, for example, at the timings (a) to (c).

Thereby, the component extraction unit 14 extracts the component from the new reference data, and the component evaluation unit 15 evaluates the component. When the constituent element has been evaluated before and the constituent element and its evaluation value are stored in the memory, the constituent element storage unit 16 replaces the evaluation value with a new evaluation result (evaluation value) and stores it. If not, the component and the evaluation value are associated with each other and newly stored in the memory.

That is, the predictive coding unit 10 includes a plurality of constituent elements constituting at least a part of data corresponding to the classification information at an arbitrary timing (for example, timings (a) to (b) described above). The learned pattern can be updated by re-evaluating the degree of contribution to the combination with the classification information. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.

(Management function)
The predictive coding unit 10 can further include a management unit 18 (for example, the management unit 18 has the following functions (1) to (5)).

(1) Review Heat Map
The data evaluation unit 17 derives an index for each of a plurality of target data, and the user (for example, in the order in which the index indicates that the target data is highly related to the predetermined case) As an example, consider the case where each target data is confirmed and classification information is given (confirmed review). At this time, the management unit 18 uses the gradation corresponding to the ratio that the target data associated with the classification information occupies for all the target data, and the distribution of the ratio with respect to the result of evaluating each of the plurality of target data. Can be displayed in a visible manner.

For example, when the data evaluation unit 17 derives a numerical value in the range of 0 to 10000 as the index, the management unit 18, for example, has a range obtained by dividing the index every 1000 (that is, 0 to 1000 in the first interval). , 1001 to 2000 as the second section, 2001 to 3000 as the third section, etc.) (for example, the target data with the index of 2500 is classified into the third section), and a certain range For example, by changing the color tone of the range so that the ratio of the target data to which the predetermined classification information (for example, “Related”) occupies the total number of target data classified into The range can be displayed (for example, the higher the ratio, the closer to the warm color system and the lower, the closer to the cold color system). The management unit 18 displays the other ranges in the same manner for the other ranges.

Thereby, since the management unit 18 can display the distribution of the ratio in each range using gradation, for example, the index indicates that the relevance between the target data and the predetermined case is high. If the above-mentioned ratio in the range is indicated by a cold color tone in spite of the range (for example, the ninth section where the index is 8001 to 9000), the confirmation review by the user may be wrong Can suggest that. That is, the data analysis system further provides an additional effect that allows the user to grasp the distribution at a glance.

(2) Central Linkage
The management unit 18 can visualize interrelationships (eg, hierarchical relationships, series relationships, data transmission / reception, etc.) between a plurality of subjects (eg, people, organizations, computers, etc.). For example, when an e-mail is transmitted from the first computer to the second computer, the management unit 18 converts the first circle representing the first computer and the second circle representing the second computer into the first circle. A predetermined display device (for example, a display provided in the client device 10) is a diagram that is connected by an arrow (for example, a thickness corresponding to the size of the e-mail) from the circle to the second circle. Can be displayed.

Further, the management unit 18 can visualize the interrelationship according to the result evaluated by the data evaluation unit 17. For example, when the data evaluation unit 17 derives a numerical value in the range of 0 to 10000 as the index, the management unit 18 may, for example, target data (for example, first data) associated with an index belonging to a specified section. The diagram can be displayed on the predetermined display device only on the basis of the electronic mail transmitted from the computer to the second computer. Thereby, the data analysis system further exhibits an additional effect that allows the user to grasp the mutual relationship between a plurality of subjects at a glance.

(3) Behavior Extractor
The management unit 18 determines whether or not the first component representing the predetermined operation is included in the target data. When determining that the first component is included, the management unit 18 identifies the second component representing the target of the predetermined operation can do.

For example, when the sentence “determine the specification” is included in the target data, the component “specification” and “determine” are extracted from the sentence, and the component (determining the predetermined operation) The other component (object) called "specification" that is the target of the verb) is specified. Next, the management unit 18 associates the meta information (attribute information) indicating the attribute (property / feature) of the target data including the above constituent element and other constituent elements with the constituent element and the second constituent element. Here, the meta information is information indicating a predetermined attribute of data. For example, when the target data is an e-mail, the name of the person who sent the e-mail, the name of the person who received the e-mail, and the e-mail It may be an address, the date and time of transmission / reception, and the like.

Then, the management unit 18 associates the two components with the meta information and displays them on a predetermined display device (for example, a display provided in the client device 3). For example, the management unit 18 connects the circle representing the first component and the circle representing the second component with an arrow from the first circle to the second circle. It can be displayed on a display device. Thereby, the data analysis system further exhibits an additional effect that the user can grasp the predetermined operation and the target at a glance.

(4) Automatic summarization based on generative concept extraction The management unit 18 can extract data including constituent elements corresponding to subordinate concepts of a preselected concept from a plurality of target data, and can summarize the plurality of target data. Content (eg, sentences, graphs, tables, etc.) can be generated.

First, the user selects some concepts according to the topic to be detected from the target data, and registers the selected concepts in the management unit 18 in advance. For example, if the topic to be detected is “illegal” or “dissatisfied”, the concept category is divided into five categories of “behavior”, “emotion”, “nature / state”, “risk”, and “money” For example, “behavior” for “behavior”, “despise”, etc. “feeling” for “feelings”, “being angry”, etc. “dullness” for “nature / state”, “ The concept of “risk” and “danger” for “risk”, such as “bad attitude”, and “money paid for human labor” for “money” are given to the management unit 18 by the user. sign up.

For each registered concept, the management unit 18 searches the reference data for a component corresponding to the subordinate concept of the concept, associates the searched component with the concept, and stores an arbitrary memory (for example, storage Store in system 18). Then, the management unit 18 extracts the stored constituent element from the target data, specifies a concept associated with the constituent element, and outputs a summary using the concept.

For example, the management unit 18 extracts the concepts “system”, “sales” and “do” from the text “monitoring system order” included in a certain e-mail, and “accounting system introduction” included in another e-mail. The concepts “system”, “sale”, and “do” are extracted from the text “”, and “sell system” is output as a summary of these emails. At this time, the management unit 18 can show, for example, a graph (for example, a pie chart) indicating the ratio of target data including the concept of “sell system” to all target data. Thereby, the data analysis system further exhibits an additional effect of allowing the user to grasp the entire image of the target data.

(5) Topic clustering
The management unit 18 can cluster the plurality of target data according to topics (subjects) included in the plurality of target data. For example, the management unit 18 can cluster a plurality of target data using an arbitrary classification model (for example, K-means, support vector machine, spherical clustering, etc.). Thereby, the data analysis system further exhibits an additional effect of allowing the user to grasp the entire image of the target data.

(Phase analysis function)
The predictive coding unit 10 may further include a phase analysis unit 19 (not shown in FIG. 2). The phase analysis unit 19 has the following functions (1) to (3), for example.

(1) Phase analysis The phase analysis part 19 can analyze the phase which shows each step in which a predetermined case progresses. Here, a flow in which the phase analysis unit 19 analyzes a phase based on an example in which the above system is realized as a criminal investigation support system and a predetermined case is “collusion” will be described.

The collusion involves the relationship building phase (the stage of building relationships with competitors), the preparation phase (the stage of exchanging information about competitors with competitors), and the competition phase (providing prices to customers, obtaining feedback, It is known to progress in the order of communication). Therefore, the system administrator sets the above three phases in the phase analysis unit 19. The system learns a plurality of patterns corresponding to the plurality of phases from a plurality of types of reference data respectively prepared for a plurality of preset phases, and the target data based on the plurality of phases, respectively. For example, it is possible to specify “in which phase the organization to be analyzed is currently in”.

That is, the component evaluation unit 15 refers to a plurality of types of reference data respectively prepared for a plurality of preset phases, evaluates components included in the plurality of types of reference data, and The element and the result (evaluation value) obtained by evaluating the component are associated with each other and stored in the memory for each phase (that is, a plurality of patterns corresponding to the plurality of phases are respectively learned). Next, the data evaluation unit 17 derives an index for each of a plurality of phases by analyzing the target data based on the pattern learned for each phase.

Then, the phase analysis unit 19 determines whether or not the index satisfies a predetermined determination criterion (for example, a threshold value) set in advance for each phase (for example, whether or not the index exceeds the threshold value). ) And the count value corresponding to the phase is increased. Finally, the phase analysis unit 19 specifies the current phase based on the count value (for example, the phase having the maximum count value is set as the current phase). Or when it determines with the parameter | index derived for every phase satisfy | filling the predetermined criterion set to the said phase, the phase analysis part 19 can also specify the said phase as a present phase. Thereby, the data analysis system further exhibits an additional effect that a phase indicating each stage where a predetermined case progresses can be suggested to the user.

(2) Phase progress prediction based on a prediction model The phase analysis unit 19 is based on an index derived by evaluating a plurality of target data based on a model that can predict the progress of a predetermined action related to a predetermined case. Predict and present the following actions:

That is, for example, the phase analysis unit 19 uses the index derived for the first phase (for example, the relationship building phase) and the index derived for the second phase (for example, the preparation phase) as variables. Assuming a regression model (a model in which the progress can be predicted), the possibility (for example, the probability) of proceeding to the third phase (for example, the competitive phase) can be predicted based on the regression coefficient optimized in advance. Thereby, the data analysis system further exhibits an additional effect that the result of predicting the progress of the predetermined action related to the predetermined case can be suggested to the user.

(3) Optimization of determination criteria The phase analysis unit 19 uses the above-mentioned determination criteria (predetermined determination criteria set in advance for each phase, for specifying phases based on the index derived by the data evaluation unit 17, For example, the threshold) can be optimized according to given data. For example, the management unit 18 performs regression analysis on the relationship between the index derived for each of the plurality of target data and the ranking of the index (that is, the rank when the indices are arranged in ascending order), and the regression Based on the result of the analysis, the determination criterion can be reset (for example, the threshold value is changed).

First, the administrator of the system previously sets a ranking threshold for the ranking. For example, a function (y = e ^αx + β (e is the base of the natural logarithm) where the phase analysis unit 19 determines the relationship between the index derived by the data evaluation unit 17 and the ranking of the index. α and β are parameters that take real values)) (for example, the parameters of the function are determined by the method of least squares), and the index corresponding to the ranking threshold is newly set in the function. Is set as a simple criterion (the threshold after change). As a result, the data analysis system can optimize the determination criterion according to given data, and thus has the additional effect of improving the accuracy of data analysis.

(Auxiliary function)
Each unit included in the predictive coding unit 10 can have, for example, the following auxiliary functions (1) to (6).

(1) High Resolution Evaluation The data evaluation unit 17 can evaluate target data with high resolution. That is, the data evaluation unit 17 not only derives an index for the target data but also divides the target data into a plurality of parts (for example, sentences or paragraphs (partial target data) included in the target data). Based on the learned pattern, each of the plurality of partial target data can be evaluated (an index is derived for the partial target data).

The data evaluation unit 17 can also integrate a plurality of indices derived for each of the plurality of partial target data, and use the integrated index as an evaluation result of the target data (for example, each index is derived as a numerical value). The maximum value of the index is extracted and used as an integrated index for the target data, or the average of the index is set as an integrated index for the target data, or a predetermined number of the indexes are added in descending order, Or an integrated indicator). Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.

(2) Time-series evaluation When analyzing data whose properties change with the passage of time (for example, an electronic medical record that records a medical condition that progresses with the passage of time), the component evaluation unit 15 delimits at predetermined intervals. Each pattern is learned from the obtained reference data (for example, the reference data of the first section, the reference data of the second section, etc.) (that is, the component and the result of evaluating the component at each predetermined time) The data evaluation unit 17 can evaluate the target data based on each of the patterns. That is, the data evaluation unit 17 can derive an index for the target data along the time series. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.

At this time, the data evaluation unit 17 can predict a future index based on the temporal change of the index. For example, the data evaluation unit 17 sets a model for time series analysis (for example, autoregressive model, moving average model, etc.) and within a predetermined period (for example, the past month) before new target data is obtained. The next index obtained when the new target data is evaluated can be predicted based on the index derived in step. Thereby, the data analysis system can further exhibit an additional effect that an event that can occur in the future (for example, a risk that an undesirable situation occurs) can be presented to the user.

(3) Case-by-case evaluation Data that changes in nature depending on the type of case (for example, litigation-related documents whose contents change according to the type of lawsuit (for example, violation of antitrust law, information leakage, patent infringement, etc.) Etc.), the component evaluation unit 15 learns each pattern from the reference data prepared for each case (for example, reference data related to violation of the Antimonopoly Act, reference data related to information leakage, etc.) (that is, The data evaluation unit 17 can evaluate the target data based on the pattern, respectively, by acquiring the component and the result of evaluating the component for each case. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.

(4) Syntax analysis The data evaluation unit 17 can analyze the structure of the target data and reflect the analysis result in the evaluation of the target data. For example, when the target data includes a sentence (text) at least partially, the data evaluation unit 17 expresses each sentence included in the sentence (for example, whether the sentence is a positive form or a negative form). Or the like, and the result of the analysis can be reflected in an index derived for the target data. Here, the positive form is an expression that affirms the subject (for example, “the dish is delicious”), and the negative form is an expression that denies the subject (for example, “the dish is not delicious” or “the dish is not delicious”). Yes, the negative form may be an expression that affirms or denies the subject matter (eg, “the food was not delicious” or “the food was not delicious”).

The data evaluation unit 17 can adjust the index according to the expression form. For example, when the data evaluation unit 17 derives a numerical value in a predetermined range as the index, the data evaluation unit 17 adds, for example, “+ α” to the positive form and “−β” to the negative form, The above index can be adjusted by adding “+ θ” to the depolarized form (α, β, and θ may be arbitrary numerical values, respectively). Further, when the data evaluation unit 17 detects that the sentence included in the target data is negative, for example, by canceling the sentence, the component included in the sentence is not used as a basis for deriving the index ( The component is not considered).

Furthermore, the constituent element evaluation unit 15 can increase or decrease the evaluation value of the constituent element depending on, for example, whether a certain morpheme (constituent element) is a subject, an object, or a predicate of the sentence. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.

(5) Evaluation Considering Correlation (Co-occurrence) Between Components The data evaluation unit 17 correlates the first component included in the target data with the second component included in the target data (co-occurrence, For example, the index for the target data can be derived in consideration of the frequency of occurrence of both at the same time.

For example, when the target data includes a sentence (text) at least in part, and the first keyword (first component) “price” appears in the sentence, the data evaluation unit 17 determines that the first keyword is Based on the number of occurrences of the second keyword (second component) at a second position (for example, a position included in a predetermined range including the first position) in the vicinity of the appearing first position, the index Can be derived. Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.

(6) Emotion Analysis When the target data includes user evaluation information for a predetermined case, the data evaluation unit 17 is the user's emotion that generated the target data, and the predetermined data generated based on the evaluation information Emotions for the case can be extracted from the target data (emotions included in the target data are evaluated).

For example, when data included in a website introducing a product / service (for example, an online product site, a restaurant guide) is to be analyzed, the data evaluation unit 17 is included in a comment (review) on the product / service. Components (for example, keywords such as “good”, “fun”, “bad”, “clogged”) and evaluation of the product / service (eg, “very good”, “good”, “ The target data (for example, data included in other websites) can be evaluated based on a combination (reference data) with a combination of “normal”, “bad”, and “very bad”. At this time, the data evaluation unit 17 can increase or decrease the evaluation result according to, for example, exaggerated expressions (for example, “very”, “very”, etc.). Thereby, the data analysis system further exhibits an additional effect that the accuracy of data analysis can be improved.

[Example of data analysis system processing data other than document data]
In the present embodiment, the case where the data analysis system analyzes document data is mainly assumed, and an example based on the assumption has been described. However, the system is not limited to document data (for example, audio data, image data). , Video data, etc.).

For example, when analyzing speech data, the system may analyze the speech data itself, convert the speech data into document data by speech recognition, and convert the converted document data as an analysis target. Good. In the former case, for example, the system divides the voice data into partial voices of a predetermined length to form components, and uses the voice analysis method (for example, hidden Markov model, Kalman filter, etc.) to convert the partial voices. By identifying, the voice data can be analyzed. In the latter case, a speech is recognized using an arbitrary speech recognition algorithm (for example, a recognition method using a hidden Markov model), and the procedure similar to the procedure described in the embodiment is performed on the recognized data. Can be analyzed.

When analyzing image data, the system, for example, divides the image data into partial images of a predetermined size to form components, and any image recognition method (for example, pattern matching, support vector machine, neural network) Etc.) can be used to identify the partial image.

Further, when analyzing video data, the system, for example, divides a plurality of frame images included in the video data into partial images each having a predetermined size to form a component, and an arbitrary image recognition technique (for example, a pattern The video data can be analyzed by identifying the partial image using matching, a support vector machine, a neural network, or the like.

[Example of implementation using software and hardware]
The control block of the data analysis system may be realized by a logic circuit (hardware) formed on an integrated circuit (IC chip) or the like, or may be realized by software using a CPU. In the latter case, the system includes a CPU that executes a program (control program for the data analysis system) that is software that implements each function, and a ROM (in which the program and various data are recorded so as to be readable by the computer (or CPU)). A Read Only Memory) or a storage device (these are referred to as “recording media”), a RAM (Random Access Memory) for developing the program, and the like are provided. And the objective of this invention is achieved when a computer (or CPU) reads the said program from the said recording medium and runs it. As the recording medium, a “non-temporary tangible medium” such as a tape, a disk, a card, a semiconductor memory, a programmable logic circuit, or the like can be used. The program may be supplied to the computer via an arbitrary transmission medium (such as a communication network or a broadcast wave) that can transmit the program. The present invention can also be realized in the form of a data signal embedded in a carrier wave in which the program is embodied by electronic transmission. The above program can be implemented in any programming language, for example, a script language such as Python, ActionScript, JavaScript (registered trademark), an object-oriented programming language such as Objective-C, Java (registered trademark), HTML5, or the like Can be implemented using other markup languages. Also, any recording medium that records the above program falls within the scope of the present invention.

[Other application examples]
In the above-described embodiment, the system uses the related component having the relationship with the component included in the reference data for the evaluation of the target data, so that the target data can be obtained even if the number of reference data is small. Although an example realized as a system that can be accurately evaluated has been described, the system includes, for example, a discovery support system, a forensic system, an e-mail monitoring system, a medical application system (for example, a pharmacovigilance support system, a clinical trial efficiency system) , Medical risk hedging system, fall prediction (fall prevention) system, prognosis prediction system, diagnosis support system, etc.) Internet application system (eg smart mail system, information aggregation (curation) system, user monitoring system, social media) Management systems, etc.), information leakage detection system, project evaluation system, marketing support system, intellectual property assessment system, unauthorized transaction monitoring system, call center escalation system, such as credit investigation system, it can also be implemented as any of the system.

For example, when the data analysis system of the present invention is realized as a discovery support system, the data analysis system uses target data (for example, documents, emails, spreadsheet data, etc.) as a predetermined evaluation standard (for example, in this case lawsuit). (E.g., whether or not the data should be submitted in the discovery procedure), by using related components that are related to the components included in the reference data for the evaluation of the target data Even if the number of data is small, the target data can be accurately evaluated and only the documents related to this case can be efficiently and reliably submitted to the court.

When the data analysis system of the present invention is realized as a forensic system, the data analysis system uses target data (for example, documents, emails, spreadsheet data, etc.) as a predetermined evaluation standard (for example, the data is a crime). (E.g., whether or not the act is provable evidence), by using the related component that is related to the component included in the reference data for the evaluation of the target data Even if the number of data is small, it is possible to accurately evaluate the target data and efficiently and reliably extract evidence that proves the criminal activity.

Further, when the data analysis system of the present invention is realized as an e-mail monitoring system, the data analysis system transmits / receives target data (for example, e-mail, attached file, etc.) to a predetermined evaluation standard (for example, e-mail By using a related component that is related to the component included in the reference data in the evaluation of the target data when evaluating based on whether or not the user has attempted fraud) Even if the number of reference data is small, it is possible to accurately evaluate target data and efficiently and reliably detect signs of fraud such as information leakage and collusion.

In addition, the data analysis system of the present invention is realized as a medical application system (for example, pharmacovigilance support system, clinical trial efficiency system, medical risk hedging system, fall prediction (fall prevention) system prognosis prediction system, diagnosis support system, etc.). In this case, the data analysis system uses target data (for example, electronic medical records, nursing records, patient diaries, etc.) based on predetermined evaluation criteria (for example, whether or not to take a specific risk action of the patient, (E.g., whether or not the reference data is effective), by using the related components that are related to the components included in the reference data for the evaluation of the target data. Even if the number is small, the target data is accurately evaluated, for example, the patient falls into a dangerous state (for example, falls) The efficacy of the prediction and drugs, to efficiently and reliably, it is possible to objectively evaluate.

When the data analysis system of the present invention is realized as an Internet application system (for example, a smart mail system, an information aggregation (curation) system, a user monitoring system, a social media management system, etc.), the data analysis system is a target. Data (for example, a message posted by the user to the SNS, recommended information posted on the website, profile of the user or group, etc.) is determined based on a predetermined evaluation standard (for example, the preference of the user and the preference of other users) The related constituent elements that are related to the constituent elements included in the reference data when the evaluation is made based on whether or not the user's preference and the restaurant attribute match. By using it in the evaluation of, for example, the number of reference data Accurately evaluate the target data at least, display a list of other users who are likely to feel at ease with the user, present restaurant information that suits the user's preferences, or cause harm to the user It is possible to efficiently and reliably execute a warning for a group that may be.

In addition, when the data analysis system of the present invention is realized as an information leakage detection system, the data analysis system uses target data (for example, e-mail, database access log information) as a predetermined evaluation criterion (for example, the When evaluating based on whether or not the user who sent and received e-mails is trying to commit fraud, use related components that are related to the components included in the reference data to evaluate the target data Thus, even if the number of reference data is small, it is possible to accurately evaluate the target data and efficiently and reliably find a sign of information leakage.

In addition, when the data analysis system of the present invention is realized as an information asset utilization system (project evaluation system), the data analysis system includes information assets (target data) possessed by companies / experts for effective information for the project. Therefore, when extracting dynamically according to the situation of the project, the number of reference data can be reduced by using the related components that are related to the components included in the reference data for the evaluation of the target data. Even if the target data is accurately evaluated, for example, (1) In order to improve the efficiency of development sites where shortening of the development period is desired, information on products developed in the past can be reused according to the requirements of the development. (2) It is possible to efficiently and reliably execute the specification of useful information assets based on the expertise possessed by skilled engineers.

When the data analysis system of the present invention is realized as a marketing support system, the data analysis system uses target data (for example, company / individual profile, product information, etc.) as a predetermined evaluation standard (for example, When evaluating based on whether the product is male or female, or whether the consumer has a favorable impression on the product, etc., the related components that are related to the components included in the reference data are used to evaluate the target data. By using it, even if the number of reference data is small, it is possible to efficiently and reliably achieve evaluation of target data accurately, for example, to extract a market evaluation for a certain product.

Further, when the data analysis system of the present invention is realized as an intellectual property evaluation system, the data analysis system uses target data (for example, patent publications, documents summarizing the invention, academic papers, etc.) as a predetermined evaluation standard (for example, , Whether the patent publication can be used as evidence to reject or invalidate a given patent). By using it for evaluation, even if the number of reference data is small, the target data is accurately evaluated, for example, from a large number of documents (for example, patent gazettes, academic papers, sentences published on the Internet). Achieve efficient and reliable extraction of invalid materials. At this time, the data analysis system, for example, combines each claim of a patent to be invalidated with a “Related” label (classification information), and each claim of an unrelated patent different from the patent and “Non- A combination with a “Related” label (classification information) is acquired as reference data, a pattern is learned from the reference data, and an index is calculated for a large number of documents (target data) (for example, an index for each paragraph of a patent publication) The target data can be evaluated by calculating and adding a predetermined number from the top of the index to obtain the index of the patent publication.

Further, when the data analysis system of the present invention is realized as an unauthorized transaction monitoring system, the data analysis system uses target data (for example, e-mail, financial transaction information, bid information, etc.) as a predetermined evaluation criterion (for example, the When evaluating based on whether the user who sent and received e-mail is going to conduct fraudulent transactions, etc., use related components that are related to the components included in the reference data to evaluate the target data Thus, even if the number of reference data is small, it is possible to accurately evaluate the target data and efficiently and reliably detect a sign of fraud such as cartels and collusion.

In addition, when the data analysis system of the present invention is realized as a call center escalation system, the data analysis system uses target data (for example, telephone call history, recorded voice, etc.) as a predetermined evaluation criterion (for example, past history). When evaluating based on whether or not it is similar to the corresponding case, etc., by using the related components that are related to the components included in the reference data for the evaluation of the target data, even if the number of reference data Even if there is a small amount of data, it is possible to efficiently and reliably achieve evaluation of the target data accurately and, for example, to extract a response method optimal for the current situation from past response cases.

When the data analysis system of the present invention is implemented as a credit check system, the data analysis system receives target data (for example, company profile, information about company performance, information about stock prices, press releases, etc.) in a predetermined manner. When evaluating based on evaluation criteria (for example, whether the company goes bankrupt, whether the company grows, etc.), the related components that are related to the components included in the reference data are subject data For example, even if the number of reference data is small, the target data can be accurately evaluated, and for example, the prediction of corporate growth / bankruptcy can be achieved efficiently and reliably.

Further, when the data analysis system of the present invention is realized as a driving support system, the data analysis system uses target data (for example, data acquired from an in-vehicle sensor, a camera, a microphone, etc.) as a predetermined evaluation standard (for example, When the evaluation is performed based on whether or not the driver is focused on information during driving by the expert driver, the related component having the relationship with the component included in the reference data is used for the evaluation of the target data. Thus, even if the number of reference data is small, the target data can be accurately evaluated, and for example, automatic extraction of useful information that can make driving safe and comfortable can be achieved efficiently and reliably.

Furthermore, when the data analysis system of the present invention is realized as a sales support system, the data analysis system uses target data (for example, company / individual profile, product information, etc.) based on a predetermined evaluation standard (for example, When evaluating based on whether the product is male or female, or whether the consumer has a favorable impression on the product, etc., the related components that are related to the components included in the reference data are used to evaluate the target data. By using it, even if the number of reference data is small, the target data can be accurately evaluated, and for example, extraction of market evaluation for a certain product can be efficiently and reliably achieved.

Furthermore, when the data analysis system of the present invention is realized as a financial system (for example, a stock price prediction system), the data analysis system uses target data (for example, the market price of the stock price) as a predetermined evaluation standard (for example, a stock price). The number of reference data can be reduced by using the related components that are related to the components included in the reference data for the evaluation of the target data. However, it is possible to accurately evaluate the target data and efficiently and reliably achieve the prediction of the future stock price, for example.

Depending on the field to which the data analysis system of the present invention is applied, in consideration of circumstances peculiar to the field, for example, preprocessing (for example, extracting an important part from the data and extracting only the important part from the data) The analysis target may be applied), or the mode of displaying the data analysis result may be changed. It will be understood by those skilled in the art that a variety of such variations can exist, and all variations fall within the scope of the present invention.

[Summary]
A data analysis system according to an aspect of the present invention includes a memory, an input control device, and a controller, and the controller generates an index that ranks a plurality of target data, and the index includes each target data Corresponding to a predetermined case and changes based on an input given by a user via the input control device, and the memory stores at least the plurality of target data at least temporarily. And the input control device presents sample data for the target data to the user, accepts input of classification information from the user, and the classification information is based on the input to classify the sample data. A combination of the sample data and the classification information received from the user is associated with the sample data, and the reference data The controller obtains a plurality of the reference data, extracts a first component from the plurality of reference data, and the first component is at least one of the reference data. And the second component having the relevance to the evaluated first component is the plurality of target data, and the degree of contribution of the first component to the combination is evaluated. And the second component element constitutes at least a part of the target data, the second component element is evaluated, and the evaluation result of the second component element is obtained. The relevance between the plurality of target data and the predetermined case is evaluated by generating the index based on the index. Therefore, according to the data analysis system, since the second component can be supplemented for the evaluation of the target data, the analysis of the data group can be accurately performed regardless of the amount of the partial data with respect to the data group. It can be carried out.

In the data analysis system according to another aspect of the present invention, the controller includes the first configuration from among the plurality of components based on superiority or inferiority of the evaluation of each of the plurality of components included in the reference data. By determining the elements, an additional effect is achieved that the second component is easily and quickly set up in association with the first component.

In the data analysis system according to another aspect of the present invention, the controller includes the first configuration from among the plurality of components based on superiority or inferiority of the evaluation of each of the plurality of components included in the reference data. By determining the elements, an additional effect is achieved that the second component is set appropriately and easily and quickly while relating to the evaluation result of the first component.

In the data analysis system according to another aspect of the present invention, the controller determines, as the first component, the component having the highest evaluation among the plurality of components included in the target data. An additional effect that a component having a high evaluation value can be selected as the second component is achieved.

In the data analysis system according to another aspect of the present invention, the controller sets the association between the second component and the first component based on a predetermined criterion, and sets the relationship to the criterion. By extracting the second component from the plurality of target data based on the second component, the second component related to the first component can be automatically supplemented as effective for the analysis of the target data. As a result, an additional effect that the target data can be accurately evaluated is achieved.

In the data analysis system according to another aspect of the present invention, the controller is configured such that the second component is in a co-occurrence relationship with the first component, is in a similar word relationship, and Analyzing the second constituent element related to the first constituent element by analyzing the second constituent element by extracting the second constituent element based on at least one of the relationships having the common meta information As a result, an additional effect that the target data can be accurately evaluated can be achieved.

In the data analysis system according to another aspect of the present invention, the controller includes the second component according to the frequency with which the second component exists with the predetermined relationship with respect to the plurality of target data. The evaluation that the second constituent element can be evaluated accurately and reliably by evaluating the constituent element and evaluating the plurality of target data based on the evaluation result of the second constituent element. Effects are achieved.

A control method for a data analysis system according to another aspect of the present invention is a control method for a data analysis system for analyzing data, which generates an index for ranking a plurality of target data, and the index includes each target data And a second step of storing at least temporarily the plurality of target data corresponding to the relationship between the user and the predetermined case and changing based on an input from the user And a third step of presenting sample data for the target data to the user, and an input of classification information is received from the user, the classification information being based on the input to classify the sample data Reference is made to the combination of the fourth step that is associated with the sample data and the classification information received from the user. A fifth step of providing as a data, a sixth step of obtaining a plurality of the reference data, and extracting a first component from the plurality of reference data, the first component being the reference A seventh step that constitutes at least part of the data; an eighth step that evaluates the degree to which the first component contributes to the combination; and an association with the evaluated first component A second component having the characteristics is extracted from at least one of the plurality of target data, and the second component constitutes at least a part of the target data; The tenth step of evaluating the second component, and generating the index based on the evaluation result of the second component, thereby determining the relevance between the plurality of target data and the predetermined case An eleventh step worthy, characterized in that it comprises a. Therefore, according to the control method of the data analysis system, since the second component can be supplemented for the evaluation of the target data, the analysis of the data group is performed regardless of the amount of the partial data with respect to the data group. Can be done accurately.

Furthermore, a data analysis system control program according to another aspect of the present invention is a data analysis system control program that causes a computer to execute each step included in the data analysis system control method invention. A recording medium according to still another aspect of the present invention is a recording medium that records a control program of the data analysis system. Therefore, according to the control program and the recording medium of the data analysis system, the second component can be supplemented for the evaluation of the target data. Therefore, regardless of the amount of the partial data for the data group, the data Group analysis can be performed accurately.

A data analysis system according to another aspect of the present invention is, for example, a data analysis system that evaluates target data, and the system includes a memory, an input control device, and a controller, and the controller includes a plurality of targets. The data is evaluated, and the evaluation corresponds to, for example, the relationship between each target data and a predetermined case, and an index that enables ranking of the plurality of target data is generated by the evaluation, The index can be changed based on an input given by a user via the input control device, and the memory stores, for example, at least temporarily the plurality of target data evaluated by the controller, and the input The control device, for example, allows the user to input an order for the controller to rank the plurality of target data, and the plurality of target data The order changes, for example, according to the index that changes based on the input, and the input includes, for example, reference data different from the plurality of target data, the reference data, and the predetermined case. The classification is, for example, divided into a plurality of classification information according to the content of the reference data, and at least one of the plurality of classification information is The reference data is given to the reference data by the input, the reference data is presented to the user, and the at least one classification information given to the presented reference data by the user input; A combination with the reference data is provided to the controller, and the controller includes, for example, a plurality of components included in the reference data. A pattern characterized by the reference data is extracted from the reference data according to the classification information given by the input by evaluating the degree of contribution to each combination provided by the control device, and the extracted pattern is converted into the extracted pattern. And determining the index by evaluating relevance between the target data and the predetermined case, setting the determined index in the target data, and ranking the plurality of target data according to the index The user is notified of the plurality of target data arranged in order.

[Additional Notes]
The present invention is not limited to the above-described embodiments, and various modifications can be made within the scope of the claims, and the technical means disclosed in different embodiments can be appropriately combined. Embodiments to be made are also included in the technical scope of the present invention. Furthermore, a new technical feature can be formed by combining the technical means disclosed in each embodiment.

The present invention can be widely applied to arbitrary computers such as a personal computer, a server device, a workstation, and a mainframe, and is particularly applicable to an artificial intelligence system.

DESCRIPTION OF SYMBOLS 1 ... Data analysis system, 2 ... Server apparatus, 3 ... Client apparatus, 4 ... Database, 5 ... Storage system, 6 ... Management computer, 10 ... Predictive coding part, 11 ... Data acquisition part, 12 …… Classification information acquisition unit, 13 …… Data classification unit, 14 …… Constituent element extraction unit, 15 …… Constituent element evaluation unit, 16 …… Constituent element storage unit, 17 …… Data evaluation unit, 18 …… Management , 19 …… Phase Analysis Department.

Claims

A data analysis system for analyzing data,
A memory, an input control device, and a controller;
The controller generates an index for ranking a plurality of target data, and the index corresponds to the relationship between each target data and a predetermined case, and is given by the user via the input control device. Change based on input,
The memory stores the plurality of target data at least temporarily,
The input control device
Present sample data for the target data to the user,
The classification information is received from the user, and the classification information is associated with the sample data based on the input to classify the sample data.
Providing a combination of the sample data and the classification information received from the user as reference data to the controller;
The controller is
Obtaining a plurality of said reference data;
A first component is extracted from the plurality of reference data, and the first component constitutes at least a part of the reference data,
Assessing the degree to which the first component contributes to the combination;
A second component having a relationship with the evaluated first component is extracted from at least one of the plurality of target data, and the second component configures at least a part of the target data Is what
Evaluate the second component,
Evaluating the relevance between the plurality of target data and the predetermined case by generating the index based on the evaluation result of the second component;
Data analysis system.
The data analysis system according to claim 1, wherein the controller sets the second component from components not included in the reference data.
The data analysis system according to claim 2, wherein the controller determines the first component from the plurality of components based on superiority or inferiority of the evaluation of each of the plurality of components included in the reference data. .
The data analysis system according to claim 3, wherein the controller determines, as the first component, a component having the highest evaluation among a plurality of components included in the target data.
The controller sets the association between the second component and the first component based on a predetermined criterion, and sets the second component based on the criterion to the plurality of objects. The data analysis system according to claim 1, wherein the data analysis system is extracted from data.
The controller has at least one of the second component having a co-occurrence relationship with the first component, a similar word relationship, and a relationship having common meta information. The data analysis system according to claim 5, wherein the second component is extracted based on the two.
The controller evaluates the second component according to the frequency with which the second component exists with the predetermined relationship with respect to the plurality of target data, and the plurality of target data The data analysis system according to claim 5, wherein evaluation is performed based on an evaluation result of the second component.
The data analysis system according to claim 1, wherein the controller further generates the index based on an evaluation result of the first component.
The controller generates the index based on the evaluation result of the first component and evaluates the plurality of target data, and then evaluates the evaluation result of the first component and the evaluation of the second component The data analysis system according to claim 8, wherein the index is generated based on a result and the plurality of target data are evaluated.
A method for controlling a data analysis system for analyzing data,
An index for ranking a plurality of target data is generated, and the index corresponds to the relationship between each target data and a predetermined case, and changes based on an input from the user. And the steps
A second step of storing the plurality of target data at least temporarily;
A third step of presenting sample data for the target data to the user;
Receiving a classification information input from the user, the classification information is associated with the sample data based on the input to classify the sample data; and
A fifth step of providing, as reference data, a combination of the sample data and classification information received from the user;
A sixth step of obtaining a plurality of the reference data;
A first component extracted from the plurality of reference data, wherein the first component constitutes at least part of the reference data;
An eighth step of assessing the degree to which the first component contributes to the combination;
A second component having a relationship with the evaluated first component is extracted from at least one of the plurality of target data, and the second component configures at least a part of the target data A ninth step to do;
A tenth step of evaluating the second component;
An eleventh step of evaluating relevance between the plurality of target data and the predetermined case by generating the index based on an evaluation result of the second component;
A method for controlling a data analysis system including:
A data analysis system control program for causing a computer to execute each step included in the data analysis system control method according to claim 10.
A computer-readable recording medium in which the control program of the data analysis system according to claim 11 is recorded.