WO2017072822A1

WO2017072822A1 - Relevance evaluation system and method, program, and recording medium

Info

Publication number: WO2017072822A1
Application number: PCT/JP2015/005479
Authority: WO
Inventors: 秀樹武田; 和巳蓮子
Original assignee: 株式会社Ｕｂｉｃ
Priority date: 2015-10-30
Filing date: 2015-10-30
Publication date: 2017-05-04
Also published as: JPWO2017072822A1

Abstract

This relevance evaluation system is configured to evaluate relevance of examined data to reference data, and comprises: a data acquisition unit for acquiring each of reference data and examined data; an evaluation component extraction unit for extracting, from the examined data, evaluation components representing features of the reference data among data components of the reference data in the order of appearance corresponding to an arrangement direction of data components of the examined data; and a relevance evaluation unit for calculating a feature coefficient based on the order of appearance of the evaluation components of the examined data in the arrangement direction of the examined data. Even when a plurality of examined data groups show no difference in score values representing features, it is possible to recognize an examined data group that has a high degree of relevance to a reference data group.

Description

Relevance evaluation system, method, program, and recording medium

The present invention relates to a relevance evaluation system, method, program for evaluating relevance between data, and a recording medium storing it.

For example, a data aggregate (hereinafter simply referred to as “data”) composed of many data components (for example, “word” in the case of document data) always has a characteristic in its contents. In data having a large number of data components to be configured, it may be necessary to objectively evaluate the characteristics of the data without comparing the details in detail. As such a method, there is a method of calculating a characteristic value representing similarity in each piece of data and comparing the similarity of the data.

For example, as an example of this method, Patent Document 1 discloses an example of similar document search. Here, a feature word characterizing the description content is extracted from a document set made up of a large number of documents in advance, and a set of feature words is created. Then, for each document constituting the document set, a feature vector from a data component serving as a reference is calculated and stored for the feature word. Subsequently, in the input document, the similarity is calculated by comparing with the feature word, and it is determined that the document having the most similar score value is the closest to the input document. As described above, in the example of similar document search, it is common to determine the degree of similarity by calculating the degree of similarity (hereinafter referred to as “score value”) calculated based on reference data. In Patent Document 1, weighting is performed from a grammatical point of view in order to increase the accuracy of determination of the degree of similarity in similarity search.

JP 2014-106665 A

When a plurality of data having the same score value is found in a general method for examining the degree of relevance of a plurality of data with respect to reference data, as in the method disclosed in Patent Document 1, The superiority or inferiority cannot be determined as to which data is most relevant to the reference data among the plurality of data. Therefore, conventionally, in order to determine the high degree of relevance in data, generally, the accuracy of determining the high degree of relevance is improved by improving the calculation accuracy of the score value for the data. Was common.

However, the type of data is not limited to document data as disclosed in Patent Document 1, and data having various types of morphemes such as image data and audio data as data components can be considered. Therefore, an index that causes a difference in the degree of relevance of the data with respect to the reference data is obtained by a simple method.

A relevance evaluation system that evaluates relevance of test data with respect to reference data, the relevance evaluation system including a data acquisition unit that acquires the reference data and the test data, respectively, and the reference data Of the data components, an evaluation component that represents the characteristics of the reference data is extracted from the test data in the order of appearance according to the arrangement direction of the data components of the test data; The relevance evaluation system includes a relevance evaluation unit that calculates a feature coefficient based on the appearance order of the evaluation components of the test data in the arrangement direction of the test data.

A method for evaluating the relevance between reference data and test data by a relevance evaluation system comprising a computer, wherein the reference data and the test data are respectively acquired, and the data components of the reference data Among them, the evaluation component representing the characteristics of the reference data is extracted from the test data in the order of appearance according to the arrangement direction of the data component of the test data, and the test data in the alignment direction of the test data is extracted. This is solved by a relevance evaluation method for calculating a feature coefficient based on the appearance order of the evaluation components of the inspection data.

A relevance evaluation program that can be executed in a relevance evaluation system comprising a computer, the program evaluating relevance between reference data and test data, and the program includes the reference data and the subject data. Each of the test data and the evaluation component representing the characteristics of the reference data among the data components of the reference data in the arrangement direction of the data components of the test data from the test data Accordingly, the relevance evaluation program executes the step of extracting in the order of appearance, and the step of calculating the feature coefficient based on the order of appearance of the evaluation components of the test data in the arrangement direction of the test data. .

A storage medium that is executable in a relevance evaluation system including a computer and stores a relevance evaluation program for evaluating relevance between reference data and test data, the program including the reference data and Obtaining each of the test data, and an evaluation component representing a characteristic of the reference data among the data components of the reference data, the arrangement of the data components of the test data from the test data A storage medium that performs the steps of extracting in order of appearance according to the direction and calculating the feature coefficient based on the order of appearance of the evaluation components of the test data in the alignment direction of the test data is solved by a storage medium .

The present invention makes it possible to select data closest to the reference data for two or more data.

It is a figure of the hardware constitutions of the component relevance evaluation system 1 of this invention. It is a figure explaining the reference data R and the test data T used as the comparison object of the relevance in this invention. FIG. 5 is a conceptual diagram showing reference data R. FIG. 3 is a conceptual diagram showing test data T. It is the figure which showed contrast with the evaluation component of the reference data R which considered the order of appearance, and the evaluation component of the test data T which considered the order of appearance. It is a functional block diagram of component relevance evaluation system 1 of Embodiment 1 of the present invention. It is the figure which showed the algorithm of the program of Embodiment 1 of this invention. It is the figure which showed the algorithm of the program of Embodiment 2 of this invention. It is a functional block diagram of the component relationship evaluation system 1 of Embodiment 3 of this invention. It is the figure which showed the algorithm of the program of Embodiment 3 of this invention.

(Embodiment 1)
[Hardware configuration of component relevance evaluation system]
A component relevance evaluation system (hereinafter simply referred to as “system”) of the present invention will be described with reference to FIG. FIG. 1 is an example of a hardware configuration of the system 1. The system 1 includes a server device 10 and a client terminal 11. The server device 10 includes an arithmetic device 10a that performs calculation and a storage device 10b that stores data.

The server device 10 can execute main processing of data analysis. The client terminal 11 can execute a data analysis related process in the server device 10. The storage device 10b is, for example, any recording medium (for example, a memory or a hard disk) that can store data (including digital data and analog data). The arithmetic device 10a is a controller (for example, a central processing unit (CPU)) that can execute a control program stored in a recording medium. The computing device 10a is a computer or a computer system (a system that realizes data analysis by operating a plurality of computers in an integrated manner) that analyzes data stored at least temporarily in a recording medium. The computing device 10a may be configured as a management computer (not shown) in the form of an external device of the server device 10, and the storage device 10b is configured as the data storage server device 13 of the external storage device of the server device 10. You may make it comprise with a form.

The management computer (not shown) may include, for example, a memory, a controller, a bus, an input / output interface, and a communication interface. Note that application programs that can control the respective devices of the client terminal 11, the server device 10, and the management computer (not shown) are stored in the memory provided in each of the client terminal 11, the server device 10, and the management computer (not shown). Yes. As each controller executes an application program, the application program (software resource) and the hardware resource cooperate to operate each device.

The storage device 10b is composed of, for example, a disk array system, and can include a database that records data and results of evaluation / classification of the data. The server device 10 and the storage device 10b are connected by a direct connection method (DAS) or a storage device area network (SAN).

The client terminal 11 presents data in the middle of the processing process in the server device 10 to the user. As a result, the user can input, that is, provide classification information through bidirectional exchange via the client terminal 11. The client terminal 11 includes, for example, a memory, a controller, a bus, an input / output interface (for example, a keyboard and a display), and a communication interface (communication means using a predetermined network). For communication). The client terminal 11 may be configured to include an input device 12 such as a scanner.

Note that the hardware configuration shown in FIG. 1 is merely an example, and the system 1 can be realized by other hardware configurations. For example, a configuration in which part or all of all the processes are executed in the server device 10 may be used, or a part or all of the processing may be executed in the client terminal 11. In this embodiment, the input device 12 is connected to the client terminal 11 and can transmit to the server device 10. However, the input device 12 directly connects to the server device 10 and inputs data to the server from here. May be. It will be understood by those skilled in the art that there are various hardware configurations that can implement the system 1, and the configuration is not limited to the configuration illustrated in FIG. 1, for example.

[Principle of relevance evaluation of component relevance evaluation system]
Next, the principle of relevance evaluation in the component relevance evaluation system according to the present invention will be described with reference to FIG. FIG. 2 is a diagram illustrating the reference data R and the test data T that are comparison targets of the relevance in the present invention. In the component relevance evaluation in the present invention, it is determined whether two or more test data T (in this embodiment, test data T1 and test data T2) are highly related to the reference data R. It is. The feature coefficient is calculated as an index for evaluating the relevance, thereby evaluating the high relevance. Both the test data T1, T2 and the reference data R are aggregates of data components. That is, the test data T1 and T2 are composed of a plurality of unit data t1, the test data T2 is composed of a plurality of data components t2, and the reference data R is composed of a plurality of data components r. The data type of the test data T and the reference data R is not particularly limited. It may be document data, or any data aggregate, such as image data and audio data, as long as it is an aggregate of unit data. Accordingly, the data components include morphemes, keywords, sentences, paragraphs, and / or metadata (for example, header information of an e-mail) constituting a document, partial voices constituting the voice, and volume (gain) information. And / or timbre information, partial image, partial pixel, and / or luminance information constituting an image, frame image, motion information, and / or three-dimensional information constituting a video.

That is, for example, if the test data T and the reference data R are assumed to be document data, the data component is text data having a typical word or phrase constituting it as a representative example. In the test data T and the reference data R, the types of data are typically the same, but they are not necessarily the same. For example, when the reference data R is document data and the data component is a word, and the test data T is speech data, a comparison is made between the data component as characters and the word data as speech. Thus, the degree of relevance can be evaluated.

Subsequently, the principle of relevance evaluation in the component relevance evaluation system according to the present invention will be described with reference to FIGS. FIG. 3 is a diagram showing data components. First, the arrangement direction of data constituent elements constituting the reference data R is defined. The arrangement direction necessary for evaluating the contents of the reference data R is determined. In the example shown in FIG. 3, the arrangement direction of the data components is determined from the left to the right. Next to the rightmost data component, the leftmost data in the line one level down is assigned, and from that position The alignment direction is determined so as to go to the right. As a simple example, in the case of document data, the order of character strings is the arrangement direction. However, in the case of image data or the like, the arrangement direction most appropriate for evaluation is determined.

Subsequently, among the data constituent elements constituting the reference data R, a plurality of data constituent elements that best represent the characteristics of the content of the reference data R are used as evaluation constituent elements, and appear in accordance with the arrangement direction of the predefined unit data. Extract sequentially. In the example shown in FIG. 4, five evaluation components m1, m2, m3, m4, and m5 are selected. The selection of evaluation components and the order of their appearance are selected so as to most accurately represent the characteristics of the content of the reference data R. The evaluation components m1, m2, m3, m4, and m5 of the reference data R and their appearance order function as predetermined criteria for evaluating the relevance of the test data T.

Next, the evaluation of the degree of relevance of the test data T with the reference data R will be described. FIG. 4 is a diagram showing the test data T. As shown in FIG. First, as in the case of the reference data R, the arrangement direction of the data components constituting the test data T is determined. An arrangement direction necessary for evaluating the contents of the test data T is determined. In the example shown in FIG. 4, as in FIG. 3, the arrangement direction of the data components is determined from the left to the right. Next to the rightmost data component, the leftmost data in the line one level down is displayed. It is assigned and the arrangement direction is determined so as to go to the right from the position. Next, in the test data T, the evaluation components m1, m2, m3, m4, and m5 previously defined in the reference data R are detected in the order of appearance. In the example of FIG. 4, in the test data T, the data components m1, m4, m3, and m2 of the test data T corresponding to the evaluation components are detected in the order of appearance. The data component of the test data T corresponding to the evaluation component m5 is not detected. That is, among the five evaluation components m1, m2, m3, m4, and m5 previously defined in the reference data R, the data components m1, m2, m3, and m4 are extracted from the test data T and their appearance The order is m1, m4, m3, m2.

Next, in the test data T, the evaluation components m1, m4, m3, and m2 detected in the order of appearance are compared with the order of appearance in the reference data R, and the relevance of the test data T to the reference data R is compared. Check out. FIG. 5 shows evaluation components m1, m2, m3, m4, and m5 (upper side in FIG. 5) of the reference data R in consideration of the appearance order, and evaluation components m1, m4, and m3 of the test data T in consideration of the appearance order. , M2 (lower side in FIG. 5). Here, a characteristic coefficient (Order) which is an index indicating the degree of relevance is defined as follows.

The characteristic coefficient (Order) is the value of “the two combinations selected from the evaluation components detected in the test data T” with respect to “the number of combinations for selecting two from the evaluation components detected in the test data T”. Of these, the ratio is “the same number of combinations as the order of appearance of the evaluation components of the reference data R”. That is, in the denominator, when the number of evaluation components detected in the test data T is N, the number of combinations of two evaluation components among the evaluation components detected in the test data T is N (N -1) / 2. For example, in the example of FIGS. 4 and 5, since four evaluation components m1, m2, m3, and m4 are detected in the test data T, there are six patterns. Specifically, it is a combination of (m1, m2), (m1, m3), (m1, m4), (m2, m3), (m2, m4), (m3, m4).

The numerator calculates the number of the combinations in which the appearance order of the evaluation components of the reference data R is the same among the two combinations selected from the evaluation components detected in the test data T out of the total number of combinations. To do. Here, only the order of appearance is considered, and the appearance of another constituent element between constituent elements is not considered as an evaluation target. In the examples of FIGS. 4 and 5, there are three combinations (m1, m2), (m1, m3), and (m1, m4) that have the same order of appearance as the reference data R among the above combinations. The presence of m4 between m1 and m3 is not subject to evaluation. Therefore, in this case, the characteristic coefficient (Order) is 0.5.

If the data is completely the same as the reference data R, the order of appearance is the same in all the two combinations selected from the evaluation components, so that T (N) / F (N) = 1. 0. That is, the more evaluation components appear in the test data T in the same order as the reference data R, the higher the relationship between the test data T and the reference data R and the characteristic coefficient (Order) is close to 1. Become. On the other hand, when the relationship between the test data T and the reference data R is low, the characteristic coefficient (Order) is close to zero. Therefore, if the characteristic coefficient (Order) is larger, the relation between the test data T and the reference data R is high, and if the characteristic coefficient (Order) is larger, the relation between the test data T and the reference data R is It can be said that it is low. The feature coefficient satisfies 0 ≦ feature coefficient (Order) ≦ 1.

[Functional block configuration of component relevance evaluation system]
FIG. 6 is a diagram illustrating an example of a functional block configuration of the system 1. The system 1 includes, for example, a reference data acquisition unit 21, a test data acquisition unit 22, an arrangement direction determination unit 23, an evaluation component extraction unit 24, a component storage unit 25, and a component relevance evaluation unit 26. A route from the reference data acquisition unit 21 to the component storage unit 25 via the arrangement direction determination unit 23 and the evaluation component extraction unit 24 is a learning process for the reference data R. On the other hand, the route from the test data acquisition unit 22 to the component relationship evaluation unit 26 via the alignment direction determination unit 23 and the evaluation component extraction unit 24 is related to the reference data R with respect to the test data T. This is a sex assessment process.

The reference data acquisition unit 21 acquires the reference data input from the input device 12 or the client terminal 11 or all data components constituting the reference data R already stored in the storage device 10b.

When the reference data acquisition unit 21 and the test data acquisition unit 22 acquire all the data components, they output the data to the alignment direction determination unit 23, determine the alignment direction of these data components, and configure the data configuration Associate elements. All the data components associated with the arrangement direction are output to the evaluation component extraction unit 24. The determination of the arrangement direction may be omitted depending on the data by using the data arrangement direction when the data is acquired in the reference data acquisition unit 21 and the test data acquisition unit 22 as they are. In this case, the arrangement direction determination unit 23 becomes unnecessary. Further, the determination of the alignment direction may be performed by the reference data acquisition unit 21 and the test data acquisition unit 22 or may be performed by the evaluation component extraction unit 24. The evaluation component extraction unit 24 extracts a component group that most representatively represents the content feature of the reference data R. In the process of the evaluation component extraction unit 24, the user can select a component group using the client terminal 11. Here, the “component group” is a group of data components. The “component group” selected by the evaluation component extraction unit 24 is output to the component storage unit 25. The component storage unit 25 stores the “component group” in the storage device 10 b or the data storage server device 13.

The evaluation component extraction unit 24 extracts the evaluation components m1, m2, m3, m4, and m5 from the data components that constitute the reference data R for which the arrangement direction is determined. The number of evaluation components extracted by the evaluation component extraction unit 24 is arbitrarily determined according to the characteristics of the reference data R. The evaluation component extraction unit 24 outputs the extracted evaluation components m1, m2, m3, m4, and m5 to the component storage unit 25. The component storage unit 25 stores in the storage device 10b or the data storage server device 13. The above is the learning process of relevance evaluation.

Subsequently, the relevance evaluation process for the test data T with respect to the reference data R will be described. The above description of the arrangement direction determination unit 23 and the evaluation component extraction unit 24 functions similarly in the evaluation process of the relevance evaluation for the test data T with respect to the reference data R. That is, as shown in FIG. 6, similarly to the reference data acquisition unit 21, the test data acquisition unit 22 is also stored in the test data T input from the input device 12 or the client terminal 11 or the storage device 10b. All the data components constituting the test data T being acquired are acquired.

When the test data acquisition unit 22 acquires all the data components, the test data acquisition unit 22 outputs the data to the arrangement direction determination unit 23. The reference data acquisition unit 21 and the test data acquisition unit 22 do not need to be configured separately, and can be the same data acquisition unit. The arrangement direction determination unit 23 determines the arrangement direction and associates the data components. All the data components associated with the arrangement direction are output to the evaluation component extraction unit 24. The evaluation component extraction unit 24 stores the evaluation component stored in the storage device 10b or the data storage server device 13 in the arrangement direction. Extract from all data components of the associated test data T. Not all evaluation components are extracted. Among the data components of the test data T, those corresponding to the evaluation components selected in the learning process in the reference data R are extracted in the order of appearance. In the example of FIG. 4, the evaluation component extraction unit 24 extracts evaluation components in the order of appearance of m1, m4, m3, and m2 according to the arrangement direction. The extracted evaluation components m1, m4, m3, and m2 are output to the component relationship evaluation unit 26. The component relevance evaluation unit 26 calculates the characteristic coefficient (Order) described above.

In addition, the component relevance evaluation unit 26 reads an evaluation value associated with the component input from the evaluation component extraction unit 24 from an arbitrary memory (for example, the storage device 10b), and based on the evaluation value Evaluate the target data. The evaluation value is a weighting value that is set in advance for each evaluation component selected in the reference data R in accordance with their characteristics. More specifically, the component relevance evaluation unit 26 adds, for example, an evaluation value associated with a component that constitutes at least a part of the target data, for example, an index of the target data (for example, target Numerical values, letters, and / or symbols that make the data orderable can be derived. As this index, for example, a score value can be used. Here, the score value (Score) is an index for quantitatively evaluating the strength of relevance of the test data T with respect to the data components of the reference data R. As long as the strength of the relevance of the test data T with respect to the data components of the reference data R can be quantitatively expressed, the calculation method of the score value (Score) is not limited. The score value may be calculated by a general method as long as the content of the reference data R can be appropriately evaluated. For example, as an example, with respect to the evaluation value of the evaluation component defined for each evaluation component extracted in the reference data R, the frequency of the evaluation component appearing in the test data T is expressed by the following equation: Can be represented. The component relevance evaluation unit 26 can associate the test data T with the score value and store both in the storage device 10b.

In addition, in the above, since the configuration described as “part” is a functional configuration realized by executing a program by the controller included in the system 1, “part” is rephrased as “processing” or “function”. May be. In addition, since the “unit” can be replaced by hardware resources, those skilled in the art will understand that these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof. However, it is not limited to either.

[Algorithm configuration of the program executed in the component relevance evaluation system]
Subsequently, an algorithm of a program executed by the system 1 for the above function will be described. First, reference data R is fetched (S101). Subsequently, the arrangement direction of the data components is determined for the read reference data R (S102). In the reference data R in which the arrangement direction of the data components is determined, a plurality of data components that best represents the characteristics of the content of the reference data R among the data components are displayed together with the appearance order according to the predefined arrangement direction. Extracted and defined as an evaluation component group for relevance evaluation (S103). The extracted evaluation component group and its appearance order data are stored in the storage device 10b (S104). The above is the learning process using the reference data R for relevance evaluation. Subsequently, the relevance evaluation process for the test data T with respect to the reference data R proceeds. First, test data T is fetched (S105). Subsequently, the arrangement direction of the data components constituting the test data T is determined (S106). Evaluation components for relevance evaluation that have been determined in advance in the learning process are extracted from the test data T for which the arrangement direction has been determined (S107). The evaluation components in the extracted test data T are extracted in the same order of appearance in the reference data R (S108). Subsequently, a feature coefficient based on the appearance order of the evaluation components of the test data in the arrangement direction of the test data is calculated. The feature coefficient calculates the degree of coincidence of the appearance order of the selected two combinations among the evaluation constituent elements of the extracted test data T with the appearance order of the evaluation constituent elements of the reference data R defined in advance. Is possible. That is, the degree of matching can correspond to the feature coefficient (Order), for example, as a matching rate. For example, in the total number of combinations of two selected evaluation components of the extracted test data T, “1” is assigned to those that match the appearance order, and “0” is assigned to those that do not match. Is granted. As described above, only the order of appearance is considered, and the appearance of another constituent element between constituent elements is not an object of evaluation. Then, the characteristic coefficient (Order) = (frequency at which 1 is assigned) / (total number of two combinations) is calculated (S109).

As described above, for example, when there are two or more test data T having the same or very similar score value (Score), the score value (Score) alone has a high relevance to the reference data R. However, in the present invention, by calculating the characteristic coefficient (Order), it can be determined that the larger the characteristic coefficient is, the higher the relevance with the reference data R is. For example, in the case of FIG. 2, when the test data T1 and the test data T2 both have a score value of 70 with respect to the reference data R, the characteristic coefficients (Order of the test data T1 and the test data T2 with respect to the reference data R) ) Are 0.6 and 0.8, respectively, it can be determined that the test data T2 is more relevant to the reference data R.

Also, if the score values are not the same, and there are two or more test data T that are very close to each other, a distribution diagram in which one axis is assigned to the score value and the other axis is assigned to the feature coefficient (Order) Is displayed on a display means such as a display or a printer, and information that allows the user to easily determine the relevance of the test data T to the reference data R is provided to the user by using two elements, “score value” and “feature coefficient”. It is also possible to make it.

(Embodiment 2)
In the system 1 according to the above-described first embodiment, the mode in which the determination is performed by the system 1 that calculates the characteristic coefficient (Order) has been described. However, by using the feature coefficient (Order) for correcting the score value, it is possible to evaluate the degree of relevance of the test data T with the corrected score value. Hereinafter, this will be described as a second embodiment.

In the second embodiment, the system 1 as a hardware configuration and the functional block diagram are the same as those in the embodiment, and different parts will be described here with reference to FIG. 6 and FIG. FIG. 8 shows the algorithm of the program according to the second embodiment. In the first embodiment, the component relationship evaluation unit 26 in the functional block in FIG. 6 only calculates the feature coefficient (Order). However, in the second embodiment, the component relevance evaluation unit 26 calculates the feature coefficient (Order) as a correction value of the score value for the test data T calculated in advance. FIG. 8 shows a program algorithm according to the second embodiment. In FIG. 8, the steps from the step of taking in the reference data R (S201) to the step of calculating the characteristic coefficient (Order) (S209) are the same as the steps S101 to S109 of the first embodiment.

In the second embodiment, the component relevance evaluation unit 26 calculates the score value (Score ^RAW ) calculated in advance for the test data T as described below after calculating the feature coefficient (Order). (S210). As described in the first embodiment, the score value may be calculated by a general method as long as the content of the reference data R can be appropriately evaluated.

In the second embodiment, in particular, if there are two or more test data T whose score values are very similar but not the same, even if the characteristic coefficient (Order) is large, the score values are different, making comparison difficult. There is a case. In such a case, by using the score value corrected by the feature coefficient, it can be determined that the larger the corrected score value is, the higher the relevance with the reference data R is.

For example, in the case of FIG. 2, when the score values (Score ^RAW ) of the test data T1 and the test data T2 with respect to the reference data R are 72 and 71, respectively, the test data T1 and the test data T2 with respect to the reference data R If the feature coefficient (Order) is 0.65 and 0.67 respectively, the score values corrected by the feature coefficient are 45.5 and 46.9, respectively. As a result, although the score value is higher in the test data T2, it is possible to determine that the test data T2 is more relevant to the reference data R.

(Embodiment 3)
In the system 1 according to the second embodiment, the test data T1 and the score value (Score ^RAW ) of the test data T2 with respect to the reference data R are calculated separately from the feature coefficient (Order). That is, this is a form that can be used when the evaluation component group for calculating the score value and the feature coefficient (Order) are different. In the third embodiment, the calculation of the score value and the calculation of the characteristic coefficient are carried out by a series of processes using a common evaluation component determined in advance by the reference data R. Hereinafter, this will be described as a third embodiment.

FIG. 9 is a diagram illustrating an example of a functional block configuration of the system 1 according to the third embodiment. As in the first embodiment, the system 1 includes a reference data acquisition unit 21, a test data acquisition unit 22, an arrangement direction determination unit 23, an evaluation component extraction unit 24, and a component storage unit 25. Since these are the same as those in the first embodiment, description thereof is omitted. In addition to these, the third embodiment further includes a component relevance evaluation unit 26, a score value calculation unit 27, and a score value correction unit 28.

FIG. 9 is a diagram illustrating an example of a functional block configuration of the system 1 according to the third embodiment. As in the first embodiment, the system 1 includes a reference data acquisition unit 21, a test data acquisition unit 22, an arrangement direction determination unit 23, an evaluation component extraction unit 24, and a component storage unit 25. Since these are the same as those in the first embodiment, only different parts will be described. In addition to these, the third embodiment further includes a component relevance evaluation unit 26, a score value calculation unit 27, and a score value correction unit 28. The evaluation component extraction unit 24 extracts an evaluation component group that most appropriately represents the content of the reference data R, and classifies it into N groups. The score value calculation unit 27 calculates a score value (Score (i) ^RAW ) for each of the N groups. The score value may be calculated by a general method as long as the content of the reference data R can be appropriately evaluated. The component relevance evaluation unit 26, for each group of evaluation component groups, is a feature coefficient (Order) that is the ratio of the same order of appearance as the reference data R in the two combinations selected by the method in the first embodiment. Calculate The calculation method of the characteristic coefficient (Order) is as described in the first embodiment. Then, the score value correcting unit 28 multiplies the score value (Score (i) ^RAW ) and the feature coefficient (Order) for each group, and calculates the sum as follows.

Next, an algorithm for the third embodiment will be described with reference to FIG. FIG. 10 shows an algorithm in the third embodiment. First, reference data R is fetched (S301). Subsequently, the arrangement direction of the data components is determined for the read reference data R (S302). In the reference data R in which the arrangement direction of the data components is determined, a plurality of data components that best represents the characteristics of the content of the reference data R among the data components are displayed together with the appearance order according to the predefined arrangement direction. Extract and define as an evaluation component for relevance evaluation. At this time, the evaluation component group is classified into N groups (S303). The extracted evaluation components and their appearance order data are stored in the storage device 10b (S304). The above is the learning process using the reference data R for relevance evaluation. Subsequently, the relevance evaluation process for the test data T with respect to the reference data R proceeds. First, test data is fetched (S305). Subsequently, the arrangement direction of the data components constituting the test data T is determined (S306). Evaluation components for relevance evaluation that have been determined in advance in the learning process are extracted from the test data T for which the arrangement direction has been determined (S307). A score value (Score (i) ^RAW ) is calculated for each of the N groups of evaluation components (S308). On the other hand, in each of the N evaluation component groups, the same order of appearance in the reference data R is extracted. Of the extracted evaluation component groups of the test data T, the degree of coincidence between the appearance order of the two selected combinations with the appearance order of the evaluation component group of the reference data R defined in advance is acquired. The degree of match can be, for example, the feature factor (Order) as a match rate. For example, in the total number of combinations of the two selected evaluation component groups of the extracted test data T, “1” is given to those that match the appearance order, and “0” is given to those that do not match. Is given (S309). As described above, only the order of appearance is considered, and the appearance of another constituent element between constituent elements is not an object of evaluation. Then, the characteristic coefficient (Order) = (frequency at which 1 is assigned) / (total number of two combinations) is calculated (S310). In each group, the score value (Score (i) ^RAW ) is multiplied by the feature coefficient (Order), and the sum is calculated (S311).

As described above, since the score value (Score (i) ^RAW ) and the feature coefficient (Order) are calculated by the same evaluation component group, the calculation process is simplified and the score value is easily calculated. The determination based on the corrected score value is the same as in the first and second embodiments.

[Example of implementation using software and hardware]
The control block of the data analysis system may be realized by a logic circuit (hardware) formed on an integrated circuit (IC chip) or the like, or may be realized by software using a CPU. In the latter case, the system includes a CPU that executes a program (control program for the data analysis system) that is software that implements each function, and a ROM (in which the program and various data are recorded so as to be readable by the computer (or CPU)). A Read Only Memory) or a storage device (these are referred to as “recording media”), a RAM (Random Access Memory) for developing the program, and the like are provided. And the objective of this invention is achieved when a computer (or CPU) reads the said program from the said recording medium and runs it. As the recording medium, a “non-temporary tangible medium” such as a tape, a disk, a card, a semiconductor memory, a programmable logic circuit, or the like can be used. The program may be supplied to the computer via an arbitrary transmission medium (such as a communication network or a broadcast wave) that can transmit the program. The present invention can also be realized in the form of a data signal embedded in a carrier wave in which the program is embodied by electronic transmission. Note that the above program can be implemented in any programming language. Also, any recording medium that records the above program falls within the scope of the present invention.

[Other application examples]
Such systems include, for example, discovery support systems, forensic systems, e-mail monitoring systems, medical application systems (eg, pharmacovigilance support systems, clinical trial efficiency systems, medical risk hedging systems, fall prediction (fall prevention) systems, prognosis predictions) System, diagnosis support system, etc.), Internet application system (eg, smart mail system, information aggregation (curation) system, user monitoring system, social media management system, etc.), information leakage detection system, project evaluation system, marketing support system, Artificial intelligence systems that analyze big data, such as intellectual property evaluation systems, fraud monitoring systems, call center escalation systems, credit check systems The relevance of a given cases may be implemented as any system) can be evaluated. Depending on the field to which the data analysis system of the present invention is applied, in consideration of circumstances peculiar to the field, for example, preprocessing (for example, extracting an important part from the data and extracting only the important part from the data) The analysis target may be applied), or the mode of displaying the data analysis result may be changed. It will be understood by those skilled in the art that a variety of such variations can exist, and all variations fall within the scope of the present invention.

The present invention is not limited to the above-described embodiments, and various modifications can be made within the scope of the claims, and the technical means disclosed in different embodiments can be appropriately combined. Embodiments to be made are also included in the technical scope of the present invention. Furthermore, a new technical feature can be formed by combining the technical means disclosed in each embodiment.

DESCRIPTION OF SYMBOLS 1 System 10 Server apparatus 11 Client terminal 12 Input apparatus 13 Data storage server apparatus 21 Reference | standard data acquisition part 22 Test data acquisition part 23 Arrangement direction determination part 24 Evaluation component extraction part 25 Component element storage part 26 Component element evaluation part 27 score value calculation unit 28 score value correction unit

Claims

A relevance evaluation system that evaluates the relevance of test data to reference data,
A data acquisition unit for acquiring the reference data and the test data, respectively;
Evaluation component extraction for extracting an evaluation component representing the characteristics of the reference data among the data components of the reference data in the order of appearance according to the arrangement direction of the data components of the test data from the test data And
A relevance evaluation unit that calculates a feature coefficient based on the appearance order of the evaluation components of the test data in the arrangement direction of the test data.
The relevance evaluation system according to claim 1,
The characteristic coefficient is a combination of two evaluation components in the same order of appearance in the reference data out of the total number of two combinations of components selected from the evaluation components of the test data. A relevance assessment system that is the percentage of occurrences.
The relevance evaluation system according to claim 1,
The relevance evaluation unit is a relevance evaluation system that performs an operation of multiplying the score value of the test data by the feature coefficient.
The relevance evaluation system according to claim 1,
The evaluation component extraction unit classifies the evaluation component of the extracted test data into a plurality of groups,
The relevance evaluation system includes a score value calculation unit that calculates a score value based on the extracted evaluation component for each of the plurality of groups.
The relevance evaluation unit calculates the feature coefficient for each of the plurality of groups,
The relevance evaluation system includes a score value correction unit that multiplies the score value and the feature coefficient for each of the plurality of groups, and calculates a sum of the multiplied numbers for all of the plurality of groups. Relevance evaluation system provided.
A method for evaluating the relationship between reference data and test data using a relationship evaluation system comprising a computer,
Obtaining the reference data and the test data,
The evaluation component representing the characteristics of the reference data among the data components of the reference data is extracted from the test data in the order of appearance according to the arrangement direction of the data components of the test data,
A relevance evaluation method for calculating a feature coefficient based on the order of appearance of the evaluation components of the test data in the arrangement direction of the test data.
A relevance evaluation program executable in a relevance evaluation system comprising a computer, the program evaluating relevance between reference data and test data, the program comprising:
Obtaining the reference data and the test data, respectively;
Extracting the evaluation component representing the characteristics of the reference data among the data components of the reference data in the order of appearance according to the arrangement direction of the data components of the test data from the test data;
A relevance evaluation program for executing a step of calculating a feature coefficient based on the order of appearance of the evaluation components of the test data in the arrangement direction of the test data.
A storage medium that is executable in a relevance evaluation system including a computer and stores a relevance evaluation program for evaluating relevance between reference data and test data, the program being
Obtaining the reference data and the test data, respectively;
Extracting the evaluation component representing the characteristics of the reference data among the data components of the reference data in the order of appearance according to the arrangement direction of the data components of the test data from the test data;
And a step of calculating a feature coefficient based on the order of appearance of the evaluation components of the test data in the arrangement direction of the test data.