WO2013001571A1

WO2013001571A1 - Unstructured data analysis system

Info

Publication number: WO2013001571A1
Application number: PCT/JP2011/003695
Authority: WO
Inventors: 井口　慎也; 横井　一仁; 児玉　昇司; 陽介石井
Original assignee: 株式会社日立製作所
Priority date: 2011-06-29
Filing date: 2011-06-29
Publication date: 2013-01-03

Abstract

[Problem] When a data management system stores a large volume of unstructured data and an application specifies a short response time, there is insufficient time to extract and process information from the stored unstructured data, and sending back an on-time response is impossible. Also, even if an application makes a demand with a high level of secrecy, a data management system will request data analysis from an external network, and information leakage will occur. [Solution] When an application demands data from a data management system, a response time, information secrecy level and response precision level are specified. A data management system provides a means for requesting data analysis from an external network only when the secrecy level is low. Information extracted from unstructured data is managed by appending a precision-based score thereto, and is mapped to a precision level. When the response time is short, only locations with large scores are searched from managed information.

Description

Unstructured data analysis system

The present invention relates to a system for extracting semantic information by analyzing unstructured data such as text, video, images, and audio.

In recent years, in various fields such as medical care, finance, corporate information, government agencies, video surveillance, etc., we want to analyze unstructured data that has traditionally only been stored, stored, and referenced and reuse it in business such as academic research and marketing Needs are growing. However, unstructured data takes a long time to process because of its large data size and data volume, and unlike numerical data, it cannot be processed mechanically in its original form, and therefore has a problem that it is difficult to reuse.

For example, in order to search for similar images from a large amount of image data at high speed, basic technology suitable for large-scale data processing, such as parallel processing of similarity calculation processing by multiple computers and data arrangement on the HDD, is required. Become. Also, in order to perform analysis processing such as similar case search and statistical processing on images scanned from medical records on paper media, computers such as numerical data such as examination results and their meanings, specific expressions and their attribute values from scanned images are used. A technique for extracting information that can be processed and converting it into structural data is required.

In the Internet world, there is a growing trend to build applications that process large amounts of text data to extract and use useful information. Google has published Google Squared, which presents search results in tabular form (Non-patent Document 1). Unlike normal keyword search, Google Squared displays names and attributes belonging to the lower level of the search keyword in a tabular format. For example, for a query “cat”, names such as American Shorthair and Persian are displayed together with images and descriptions. Images and descriptions are links to resources on the Internet. One of the features is that the name and attributes to be displayed have an interface that can be extended by the user. IBM, on the other hand, built a question-and-answer (QA) system called "Watson" and won the highest amount by challenging a popular quiz program in the US (Non-patent Document 2). This system instantly seeks answers from a variety of complex queries expressed in natural language. Common to these technologies is the aggregation of information from different resources to acquire and use useful knowledge. The key is “information extraction” technology, which is how to extract useful information from a large amount of documents. In particular, in the field of text analysis, research and development of technology for processing synonyms with different expressions but the same meaning with computers is actively performed in order to process fluctuations in words handled by humans (Non-patent Document 3). On the other hand, research and development such as high-speed similar image search (Non-Patent Document 4), voice search (Non-Patent Document 5), etc. to extract structured information from unstructured data in various formats such as video, images, and voices Has also been implemented.

However, in unstructured data analysis, it is not always possible to process only one type of data with one type of processing method, but it is called multimodal that captures meaning by linking the analysis results of multiple types of data to each other Realization of processing is necessary. For example, when a person is talking, it is similar to understanding what the other person wants to convey by looking at the other person's facial expressions and gestures and processing them in the brain. ing. Further, the processing method is different and the calculation time is different depending on the information to be extracted. Therefore, in order to realize a data management system that handles various unstructured data, the requirements of applications that use the data management system, selection and execution of processing methods according to the target data, and various meanings extracted from the data A mechanism to manage the system flexibly is necessary.

For example, a system that automatically recognizes natural input from a user such as a conversation sentence, extracts various information from a database, and presents it to the user has been proposed (Patent Document 1). This system is equipped with two different types of artificial intelligence, one with limited ability to recognize user input but fast response speed, and the other with wide and high accuracy recognition of user input. However, it takes more processing time to improve accuracy. When the input is made by the user, the former is executed. If the latter is not possible, the latter is used to allow flexible control of the recognition rate and the processing time.

A hierarchical database that can add and delete various data analysis functions has also been proposed so that various unstructured data can be handled in an integrated manner in the existing database (Patent Document 2).

Also, if the database becomes complicated, it may be difficult to respond within the time required by the application. On the other hand, a system has been proposed in which a request time is received from an application, a processing time is estimated in a database, and an error is returned when data meaningful to the application cannot be returned within the time (Patent Document 3). ).

JP2005-339237 JP2003-99320 Japanese Patent Application No.11-278598

In Patent Document 2, when the data analysis function to be called exists in an external environment such as the Internet, there is a possibility of data leaking in the process of transmitting / receiving the analysis data and the result to / from the data analysis function. In addition, the data analysis function provider is not always reliable, and there is a possibility of data leakage from the data analysis function provider.

In the above-mentioned Patent Document 3, an error is only returned when necessary data cannot be returned within the response request time from the application. However, when managing enormous amounts of data, if the response request time from the application is short, the situation in which it is difficult to gather all the data increases, errors frequently occur, and the system may not function.

In the above Patent Documents 2 and 3, when an unknown access request format comes from an application, it cannot be handled at all.

Therefore, when the application requests data to the data analysis system, a technology for preventing information leakage by not performing outsourcing is provided by notifying the level of request confidentiality, and when the required confidentiality is high.

In addition, when an application requests data from the data analysis system, it provides a technology for acquiring optimal data within the time required by the application by specifying the data search time limit.

In addition, when an application requests data from the data analysis system, a technology that extracts the highly accurate data held by the data management system and sends it back to the application by specifying the accuracy level of the results obtained. provide.

In addition, when an application requests data from the data analysis system, when a keyword is specified, a technique is provided that extracts and returns related data using synonyms of the keyword.

In addition, when an application requests data from the data analysis system, if the requested parameter has an unknown value or structure, information is extracted using the unstructured data analysis processing function and used as a key for data retrieval. Provide technical means to search and return data in the data revision system.

The solution is as described in the scope of claims, and as an example, the data analysis system has the following configuration.

The data analysis system is connected to a storage device that stores unstructured data having metadata and a data body,
A metadata extraction unit that acquires metadata of unstructured data from the storage device and creates first metadata from the metadata;
Obtain metadata and data body of unstructured data from the storage device, extract semantic information representing the contents of the unstructured data from the metadata and the data body, and create second meta information having the semantic information An information extraction unit to
A meta information repository that stores the first meta information and the second meta information in association with each other;
A meta information management unit that extracts meta information stored in the meta information repository in response to a request from the application;
A meta information management unit extracts meta information by a method specified by a request to create output data, and has a data output unit that outputs the output data to the application,
When extracting the semantic information of the unstructured data, the information extraction unit calculates a score value indicating the degree that the semantic information represents the contents of the unstructured data, and includes the score value in the second meta information. .

When the application accesses the data analysis system, specify the accuracy of the information to be acquired, and when searching the meta information repository in the data analysis system, search for information with a low score value is omitted, and the entire search The amount can be reduced and the response speed can be improved. Also, by specifying the confidentiality level of the request by the application, transfer of information analysis to an inadvertent external network can be prohibited, and information leakage can be prevented. Furthermore, search omissions are also reduced by searching the meta information repository for synonyms related to search keywords from applications. And even if the request from the application is unknown, it is possible to search the meta information repository by extracting the keywords included in the format.

It is an example of a system configuration. It is an example of the structure of the meta information stored in a meta information repository. It is an example of the structure of metadata. It is an example of the attribute which the node which comprises meta information has. It is an example of the attribute which the edge which comprises meta information has. It is an example of a parameter specified by a data output request issued by an application to a data analysis system. It is an example of an identification rule for an access request from an application. It is an example of a flow from analyzing raw data to registering it in a meta information repository. It is an example of the cooperation operation | movement of an application and a data analysis system. The access request identification is an example of determination for distributing requests from applications. It is an example of the flow which produces | generates meta information repository access information from the access request of an application. It is an example of the flow which processes meta information into an output format according to the request | requirement of an application. It is an example of the flow of a process which registers the meta information which the application input into a meta information repository. It is an example of the flow of a process which detects that the metadata in a data store was updated and performs information extraction. It is an example of the flow of the learning process of the information extraction part using the sample data for learning.

FIG. 1 is a diagram illustrating a functional configuration example of a system.
In this figure, reference numeral 101 is a data analysis system, reference numeral 102 is an external network, reference numeral 103 is an external information extraction unit, reference numeral 104 is a data holding system, reference numeral 105 is a raw data store, reference numeral 106 is an access request identification unit, reference numeral 107 is Meta information management unit, reference numeral 108 is an application, reference numeral 110 is an output suppression pattern determination unit, reference numeral 111 is a data shaping unit, reference numeral 112 is a meta information repository, reference numeral 113 is an output suppression rule, reference numeral 114 is an information extraction unit, reference numeral 115 is Outsourcing determination unit, 116 is an extraction information association unit, 117 is a data extraction system, 118 is a learning mechanism, 119 is a data store, 120 is external reference data, 121 is an extraction model, 122 is an access identification Rule, reference numeral 123 is a metadata extraction unit, reference numeral 124 is meta information, reference numeral 125 is a schema definition, reference numeral 128 is an external data capturing unit, reference numeral 129 is sample data for learning Reference numeral 130 is a learning restriction rule, reference numeral 133 is raw data, reference numeral 134 is a data body, reference numeral 135 is metadata, reference numeral 139 is an information extraction unit usage rule, reference numeral 142 is an external extraction model, reference numeral 143 is a data update monitoring unit, reference numeral 144 is an authentication processing unit.

The data analysis system 101 is a system that manages meta information 124 generated by acquiring, analyzing, and analyzing data on the data store 119, and processing and outputting the meta information in response to a request from the application 108. The substance of the data analysis system 101 includes a memory, a processing device (CPU), a network interface connected to the external network 102, an interface for accessing a storage device in which the data store 119 is mounted, a meta information repository 112, and an information extraction unit. A computer having a storage device in which the usage rule 139, the extraction model 121, the learning sample data 129, and the learning restriction rule 130 are mounted and stored, and various processing units and functions constituting the data analysis system 101 described below are as follows. This is realized by a program on the memory being executed by the CPU. The data analysis system 101 includes an access request identification unit 106, a meta information management unit 107, an output suppression pattern determination unit 110, a data shaping unit 111, a meta information repository 112, an information extraction unit 114, an outsourcing determination unit 115, an extraction information association unit 116, a metadata extraction unit 123, an external data capturing unit 128, learning sample data 129, a learning restriction rule 130, an information extraction unit usage rule 139, a data update monitoring unit 143, and an authentication processing unit 144.

The external network 102 is an external network such as the Internet, and a server equipped with an information extraction function that is accessed by the data analysis system during information analysis is arranged. In this server, an external information extraction unit 103, external reference data 120, and an external extraction model 142 are mounted.

The external information extraction unit 103 exists in the external network 102 and analyzes the metadata and the data body stored in the data store 119 in the same manner as the information extraction unit 114 of the data analysis system, and generates meta information 124. The entity is realized, for example, by an information extraction processing program stored in a server or the like installed in a data center or the like being executed by a processing device of the server.

The data holding system 104 is a system having raw data (unstructured data) that is extracted by the data extraction system. For example, an information sharing system in a company, a file server, etc., have a raw data store 105, and are connected to a data extraction system 117.

The raw data store 105 is a data storage device that stores the raw data 133 to be extracted by the data extraction system, and is a storage device such as a file storage or a hard disk in the server.

The access request identifying unit 106 analyzes the request content from the application according to the access identification rule to determine the processing method, and requests the authentication processing unit, metadata extraction unit, outsourcing determination unit, and meta information management unit to perform processing. It is a function and is connected to the meta information management unit 107 and the application 108.

The meta information management unit 107 is a function that searches the meta information repository using data received from the access request identification unit or the extracted information association unit and requests the data shaping unit to process data for output. It also manages the meta information repository. The access request identifying unit 106 and the data shaping unit 111 are connected.

The application 108 is a system that uses the data analysis system 101. For example, a system that integrates and displays a plurality of medical information, a content search system that uses a natural language, and the like are conceivable. The entity is realized by a processing device on a computer executing an application program on a memory. The application is connected to the access request identifying unit 106 and the output suppression pattern determining unit 110.

The output suppression pattern determination unit 110 checks the content of the shaped data created by the data shaping unit according to the content of the output suppression rule, and determines whether output is possible. The data shaping unit 111 and the application 108 are connected.

The data shaping unit 111 has a function of processing data (meta information) acquired from the meta information repository according to the schema definition and outputting the processed data to the application, and is connected to the meta information management unit 107 and the output suppression pattern determination unit 110. Depending on the output format, the data body 134 may be acquired from the data store 119 and integrated, or the application 108 may execute this processing.

The meta information repository 112 provides functions for storing meta information 124 and searching for and outputting necessary meta information in response to requests from various modules. As a storage format of the meta information, for example, flexible structures such as RDF (Resource Description Framework), graph structure, and tree structure are conceivable. Here, RDF is a data format that represents and stores all data in the form of subject-predicate-object. Further, since it can support various data formats, a method is also conceivable in which the access request identifying unit 106, the output suppression pattern determining unit 110, and the data shaping unit 111 store information that is referred to when executing processing. On the other hand, in consideration of processing performance, a configuration in which individual data storage units are provided separately for these three functions is also conceivable. The meta information repository 112 includes an output suppression rule 113, an access identification rule 122, meta information 124, and a schema definition 125.

The output suppression rule 113 registers a judgment rule used for output suppression, for example, a table that manages a combination of a series of keywords that are not permitted to be output for each access authority. However, the substance of the output control rule 113 is converted into a form corresponding to the management structure of the meta information repository 112 and stored.

The information extraction unit 114 is a function that extracts the meaning from the data body acquired via the outsourced determination unit and creates meta information. For example, when a sunflower image is inserted, it is a “sunflower” and becomes a “three” image. This is a function for generating information such as being included and executing processing for conversion to meta information, and is composed of a learning mechanism 118 and an extraction model 121.

The outsourcing determination unit 115 extracts the meta information of the data extracted from the data storage system and stored in the data store or the data attached to the access request acquired from the access request identification to the information extraction unit on the external network. It is determined whether it is to be performed or to be performed by an information extraction unit in the data analysis system.

The extracted information association unit 116 associates the metadata, the meta information obtained from the outsourcing determination unit, and the information extraction unit, and stores them in the meta information repository access. Note that when associating metadata and meta information, the meta information repository may be referred to.

The data extraction system 117 is a system that executes processing for extracting raw data from a single data storage system or a plurality of data storage systems and storing it in a data store, and is connected to the data storage system 104.

The learning mechanism 118 is a function for learning based on data provided by the external data capturing unit and creating an extraction model.

The data store 119 is a storage device that holds the data extracted by the data extraction system, and includes a file storage or the like. Depending on the specifications of the data analysis system, the metadata and the data body may be stored separately.

External reference data 120 is data existing in a storage device on an external network that is referred to by the information extraction unit as necessary.

The extraction model 121 is a data group used as a criterion when used for extracting meaning from information input to the information extraction unit, and is created by a learning mechanism. Using the criteria registered in the extraction model, the information extraction unit creates meta information from the input information (data body). It is also conceivable that a plurality of extraction models 121 are prepared and switched for use such as request contents of an application or backup when learning fails.

The access identification rule 122 is a group of rules used for determination by the access request identification unit.

The metadata extraction unit 123 has a function of extracting metadata from data acquired from the data store or the access request identification unit.

Meta information 124 is information (meta information) extracted by the information extraction unit.

Schema definition 125 is a group of rules that the data shaping unit refers to when processing data (meta information) into a format according to the request of the application. For example, when meta information is converted into a table format and output, which attribute of the meta information is used as a table column is described.

The external data capturing unit 128 has a function of capturing an extraction model and data on an external network as necessary, and performing comparison processing with the sample data for learning, and is connected to the sample data 129 for learning. The learning sample data 129 is sample data used by the information extraction unit for learning. For example, the sample data 129 includes document data, sample images, sample audio data, etc. with parts of speech, and is connected to the external data capturing unit 128. .

The learning restriction rule 130 is a rule that controls the learning mechanism of each information extraction unit. Control learning timing using various learning data. For example, whether or not to learn the result of the information extraction process of data from the data store, if the data in the meta information repository is used for learning, automatically learn the result when the meta information repository is updated Whether or not is described.

Raw data 133 is a data body to be extracted by the data extraction system, and includes a data body 134 and metadata 135.

The data body 134 is an entity such as an image, sound, or document.

Metadata 135 exists for each analysis target data, and is information that explains what the analysis target data is. For example, data indicating data storage location information, last update date / time information, last access date / time information, access authority, file owner, and the like.

The information extraction unit usage rule 139 is a group of criteria used by the outsourcing determination unit to determine an information extraction outsourcing destination.

The external extraction model 142 is model data used for extracting metadata created by the learning mechanism, and is arranged in a data center or the like existing on an external network.

The data update monitoring unit 143 has a function of monitoring the update state of data on the data store and notifying the metadata extraction unit and the outsourcing determination unit when an update is detected.

The authentication processing unit 144 has a function of receiving authentication information from the application via the access request identifying unit and checking the validity of the application access by collating with the authentication information stored in the meta information repository.

FIG. 2 is a diagram showing an example of the structure of the meta information 124 stored in the meta information repository 112. As shown in FIG.
The meta information is expressed by, for example, a graph structure or a tree structure composed of nodes and edges connecting the nodes. This makes it possible to express various things and the relationships between them. In this example, an example of a node and an edge is shown. An example of data held by nodes and edges will be described later.
In this figure, reference numeral 201 is node 1 and reference numeral 202 is an edge.

The node 1201 has a plurality of attribute information characterizing the node itself and an edge list related to the node. Edge 202 represents the relationship between nodes. It has an attribute list that characterizes the relationship between nodes.

Figure 3 shows the metadata structure.

Metadata 135 is a set of attribute values describing the outline of the data body (file etc.) of raw data. A list of these parameters is shown. Note that not all attributes included in this example are always necessary, and the metadata is configured by combining according to the situation.

“Attribute name” 301 is a general name of the attribute value included in the metadata. , Line 303 is a description regarding what “attribute name” is “date and time”, line 304 is a description regarding what “attribute name” is “size”, and line 305 indicates that “attribute name” is “ Line 306 is a description about what “attribute name” is “type”, and line 307 is a description about what “attribute name” is “access authority”. is there.

“Example of attribute value” 302 is an example of an attribute value. For example, although “attribute name” is “date and time”, “example of attribute value” is “2011/7/7”, for example, indicating the creation date or update date and time of the data body and metadata. Although “attribute name” is “size”, “example of attribute value” is, for example, “115 MByte”, indicating the size of raw data. Although “attribute name” is “location of acquisition source”, “example of attribute value” is “NAS1 / doc”, for example, and indicates information for specifying the acquisition source (storage location) of raw data. Although “attribute name” is “type”, “example of attribute value” is, for example, “sentence, video, image, etc.” and indicates the type of raw data. Although “attribute name” is “access authority”, “example of attribute value” indicates a list of accessible user IDs.

Fig. 4 is a diagram showing nodes of meta information.

This is an example of the attributes of the nodes that make up the meta information. Note that attributes may be added and deleted according to the purpose of use of the node.

“Attribute name” 401 is a general name of the attribute value included in the node. Line 403 is a description related to what “attribute name” is “date and time”, line 404 is a description related to what “attribute name” is “node type”, and line 405 is a description related to “attribute name”. Line 406 is a description related to “connection edge list”, line 406 is a description related to “attribute name” being “metadata”, and line 407 is a description related to “attribute name” being “access authority” Line 408 is a description regarding an item whose “attribute name” is “extraction information”, line 409 is a description regarding an item whose “attribute name” is “score information”, and line 410 is an attribute name Is a description of what is “number of uses”.

“Example of attribute value” 402 is an example of an attribute value. Although the “attribute name” is “date and time”, the “example of attribute value” is, for example, “2011/7/7” and indicates the creation date or update date of the node of the meta information. Although “attribute name” is “node type”, “example of attribute value” is, for example, a proper noun, a keyword, file information, etc., and indicates the type of raw data. Although “attribute name” is “connection edge list”, “example of attribute value” is “edge 1, edge 5,...”, For example, and indicates identification information of an edge connected to the node. Although the “attribute name” is “metadata”, the “example of attribute value” is “AAA metadata”, for example, indicating the metadata identification information of the raw data from which the metadata information is extracted and the metadata itself . Although “attribute name” is “access authority”, “example of attribute value” is, for example, “accessible user ID list”, and indicates the ID or user group identification name of a user who can access the node. Although “attribute name” is “extraction information”, “example of attribute value” indicates, for example, information extracted from raw data such as extracted keywords, values, and the like. Although “attribute name” is “score information”, “example of attribute value” is a numerical value such as “4.3”, for example, and represents the likelihood of information registered as extracted information 408 calculated based on a predetermined standard. Information. As a method of calculating the “score value”, for example, when the information extraction unit 114 provides a function of analyzing the structure of an image and generating a keyword that represents the attribute value as an attribute value, the model 121 of the information extraction unit 114 is provided. It is conceivable that the similarity between how much the image pattern data group recorded in the image data and the analysis target data are similar is indicated as a score value. An equivalent method can be considered for analysis of voice, sensor information, and the like. Although “attribute name” is “use count”, “example of attribute value” is a numerical value such as “15”, for example, and is information indicating the number of times the information of the node is output to the application.

FIG. 5 is a diagram showing a list of attributes held by edges constituting the meta information. The attribute may be added or deleted according to the purpose of use of the edge.
“Attribute name” 501 is a general name of the attribute value included in the edge, line 503 is a description regarding “attribute name” being “date and time”, and line 504 is “attribute name” is “edge type”. Line 505 is a description regarding an item whose “attribute name” is “connection source node”, and line 506 is a description regarding an item whose “attribute name” is “connection destination node”. 507 is a description regarding the “attribute name” being “access authority”, line 508 is a description regarding the “attribute name” being “extraction information”, and line 509 is the description regarding “attribute name” being “score information” ”And a line 510 is a description about the item whose“ attribute name ”is“ number of times of use ”.

“An example of an attribute value” 502 describes an example of an attribute value. For example, although “attribute name” is “date and time”, “example of attribute value” is “2011/7/7” and indicates the date and time when the edge was generated and updated. Although “attribute name” is “edge type”, “example of attribute value” is, for example, a parent-child relationship, a similarity relationship, a synonym relationship, etc., and a relationship between a connection source node and a connection destination node connected by the edge Indicates. Although the “attribute name” is “connection source node”, the “attribute value example” is, for example, “node 1”, which is identification information of the connection source node of the edge, and the “attribute name” is “connection destination node”. For example, “example of attribute value” is “node 5”, which is identification information of a connection destination node of the edge. Although “attribute name” is “access authority”, “example of attribute value” is a list of IDs of users who can access the edge. Although “attribute name” is “extracted information”, “example of attribute value” indicates, for example, extracted keywords, values, and the like and extracted from raw data. Although “attribute name” is “score information”, “example of attribute value” is a numerical value such as “7.3”, for example, and represents the likelihood of information registered as extracted information 508 calculated based on a predetermined standard. Information. An example of the calculation method is equivalent to the “score value” at the node. Although “attribute name” is “usage count”, “example of attribute value” is a numerical value such as “15”, for example, and is information indicating the number of times the information on the edge is output to the application.

Fig. 6 is a diagram showing an example of a request issued by the application to the data analysis system. FIG. 6 shows an example of parameters specified by the data output request issued by the application to the data analysis system. Note that it is not necessary to specify all the parameters shown in FIG. 6 depending on the request contents and the type of application.

“Parameter name” 601 is the name of the parameter. Line 603 is a description regarding what “parameter name” is “request ID”, line 604 is a description regarding what “parameter name” is “information registration permission flag”, and line 605 indicates “parameter name”. Line 606 is a description regarding what “parameter name” is “request confidentiality level”, and line 607 is a description regarding “parameter name” is “accuracy level”. Line 608 is a description regarding what “parameter name” is “authentication information”, line 609 is a description regarding what “parameter name” is “keyword”, and line 610 indicates “ The description of the parameter name is “synonym search permission flag”, line 611 is the description of the parameter name “meta information”, and line 612 is the parameter name “required”. Is a description about what is "query", line 613 is a description about what "parameter name" is "output format", and line 614 is a description about what "parameter name" is "cache flag", Line 615 is a description regarding the “parameter name” being “prefetch flag”, and line 616 is a description regarding the “parameter name” being “file”.

“Example of parameter value” 602 is an example of a parameter value. Although “parameter name” is “request ID”, “example of parameter value” is “111”, for example, and is information for uniquely identifying the request. Although “parameter name” is “information registration permission flag”, “parameter value example” is “permitted” or “non-permitted”, and in the case of “permitted”, the number of accesses of meta information used as a result of this request Updates and registration of new meta information when it is generated are allowed to be registered in the meta information repository. “Non-permitted” indicates that the above processing is not performed. Although the “parameter name” is “data search time limit”, the “parameter value example” is a time such as “100 ms”, for example, and the time allowed by the application as the data search time is specified. Although “parameter name” is “request confidentiality level”, “example of parameter value” is a numerical value indicating the confidentiality of the request, such as “5”. Although “parameter name” is “accuracy level”, “parameter value example” is a numerical value indicating the accuracy level of meta information allowed by the application, such as “4”. As the accuracy value, for example, the case of using the “score value” of the meta information is considered. Although “parameter name” is “authentication information”, “parameter value example” is application user authentication information, and “parameter name” is “keyword”, but “parameter value example” is, for example, “” Aspirin "" Is a search keyword specified by the application. Although “Parameter name” is “Synonym search permission flag”, “Example of parameter value” is “Permitted” or “Not permitted”. In the case of “not allowed”, it is requested not to search for synonyms of keywords. Although the “parameter name” is “meta information”, the “example of parameter value” is, for example, “meta information structure data representing an aspirin component”, which is information for specifying the type of meta information to be acquired. Although “parameter name” is “request query”, “example of parameter value” is “SQL, SPARQL, etc. for acquiring aspirin components and prescription list”, for example, and indicates the query processing requested by the application. Although “parameter name” is “output format”, “example of parameter value” indicates an output format specified by the application, such as a table format or a list format. Although “Parameter Name” is “Cache Flag”, “Parameter Value Example” is “Cache Required” or “Cache Not Required”. The cached information is cached, and the cached information is returned when an equivalent request is received. In the case of “no cache required”, every time the same request is received from the application, the meta information is accessed and the output data is generated again. Although "Parameter name" is "Prefetch flag", "Parameter value example" is "Valid" or "Invalid". If it is "Valid", data with a structure similar to the data output content is meta information Search in the repository, create the list data, and respond using this list data when an equivalent request comes from the application. The list data may be stored on the meta information repository, or another storage unit may be created.
In the case of “invalid”, the above processing is not performed. Although “parameter name” is “file”, “example of parameter value” is information for specifying a file to be searched such as “aspirin image”.

FIG. 7 is a diagram showing an example of an access identification rule.

This is a rule list used as a criterion for the access request identifying unit to identify an access request from an application, and is stored in the meta information repository. However, when the rule has a constant value, it may be embedded in the access request identifying unit as a parameter string, or a dedicated storage unit for storing the rule may be provided.
“Attribute name” 701 is a general name of an attribute value included in the metadata. Line 703 is a description relating to “attribute name” being “outsourcing processing time threshold”, line 704 is a description relating to “attribute name” being “authentication confidentiality level”, and line 705 is “ This is a description related to “attribute name” being “corresponding request query type list”.
“Example of attribute value” 702 is an example of an attribute value. Although the “attribute name” is “outsourcing processing time threshold”, the “attribute value example” is a time such as “100 ms”, and the “attribute name” is “attribute value required for authentication confidentiality level”. "Example" is a numerical value indicating the level of confidentiality, such as "2". Although "Attribute name" is "Corresponding request query type list", "Example of attribute value" is data such as CSV, SQL, SPARQL, ... This is a list of queries that can be processed by the analysis system.

FIG. 8 is a diagram showing a process of extracting metadata from raw data, converting it into meta information, and registering it in the meta information repository 112. In this process, the raw data acquired by the data extraction system from the data holding system is stored in the data store, the data analysis system extracts the metadata from this, and creates the meta information based on this metadata. It is a process of registering in the meta information repository.

Step 801 is a process in which the data extraction system acquires raw data from the data holding system.

Step 802 is a process in which the data holding system separates the metadata and the data body from the acquired raw data and stores them separately in the data store.
In this step, the data holding system may store the acquired raw data in the data store without separating the metadata and the data body.

Step 803 is a process in which the metadata extraction unit of the data analysis system acquires metadata from the data store. In this step, the metadata extraction unit may receive the notification that the data store has been updated from the data extraction system and acquire the metadata in response to this notification, or the data update monitoring unit may determine whether or not the data store has been updated. The metadata extraction unit may acquire metadata when it is monitored and an update of the data store is detected.

Step 804 is processing in which the metadata extraction unit converts the metadata into metadata.
For example, a node of meta information having an attribute name and an attribute value including data such as a file name, a file size, and a metadata storage destination that the metadata has is created. Alternatively, a meta information node having the metadata itself as one attribute value may be generated.

Step 805 is processing for storing the meta information generated by the metadata extraction unit in step 805 in the meta information repository. Step 806 is processing for examining whether or not the data analysis system continues to extract information from the data body. If information extraction from the data body is executed, the process proceeds to step 807. If the information extraction from the data body is not executed, the process is terminated. For example, when the data body is moving image data, video data and audio data may be included. Information extraction needs to be performed on all of these data. The information extraction execution determination process referred to here indicates a process of determining whether or not the analysis of all the analysis target data included in the data body has been completed. Step 807 is a process in which the outsourcing determination unit reads the data body from the data store.

Step 808 is a process for checking whether or not the outsourcing determination unit can cooperate with the external information extraction unit 103 of the external network. If it is possible to cooperate with the information extraction unit of the external network, the process proceeds to step 809. If it is not possible to cooperate with the information extraction unit of the external network, the process ends. For example, the possibility of cooperation is determined based on criteria such as whether the data analysis system is connected to an external network or whether access to the external network is permitted.

Step 809 is a process in which the outsourcing determination unit checks the metadata.

Step 810 is processing for examining whether or not the outsourcing determination unit needs to be outsourced, that is, whether the external information extraction unit 103 needs to create meta information from the data body. If outsourcing is necessary, go to Step 811. If outsourcing is not necessary, go to Step 813. As this judgment criterion, for example, in the case of a data format that is not supported by the information extraction unit in the data analysis system, it may take a long time to process by the internal information extraction unit, and the processing efficiency may be poor.

Step 811 is a process in which the outsourcing determination unit sends the metadata acquired from the data store and the data body to the external information extraction unit and requests information extraction (meta information creation). The external information extraction unit that has received the request extracts semantic information representing the content of the raw data from the metadata and the data body in the same manner as the processing performed by the information extraction unit 114 in step 813 described later. Further, as an index indicating how accurately this semantic information represents the content of the raw data, a score value (score information) is generated by a predetermined method, and meta information having score information and semantic information is created. . In this step, encrypted communication may be used to prevent leakage of information to the outside.

Step 812 is processing in which the external information extraction unit returns the meta information extracted based on the request of Step 811 to the outsourcing determination unit. In this step, encrypted communication may be used to prevent information leakage to the outside.

Step 813 is a process in which the data analysis system calls the information extraction unit 114 to create metadata and meta information from the data body. The information extraction unit 114 extracts semantic information representing the content of the raw data from the metadata and the data body. Further, a score value indicating the degree to which the semantic information represents raw data is calculated and used as score information. Then, meta information having semantic information and score information is created. For example, when the information extraction unit 114 analyzes the structure of the image and provides a function for generating a keyword meaning it as an attribute value, the image pattern data group recorded in the model 121 of the information extraction unit 114, By determining how similar the analysis target data is, by generating meta information with the keyword corresponding to the image pattern data with the highest similarity and the attribute value with the similarity as “score value” Meta information explaining what the analysis target image means can be generated. An equivalent method can be considered for analysis of voice, sensor information, and the like.

Step 814 is a process of associating the meta information converted by the extraction information association unit with the meta information extracted by the information extraction unit and storing it in the meta information repository. Fig. 9 is a diagram showing the cooperative operation between the application and the data analysis system.

Step 901 is a process in which the application issues a processing request to the data analysis system. An example of the processing request issued by the application is as shown in FIG. In this step, various formats such as a unique format, a Web service call, and an SQL can be considered as the format of the processing request.

Step 902 is a process in which the access request identification unit determines the request content according to the access request identification rule. Details will be described in detail with reference to FIG.

Step 903 is a process for checking whether the processing request cannot be processed (accessed) by the meta information management unit based on the determination in step 902. If the meta information management unit cannot process, the process proceeds to step 904. If it can be processed by the meta information management unit, the process proceeds to step 905.

Step 904 is processing in which the external information extraction unit 103 or the information extraction unit 114 generates access meta information (that is, meta information used as a search condition). Details will be described later with reference to FIG.

Step 905 is a process in which the meta information management unit acquires meta information corresponding to the application request from the meta information repository. Specifically, based on keywords specified in the application request, request query type, meta information, file, etc., search meta information in the meta information repository for meta information that matches this condition. To get it. If the meta information has been generated in step 904, the meta information management unit searches the meta information repository using the generated meta information as a search condition. When the accuracy level 607 is specified in the application request, the meta information management unit searches the meta information only for the meta information whose

score information

409 or 509 is the accuracy level 607 or higher. I do. In addition, when “permission” of the synonym search permission flag 610 is set together with the keyword 609 in the request of the application, the meta information management unit sets the synonym dictionary (each of the plurality of keywords included in the meta information repository). The synonym of the keyword is first extracted, and the meta information is searched using each of the extracted one or more synonyms as a search condition. Get meta information. Step 906 is processing in which the data shaping unit and the output suppression pattern determination unit reshape the acquired meta information data and determine output suppression. Details will be described later with reference to FIG.

Step 907 is a process for outputting the formatted data to the application.

Step 908 is a process for checking whether the prefetch flag 615 of the processing request from the application is valid. If the prefetch flag of the application request is valid, the process proceeds to step 909. If the prefetch flag of the application request is not valid, the process ends.

Step 909 searches the meta information repository for meta information having a structure similar to the meta information output in step 907, creates a list of identification information of the meta information obtained by searching, and stores it in the meta information repository. Process.

FIG. 10 is a diagram showing an access request identification determination.

This processing flow corresponds to the details of step 902 in FIG. 9 and is a process in which the access request identifying unit distributes requests from applications.

Step 1001 is a process for acquiring an access request (processing request) from the application.

Step 1002 is a process of checking whether the request confidentiality level 606 set in the processing request is equal to or higher than the authentication confidentiality level 704 set as the access identification rule. If the requested confidentiality level 606 is the authentication required confidentiality level 704 or higher, the process proceeds to step 1003. If the requested confidentiality level 606 is not higher than the authentication confidentiality level 704, the process proceeds to step 1008.

Step 1003 is a process for checking whether there is authentication information 608 in the access request. If there is authentication information, the process proceeds to step 1004. If there is no authentication information, the process proceeds to step 1007.

Step 1004 is a process of requesting execution of authentication processing by passing authentication information 608 to the authentication processing unit.

Step 1005 is a process for checking whether an authentication error has occurred based on the authentication result received from the authentication processing unit. If an authentication error has occurred, go to Step 1006. If no authentication error has occurred, the process proceeds to step 1008.

Step 1006 is a process for returning an error to the application.

Step 1007 is a process for returning an error to the application.

Step 1008 is a process for checking whether or not the search meta information 611 is specified in the access request. If the search meta information is designated, the process proceeds to step 1009. If no search meta information is specified, the process proceeds to step 1010.

Step 1009 is a process in which the meta information management unit determines that the meta information repository can be searched using the search meta information 611 as a search condition. In this case, it is determined in step 903 of FIG. 9 that the meta information management unit can process.

Step 1010 is a process for checking whether or not the keyword 609 is specified in the access request. If a keyword is specified, the process proceeds to step 1011. If no keyword is specified, the process proceeds to step 1014.

Step 1011 is a process for checking whether the synonym search is permitted in the access request, that is, whether the synonym search permission flag 610 is permitted. If the synonym search is permitted, the process proceeds to step 1013. If the synonym search is not permitted, the process proceeds to step 1012.

Step 1012 is processing in which the meta information management unit determines that the meta information can be acquired (accessed) from the meta information repository using the keyword. In this case, it is determined in step 903 of FIG. 9 that the meta information management unit can process.

Step 1013 is processing for determining that it is necessary to execute access synonym extraction processing of the meta information repository. In this case, in step 903 of FIG. 9, the meta information management unit determines that the synonym extraction process is necessary.

Step 1014 is a process for checking whether the request query 612 is specified in the access request. If a request query is specified, the process proceeds to step 1015. If no request query is specified, the process proceeds to step 1018.

Step 1015 is a process for checking whether or not the request query confirmed in Step 1014 is registered in the corresponding request query type list 705 of the access identification rule shown in FIG. If the request query is registered in the corresponding request query type list of the access identification rule, the process proceeds to step 1017. If the request query is not registered in the corresponding request query type list of the access identification rule, the process proceeds to step 1016.

Step 1016 is processing in which the information extraction unit determines that query analysis is necessary. In this case, since it is determined in step 903 of FIG. 9 that the meta information management unit cannot process, the process proceeds to step 904, where the specified query is executed by the external information extraction unit, the information extraction unit, etc. Information will be extracted. The meta information obtained by this query processing will be used later as a search condition by the meta information management unit in step 905 of FIG.

Step 1017 is processing in which the meta information management unit executes query processing to determine that search meta information can be acquired (accessed) from the meta information repository. In this case, in step 903 of FIG. 9 described above, it is first determined that the meta information management unit needs to search the meta information repository by using the query processing to acquire the meta information for search and then using it. The Step 1018 is a process for checking whether the file 616 is specified in the access request. If a file is specified, the process proceeds to step 1019. If no file is specified, the process proceeds to step 1020.

Step 1019 is a process of extracting metadata from the file, converting it into meta information, and storing it in the meta information repository. The calculation of the meta information and the score value is as described above. In this step, if metadata extraction fails, an error is returned.

Step 1020 is a process in which the meta information management unit determines that the meta information can be acquired (accessed) from the meta information repository using the meta information created in step 1019 as the search meta information. In this case, it is determined in step 903 of FIG. 9 that the meta information management unit can process. FIG. 11 is a diagram showing access meta information generation. This process is a detailed flow of step 904 in FIG. 9 and shows a flow of a process of extracting meta information for access to the meta information repository based on the analysis target data included in the request from the application.

Step 1101 is a process in which the metadata extraction unit transmits the analysis target data to the outsourcing determination unit. The analysis target data includes, for example, a request query 612 and the like, and attached data such as voice data and image data attached to a request from the application. Step 1102 is a process in which the meta information extraction unit extracts metadata of analysis target data.

Step 1103 is a process in which the outsourcing determination unit checks whether the data to be analyzed is an outsourcing object. In this step, the content of the metadata of the analysis target data is compared with the information extraction unit usage rule, and it is determined whether or not it is an outsourcing target. For example, the type of query that can be processed by the external information extraction unit 103 is registered in the information extraction unit usage rule. Therefore, referring to this, it is determined whether the query processing specified by the processing request from the application can be executed by the external information extraction unit 103, and if it can be executed, it may be set as an outsourcing target. In addition, the required level of required confidentiality is registered in advance in the information extraction unit usage rule, and if the required confidentiality level 606 included in the request from the application is lower than this required level of required confidentiality, the information is to be outsourced. Anyway. If it is outsourced, the process proceeds to step 1104. If it is not outsourced, the process proceeds to step 1106. Step 1104 is processing in which the outsourcing determination unit sends analysis target data to the external information extraction unit. Step 1105 is a process in which the outsourcing determination unit receives the analysis result (that is, the extracted meta information) of the analysis target data from the external information extraction unit.

Step 1106 is a process in which the outsourcing determination unit inputs analysis target data to the information extraction unit, and the information extraction unit that receives the data extracts information by executing information extraction. The extraction method is as exemplified above. Step 1107 is processing for examining whether or not unanalyzed data exists in the analysis target data. In this step, when the analysis target data is composed of a plurality of data such as moving images and voices, it is confirmed whether or not the processing has been executed for all of them. If unanalyzed data exists in the analysis target data, the process returns to step 1103. On the other hand, if unanalyzed data does not exist in the analysis target data, the process proceeds to step 1108.

Step 1108 is a process of integrating all extracted meta information in the extracted information association unit. The integrated information is stored in the meta information repository.

Fig. 12 is a diagram showing an overview of data shaping and output suppression processing. This process corresponds to the details of step 906 in FIG. 9, and the data shaping unit 111 and the output suppression pattern determination unit 110 process the meta information into an output format according to the application request, and the processing result is satisfactory from the viewpoint of security. This is the process of controlling whether the output is determined.

Step 1201 is processing in which the data shaping unit acquires the output target meta information acquired from the meta information repository by the meta information management unit in step 905.

Step 1202 is a process in which the data shaping unit acquires the schema definition corresponding to the access request from the application from the meta information repository. In this step, the schema definition describes the rules for converting meta information into table format, list format, XML format, etc., according to the output format specified as output format 613 in the application access request Get things from the meta information repository.

Step 1203 is a process in which the data shaping unit checks whether the data body reference is specified in the schema definition or the output target meta information. If the data body reference is specified in the schema definition or the output target meta information, the process proceeds to step 1204. If the data body reference is not specified in the schema definition or the output target meta information, the process proceeds to step 1205.

Step 1204 is a process in which the data shaping unit acquires the data body whose reference is specified from the data store.

Step 1205 is a process in which the data shaping unit generates output data based on the schema definition.

Step 1206 is a process in which the output suppression pattern determination unit checks the output data format.
In this step, the output suppression pattern determination unit checks the format of the output data based on the rules described in the output suppression pattern determination rules in the meta information repository.

Step 1207 is a process in which the output suppression pattern determination unit checks whether the output data includes an output prohibition structure based on the output suppression pattern determination rule. If the output prohibition structure is included in the output data, the process proceeds to step 1208. If the output prohibition structure is not included in the output data, the process proceeds to step 1210.

Step 1208 is a process in which the output suppression pattern determination unit converts a portion corresponding to the output prohibition structure in the output data into dummy data. In this step, for the dummy data to be assigned, for example, a method of assigning a predetermined value according to the type such as a character string or a value can be considered.

Step 1209 is a process in which the output suppression pattern determination unit outputs output data with a warning to the application.

Step 1210 is a process in which the output suppression pattern determination unit outputs output data to the application.

FIG. 13 is a diagram showing a flow of processing for registering meta information input by the application in the meta information repository.

Step 1301 is a process in which the application issues a meta information registration request to the data analysis system. This meta information registration request includes meta information to be registered.

Step 1302 is a process in which the access request identifying unit searches the meta information repository using the meta information received from the application as a search key.

Step 1303 is a process for checking whether a structure partially matching the received meta information is found in the meta information repository. If a partially matching structure is found, go to step 1304. If a partially matching structure is not found, go to step 1305.

Step 1304 is processing for adding meta information from the application to the top node of the matching structure.

Step 1305 is a process of registering as a new data structure.

FIG. 14 is a diagram showing a flow of processing for detecting that data in the data store has been updated and performing information extraction.

Step 1401 is a process in which the data update monitoring unit detects a data store update.

Step 1402 is processing for executing metadata extraction and information extraction processing (creation / update of meta information for update data) for the update data.

Step 1403 is a process of updating (reflecting) the meta information corresponding to the corresponding data in the meta information repository.

FIG. 15 is a diagram showing the flow of the learning process of the information extraction unit using the learning sample data.

Step 1501 is a process in which the information extraction unit starts learning.

Step 1502 is a process for checking whether external data can be imported. If external data can be imported, the process proceeds to step 1503. If external data cannot be imported, the process proceeds to step 1509. As a criterion for external data capture, for example, information that learning data that can be accessed by the data analysis system in the external network is registered in the data analysis system and can be connected to the external network. Conceivable.

Step 1503 is a process in which the external data capturing unit captures external reference data.

Step 1504 is a process for collating the learning sample data with the captured data captured by the external data capturing unit.

Step 1505 is a process for checking whether there is a contradictory portion between the fetched data fetched by the external data fetching unit and the learning sample data. If there is a contradiction, the process proceeds to step 1506. If there is no conflict, go to Step 1509. This contradiction means, for example, that the internal learning data is defined as “dogs are animals” and “animals and plants are different”, and external data is defined as “dogs are plants”. In some cases, simple values such as “Mt. Fuji is 3776m high” and “Mt. Fuji is 3022m high” can be considered.

Step 1506 is a process for replacing the external data with the learning sample data for the inconsistent portion.

Step 1507 is a process for checking whether or not replacement is possible. If it cannot be replaced, the process proceeds to step 1508. If not replaceable, go to step 1509.

Step 1508 is a process for canceling the use of the external reference data.

Step 1509 is a process in which the learning mechanism executes a learning process using the available learning sample data in the information extraction unit to correct the extraction model.

101 ... Data analysis system
102 ... External network
103 ... External information extraction unit
104 ... Data retention system
105 ... Raw data store
106: Access request identification section
107… Meta information management department
108… Application
110 ... Output suppression pattern judgment unit
111 ... Data shaping section
112 ... Meta information repository
113… Output suppression rules
114 ... Information extraction unit
115 ... Outsourcing decision section
116 ... Extraction information association part
117 ... Data extraction system
118 ... Learning mechanism
119… Data store
120… External reference data
121 ... Extraction model
122 ... Access identification rule
123 ... Metadata extraction unit
124… Meta information
125 ... Schema definition
128 ... External data capture unit
129 ... Sample data for learning
130 ... Learning restriction rules
133… Raw data
134 ... Data body
135 ... Metadata
139 ... Information extraction unit usage rules
142 ... External extraction model
143 ... Data update monitoring unit
144 ... Authentication processing section
201 ... Node 1
202 ... Edge
301 ... Attribute name
302 ... Example of attribute value
401 ... Attribute name
402 ... Example of attribute value
501 ... Attribute name
502 ... Attribute value example
601 ... Parameter name
602 ... Parameter value example
701 ... Attribute name
702 ... Example of attribute value

Claims

A data analysis system connected to a storage device for storing unstructured data having metadata and a data body,
A metadata extraction unit that obtains the metadata of the unstructured data from the storage device and creates first metadata from the metadata;
The metadata and the data body of the non-structured data are acquired from the storage device, semantic information representing the contents of the non-structured data is extracted from the metadata and the data body, and the second has the semantic information An information extraction unit that creates meta information of
A meta information repository that stores the first meta information and the second meta information in association with each other;
In response to a request from an application, a meta information management unit that extracts meta information stored in the meta information repository;
Processing the meta information extracted by the meta information management unit by a method specified by the request to create output data, and a data output unit for outputting the output data to the application,
When the information extraction unit extracts semantic information of the non-structured data, the information extraction unit calculates a score value indicating a degree that the semantic information represents the content of the non-structured data, and uses the score value as the second meta information. Data analysis system characterized by being included in.
The data analysis system according to claim 1, further comprising:
The request from the application specifies an accuracy level representing the accuracy of the data,
The meta-information management unit extracts meta-information only for meta-information stored in the meta-information repository with the score value being greater than the accuracy level. system.
A data analysis system according to claim 2, comprising:
In the request from the application, a time limit value indicating a data search time limit is specified,
The meta information management unit determines a lower limit of a score value of meta information to be searched based on the time limit value.
The data analysis system according to claim 3, further comprising:
An outsourcing determination unit that determines whether or not to allow the external information extraction unit connected to the data analysis system to process the request;
The request from the application includes the level of confidentiality of the request,
The outsourcing determination unit controls the external information processing unit not to process the request when the confidentiality level of the request is higher than a predetermined confidentiality level. Data analysis system.
The data analysis system according to claim 4, wherein
The meta information repository has a synonym dictionary that registers synonyms for each keyword,
When the request received from the application includes a keyword specification and a specification to search for a synonym of the keyword, the meta information management unit uses the synonym dictionary to synonym the keyword A data analysis system characterized by acquiring meta information by searching the meta information repository using the keyword and the synonym.
The data analysis system according to claim 5, further comprising:
Based on the request from the application, it has an access request identifying unit that identifies the content of the request,
When the access request identification unit determines that the request from the application cannot be processed by the meta information management unit, the access request identification unit delivers the content of the request to the information extraction unit,
The information extraction unit extracts semantic information from the received request content to create meta information, stores the meta information in the meta information repository,
The meta information management unit searches the meta information repository using the meta information extracted by the information extraction unit.