US20150142841A1 - Data model optimization - Google Patents
Data model optimization Download PDFInfo
- Publication number
- US20150142841A1 US20150142841A1 US14/609,052 US201514609052A US2015142841A1 US 20150142841 A1 US20150142841 A1 US 20150142841A1 US 201514609052 A US201514609052 A US 201514609052A US 2015142841 A1 US2015142841 A1 US 2015142841A1
- Authority
- US
- United States
- Prior art keywords
- entity class
- entity
- semantic
- name
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G06F17/30386—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/289—Object oriented databases
-
- G06F17/2785—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Definitions
- This description relates to the optimization of data models.
- Data models may be used to organize or structure data about one or more business systems. Over time, as businesses grow and become more sophisticated, so too may their data models grow as well. This growth of the data models may cause the data models to develop redundancies, deep levels of recursive hierarchies and other inefficiencies. Such inefficiencies also may be introduced when two or more data models are merged into one such as, for example, when one business buys or merges with another.
- One of the challenges in removing these inefficiencies from the data models is that often times the names and structure of the data model may be closely integrated with the semantics and linguistics of the particular business system. It would be beneficial to have a system capable of identifying semantic similarities and/or differences across one or more data models based on the semantic meanings of the objects, to optimize (e.g., identify and remove these inefficiencies) one or more data models.
- a computer-implemented method for normalizing a data model of one or more entity classes where each entity class may include a name and one or more attributes.
- the name of each of the entity classes may be refined to conform to a naming convention.
- a semantic meaning of each of the refined names and the one or more attributes of each entity class may be determined. It may be determined that the name of a first entity class is semantically similar to the name of a second entity class based on a semantic distance between the semantic meaning of the name of the first entity class and the semantic meaning of the name of the second entity class.
- a substantial similarity may be determined between the first entity class and the second entity class based on a comparison of the semantic meaning of the one or more attributes of the first entity class to the semantic meaning of the one or more attributes of the second entity class.
- the data model may be normalized based on the substantial similarity.
- a computer-implemented method for recursively refining names of one or more entity classes of a data model based on a naming convention where each of the names may include one or more terms. Invalid terms may be removed from the one or more terms of the names.
- a complex term in one or more of the names may be identified, where the complex term may include a combination of two or more simpler terms. The identified complex term in the names may be separated to reflect the two or more simpler terms.
- a semantic distance between the two or more terms of the name may be determined. From each of the names including two or more terms, one or more terms with the semantic distance being greater than a threshold semantic distance may be removed.
- a computer-implemented method for normalizing a first entity class of a data model having a plurality of entity classes where each of the entity classes may include a name and one or more attributes. It may be determined that the names of the plurality of entity classes and the one or more attributes conform to a core component technical specification (CCTS) naming convention, where each name and attribute may include an object term, a property qualifier and a property term. It may be determined that each of the one or more attributes of the first entity class includes a unique property qualifier. It may be determined that none of the plurality of entity classes includes multiple attributes that are substantially similar to the one or more attributes of the first entity class.
- CCTS core component technical specification
- a first medial distance between the attribute of the first entity class and the name of the first entity class may be determined, and a second medial distance between the attribute of the first entity class and the names of one or more entity classes, from the plurality of entity classes, associated with the first entity class, may be determined based on a semantic distance between the object terms, property qualifiers and property terms.
- the first entity class may be normalized such that the first medial distance between the attribute and the first entity class is less than the second medial distance between the attribute and any of the one or more associated entity classes.
- FIG. 1 is a block diagram of an example system for optimizing a data model.
- FIG. 2 is an example implementation of a naming convention, according to an example embodiment.
- FIG. 3 is an example implementation of a dictionary, according to an example embodiment.
- FIG. 4 is an example flowchart illustrating example operations associated with optimizing a data model, according to an example embodiment.
- FIG. 5 illustrates an example normalization process of the data model optimizer (DMO), according to an exemplary implementation.
- FIG. 6 is an example flowchart of a process illustrating example operations associated with optimizing a data model, according to an exemplary implementation.
- FIG. 7 is an example flowchart of a process illustrating example operations associated with optimizing a data model, according to an exemplary implementation.
- FIG. 8 is an example flowchart of a process illustrating example operations associated with optimizing a data model, according to an exemplary implementation.
- FIG. 9 is an example flowchart of a process illustrating example operations associated with determining a semantic meaning, according to an exemplary implementation.
- FIG. 1 is a block diagram of an example system 100 for optimizing a data model 102 .
- the system 100 may include a data model optimizer 104 (hereinafter, “DMO 104 ”) that may normalize, consolidate, harmonize and/or otherwise optimize the data model 102 .
- the DMO 104 may compare entity classes 106 A-E of the data model 102 to each other, and based on semantics used to describe the entity classes 106 A-E, attributes 108 of the entity classes 106 A-E and relationships between the entity classes 106 A-E, may optimize the data model 102 .
- Such optimization may include, for example, reducing redundancy between the entity classes 106 A-E, flattening a hierarchy of the data model 102 , normalizing the data model 102 and increasing reuse among the objects of the data model 102 .
- the data model 102 may include an organization and structure of data.
- the data model 102 may include a description or representation of entities or objects of a data structure and the relationships that exist among them.
- the data model 102 may map concepts and relationships and describe the semantics of a business system, or otherwise provide a conceptual schema of the system.
- the data model 102 may include, for example, descriptions of objects of significance to an organization about which it may collect information (e.g., entity classes 106 A-E), characteristics of the objects of significance (e.g., attributes 108 ) and associations between the objects (e.g., relationships between the entity classes 106 A-E).
- the entity classes 106 A-E may include main objects, classes and/or concepts of the data model 102 and may represent generic real-world objects.
- the entity classes 106 A-E may identify which concepts represented by the data model 102 are of the greatest importance and may provide a structure by which data may be associated with those concepts and may be organized within the data model 102 .
- a data model 102 of a business organization may include entity classes 106 A-E pertaining to customer information, sales associates, sales territories, sales leads and invoices.
- the data model 102 may represent a business system in which sales leads are generated, estimates provided, services or products are rendered and invoices are generated, among other activities.
- the entity classes 106 A-E of the data model 102 may include customer information 106 A, price estimates 106 B, sales orders 106 C, invoices 106 D and payment receipts 106 E.
- the entity classes 106 A-E may include additional and/or different entity classes as may be deemed appropriate per the represented system or data.
- the attributes 108 may include defined characteristics or properties of the entity classes 106 A-E.
- the attributes 108 may include members of or informational subsets of the broader entity class concepts.
- an entity class 106 A for storing customer information may include the attributes 108 first name, last name, prefix, address and credit score.
- Relationships between the entity classes 106 A-E may indicate explicit and/or implicit constraints and/or limitations on the data of the data model 102 .
- the relationships may be indicated, in the example of FIG. 1 , by the lines shown connecting the entity classes 106 A-E within the data model 102 .
- a relationship may indicate a dependency or inheritance between two or more of the entity classes 106 A-E, such as a parent-child relationship.
- the entity class 106 A may be associated with general vehicle identification information and the entity class 106 B may be more particularly associated with truck information. Then for example, the entity class 106 B may inherit attributes 108 of the entity class 106 A, whereby a truck is a type of vehicle and a relationship line may indicate such a dependency within the data model 102 .
- the relationships may indicate associations other than the parent-child, or other dependency or inheritance relationship, between two or more of the entity classes 106 A-E.
- the entity classes 106 A-E may each have a name 110 .
- the name 110 may indicate an identification of the entity classes 106 A-E.
- the entity class 106 D may be associated with payment(s) received from customers and thus may include the name 110 Payment_History.
- the names 110 may be associated with or otherwise indicate the concept related to the entity classes 106 A-E; however, entity classes 106 A-E with seemingly irrelevant and/or redundant names 110 also may exist in the data model 102 .
- the data model optimizer 104 may optimize the data model 102 . Such optimization may include adding, removing or otherwise modifying the entity classes 106 A-E, attributes 108 , names 110 and/or relationships of the data model 102 .
- the DMO 104 may, through the use of one or more components, refine the names 110 and/or attributes 108 to conform to a naming convention 112 , determine the semantics used in naming the entity classes 106 A-E and attributes 108 of the data model 102 , compare the semantics to the relationships between the objects of the data model 102 and determine and perform one or more optimization and/or normalization procedures on the data model.
- the DMO 104 may include a naming engine 114 , a semantics analyzer 120 , a comparator 126 , a modeler 132 , a statistics system 138 and a modification handler 140 .
- the DMO 104 may interface with a repository 122 , where the repository 122 may include the naming convention 112 .
- the repository 122 may include information that may be used and/or referenced by the DMO 104 to determine the semantics of the data model 102 .
- the naming convention 112 may include a standard or criterion associated with naming objects of the data model 102 .
- the naming convention 112 may provide a syntax-neutral and technology-independent way to improve reuse, interoperability and consistency across the entity classes 106 A-E and/or attributes 108 of the data model 102 .
- An example naming convention 112 may include the core component technical specification (CCTS).
- the naming convention 112 may include a three-part naming convention.
- the three-part naming convention, or dictionary entry name (DEN) may use data elements similar to those used in formulating sentences.
- the DEN may include three main components, an object term or object class term (OCT) 202 , a property term (PT) 204 and a representation term (RT) 206 .
- the DEN in other implementations, may include additional terms such as a property qualifier (PQ) 208 .
- each DEN component may map to a portion of a natural language sentence.
- the OCT 202 may correspond to the object “purchase order” of the sentence
- the PT 204 may correspond to the verb “supply” of the sentence
- the RT 206 may correspond to the subject “item” of the sentence
- the PQ 208 may correspond to the predicate of the sentence.
- DEN naming convention 112
- the DEN approach may allow the names 110 of the entity classes 106 A-E to be based on natural language and implemented in a standard form.
- the DEN approach may allow for a more precise method of naming the entity classes 106 A-E and attributes 108 , where the semantics of each component (e.g., 202 , 204 , 206 and 208 ) may be processed individually and/or as a group by the DMO 104 .
- the naming engine 114 may be configured to refine the names 110 (and attributes 108 ) of the data model 102 to conform to the naming convention 112 .
- the naming engine 114 may add, remove or otherwise modify characters or terms of the names 110 to conform them to the naming convention 112 .
- the naming convention 112 may prohibit one or more invalid terms 116 from appearing in the names 110 .
- the invalid terms 116 may include any combination of one or more letters, symbols, whitespace, numbers or other characters that may not to be included in names conforming to the naming convention 112 .
- the invalid terms 120 may include the terms “ ” (e.g., whitespace) and “_”.
- the naming engine 114 may parse the names 110 for the invalid terms 116 . If the naming engine 114 finds an invalid term 116 within the name 110 , the naming engine 114 may remove the invalid term 116 from the name 110 .
- the naming engine 114 may produce a refined name 118 .
- the refined name 118 may include the name 110 , after processing by the naming engine 114 , as it conforms to the naming convention 112 .
- the name 112 of the entity class 106 C may be “Customer Information”.
- the naming engine 114 may apply the rules or standards of the naming convention 112 (e.g., no invalid terms 116 ) and may produce the refined name 118 “CustomerInformation”. It is possible that the refined name 118 may include the name 110 without any modification, for instance, if the naming engine 114 determines the name 110 already conforms to the naming convention 112 or the recommended modifications were rejected.
- the semantics analyzer 120 may be configured to compare the refined names 118 to the repository 122 to determine a semantic meaning 124 of the refined names 118 .
- the repository 122 may include a library of various components or resources that may be used to determine the semantic meaning 124 of the names 110 and/or attributes 108 .
- the repository 122 may include a dictionary 123 such as, for example, the WordNet® dictionary that may determine associations between words or terms.
- the dictionary 123 may include terms of words that may be connected through various relationships, including synonyms, homonyms and antonyms.
- the semantic analyzer 120 may compare one or more portions of the refined name 118 to the repository 122 , including the dictionary 123 , to determine the semantic meaning 124 .
- the WordNet® dictionary (e.g., 123 ) may include multiple synonym sets for determining the semantic meanings or semantic distance 128 between words or terms.
- the repository 122 including the WordNet® dictionary, is further discussed below with reference to FIG. 3 .
- the repository 122 may include business objects 125 A, the Internet 125 B, data models 125 C and/or any other combination or resources that may be used to determine the semantic meaning 124 of one or more words.
- the business objects 125 A may include entity classes (e.g., 106 A-E) or business documents used in one or systems.
- the business objects 125 A may include a sales estimate form.
- the Internet 125 B may include any resources available via the Internet, including the WordNet® dictionary (e.g., dictionary 123 ).
- the data models 125 C may include any additional or existing data models 102 that may be stored in the repository 122 . Though discussed with reference to a single data model 102 , it should be understood that the data model 102 may refer to several data models 102 that are to be combined in some way and optimized by the DMO 104 .
- the semantic meaning 124 may include an interpretation of the linguistics or semantics used in the refined name 118 .
- the semantic meaning 124 may include, for example, a set of synonyms or other words that may be semantically relevant to one or more terms of the refined name 118 .
- the refined name 118 “Customer” may result in the semantic meaning 124 “client, consumer and purchaser”.
- Example operations related to determining the semantic meaning 124 are discussed below in greater detail in reference to FIG. 9 .
- the naming engine 114 may use the semantic meaning 124 to further refine the refined name 118 per the naming convention 112 .
- the naming convention 112 may prohibit complex terms 126 whereas, based on the semantic meaning, 124 the naming engine 114 may identify and remove complex terms 126 from the refined names 118 .
- the complex terms 126 may include a combination of two or more terms that may be prohibited by the naming convention 112 .
- the complex terms 126 may include a combination of two or more terms with two or more distinct semantic meanings 124 that are joined together as a complex term but may retain the two or more distinct semantic meanings 124 .
- an unacceptable complex term 126 may include StreetAddress, where the combination of the term “Street” with the term “Address” is unacceptable because both terms when joined, as a complex term, retain their separate meanings.
- the naming engine 114 may compare the semantic meaning 124 of “Street” and “Address” with “StreetAddress” to determine a complex term 126 exists.
- the naming engine 114 may then, for example, split “StreetAddress” into two simpler terms “Street” and “Address”. Then for example, the naming engine 114 may determine that “Street” is part of “Address”, and thus “StreetAddress” may be simplified to the refined name 118 “Address”.
- an acceptable complex term may include PostOfficeBox, where the combination of the terms “PostOffice” (or even “Post” and “Office”) and “Box” when combined, create a new joined meaning in “PostOfficeBox”, where “PostOfficeBox” may have a singular and distinct semantic meaning 124 .
- the comparator 126 may be configured to determine a semantic distance 128 between the terms within a refined name 118 and/or between two or more names 110 or attributes 108 .
- the semantic distance 128 may include an indication of the distance or variance between the semantic meaning 124 of two or more terms. For example, the semantic distance 128 may indicate that the distance between “Vehicle” and “Truck” is less than the distance between “Vehicle” and “Man”, as truck is a type of vehicle.
- the semantic distance 128 may indicate a distance from a shared or root term.
- the comparator 126 may determine a semantic distance between “Car” and “Truck” by comparing the semantic distances 128 between “Car” and “Truck” from a root term they share, such as “Vehicle”. Then for example, the semantic distance 128 between “Car” and “Truck” may be determined by adding the semantic distance between “Vehicle” and “Car” and “Vehicle” and “Truck”.
- the comparator 126 may compare the semantic distance 128 to a semantic threshold distance 130 to determine whether the semantic distance 128 exceeds the semantic threshold distance 130 .
- the semantic threshold distance 130 may indicate the furthest or greatest acceptable semantic distance (e.g., 128 ) allowed between two or more terms. Then, based on whether the semantic distance 128 exceeds the semantic threshold distance 130 , the comparator 126 may modify one or more terms of a refined name 118 .
- a refined name 118 may include the terms “Vehicle” and “Man”.
- the semantic distance 128 between “Vehicle” and “Man” may be of the value 3 (e.g., Vehicle ⁇ Manufacturer ⁇ Employee ⁇ Man), where the semantic threshold distance 130 may indicate a value of 2.
- the comparator 126 may then remove, or recommend removal, of “Man” from the refined name 118 .
- the semantic threshold distance 130 may indicate a first threshold for determining similarity between terms of a refined name 118 , a second threshold for determining similarity between two or names 110 of various entity classes 106 A-E and a third threshold for determining similarity between two or more attributes 108 .
- Example operations associated with determining the semantic distance 128 is further discussed below with reference to FIG. 3 .
- the modeler 132 may be configured to determine whether a substantial similarity exists between two or more objects of the data model 102 and one or more modifications 134 that may be made to optimize the data model 102 .
- the modeler 132 may determine whether the substantial similarity exists between two or more entity classes 106 A-E.
- the substantial similarity may indicate that two or more of the entity classes 106 A-E are substantially, semantically similar.
- the modeler 132 may, for example, first compare the semantic meanings 124 of the refined names 118 of the first entity class 106 B and the second entity class 106 C. The modeler 132 may then, for example, compare the semantic meaning 124 of the attributes 108 of the entity classes 106 B and 106 C.
- the modeler 132 may determine that a substantial similarity exists between the entity class 106 B and the entity class 106 C based on the semantic meaning 124 comparisons of the names 110 (e.g., refined names 118 ) and the attributes 108 (e.g., being less than the semantic threshold distance 130 ).
- the modeler 132 may determine and/or perform one or more modifications 134 to the data model 102 .
- the modification 134 may include any modification designed to optimize, normalize or otherwise increase the efficiency of the data model 102 including, for instance, conforming the names 110 and/or attributes 108 to the naming convention 112 , as discussed above.
- the modifications 134 may include adding, removing, moving or otherwise modifying the entity classes 106 A-E, names 110 , attributes 108 and/or relationships of the data model 102 .
- the modeler 132 may create new entity classes in the data model 102 . For example, the modeler 132 may determine that multiple attributes 108 are used in a first entity class 106 C and a second entity class 106 A. Then for example, the modeler 132 may create a new entity class that includes the attributes 108 used in both the first and second entity classes 106 C and 106 A. The modeler 132 may then, for example, establish relationships in the data model 102 between the first entity class 106 C, the second entity class 106 A and the newly created entity class.
- the modification 134 may be based, at least in part, on a usage 136 as determined by the statistics system 138 .
- the statistics system 138 may be configured to track how often an entity class 106 A-E and/or attribute 108 is used (e.g., usage 136 ) within the data model 102 . Then, for example, based on the usage 136 , the modeler 132 may determine which entity classes 106 A-E and/or attributes 108 may be removed or otherwise modified in the data model 102 . For example, the modeler 132 may determine that a substantial similarity (e.g., redundancy) exists between a first entity class 106 D and a second entity class 106 E.
- a substantial similarity e.g., redundancy
- the usage 136 may indicate that the second entity class 106 E is used far more often than the first entity class 106 D.
- the modification 134 may indicate removing the first entity class 106 D or collapsing it into the second entity class 106 E, based on the usage 136 (and adjusting the relationships accordingly).
- the modification handler 140 may be configured to generate a message 142 based on the modification 134 .
- the message 142 may include a notification to a user 144 about the modification 134 to be performed on the data model 102 . Then, for example, based on the message 142 , as provided to the user 144 via an interface 146 , the user 144 may accept or reject the modification 134 . Based upon the response of the user 144 to the message 142 , the modeler 132 may then either implement or disregard the modification 134 .
- Data models 102 may be complex structures based on detailed semantics and/or linguistics that may be specialized in different industries. Allowing the user 144 the ability to accept/reject modifications 134 to optimize or modify one or more aspects of the data model 102 , provides the system 100 with another level of error checking such that the DMO 104 may be prevented from performing modifications 134 to aspects of the data model 102 , which while may not be optimal (as determined by the DMO 104 ), may have been purposely included in the data model 102 . Then, for example, based on the user's 144 acceptance/rejection of the modification 134 , the DMO 104 may recursively analyze the data model 102 and determine additional modifications 134 that may be made.
- the recursive processing of the data model 102 by the DMO 104 allows the DMO 104 to implement a semi-automatic, or recursive, optimization process with regards to the data model 102 .
- the semi-automatic process may allow the DMO 104 to automatically determine which modifications 134 are to be made to the data model 102 , but also may allow the user 144 to determine that the semantical analysis performed by the DMO 104 is accurate with regards to the data model 102 .
- the DMO 104 may both flatten the hierarchy of the data model 102 and normalize the data model 102 .
- the DMO 104 may balance the objective of normalization in increasing reuse of objects of the data model 102 and the objective in flattening the hierarchy of the data model 102 in having fewer recursive levels of inheritance between the objects of the data model 102 .
- the DMO 104 may balance these two objectives and modify the data model 102 based on a determination as to which objectives will lead to greater optimization in the data model 102 .
- FIG. 3 is an example implementation of a dictionary 123 , according to an exemplary implementation.
- FIG. 3 illustrates an example mapping of terms within the dictionary 123 (of FIG. 1 ) of the repository 122 .
- the mapping of FIG. 3 may be used, for example, to determine a semantic distance (e.g., 128 ) between two or more terms or a semantic meaning (e.g., 124 ) of a term.
- the semantic meaning of “market order” 302 may include “commercial document” 304 and “purchase order” 306 .
- the semantic distance between “stop-loss order” 308 and “stop order” 310 may be 1, such that each term is 1 term removed from a root or common term (e.g., “order” 312 /“purchase order” 306 ).
- the lines between the nodes of the mapping of FIG. 3 may indicate relationships between the terms of each node.
- Example relationships may include synonyms, homonyms, and antonyms.
- FIG. 4 is an example flowchart 400 illustrating example operations associated with optimizing a data model 102 , according to an exemplary implementation.
- the flowchart 400 may provide that after each step (e.g., 402 , 404 , 406 , 408 , 410 , 412 , 414 and 416 ) the message 142 may be generated at step 418 , and a user 144 may then decline or accept the modification 134 .
- the flowchart may return to step 402 . This may allow the DMO 104 to recursively optimize one or more data models 102 . If the modification 134 is rejected, the optimization may be completed at 420 and restarted at a later time.
- the dictionary entry name (DEN) of the entity class 106 may be cleaned up.
- the naming engine 114 may remove invalid terms 116 from the names 110 of entity classes 106 A-E.
- a message 142 may be generated requesting acceptance for this modification 134 prior to completion by the DMO 104 .
- empty data structures may be removed from the data model 102 .
- the DMO 104 may determine which of the entity classes 106 A-E of the data model 102 contain no attributes 108 . Then for example, the DMO 104 may remove any of the empty entity classes 106 A-E from the data model 102 . The DMO 104 also may adjust relationships between the removed entity class and the remaining existing entity classes, to reflect the removal.
- non-valid terms may be cleaned up.
- the naming engine 114 may modify complex terms 126 in the refined names 118 to simpler terms.
- entities with non-relevant semantics may be cleaned up.
- the comparator 126 may determine whether the semantic distance 128 between two or more terms of a name 110 (including refined name 118 ) and/or attribute is less than the semantic threshold distance 130 . If the comparator 126 determines, for example, that the semantic distance 128 between two or more terms of a single name 110 is greater than the semantic threshold distance 130 , then one or more of the terms may be modified or cleaned up.
- redundancy may be cleaned up.
- the modeler 132 may consolidate, remove or otherwise clean up the redundancies.
- An example redundancy between two attributes 108 may be based on a semantic similarity between their names (e.g., the semantic distance 128 being less the semantic threshold distance 130 ) and a similar type definition.
- the type definition may include any number of descriptions as to the type of information that may be stored by the attribute 108 , such as, integer, whole number, string and currency.
- type codes may be extracted.
- An attribute 108 may include a type definition to describe the information to be stored in the attribute.
- a first name attribute 108 may include the type definition ‘string,’ and an age attribute 108 may include the type ‘integer.’
- the type ‘code’ may indicate that a separate file or code list may be referenced to determine what the code refers to.
- an attribute error code may include a type ‘code’ which makes reference to a list of error codes.
- the list of error codes may include code 32 indicating a printer error, code 43 indicating a software error and code 55 indicating a network error.
- the modeler 132 may determine that this attribute 108 includes a type code and may extract it and modify the entity class 106 A-E and/or attribute 108 accordingly.
- entities may be consolidated. For example, if the modeler 132 determines a substantial similarity between two or more entity classes 106 A-E, then the redundancy may be cleaned up. For example, one of the redundant entity classes may be removed, or for example the redundant entity classes may be consolidated into fewer entity classes.
- the data model 102 may be normalized or de-normalized, as appropriate, to improve optimization.
- the DMO 104 may compare the semantic meanings 124 of the names 110 and the attributes 108 of the entity classes 106 A-E to determine whether the attributes 108 are properly positioned within the data model 102 . If the DMO 104 determines that an attribute 108 , of a first entity class (e.g., 106 A) is more closely associated with the attributes 108 of a second entity class (e.g., 106 B), the DMO 104 may move the attribute 108 into the second entity class.
- a first entity class e.g., 106 A
- a second entity class e.g., 106 B
- the DMO 104 also may determine whether or not to create new entity classes within the data model 102 . For example, if multiple attributes 108 of an entity class 106 have similar values (e.g., property qualifiers 208 ), those attributes 108 may be extracted from the current entity class 106 and placed into a new entity class. An example of this normalization may be seen in FIG. 5 .
- FIG. 5 illustrates an example normalization process of the data model optimizer (DMO) 104 , according to an exemplary implementation.
- the entity class “Item” e.g., OCT 202
- Blocks 504 A and 504 B show the entity class after normalization where the entity class “Item” has been divided into the entity classes “Item” 504 A and “Shipment” 504 B.
- the attributes of “Item” 502 associated with “Shipment” 208 have been moved into the new entity class “Shipment” 504 B.
- An object class qualifier (OCQ) 506 may be used to further describe or qualify the object class term (OCT) 202 .
- OCT object class term
- each of the items 202 may be part of the purchase order line 506 .
- the columns 508 A and 508 B may indicate the minimum and maximum number of values that may be occupied in each attribute 108 .
- the DMO 104 may normalize the data model 102 into a third normalization form.
- the attributes 108 of an entity class 106 A-E may each be atomic or include a single value.
- a name attribute 108 including a person's first and last name e.g., Mark Smith
- a first name attribute 108 e.g., Mark
- a last name attribute 108 e.g., Smith
- every non-key attribute 108 of an entity class 106 A-E may depend on the primary key of the entity class 106 A-E.
- the city field of an attribute 108 may depend on the zip code field of the attribute 108 .
- each attribute 108 of an entity class 106 A-E may depend on nothing other than the primary key.
- the zip code may serve as a primary key upon which the city and state are dependent.
- the DMO 104 may normalize the data model 102 with regards to the three normalization forms.
- FIG. 6 is an example flowchart of a process 600 illustrating example operations associated with optimizing a data model 102 , according to an exemplary implementation.
- Process 600 may include refining the name of each of the entity classes to conform to a naming convention ( 610 ), determining a semantic meaning of each of the refined names and the one or more attributes of each entity class ( 620 ), determining that the name of a first entity class is semantically similar to the name of a second entity class based on a semantic distance between the semantic meaning of the name of the first entity class and the semantic meaning of the name of the second entity class ( 630 ), determining a substantial similarity between the first entity class and the second entity class based on a comparison of the semantic meaning of the one or more attributes of the first entity class to the semantic meaning of the one or more attributes of the second entity class ( 640 ) and normalizing the data model based on the substantial similarity ( 650 ).
- the name of each of the entity classes may be refined to conform to a naming convention.
- the naming engine 114 may refine the names 110 of the entity classes 106 A-E to conform to the naming convention 112 .
- a semantic meaning for each of the refined names and the one or more attributes of each entity class may be determined.
- the semantics analyzer 120 may determine the semantic meaning 124 for the refined names 118 and the attributes 108 of the entity classes 106 A-E.
- the name of a first entity class may be determined to be semantically similar to the name of a second entity class based on a semantic distance between the semantic meaning of the name of the first entity class and the semantic meaning of the name of the second entity class.
- the comparator 126 may determine that the semantic meaning 124 of the entity class 106 A is semantically similar to the semantic meaning 124 of the entity class 106 B based on the semantic distance 128 being less than the semantic threshold distance 130 .
- a substantial similarity between the first entity class and the second entity class may be determined based on a comparison of the semantic meaning of the one or more attributes of the first entity class to the semantic meaning of the one or more attributes of the second entity class.
- the modeler 132 may determine a substantial similar between the entity class 106 A and the entity class 106 B based on a comparison (e.g., semantic distance 128 being less than the semantic threshold distance 130 ) of the semantic meanings 124 of the attributes 108 of the entity class 106 A and the entity class 106 B.
- the data model may be normalized based on the substantial similarity.
- the modeler 132 may apply the modification 134 to normalize the data model 102 .
- the “Item” entity class in 502 may be divided into the entity classes “Item” and “Shipment” in 504 .
- FIG. 7 is an example flowchart of a process 700 illustrating example operations associated with optimizing a data model 102 , according to an exemplary implementation.
- Process 700 may include removing invalid terms from the one or more terms of the names ( 710 ), identifying a complex term in one or more terms of the names, the complex term including a combination of two or more simpler terms ( 720 ), separating the identified complex term in the names to reflect the two or more simpler terms ( 730 ), determining, for each of the names including two or more terms, a semantic distance between the two or more terms of the name ( 740 ) and removing from each of the names including two or more terms, one or more terms with the semantic distance being greater than a threshold semantic distance ( 750 ).
- invalid terms may be removed from the one or more terms of the names.
- the naming engine 114 may remove invalid terms 116 from the names 110 .
- a complex term may be identified in one or more of the names, where the complex term may include a combination of two or more simpler terms.
- the naming engine 114 may identify complex terms 126 in one or more of the refined names 118 .
- identified complex term may be separated in the names to reflect the two or more simpler terms.
- the naming engine 114 may separate the complex terms 126 , identified in the refined name 118 . Then, for example, the naming engine 114 may separate the complex terms 126 into simpler terms in the refined name 118 , or may remove one or more of the simpler terms.
- a semantic distance between the two or more terms of the name may be determined.
- the comparator 126 may determine the semantic distance 128 between two or more terms of the refined name 118 .
- one or more terms with the semantic distance being greater than a threshold semantic distance may be removed.
- the modeler 132 may remove one or more terms with a semantic distance 128 being greater than the semantic threshold distance 130 .
- FIG. 8 is an example flowchart of a process 800 illustrating example operations associated with optimizing a data model 102 , according to an exemplary implementation.
- Process 800 may include determining that the names of the plurality of entity classes and the one or more attributes conform to a core component technical specification (CCTS) naming convention, wherein each name and attribute includes an object term, a property qualifier and a property term ( 810 ), determining that each of the one or more attributes of the first entity class includes a unique property qualifier ( 820 ), determining that none of the plurality of entity classes includes multiple attributes that are substantially similar to the one or more attributes of the first entity class ( 830 ), determining, for each of the one or more attributes, a first medial distance between the attribute of the first entity class and the name of the first entity class, and a second medial distance between the attribute of the first entity class and the names of one or more entity classes, from the plurality of entity classes, associated with the first entity class, based on a semantic distance between the object terms, property qualifiers
- the names of the plurality of entity classes and the one or more attributes may be determined to conform to a core component technical specification (CCTS) naming convention, where each name and attribute includes an object term, a property qualifier and a property term.
- CCTS core component technical specification
- the naming engine 114 may determine that the names 110 and attributes 108 of the entity classes 106 A-E conform to the CCTS naming convention 112 .
- Each name 110 and attribute 108 may include the object term 202 , the property qualifier 208 and the property term 204 .
- each of the one or more attributes of the first entity class may be determined to include a unique property qualifier.
- the modeler 132 may determine that each of the attributes 108 , of the “Item” entity class in 504 , includes a unique property qualifier 208 .
- the modeler 132 may determine a first medial distance between a first attribute 108 of the entity class 106 A and the name 110 of the entity class 106 A. The medial distance may be determined based on the semantic distance 128 between the object terms 202 , property qualifiers 208 and property terms 204 . Similarly, the modeler 132 may determine a second medial distance between a first attribute 108 of the entity class 106 A and the name 110 of a second entity class 106 B.
- the first entity class may be normalized such that the first medial distance between the attribute and the first entity class is less than the second medial distance between the attribute and any of the one or more associated entity classes.
- the modeler 132 may determine the modification 134 to normalize the data model 102 such that the first medial distance between the first attribute 108 of the name 110 of the entity class 106 A is less than the second medial distance between the first attribute 108 of the name 110 of the other associated entity classes 106 B-E.
- the modeler 132 may then perform a modification 134 to move the first attribute 108 into a different entity class 106 B-E such that the first medial distance is less than the second medial distances.
- FIG. 9 is an example flowchart of a process 900 illustrating example operations associated with determining a semantic meaning 124 , according to an exemplary implementation.
- the process 900 illustrates example operations of the semantics analyzer 120 , of FIG. 1 , in determining the semantic meaning 124 of the names 110 (including refined names 118 ) and/or attributes 108 of the entity classes 106 A-E of the data model 102 .
- Steps 904 A, 904 B, 904 C, 904 D and 904 E each illustrate one of any number of various checks that may be performed to determine the semantic meaning 124 .
- 902 indicates various data sources that may be used to perform any of the checks 904 A-E.
- the relationship between the information available in the repository 122 may be compared against the DEN (e.g., name 110 , refined name 118 ) of an entity class 106 A-E. Then, for example, the closest relationship (e.g., semantic distance 128 ) between the information in the repository 122 and the DEN of the entity class may be determined in 906 , and a rating may be stored in step 908 .
- the rating may indicate a similarity between objects of the repository 122 and objects of the data model 102 .
- a rating (e.g., 908 ) may be determined for each of the checks 904 A-E. In 910 , the best or highest rating may be determined as used to determine the semantic meaning 124 of the term.
- a context or business context of the entity may be determined and the relationship between the repository 122 and the business context may be determined.
- various industries may include industry-specific terminology, then, for example, the industry-specific terminology may be used to determine the semantic meaning 124 of the terms.
- the relationship of the DENs of the attributes 108 of the entity are compared to the repository 122 and the closest relationship 906 may be determined.
- descriptions of the entity class and repository 122 may be determined.
- the descriptions may include, for example, natural language words, phrases and/or sentences used to describe entities 106 A-E and items (e.g., 125 A, 125 B and 125 C) of the repository 122 .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
A name of one or more entity classes of the data model may be refined to conform to a naming convention. A semantic meaning of each of the names and one or more attributes of each entity class may be determined. It may be determined that the name of a first entity class is semantically similar to the name of a second entity class based on a semantic distance between the semantic meaning of the names, where a substantial similarity may be determined between the first entity class and the second entity class by comparing the semantic meaning of the one or more attributes of the first entity class to the semantic meaning of the one or more attributes of the second entity class. The data model may be normalized based on the substantial similarity.
Description
- This application is a divisional of, and claims priority to, U.S. patent application Ser. No. 13/596,532, filed on Aug. 28, 2012, entitled “DATA MODEL OPTIMIZATION”, which is a continuation of, and claims priority to, U.S. patent application Ser. No. 12/269,324, filed on Nov. 12, 2008, entitled “DATA MODEL OPTIMIZATION”, now issued as U.S. Pat. No. 8,290,989. All of the foregoing patents and patent applications are incorporated by reference herein in their entireties.
- This description relates to the optimization of data models.
- Data models may be used to organize or structure data about one or more business systems. Over time, as businesses grow and become more sophisticated, so too may their data models grow as well. This growth of the data models may cause the data models to develop redundancies, deep levels of recursive hierarchies and other inefficiencies. Such inefficiencies also may be introduced when two or more data models are merged into one such as, for example, when one business buys or merges with another.
- One of the challenges in removing these inefficiencies from the data models is that often times the names and structure of the data model may be closely integrated with the semantics and linguistics of the particular business system. It would be beneficial to have a system capable of identifying semantic similarities and/or differences across one or more data models based on the semantic meanings of the objects, to optimize (e.g., identify and remove these inefficiencies) one or more data models.
- According to one general aspect, a computer-implemented method for normalizing a data model of one or more entity classes is provided, where each entity class may include a name and one or more attributes. The name of each of the entity classes may be refined to conform to a naming convention. A semantic meaning of each of the refined names and the one or more attributes of each entity class may be determined. It may be determined that the name of a first entity class is semantically similar to the name of a second entity class based on a semantic distance between the semantic meaning of the name of the first entity class and the semantic meaning of the name of the second entity class. A substantial similarity may be determined between the first entity class and the second entity class based on a comparison of the semantic meaning of the one or more attributes of the first entity class to the semantic meaning of the one or more attributes of the second entity class. The data model may be normalized based on the substantial similarity.
- According to another general aspect, a computer-implemented method for recursively refining names of one or more entity classes of a data model based on a naming convention is provided, where each of the names may include one or more terms. Invalid terms may be removed from the one or more terms of the names. A complex term in one or more of the names may be identified, where the complex term may include a combination of two or more simpler terms. The identified complex term in the names may be separated to reflect the two or more simpler terms. For each of the names including two or more terms, a semantic distance between the two or more terms of the name may be determined. From each of the names including two or more terms, one or more terms with the semantic distance being greater than a threshold semantic distance may be removed.
- According to another general aspect, a computer-implemented method for normalizing a first entity class of a data model having a plurality of entity classes is provided, where each of the entity classes may include a name and one or more attributes. It may be determined that the names of the plurality of entity classes and the one or more attributes conform to a core component technical specification (CCTS) naming convention, where each name and attribute may include an object term, a property qualifier and a property term. It may be determined that each of the one or more attributes of the first entity class includes a unique property qualifier. It may be determined that none of the plurality of entity classes includes multiple attributes that are substantially similar to the one or more attributes of the first entity class. For each of the one or more attributes of the first entity class, a first medial distance between the attribute of the first entity class and the name of the first entity class may be determined, and a second medial distance between the attribute of the first entity class and the names of one or more entity classes, from the plurality of entity classes, associated with the first entity class, may be determined based on a semantic distance between the object terms, property qualifiers and property terms. For each of the one or more attributes of the first entity class, the first entity class may be normalized such that the first medial distance between the attribute and the first entity class is less than the second medial distance between the attribute and any of the one or more associated entity classes.
- The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.
-
FIG. 1 is a block diagram of an example system for optimizing a data model. -
FIG. 2 is an example implementation of a naming convention, according to an example embodiment. -
FIG. 3 is an example implementation of a dictionary, according to an example embodiment. -
FIG. 4 is an example flowchart illustrating example operations associated with optimizing a data model, according to an example embodiment. -
FIG. 5 illustrates an example normalization process of the data model optimizer (DMO), according to an exemplary implementation. -
FIG. 6 is an example flowchart of a process illustrating example operations associated with optimizing a data model, according to an exemplary implementation. -
FIG. 7 is an example flowchart of a process illustrating example operations associated with optimizing a data model, according to an exemplary implementation. -
FIG. 8 is an example flowchart of a process illustrating example operations associated with optimizing a data model, according to an exemplary implementation. -
FIG. 9 is an example flowchart of a process illustrating example operations associated with determining a semantic meaning, according to an exemplary implementation. -
FIG. 1 is a block diagram of anexample system 100 for optimizing adata model 102. Thesystem 100 may include a data model optimizer 104 (hereinafter, “DMO 104”) that may normalize, consolidate, harmonize and/or otherwise optimize thedata model 102. TheDMO 104 may compareentity classes 106A-E of thedata model 102 to each other, and based on semantics used to describe theentity classes 106A-E,attributes 108 of theentity classes 106A-E and relationships between theentity classes 106A-E, may optimize thedata model 102. Such optimization may include, for example, reducing redundancy between theentity classes 106A-E, flattening a hierarchy of thedata model 102, normalizing thedata model 102 and increasing reuse among the objects of thedata model 102. - The
data model 102 may include an organization and structure of data. Thedata model 102 may include a description or representation of entities or objects of a data structure and the relationships that exist among them. Thedata model 102 may map concepts and relationships and describe the semantics of a business system, or otherwise provide a conceptual schema of the system. Thedata model 102 may include, for example, descriptions of objects of significance to an organization about which it may collect information (e.g.,entity classes 106A-E), characteristics of the objects of significance (e.g., attributes 108) and associations between the objects (e.g., relationships between theentity classes 106A-E). - The
entity classes 106A-E may include main objects, classes and/or concepts of thedata model 102 and may represent generic real-world objects. Theentity classes 106A-E may identify which concepts represented by thedata model 102 are of the greatest importance and may provide a structure by which data may be associated with those concepts and may be organized within thedata model 102. For example, adata model 102 of a business organization may includeentity classes 106A-E pertaining to customer information, sales associates, sales territories, sales leads and invoices. - For example, as just referenced, the
data model 102 may represent a business system in which sales leads are generated, estimates provided, services or products are rendered and invoices are generated, among other activities. Then, for example, theentity classes 106A-E of thedata model 102 may includecustomer information 106A,price estimates 106B,sales orders 106C,invoices 106D andpayment receipts 106E. In other exemplary implementations, theentity classes 106A-E may include additional and/or different entity classes as may be deemed appropriate per the represented system or data. - The
attributes 108 may include defined characteristics or properties of theentity classes 106A-E. Theattributes 108 may include members of or informational subsets of the broader entity class concepts. For example, anentity class 106A for storing customer information may include theattributes 108 first name, last name, prefix, address and credit score. - Relationships between the
entity classes 106A-E may indicate explicit and/or implicit constraints and/or limitations on the data of thedata model 102. The relationships may be indicated, in the example ofFIG. 1 , by the lines shown connecting theentity classes 106A-E within thedata model 102. A relationship may indicate a dependency or inheritance between two or more of theentity classes 106A-E, such as a parent-child relationship. For example, theentity class 106A may be associated with general vehicle identification information and theentity class 106B may be more particularly associated with truck information. Then for example, theentity class 106B may inheritattributes 108 of theentity class 106A, whereby a truck is a type of vehicle and a relationship line may indicate such a dependency within thedata model 102. In other exemplary implementations, the relationships may indicate associations other than the parent-child, or other dependency or inheritance relationship, between two or more of theentity classes 106A-E. - The
entity classes 106A-E may each have aname 110. Thename 110 may indicate an identification of theentity classes 106A-E. For example, theentity class 106D may be associated with payment(s) received from customers and thus may include thename 110 Payment_History. Generally, to improve clarity of thedata model 102, thenames 110 may be associated with or otherwise indicate the concept related to theentity classes 106A-E; however,entity classes 106A-E with seemingly irrelevant and/orredundant names 110 also may exist in thedata model 102. - The data model optimizer 104 (DMO 104), as referenced above, may optimize the
data model 102. Such optimization may include adding, removing or otherwise modifying theentity classes 106A-E, attributes 108,names 110 and/or relationships of thedata model 102. TheDMO 104 may, through the use of one or more components, refine thenames 110 and/or attributes 108 to conform to a namingconvention 112, determine the semantics used in naming theentity classes 106A-E and attributes 108 of thedata model 102, compare the semantics to the relationships between the objects of thedata model 102 and determine and perform one or more optimization and/or normalization procedures on the data model. - The
DMO 104 may include a namingengine 114, asemantics analyzer 120, acomparator 126, amodeler 132, astatistics system 138 and amodification handler 140. TheDMO 104 may interface with arepository 122, where therepository 122 may include the namingconvention 112. Therepository 122, among other things, may include information that may be used and/or referenced by theDMO 104 to determine the semantics of thedata model 102. - The naming
convention 112 may include a standard or criterion associated with naming objects of thedata model 102. The namingconvention 112 may provide a syntax-neutral and technology-independent way to improve reuse, interoperability and consistency across theentity classes 106A-E and/or attributes 108 of thedata model 102. Anexample naming convention 112 may include the core component technical specification (CCTS). - According to an example embodiment, the naming
convention 112 may include a three-part naming convention. As shown inFIG. 2 , the three-part naming convention, or dictionary entry name (DEN) may use data elements similar to those used in formulating sentences. The DEN may include three main components, an object term or object class term (OCT) 202, a property term (PT) 204 and a representation term (RT) 206. The DEN, in other implementations, may include additional terms such as a property qualifier (PQ) 208. As shown inFIG. 2 , each DEN component may map to a portion of a natural language sentence. TheOCT 202 may correspond to the object “purchase order” of the sentence, thePT 204 may correspond to the verb “supply” of the sentence, theRT 206 may correspond to the subject “item” of the sentence and thePQ 208 may correspond to the predicate of the sentence. - Using the DEN approach (e.g., naming convention 112) may allow the
names 110 of theentity classes 106A-E to be based on natural language and implemented in a standard form. The DEN approach may allow for a more precise method of naming theentity classes 106A-E and attributes 108, where the semantics of each component (e.g., 202, 204, 206 and 208) may be processed individually and/or as a group by theDMO 104. - In
FIG. 1 , the namingengine 114 may be configured to refine the names 110 (and attributes 108) of thedata model 102 to conform to the namingconvention 112. The namingengine 114 may add, remove or otherwise modify characters or terms of thenames 110 to conform them to the namingconvention 112. For example, the namingconvention 112 may prohibit one or moreinvalid terms 116 from appearing in thenames 110. - The
invalid terms 116 may include any combination of one or more letters, symbols, whitespace, numbers or other characters that may not to be included in names conforming to the namingconvention 112. For example, theinvalid terms 120 may include the terms “ ” (e.g., whitespace) and “_”. Then, for example, the namingengine 114 may parse thenames 110 for theinvalid terms 116. If the namingengine 114 finds aninvalid term 116 within thename 110, the namingengine 114 may remove theinvalid term 116 from thename 110. - The naming
engine 114 may produce arefined name 118. Therefined name 118 may include thename 110, after processing by the namingengine 114, as it conforms to the namingconvention 112. For example, thename 112 of theentity class 106C may be “Customer Information”. Then, for example, the namingengine 114 may apply the rules or standards of the naming convention 112 (e.g., no invalid terms 116) and may produce therefined name 118 “CustomerInformation”. It is possible that therefined name 118 may include thename 110 without any modification, for instance, if the namingengine 114 determines thename 110 already conforms to the namingconvention 112 or the recommended modifications were rejected. - The semantics analyzer 120 may be configured to compare the
refined names 118 to therepository 122 to determine asemantic meaning 124 of therefined names 118. Therepository 122 may include a library of various components or resources that may be used to determine thesemantic meaning 124 of thenames 110 and/or attributes 108. Therepository 122 may include adictionary 123 such as, for example, the WordNet® dictionary that may determine associations between words or terms. For example, thedictionary 123 may include terms of words that may be connected through various relationships, including synonyms, homonyms and antonyms. Then, for example, thesemantic analyzer 120 may compare one or more portions of therefined name 118 to therepository 122, including thedictionary 123, to determine thesemantic meaning 124. The WordNet® dictionary (e.g., 123) may include multiple synonym sets for determining the semantic meanings orsemantic distance 128 between words or terms. Therepository 122, including the WordNet® dictionary, is further discussed below with reference toFIG. 3 . - In addition to including the
dictionary 123, therepository 122 may include business objects 125A, theInternet 125B,data models 125C and/or any other combination or resources that may be used to determine thesemantic meaning 124 of one or more words. The business objects 125A may include entity classes (e.g., 106A-E) or business documents used in one or systems. For example, the business objects 125A may include a sales estimate form. TheInternet 125B may include any resources available via the Internet, including the WordNet® dictionary (e.g., dictionary 123). Thedata models 125C may include any additional or existingdata models 102 that may be stored in therepository 122. Though discussed with reference to asingle data model 102, it should be understood that thedata model 102 may refer toseveral data models 102 that are to be combined in some way and optimized by theDMO 104. - The
semantic meaning 124 may include an interpretation of the linguistics or semantics used in therefined name 118. Thesemantic meaning 124 may include, for example, a set of synonyms or other words that may be semantically relevant to one or more terms of therefined name 118. For example, therefined name 118 “Customer” may result in thesemantic meaning 124 “client, consumer and purchaser”. Example operations related to determining thesemantic meaning 124 are discussed below in greater detail in reference toFIG. 9 . - The naming
engine 114 may use thesemantic meaning 124 to further refine therefined name 118 per the namingconvention 112. The namingconvention 112 may prohibitcomplex terms 126 whereas, based on the semantic meaning, 124 the namingengine 114 may identify and removecomplex terms 126 from therefined names 118. - The
complex terms 126 may include a combination of two or more terms that may be prohibited by the namingconvention 112. Thecomplex terms 126 may include a combination of two or more terms with two or more distinctsemantic meanings 124 that are joined together as a complex term but may retain the two or more distinctsemantic meanings 124. For example, an unacceptablecomplex term 126 may include StreetAddress, where the combination of the term “Street” with the term “Address” is unacceptable because both terms when joined, as a complex term, retain their separate meanings. The namingengine 114 may compare thesemantic meaning 124 of “Street” and “Address” with “StreetAddress” to determine acomplex term 126 exists. The namingengine 114 may then, for example, split “StreetAddress” into two simpler terms “Street” and “Address”. Then for example, the namingengine 114 may determine that “Street” is part of “Address”, and thus “StreetAddress” may be simplified to therefined name 118 “Address”. In another example, an acceptable complex term may include PostOfficeBox, where the combination of the terms “PostOffice” (or even “Post” and “Office”) and “Box” when combined, create a new joined meaning in “PostOfficeBox”, where “PostOfficeBox” may have a singular and distinctsemantic meaning 124. - The
comparator 126 may be configured to determine asemantic distance 128 between the terms within arefined name 118 and/or between two ormore names 110 or attributes 108. Thesemantic distance 128 may include an indication of the distance or variance between thesemantic meaning 124 of two or more terms. For example, thesemantic distance 128 may indicate that the distance between “Vehicle” and “Truck” is less than the distance between “Vehicle” and “Man”, as truck is a type of vehicle. - According to another exemplary implementation, the
semantic distance 128 may indicate a distance from a shared or root term. For example, thecomparator 126 may determine a semantic distance between “Car” and “Truck” by comparing thesemantic distances 128 between “Car” and “Truck” from a root term they share, such as “Vehicle”. Then for example, thesemantic distance 128 between “Car” and “Truck” may be determined by adding the semantic distance between “Vehicle” and “Car” and “Vehicle” and “Truck”. - The
comparator 126 may compare thesemantic distance 128 to asemantic threshold distance 130 to determine whether thesemantic distance 128 exceeds thesemantic threshold distance 130. Thesemantic threshold distance 130 may indicate the furthest or greatest acceptable semantic distance (e.g., 128) allowed between two or more terms. Then, based on whether thesemantic distance 128 exceeds thesemantic threshold distance 130, thecomparator 126 may modify one or more terms of arefined name 118. For example, arefined name 118 may include the terms “Vehicle” and “Man”. Then for example, thesemantic distance 128 between “Vehicle” and “Man” may be of the value 3 (e.g., Vehicle→Manufacturer→Employee→Man), where thesemantic threshold distance 130 may indicate a value of 2. Thecomparator 126 may then remove, or recommend removal, of “Man” from therefined name 118. Thesemantic threshold distance 130 may indicate a first threshold for determining similarity between terms of arefined name 118, a second threshold for determining similarity between two ornames 110 ofvarious entity classes 106A-E and a third threshold for determining similarity between two or more attributes 108. Example operations associated with determining thesemantic distance 128 is further discussed below with reference toFIG. 3 . - The
modeler 132 may be configured to determine whether a substantial similarity exists between two or more objects of thedata model 102 and one ormore modifications 134 that may be made to optimize thedata model 102. Themodeler 132 may determine whether the substantial similarity exists between two ormore entity classes 106A-E. The substantial similarity may indicate that two or more of theentity classes 106A-E are substantially, semantically similar. Themodeler 132 may, for example, first compare thesemantic meanings 124 of therefined names 118 of thefirst entity class 106B and thesecond entity class 106C. Themodeler 132 may then, for example, compare thesemantic meaning 124 of theattributes 108 of theentity classes modeler 132 may determine that a substantial similarity exists between theentity class 106B and theentity class 106C based on thesemantic meaning 124 comparisons of the names 110 (e.g., refined names 118) and the attributes 108 (e.g., being less than the semantic threshold distance 130). - Based on whether a substantial similarity exists, the
modeler 132 may determine and/or perform one ormore modifications 134 to thedata model 102. Themodification 134 may include any modification designed to optimize, normalize or otherwise increase the efficiency of thedata model 102 including, for instance, conforming thenames 110 and/or attributes 108 to the namingconvention 112, as discussed above. Themodifications 134 may include adding, removing, moving or otherwise modifying theentity classes 106A-E,names 110, attributes 108 and/or relationships of thedata model 102. - As just referenced, the
modeler 132 may create new entity classes in thedata model 102. For example, themodeler 132 may determine thatmultiple attributes 108 are used in afirst entity class 106C and asecond entity class 106A. Then for example, themodeler 132 may create a new entity class that includes theattributes 108 used in both the first andsecond entity classes modeler 132 may then, for example, establish relationships in thedata model 102 between thefirst entity class 106C, thesecond entity class 106A and the newly created entity class. - The
modification 134 may be based, at least in part, on ausage 136 as determined by thestatistics system 138. Thestatistics system 138 may be configured to track how often anentity class 106A-E and/orattribute 108 is used (e.g., usage 136) within thedata model 102. Then, for example, based on theusage 136, themodeler 132 may determine whichentity classes 106A-E and/or attributes 108 may be removed or otherwise modified in thedata model 102. For example, themodeler 132 may determine that a substantial similarity (e.g., redundancy) exists between afirst entity class 106D and asecond entity class 106E. Then, for example, theusage 136 may indicate that thesecond entity class 106E is used far more often than thefirst entity class 106D. Then, for example, themodification 134 may indicate removing thefirst entity class 106D or collapsing it into thesecond entity class 106E, based on the usage 136 (and adjusting the relationships accordingly). - The
modification handler 140 may be configured to generate amessage 142 based on themodification 134. Themessage 142 may include a notification to a user 144 about themodification 134 to be performed on thedata model 102. Then, for example, based on themessage 142, as provided to the user 144 via aninterface 146, the user 144 may accept or reject themodification 134. Based upon the response of the user 144 to themessage 142, themodeler 132 may then either implement or disregard themodification 134. -
Data models 102 may be complex structures based on detailed semantics and/or linguistics that may be specialized in different industries. Allowing the user 144 the ability to accept/rejectmodifications 134 to optimize or modify one or more aspects of thedata model 102, provides thesystem 100 with another level of error checking such that theDMO 104 may be prevented from performingmodifications 134 to aspects of thedata model 102, which while may not be optimal (as determined by the DMO 104), may have been purposely included in thedata model 102. Then, for example, based on the user's 144 acceptance/rejection of themodification 134, theDMO 104 may recursively analyze thedata model 102 and determineadditional modifications 134 that may be made. - The recursive processing of the
data model 102 by theDMO 104, where the user 144 may accept/rejectmodifications 134, allows theDMO 104 to implement a semi-automatic, or recursive, optimization process with regards to thedata model 102. The semi-automatic process may allow theDMO 104 to automatically determine whichmodifications 134 are to be made to thedata model 102, but also may allow the user 144 to determine that the semantical analysis performed by theDMO 104 is accurate with regards to thedata model 102. - As referenced above, the
DMO 104 may both flatten the hierarchy of thedata model 102 and normalize thedata model 102. In achieving these two goals, theDMO 104 may balance the objective of normalization in increasing reuse of objects of thedata model 102 and the objective in flattening the hierarchy of thedata model 102 in having fewer recursive levels of inheritance between the objects of thedata model 102. TheDMO 104 may balance these two objectives and modify thedata model 102 based on a determination as to which objectives will lead to greater optimization in thedata model 102. -
FIG. 3 is an example implementation of adictionary 123, according to an exemplary implementation.FIG. 3 illustrates an example mapping of terms within the dictionary 123 (ofFIG. 1 ) of therepository 122. The mapping ofFIG. 3 , may be used, for example, to determine a semantic distance (e.g., 128) between two or more terms or a semantic meaning (e.g., 124) of a term. For example, the semantic meaning of “market order” 302 may include “commercial document” 304 and “purchase order” 306. Or, for example, the semantic distance between “stop-loss order” 308 and “stop order” 310 may be 1, such that each term is 1 term removed from a root or common term (e.g., “order” 312/“purchase order” 306). The lines between the nodes of the mapping ofFIG. 3 , may indicate relationships between the terms of each node. Example relationships may include synonyms, homonyms, and antonyms. -
FIG. 4 is an example flowchart 400 illustrating example operations associated with optimizing adata model 102, according to an exemplary implementation. The flowchart 400 may provide that after each step (e.g., 402, 404, 406, 408, 410, 412, 414 and 416) themessage 142 may be generated atstep 418, and a user 144 may then decline or accept themodification 134. Upon, acceptance of themodification 134, the flowchart may return to step 402. This may allow theDMO 104 to recursively optimize one ormore data models 102. If themodification 134 is rejected, the optimization may be completed at 420 and restarted at a later time. - At 402, the dictionary entry name (DEN) of the
entity class 106 may be cleaned up. For example, the namingengine 114 may removeinvalid terms 116 from thenames 110 ofentity classes 106A-E. As referenced above, if aname 110 is found to contain aninvalid term 116, then amessage 142 may be generated requesting acceptance for thismodification 134 prior to completion by theDMO 104. - At 404, empty data structures may be removed from the
data model 102. For example, theDMO 104 may determine which of theentity classes 106A-E of thedata model 102 contain no attributes 108. Then for example, theDMO 104 may remove any of theempty entity classes 106A-E from thedata model 102. TheDMO 104 also may adjust relationships between the removed entity class and the remaining existing entity classes, to reflect the removal. - At 406, non-valid terms may be cleaned up. For example, the naming
engine 114 may modifycomplex terms 126 in therefined names 118 to simpler terms. - At 408, entities with non-relevant semantics may be cleaned up. For example, the
comparator 126 may determine whether thesemantic distance 128 between two or more terms of a name 110 (including refined name 118) and/or attribute is less than thesemantic threshold distance 130. If thecomparator 126 determines, for example, that thesemantic distance 128 between two or more terms of asingle name 110 is greater than thesemantic threshold distance 130, then one or more of the terms may be modified or cleaned up. - At 410, redundancy may be cleaned up. For example, if the
entity class 106 contains multipleredundant attributes 108, themodeler 132 may consolidate, remove or otherwise clean up the redundancies. An example redundancy between twoattributes 108 may be based on a semantic similarity between their names (e.g., thesemantic distance 128 being less the semantic threshold distance 130) and a similar type definition. The type definition may include any number of descriptions as to the type of information that may be stored by theattribute 108, such as, integer, whole number, string and currency. - At 412, type codes may be extracted. An
attribute 108 may include a type definition to describe the information to be stored in the attribute. For example, afirst name attribute 108 may include the type definition ‘string,’ and anage attribute 108 may include the type ‘integer.’ Then, for example, the type ‘code’ may indicate that a separate file or code list may be referenced to determine what the code refers to. For example, an attribute error code may include a type ‘code’ which makes reference to a list of error codes. Then, for example, the list of error codes may include code 32 indicating a printer error, code 43 indicating a software error and code 55 indicating a network error. Themodeler 132 may determine that thisattribute 108 includes a type code and may extract it and modify theentity class 106A-E and/or attribute 108 accordingly. - At 414, entities may be consolidated. For example, if the
modeler 132 determines a substantial similarity between two ormore entity classes 106A-E, then the redundancy may be cleaned up. For example, one of the redundant entity classes may be removed, or for example the redundant entity classes may be consolidated into fewer entity classes. - At 416, the
data model 102 may be normalized or de-normalized, as appropriate, to improve optimization. TheDMO 104 may compare thesemantic meanings 124 of thenames 110 and theattributes 108 of theentity classes 106A-E to determine whether theattributes 108 are properly positioned within thedata model 102. If theDMO 104 determines that anattribute 108, of a first entity class (e.g., 106A) is more closely associated with theattributes 108 of a second entity class (e.g., 106B), theDMO 104 may move theattribute 108 into the second entity class. - The
DMO 104 also may determine whether or not to create new entity classes within thedata model 102. For example, ifmultiple attributes 108 of anentity class 106 have similar values (e.g., property qualifiers 208), thoseattributes 108 may be extracted from thecurrent entity class 106 and placed into a new entity class. An example of this normalization may be seen inFIG. 5 . -
FIG. 5 illustrates an example normalization process of the data model optimizer (DMO) 104, according to an exemplary implementation. InFIG. 5 , the entity class “Item” (e.g., OCT 202) may be shown asblock 502 prior to normalization.Blocks - An object class qualifier (OCQ) 506 may be used to further describe or qualify the object class term (OCT) 202. For example, each of the
items 202 may be part of thepurchase order line 506. Thecolumns attribute 108. - The
DMO 104 may normalize thedata model 102 into a third normalization form. In a first normalization form, theattributes 108 of anentity class 106A-E may each be atomic or include a single value. For example, aname attribute 108 including a person's first and last name (e.g., Mark Smith) may be separated into a first name attribute 108 (e.g., Mark) and a last name attribute 108 (e.g., Smith). - In a second normalization form, every
non-key attribute 108 of anentity class 106A-E may depend on the primary key of theentity class 106A-E. For example, the city field of anattribute 108 may depend on the zip code field of theattribute 108. - In the third normalization form, each
attribute 108 of anentity class 106A-E may depend on nothing other than the primary key. For example, in a customeraddress entity class 106, the zip code may serve as a primary key upon which the city and state are dependent. TheDMO 104 may normalize thedata model 102 with regards to the three normalization forms. -
FIG. 6 is an example flowchart of aprocess 600 illustrating example operations associated with optimizing adata model 102, according to an exemplary implementation.Process 600 may include refining the name of each of the entity classes to conform to a naming convention (610), determining a semantic meaning of each of the refined names and the one or more attributes of each entity class (620), determining that the name of a first entity class is semantically similar to the name of a second entity class based on a semantic distance between the semantic meaning of the name of the first entity class and the semantic meaning of the name of the second entity class (630), determining a substantial similarity between the first entity class and the second entity class based on a comparison of the semantic meaning of the one or more attributes of the first entity class to the semantic meaning of the one or more attributes of the second entity class (640) and normalizing the data model based on the substantial similarity (650). - At 610, in
FIG. 6 , the name of each of the entity classes may be refined to conform to a naming convention. For example, inFIG. 1 , the namingengine 114 may refine thenames 110 of theentity classes 106A-E to conform to the namingconvention 112. - At 620, a semantic meaning for each of the refined names and the one or more attributes of each entity class may be determined. For example, the
semantics analyzer 120 may determine thesemantic meaning 124 for therefined names 118 and theattributes 108 of theentity classes 106A-E. - At 630, the name of a first entity class may be determined to be semantically similar to the name of a second entity class based on a semantic distance between the semantic meaning of the name of the first entity class and the semantic meaning of the name of the second entity class. For example, the
comparator 126 may determine that thesemantic meaning 124 of theentity class 106A is semantically similar to thesemantic meaning 124 of theentity class 106B based on thesemantic distance 128 being less than thesemantic threshold distance 130. - At 640, a substantial similarity between the first entity class and the second entity class may be determined based on a comparison of the semantic meaning of the one or more attributes of the first entity class to the semantic meaning of the one or more attributes of the second entity class. For example, the
modeler 132 may determine a substantial similar between theentity class 106A and theentity class 106B based on a comparison (e.g.,semantic distance 128 being less than the semantic threshold distance 130) of thesemantic meanings 124 of theattributes 108 of theentity class 106A and theentity class 106B. - At 650, the data model may be normalized based on the substantial similarity. For example, the
modeler 132 may apply themodification 134 to normalize thedata model 102. InFIG. 5 , for example, the “Item” entity class in 502 may be divided into the entity classes “Item” and “Shipment” in 504. -
FIG. 7 is an example flowchart of aprocess 700 illustrating example operations associated with optimizing adata model 102, according to an exemplary implementation.Process 700 may include removing invalid terms from the one or more terms of the names (710), identifying a complex term in one or more terms of the names, the complex term including a combination of two or more simpler terms (720), separating the identified complex term in the names to reflect the two or more simpler terms (730), determining, for each of the names including two or more terms, a semantic distance between the two or more terms of the name (740) and removing from each of the names including two or more terms, one or more terms with the semantic distance being greater than a threshold semantic distance (750). - At 710, in
FIG. 7 , invalid terms may be removed from the one or more terms of the names. For example, inFIG. 1 , the namingengine 114 may removeinvalid terms 116 from thenames 110. - At 720, a complex term may be identified in one or more of the names, where the complex term may include a combination of two or more simpler terms. For example, the naming
engine 114 may identifycomplex terms 126 in one or more of therefined names 118. - At 730, identified complex term may be separated in the names to reflect the two or more simpler terms. For example, the naming
engine 114 may separate thecomplex terms 126, identified in therefined name 118. Then, for example, the namingengine 114 may separate thecomplex terms 126 into simpler terms in therefined name 118, or may remove one or more of the simpler terms. - At 740, for each of the names including two or more terms, a semantic distance between the two or more terms of the name may be determined. For example, the
comparator 126 may determine thesemantic distance 128 between two or more terms of therefined name 118. - At 750, from each of the names including two or more terms, one or more terms with the semantic distance being greater than a threshold semantic distance may be removed. For example, the
modeler 132 may remove one or more terms with asemantic distance 128 being greater than thesemantic threshold distance 130. -
FIG. 8 is an example flowchart of aprocess 800 illustrating example operations associated with optimizing adata model 102, according to an exemplary implementation.Process 800 may include determining that the names of the plurality of entity classes and the one or more attributes conform to a core component technical specification (CCTS) naming convention, wherein each name and attribute includes an object term, a property qualifier and a property term (810), determining that each of the one or more attributes of the first entity class includes a unique property qualifier (820), determining that none of the plurality of entity classes includes multiple attributes that are substantially similar to the one or more attributes of the first entity class (830), determining, for each of the one or more attributes, a first medial distance between the attribute of the first entity class and the name of the first entity class, and a second medial distance between the attribute of the first entity class and the names of one or more entity classes, from the plurality of entity classes, associated with the first entity class, based on a semantic distance between the object terms, property qualifiers and property terms (840) and normalizing, for each of the one or more attributes, the first entity class such that the first medial distance between the attribute and the first entity class is less than the second medial distance between the attribute and any of the one or more associated entity classes (850). - In 810, in
FIG. 8 , the names of the plurality of entity classes and the one or more attributes may be determined to conform to a core component technical specification (CCTS) naming convention, where each name and attribute includes an object term, a property qualifier and a property term. For example, inFIG. 1 , the namingengine 114 may determine that thenames 110 and attributes 108 of theentity classes 106A-E conform to theCCTS naming convention 112. Eachname 110 andattribute 108, as shown inFIG. 2 , may include theobject term 202, theproperty qualifier 208 and theproperty term 204. - In 820, each of the one or more attributes of the first entity class may be determined to include a unique property qualifier. For example, in
FIG. 5 , themodeler 132 may determine that each of theattributes 108, of the “Item” entity class in 504, includes aunique property qualifier 208. - In 830, it may be determined that none of the plurality of entity classes includes multiple attributes that are substantially similar to the one or more attributes of the first entity class. For example, in
FIG. 5 , themodeler 132 may determine that none ofattributes 108 of the “Shipment” entity class in 504 are substantially similar to theattributes 108 of the “Item” entity class in 504. - In 840, for each of the one or more attributes of the first entity class, it may be determined that a first medial distance between the attribute of the first entity class and the name of the first entity class, and a second medial distance between the attribute of the first entity class and the names of one or more entity classes, from the plurality of entity classes, associated with the first entity class, based on a semantic distance between the object terms, property qualifiers and property terms. For example, in
FIG. 1 , themodeler 132 may determine a first medial distance between afirst attribute 108 of theentity class 106A and thename 110 of theentity class 106A. The medial distance may be determined based on thesemantic distance 128 between theobject terms 202,property qualifiers 208 andproperty terms 204. Similarly, themodeler 132 may determine a second medial distance between afirst attribute 108 of theentity class 106A and thename 110 of asecond entity class 106B. - In 850, for each of the one or more attributes of the first entity class, the first entity class may be normalized such that the first medial distance between the attribute and the first entity class is less than the second medial distance between the attribute and any of the one or more associated entity classes. For example, in
FIG. 1 , themodeler 132 may determine themodification 134 to normalize thedata model 102 such that the first medial distance between thefirst attribute 108 of thename 110 of theentity class 106A is less than the second medial distance between thefirst attribute 108 of thename 110 of the other associatedentity classes 106B-E. If for example, themodeler 132 determines that the first medial distance is greater than any of the second medial distances, themodeler 132 may then perform amodification 134 to move thefirst attribute 108 into adifferent entity class 106B-E such that the first medial distance is less than the second medial distances. -
FIG. 9 is an example flowchart of aprocess 900 illustrating example operations associated with determining asemantic meaning 124, according to an exemplary implementation. Theprocess 900 illustrates example operations of thesemantics analyzer 120, ofFIG. 1 , in determining thesemantic meaning 124 of the names 110 (including refined names 118) and/or attributes 108 of theentity classes 106A-E of thedata model 102. - At 901, processing begins.
Steps semantic meaning 124. 902 indicates various data sources that may be used to perform any of thechecks 904A-E. - At 904A, the relationship between the information available in the
repository 122 may be compared against the DEN (e.g.,name 110, refined name 118) of anentity class 106A-E. Then, for example, the closest relationship (e.g., semantic distance 128) between the information in therepository 122 and the DEN of the entity class may be determined in 906, and a rating may be stored instep 908. The rating may indicate a similarity between objects of therepository 122 and objects of thedata model 102. A rating (e.g., 908) may be determined for each of thechecks 904A-E. In 910, the best or highest rating may be determined as used to determine thesemantic meaning 124 of the term. - At 904B, a context or business context of the entity may be determined and the relationship between the
repository 122 and the business context may be determined. For example, various industries may include industry-specific terminology, then, for example, the industry-specific terminology may be used to determine thesemantic meaning 124 of the terms. - At 904C, the relationship of the DENs of the
attributes 108 of the entity are compared to therepository 122 and theclosest relationship 906 may be determined. - At 904D and 904E, descriptions of the entity class and
repository 122, respectively, may be determined. The descriptions may include, for example, natural language words, phrases and/or sentences used to describeentities 106A-E and items (e.g., 125A, 125B and 125C) of therepository 122. - While certain features of the described implementations have been illustrated as described herein, many modifications, substitutions, changes and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the various embodiments.
Claims (20)
1. A computer-implemented method for refining a data model having a plurality of
entity classes, the computer-implemented method comprising:
comparing a semantic meaning of a name of a first entity class to a semantic meaning of a second entity class;
comparing a semantic meaning of an attribute of the first entity class to a semantic meaning of an attribute of the second entity class;
determining that a substantial similarity exists between the first entity class and the second entity class based on the comparison of the semantic meaning of the names of the first and second entity classes and the comparison of the semantic meaning of the attributes of the first and second entity classes; and
collapsing the first entity class into the second entity class based on the substantial similarity between the first entity class and the second entity class.
2. The computer-implemented method of claim 1 further comprising refining the names of each of the first and second entity classes to conform to a core component technical specification (CCTS) naming convention.
3. The computer-implemented method of claim 1 further comprising identifying, for the names of the first and second entity classes, a dictionary entry name (DEN) including an object term, a property term, and a representation term.
4. The computer-implemented method of claim 1 wherein the comparing the semantic meanings of the names and attributes comprises determining the semantic meanings of the names and attributes based on a single dictionary including synonym sets.
5. The computer-implemented method of claim 1 wherein the comparing the semantic meaning of the name of the first entity class to the name of the second entity class comprises:
determining a first semantic distance between the semantic meaning of the name of the first entity class and a root entity class from the first and second entity classes;
determining a second semantic distance between the semantic meaning of the name of the second entity class and the root entity class; and
comparing the first semantic distance to the second semantic distance to determine a third semantic distance between the semantic meaning of the name of the first entity class and the semantic meaning of the name of the second entity class.
6. The computer-implemented method of claim 1 wherein the determining that substantial similarity exists between the name of the first entity class and the name of the second entity class comprises determining that semantic distances between the names and attributes are less than a semantic threshold distance.
7. The computer-implemented method of claim 1 wherein the determining that the substantial similarity between the first entity class and the second entity class comprises:
determining a first semantic meaning of the attribute of the first entity class;
determining a second semantic meaning of the attribute of the second entity class; and
comparing the first semantic meaning with the second semantic meaning to determine whether the substantial similarity exists.
8. The computer-implemented method of claim 7 wherein the comparing the first semantic meaning with the second semantic meaning comprises:
determining an attribute semantic distance between the first semantic meaning and the second semantic meaning; and
determining that the attribute semantic distance is less than a semantic threshold distance.
9. The computer-implemented method of claim 1 further comprising creating a new entity class that includes the attribute of the first entity class and the attribute of the second entity class.
10. The computer-implemented method of claim 1 wherein the collapsing comprises:
adding at least a portion of the first entity class to the second entity class based on the substantial similarity of the attribute of the first entity class and the attribute of the second entity class; and
removing the first entity class based on the substantial similarity between the first entity class and the second entity class.
11. The computer-implemented method of claim 1 further comprising:
providing a message to a user requesting authorization to perform refining;
receiving the requested authorization; and
refining the names of each of the first entity class and the second entity class to conform to a naming convention based on receiving the requested authorization.
12. The computer-implemented method of claim 1 further comprising:
providing a message to a user requesting authorization to perform the collapsing;
receiving the requested authorization; and
performing the collapsing based on receiving the requested authorization.
13. A system for refining a data model having a plurality of entity classes, the system comprising:
a semantics analyzer configured to determine semantic meanings of names and attributes of the plurality of entity classes of the data model; and
a comparator configured to:
compare a semantic meaning of a name of a first entity class to a semantic meaning of a second entity class;
compare a semantic meaning of an attribute of the first entity class to a semantic meaning of an attribute of the second entity class;
determine that a substantial similarity exists between the first entity class and the second entity class based on the comparison of the semantic meaning of the names of the first and second entity classes and the comparison of the semantic meaning of the attributes of the first and second entity classes; and
collapse the first entity class into the second entity class based on the substantial similarity between the first entity class and the second entity class.
14. The system of claim 13 , wherein the comparator is configured to compare the semantic meaning of the name of the first entity class to the name of the second entity class by
determining a first semantic distance between the semantic meaning of the name of the first entity class and a root entity class from the first and second entity classes,
determining a second semantic distance between the semantic meaning of the name of the second entity class and the root entity class, and
comparing the first semantic distance to the second semantic distance to determine a third semantic distance between the semantic meaning of the name of the first entity class and the semantic meaning of the name of the second entity class.
15. The system of claim 13 , wherein the comparator is configured to collapse the first entity class into the second entity class by
adding at least a portion of the first entity class to the second entity class based on the substantial similarity of the attribute of the first entity class and the attribute of the second entity class, and
removing the first entity class based on the substantial similarity between the first entity class and the second entity class.
16. A computer-implemented method for normalizing a first entity class of a data model having the first entity class and a second entity class, wherein each of the first and second entity classes includes a name and first and second attributes, the computer-implemented method comprising:
determining that the first and second attributes of the second entity class are not substantially similar to either the first or second attributes of the first entity class;
for each of the first and second attributes of the first entity class:
determining a first medial distance between the attribute of the first entity class and the name of the first entity class; and
determining a second medial distance between the attribute of the first entity class and the name of the second entity class, and
modifying the first entity class by moving one of the first and second attributes of the first entity class to the second entity class so that, for the attribute remaining in the first entity class, the first medial distance between the remaining attribute and the name of the first entity class is less than the second medial distance between the remaining attribute and the name of the second entity class.
17. The computer-implemented method of claim 16 further comprising:
determining that the first and second attributes of the first entity class each include a non-unique property qualifier; and
creating a new entity class based on the non-unique property qualifier and including the first and second attributes of the first entity class.
18. The computer-implemented method of claim 16 wherein:
determining the first medial distance comprises determining a first semantic distance between the attribute of the first entity class and the name of the first entity class; and
determining the second medial distance comprises determining a second semantic distance between the attribute of the first entity class and the name of the second entity class.
19. The computer-implemented method of claim 16 wherein a dependent relationship exists between the first entity class and the second entity class.
20. The computer-implemented method of claim 16 further comprising recursively performing the computer-implemented method for the second entity class and a third entity class of the data model, the third entity class including a name and first and second attributes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/609,052 US20150142841A1 (en) | 2008-11-12 | 2015-01-29 | Data model optimization |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/269,324 US8290989B2 (en) | 2008-11-12 | 2008-11-12 | Data model optimization |
US13/596,532 US8954378B2 (en) | 2008-11-12 | 2012-08-28 | Data model optimization |
US14/609,052 US20150142841A1 (en) | 2008-11-12 | 2015-01-29 | Data model optimization |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/596,532 Division US8954378B2 (en) | 2008-11-12 | 2012-08-28 | Data model optimization |
Publications (1)
Publication Number | Publication Date |
---|---|
US20150142841A1 true US20150142841A1 (en) | 2015-05-21 |
Family
ID=42166150
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/269,324 Active 2029-10-24 US8290989B2 (en) | 2008-11-12 | 2008-11-12 | Data model optimization |
US13/596,532 Active US8954378B2 (en) | 2008-11-12 | 2012-08-28 | Data model optimization |
US14/609,052 Abandoned US20150142841A1 (en) | 2008-11-12 | 2015-01-29 | Data model optimization |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/269,324 Active 2029-10-24 US8290989B2 (en) | 2008-11-12 | 2008-11-12 | Data model optimization |
US13/596,532 Active US8954378B2 (en) | 2008-11-12 | 2012-08-28 | Data model optimization |
Country Status (1)
Country | Link |
---|---|
US (3) | US8290989B2 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104866625A (en) * | 2015-06-15 | 2015-08-26 | 苏州大学张家港工业技术研究院 | Method and system for entities matching |
CN105045863A (en) * | 2015-07-13 | 2015-11-11 | 苏州大学张家港工业技术研究院 | Method and system used for entity matching |
CN106354787A (en) * | 2016-08-24 | 2017-01-25 | 齐鲁工业大学 | Entity coreference resolution method based on similarity |
CN112750434A (en) * | 2020-12-16 | 2021-05-04 | 马上消费金融股份有限公司 | Method and device for optimizing voice recognition system and electronic equipment |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8290989B2 (en) | 2008-11-12 | 2012-10-16 | Sap Ag | Data model optimization |
US9852188B2 (en) * | 2014-06-23 | 2017-12-26 | Google Llc | Contextual search on multimedia content |
US10505873B2 (en) | 2014-12-30 | 2019-12-10 | Sap Se | Streamlining end-to-end flow of business-to-business integration processes |
US10192202B2 (en) | 2014-12-31 | 2019-01-29 | Sap Se | Mapping for collaborative contribution |
JP6310150B2 (en) * | 2015-03-20 | 2018-04-11 | 株式会社東芝 | Intent understanding device, method and program |
CN105446957B (en) * | 2015-12-03 | 2018-07-20 | 小米科技有限责任公司 | Similitude determines method, apparatus and terminal |
US10878473B1 (en) * | 2017-11-16 | 2020-12-29 | Amazon Technologies, Inc. | Content modification |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2029601A (en) * | 1999-12-17 | 2001-06-25 | Si Han Kim | Information coding and retrieval system and method thereof |
US7171652B2 (en) * | 2002-12-06 | 2007-01-30 | Ricoh Company, Ltd. | Software development environment with design specification verification tool |
US20060080419A1 (en) * | 2004-05-21 | 2006-04-13 | Bea Systems, Inc. | Reliable updating for a service oriented architecture |
US7769762B2 (en) | 2006-12-19 | 2010-08-03 | Sap Ag | Method and system for consolidating data type repositories |
US8290989B2 (en) | 2008-11-12 | 2012-10-16 | Sap Ag | Data model optimization |
-
2008
- 2008-11-12 US US12/269,324 patent/US8290989B2/en active Active
-
2012
- 2012-08-28 US US13/596,532 patent/US8954378B2/en active Active
-
2015
- 2015-01-29 US US14/609,052 patent/US20150142841A1/en not_active Abandoned
Non-Patent Citations (1)
Title |
---|
Goldman, "Refactoring Tools for Extreme Programming An Overview", 2002 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104866625A (en) * | 2015-06-15 | 2015-08-26 | 苏州大学张家港工业技术研究院 | Method and system for entities matching |
CN105045863A (en) * | 2015-07-13 | 2015-11-11 | 苏州大学张家港工业技术研究院 | Method and system used for entity matching |
CN106354787A (en) * | 2016-08-24 | 2017-01-25 | 齐鲁工业大学 | Entity coreference resolution method based on similarity |
CN112750434A (en) * | 2020-12-16 | 2021-05-04 | 马上消费金融股份有限公司 | Method and device for optimizing voice recognition system and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
US8954378B2 (en) | 2015-02-10 |
US20100121864A1 (en) | 2010-05-13 |
US20120324588A1 (en) | 2012-12-20 |
US8290989B2 (en) | 2012-10-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8290989B2 (en) | Data model optimization | |
JP5306359B2 (en) | Method and system for associating data records in multiple languages | |
Deb Nath et al. | Towards a programmable semantic extract-transform-load framework for semantic data warehouses | |
US7562088B2 (en) | Structure extraction from unstructured documents | |
Chakaravarthy et al. | Efficiently linking text documents with relevant structured information | |
US7707023B2 (en) | Method of finding answers to questions | |
US20080162455A1 (en) | Determination of document similarity | |
RU2635257C1 (en) | Sentiment analysis at level of aspects and creation of reports using machine learning methods | |
US20090063470A1 (en) | Document management using business objects | |
US10095766B2 (en) | Automated refinement and validation of data warehouse star schemas | |
JP2010541079A5 (en) | ||
US20180181646A1 (en) | System and method for determining identity relationships among enterprise data entities | |
CN107562919B (en) | Multi-index integrated software component retrieval method and system based on information retrieval | |
JP2008033931A (en) | Method for enrichment of text, method for acquiring text in response to query, and system | |
Abramowicz et al. | Filtering the Web to feed data warehouses | |
Goncalves et al. | Fuzzy XQuery | |
Pamungkas et al. | B-BabelNet: business-specific lexical database for improving semantic analysis of business process models | |
Talburt et al. | A practical guide to entity resolution with OYSTER | |
CN117076599A (en) | Knowledge graph-based data searching method and device and electronic equipment | |
RU2691836C1 (en) | Method of transforming a structured data array comprising main linguistic-logic entities | |
Wahid et al. | XML semantic constraint validation for XML updates: A survey | |
Alkhateeb et al. | Answering SPARQL queries modulo RDF Schema with paths | |
Naeem et al. | Generating OLAP queries from natural language specification | |
RU2572367C1 (en) | Method of searching for information in pre-transformed structured data array | |
US20220398271A1 (en) | Computing system for extracting facts for a knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SAP SE, GERMANY Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:STUHEC, GUNTHER;GESSNER, FLORIAN;LEMCKE, JENS;SIGNING DATES FROM 20150120 TO 20150128;REEL/FRAME:036335/0465 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |