AU785458B2 - Method and apparatus for analyzing the quality of the content of a database - Google Patents

Method and apparatus for analyzing the quality of the content of a database Download PDF

Info

Publication number
AU785458B2
AU785458B2 AU43918/01A AU4391801A AU785458B2 AU 785458 B2 AU785458 B2 AU 785458B2 AU 43918/01 A AU43918/01 A AU 43918/01A AU 4391801 A AU4391801 A AU 4391801A AU 785458 B2 AU785458 B2 AU 785458B2
Authority
AU
Australia
Prior art keywords
field
selected field
values
score
assigning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU43918/01A
Other versions
AU4391801A (en
Inventor
Gregg Menin
Michael Renn Neal
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Requisite Technology Inc
Original Assignee
Requisite Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Requisite Technology Inc filed Critical Requisite Technology Inc
Priority to AU43918/01A priority Critical patent/AU785458B2/en
Publication of AU4391801A publication Critical patent/AU4391801A/en
Application granted granted Critical
Publication of AU785458B2 publication Critical patent/AU785458B2/en
Assigned to CLICK COMMERCE, INC. reassignment CLICK COMMERCE, INC. Alteration of Name(s) in Register under S187 Assignors: REQUISITE TECHNOLOGY, INC.
Assigned to REQUISITE TECHNOLOGY INC. reassignment REQUISITE TECHNOLOGY INC. Request to Amend Deed and Register Assignors: CLICK COMMERCE, INC.
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Description

a
AUSTRALIA
PATENTS ACT 1990 COMPLETE SPECIFICATION NAME OF APPLICANT(S): Requisite Technology, Inc.
ADDRESS FOR SERVICE: DAVIES COLLISON CAVE Patent Attorneys 1 Little Collins Street, Melbourne, 3000.
INVENTION TITLE: o Method and apparatus for analyzing the quality of the content of a database The following statement is a full description of this invention, including the best method of performing it known to me/us:- P \OPERRJC\ 027MaMI)0404 42 pa I dom-25O/2007 -2- FIELD OF THE INVENTION The present invention relates to a method for scoring a database. The present invention also relates to a machine-readable medium having stored thereon data representing sequences of instructions. For example, this invention relates to electronic databases in general, and more specifically to a method and apparatus for analyzing the content of a database for various qualities such as comprehensibility, completeness and consistency which bear on the usefulness of the database in comparison to other databases.
BACKGROUND OF THE INVENTION Searchable electronic catalogs are commonly used in support of electronic commerce and purchasing functions. These electronic catalogs can be created from printed catalogs, spreadsheets, text documents, databases or lists and typically are rendered into databases, HTML page collections and other electronic means. Individual purchaser or marketplace system installations frequently contain several catalogs from several sources.
For example, an office supply installation may contain office supply catalogs from several different office supply vendors or manufacturers. Some of the catalogs may describe identical items such as a blue pen while each catalog will likely describe similar but different items, such as different makes of blue pens. These catalogs may vary in their quality and usability as measured by the ability of users to find and purchase items. An objective measurement of the qualities of each catalog allows one to compare catalogs and identify catalog deficiencies quickly. With sufficient support, such analyses can quickly localize the source of the deficiency.
Three critical aspects of catalog usage are purchasing, item identification and validation, and finding. Sufficient information must be present in the catalog for describing oo** 25 an item so that a user or a prospective buyer can find the item. A catalog supplier strives to present a catalog that maximizes the likelihood that items will be found, identified and then purchased. The information needed for a purchase may be only a part number or include very detailed item descriptions with images and interactive applications. Catalogs that support a greater amount of specific information generate greater sales so they are scored higher in evaluating the catalog's usefulness and in evaluating the key attribute of how easy it is for a purchaser to find the item that is sought.
08-06-'07 12:57 FROM- T-007 P005/008 F-983 -3- It is generally desirable to overcome or ameliorate one or more of the above described difficulties, or to at least provide a useful alternative.
SUMMARY OF THlE INVENTION In accordance with one aspect of the present invention, there is provided a method, performed by a computer system, for scoring a database for a quality comprising: selecting at least one field of the database for analysis; fetching each value for each record of the database from the selected fields that are to be analyzed; comparing each of the fetched values for a selected field to a field standard; and assigning a field score for each selected field based on the comparison.
In. accordance with another aspect of the present invention, there is provided a machine-readable medium having stored thereon data representing sequences of* instructions which, when executed by a processor, cause the processor to perform the steps of: selecting at least one field of the database for analysis; fetching each value for each record of the database from the selected fields that are to be analyzed; comparing each of the fetched values for a selected field to a field standard; and assigning a field score for each selected field based on the comparison.
In a preferred embodiment, the present invention provides a method for scoring a ****database for a quality, for example, completeness, consistency or comprehensibility. The method includes selecting fields of the database that arc to be analyzed, fetching values for each record of the database from the fields that are to be analyzed and comparing the fetched values to a standard. Preferably, after the comparison, a score is assigning for each field based on the comparison. The fields are ranked in order of pertinence to the quality that is to be measured and the scores are weighted for each field based on the rank of each field. The weighted scores are finally combined to obtain a score for the database.
Where the quality to be analyzed is completeness, the invention includes comparing fetched values for a field to other fetched values for the same field. Assigning a score comprises assigning points for each null value so that the score a for a field corresponds to the number of null values for all records in that field.
COMS ID No: SBMI-07708304 Receivedl by IP Australia: Time 11:59 Date 2007-06-08 P \OPERR1C007 TM )17442 sw] doc.2905(X)7 -4- Where the quality to be analyzed is consistency, the invention includes comparing the fetched values for a field to a dictionary of possible values. Assigning a score comprises assigning points for each fetched value that does not match a dictionary value so that the score a for a field corresponds to the number of non-matching values for all records for that field.
Where the quality to be analyzed is comprehensibility, the present invention includes comparing the fetched values for a field to a dictionary of possible values and assigning a score comprises assigning points for each fetched value that does not match a dictionary value so that the score a for a field corresponds to the number of non-matching values for all records for that field.
BRIEF DESCRIPTION OF THE DRAWINGS Preferred embodiments of the present invention are hereafter described, by way of non-limiting example only, with reference to the accompanying drawings, in which: Figure 1 is an example of a typical computer system upon which one embodiment of the present invention may be implemented.
Figure 2 is a flow diagram showing one embodiment of the present invention; Figure 3 is a flow diagram showing an application of a preferred embodiment of the present invention for measuring completeness; Figure 4 is a flow diagram showing an application of a preferred embodiment of the present invention for measuring consistency; and Figure 5 is a flow diagram showing an application of a preferred embodiment of the present invention for measuring comprehensibility.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS OF THE 25 PRESENT INVENTION The present invention preferably includes various steps, which will be described below. The steps of the present invention may preferably be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor or logic circuits programmed with the instructions to perform the steps. Alternatively, the steps may be performed by a combination of hardware and software.
P OPER\RJC\(7l?\ \2 417442 spa I do-25 /5f207 The present invention may be provided as a computer program product which may include a machine-readable medium having stored thereon instructions which may be used to program a computer (or other electronic devices) to perform a process according to a preferred embodiment of the present invention. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs, and magnetooptical disks, ROMs, RAMs, EPROMs, EEPROMs, magnet or optical cards, flash memory, or other type of media machine-readable medium suitable for storing electronic instructions. Moreover, the present invention may also be downloaded as a computer program product, wherein the program may be transferred from a remote computer to a requesting computer by way of data signals embodied in a carrier wave or other propagation medium via a communication link a modem or network connection).
Importantly, while preferred embodiments of the present invention will be described with reference to analyzing the quality of a catalog for finding and identifying items of particular interest to users such as potential customers, the method and apparatus described herein are equally applicable to the analysis of any sort of database for which particular qualities are to be measured. For example, the techniques described herein are thought to be useful in connection with databases for client or customer management, for inventory management and for transportation management and scheduling.
The present invention is preferably implemented in Java software instructions although any other computer programming language can be used. The Java code can be run THE NEXT PAGE IS PAGE 7 05/15/01 TUE 18:55 FAX BIl&L cuu9 on a wide variety of computer systems. An example of such a computer system upon which the present invention may be implemented will now be described with reference to Figure 1. The computer system comprises a bus or other communication means I for communicating information, and a processing means such as a processor 2 coupled with the bus I for processing information. The computer system further comprises a random access memory (RAM) or other dynamic storage device 4 (referred to as main memory), coupled to the bus I for storing information and instructions to be executed by the processor 2. The main memory 4 also may be used for storing temporary variables or other intermediate information during execution of instructions by the processor 2. The computer system may also include a read only memory (ROM) or other static storage device 6 coupled to the bus I for storing static information and instructions for the processor 2.
A data storage device 7 such as a magnetic disk or optical disc and its corresponding drive may also be coupled to the computer system for storing information and instructions. The computer system can also be coupled via the bus I to a display device 21, such as a cathode ray tube (CRT) or Liquid Crystal Display (LCD), for displaying information to an end user. For example, graphical and textual indications of installation status, time remaining in the trial period, and other information may be presented to the prospective purchaser on the display device 21. Typically, an alphanumeric input device 22. including alphanumeric and other keys, may be coupled to 20 the bus 1 for communicating information and command selections to the processor 2.
Another type of user input device is a cursor control 23, such as a mouse, a trackball, or Docket No.: 003919.P004 Express Mail No.: EL328715941 US 7 05/15/01 TUE 18:55 FAXBS& aUIU
BST&Z
Lgj U IV cursor direction keys for communicating direction information and command selections to the processor 2 and for controlling cursor movement on the display 21.
A communication device 25 is also coupled to the bus I The communication device 25 may include a modem, a network interface card, or other well known interface devices, such as those used for coupling to Ethernet, token ring, or other types of physical attachments for purposes of providing a communication link to support a local or wide area network, for example. In any event, in this manner, the computer system may be coupled to a number of clients or servers via a conventional network infrastructure, such as a company's Intranet or the Internet, for example.
It may be appreciated that a lesser or more equipped computer system than the example described above may be desirable for certain implementations. Therefore, the configuration of the computer system will vary from implementation to implementation depending upon numerous factors, such as price constraints, performance requirements, technological improvements, and other circumstances.
It should be noted that, while the steps described herein may be performed under the control of a programmed processor, such as the processor 2, in alternative embodiments, the steps may be fully or partially implemented by any programmable or hard coded logic, such as Field Programmable Gate Arrays (FPGAs), TTL logic, or Application Specific Integrated Circuits (ASICs), for example. Additionally, the method of the present invention may be performed by any combination of programmed general purpose computer components or custom hardware components. Therefore, nothing Docket No.: 0039 19.P04 Express Mail No.: EL328715941 US 8 P \OPERRJC 00"7\M.I)24 7442 pal doc.250rO2m7 -9disclosed herein should be construed as limiting the present invention to a particular embodiment wherein the recited steps are performed by a specific combination of hardware components.
The present invention is preferably directed toward analyzing lists of data, and in a preferred embodiment, to analyzing electronic catalogs. The catalog can exist as a database or in any other electronic format, such as a spreadsheet or text. Where there is no electronic format, paper catalogs or text documents can be scanned into electronic form and then processed to a standardized list of items with their descriptions. The present application will preferably describe the invention in terms of a database. In the context of a preferred embodiment of the present invention, the term database should not be construed as limited to any particular type of structure but rather in a broader sense as a list or a sequence in which items are accompanied by descriptions. Such a database can be viewed, for example, as a collection of two-dimensional tables in which each row represents a different record and each column represents a different field. Each record corresponds to a particular item. In the case of a catalog of office supplies, a record provides the catalog information for a particular office supply such as a particular pen. Different pens each have a different record. For each record, there are several fields. Each field describes an attribute :i of the item that corresponds to the record, such as price, color, weight, size etc. The present invention preferably analyzes the values that are entered into the fields of the database.
Figure 2 shows an application of a preferred embodiment of the present invention, in general, to analyzing a quality of a database. In Figure 2 the process begins with selecting the fields that are to analyzed 30. Typically, not all fields are given the same importance, as will be appreciated in the examples that follow. After the fields are selected, they are ranked in order of importance 32. The present invention preferably looks at deficiencies, excesses, and variability in the values of the fields of the database and, in order to provide a meaningful score, different fields must be accorded differing levels of importance in the scoring.
Each field is given a weight, based on its ranking and this weight is used in determining the final score. After the fields are selected and ranked, the values in the database for each of the selected fields is fetched 34 and then analyzed through a process of comparison 36. The particular type of comparison will depend upon the particular quality that is being analyzed. After the comparison, a score is assigned 38 based on the comparison. This score is the basic input into the overall score for the database. As mentioned above, the ranking of the fields is used to assign weighting factors to each of the fields 40. These weighting factors are preferably recorded in a table which is used to apply weights to each of the scores 42. It is presently preferred that the weights.all 15 constitute a multiplication factor between zero and one, however, the numerical scaling can be done in a variety of different ways. Finally, the weighted scores are combined 44 to produce an overall score for the database for the particular quality being analyzed.
Scores for multiple qualities can be combined to provide a more comprehensive score of the database. The results can also be normalized to facilitate comparisons between S 20 different databases or electronic sources.
05/15/01 TUE 18:56 FAX
USI&Z
In a preferred embodiment, the invention can be used to measure the ease with which items in a catalog can be found. Preferably three components are analyzed.
Completeness looks to see if attributes and field values for catalog items exist in the catalog, or, in other words, whether important fields for each record contain data entries.
Emphasis is placed on attributes critical to finding and purchasing such as SKU (Stock Keeping Unit), Price, Supplier Name, and Description. A catalog, which is missing these items (contains null field values) will be more difficult to use. Consistency looks for the consistent use of common abbreviations and units of measure. Comprehension looks at how the product is described by evaluating word usage. Words, including units of measure and common abbreviations, in the description fields are examined using a dictionary and parts of speech are analyzed for appropriateness and count.
Figure 3 shows an example of a flow chart for analyzing completeness.
Preferably, in the example of analyzing an electronic catalog, the completeness analysis is a check for the existence of all attributes of products that are required to make a purchase, as well as the existence of field values that enhance the ability to find a product.
In Figure 3 the process of analyzing a database for the quality of completeness begins with selecting the fields that are to analyzed 50. Typically, for the example of an electronic catalog, the fields of SKU, Price, Supplier Name, and Description would be selected. However, the particular selected fields will depend upon the particular database 20 to be analyzed and the fields which are considered to be most important. After the fields are selected they are ranked in order of importance 52. Typically, the ranking would be Docket No.: 003919.P004 Express Mail No.: EL328715941 US 11 Igj U lJ 05/15/01 TUE 18:56 FAX
BST&Z
SKU, Price, Supplier Name, and Description. The particular database, domain of the database content, and the ranking of the fields will depend on the particular database and the purpose of the analysis. Weights are next assigned 54 based on the rankings.
Examples of weights to apply would be SKU; 1.0, Price: 0.75, Supplier Name: 0.5, and Description: 0.25.
After the fields are selected and ranked, the values in the database for each of the selected fields is fetched 56 and then analyzed through a process of comparison 58.
Specifically, the value of the field is compared to a null value, a determination is made as to whether there is any data entered into the field for the particular field. Then a count is made of all of the null values for each field 60. A score is assigned 62 based on the comparison. Preferably, the score is simply the number of values that are not null for each field. Weighting factors are preferably applied to each of the scores 64. Finally, the weighted scores are combined 66 to produce an overall completeness score for the database being analyzed.
A mathematical example of determining a completeness score where three different fields are being analyzed follows.
The completeness score (wl*fl(n) w2*f2(n) w3*f3(n))/(wl+w2+w3) Where fl(n) E [all first group fields]*(count of first group fields with non-null values 20 per record]*(count of records being evaluated products in the catalog) ([all first group fields]*[count of records being evaluated]) Docket No.: 003919.P004 Express Mail No.: EL328715941US 12 I~j u 05/15/01 TUE 18:57 FAX
BST&Z
f2(n) [all second group fields]*[count of second group fields with non-null values per product]*[count of products being evaluated a catalog) ([all second group fields]*[count of products being evaluated]) f3(n) E [all third group fields]*[count of third group fields with non-null values per product]*[count of products being evaluated a catalog) ([all third group fields]*[count of products being evaluated] and where wl, w2 and w3 are the corresponding weights for the first to the third fields respectively.
A detailed report of completeness would typically show the percent completion S: (values not null) for all selected fields, a list of the selected fields, and the percent completion of all fields by category. In addition, the number of items missing key attributes in a field, number of items with rich content pictures) and the number of items without categories may be shown. Finally, the percent completion of all fields by score can be provided. This can be used to focus data value improvement efforts on those areas that most need it.
20 It may also be desired to produce scores on the basis of domains, categories or attributes. For a catalog that spans several domains, it may be useful to understand which Docket No.: 003919.P004 Express Mail No.: EL328715941US 13 tl lul domains have the greatest level of completion and which domains require the most improvement. Within a particular domain, a catalog user or creator may benefit by understanding which categories of goods or services may benefit most from remediation.
Attributes (descriptors or specifications) which relate to groups of fields present another useful basis for reporting to a catalog user or creator. If the incomplete fields belong to attributes that are common across the catalog, such as SKU and price a different remedial effort may be required than if the incomplete fields relate to category specific attributes such as color or power.
Figure 4 presents an application of the present invention to preferably analyze consistency.
Catalog users generally prefer consistency in the manner in which items are described.
This promotes confidence that when a user searches for a product description, all items 9.4944 like the desired product are found and displayed. The first element of consistency is in the usage of words, units of measure, and abbreviations, for example using ft., FT. or foot. Unnecessary or inconsistent uses of synonyms, that is using synonyms that do not 9*9g convey differences in the products, are distracting and interfere with efficient use of the catalog. The use of abbreviations with multiple possible meanings (such as CT Carton or Crate or Connecticut) can create ambiguities that also interfere with efficient use of the catalog. The present invention, preferably using a thesaurus defined by the user as a database of synonyms, can score synonym usage.
Consistency in abbreviation usage is desirable, both for catalog consistency and for avoiding ambiguity. A table of abbreviations may be created with the preferred 05/15/01 TUE 18:57 FAX BS1&Z Lj U I( abbreviations noted. Scoring of this component of consistency may be based on the ratio of preferred abbreviations to the total abbreviations. Frequency with which unique pairings occur is a second method. Combining several methods allows for a weighted score for the entire consistency component.
In Figure 4, the process of analyzing a database for the quality of completeness begins with selecting the fields that are to analyzed 70. Typically, for the example of an electronic catalog, the fields which contain units of measure and abbreviations would be selected. For a catalog, fields for dimensions, colors, types and shipping data may be selected. After the fields are selected, they are ranked in order of importance 72. The particular fields selected and the ranking of the fields will depend on the particular database and the purpose of the analysis. Weights are next assigned 74 based on the rankings. Examples of weights to apply would be Size: 1.0, Weight: 0.75, Color: 0.5, and Shipping Orders: 0.25.
After the fields are selected and ranked, the values in the database for each of the 15 selected fields is fetched 76 and then analyzed through a process of comparison Specifically, the value of each field is compared to values in a thesaurus 78. The ^thesaurus is specifically designed for the type of catalog being analyzed. It may be provided by the catalog's creator or it may be based on the needs of a particular user of the catalog. Preferably the thesaurus contains a complete listing of synonyms that are well understood in the field for units of measure and abbreviations. A different thesaurus may be required for different categories or domains.
Docket No.: 003919.P004 Express Mail No.: EL328715941 US 05/15/01 TUE 18:58 FAX BST&Z Ij Uia [n the comparison, a determination is made as to whether a unit of measure or abbreviation value from each record matches an entry in the thesaurus. Then a count is made of all of the different matching values for each field 82. A score is assigned 84 based on the number of matches. Preferably, the score is simply the number of values that find a match in the thesaurus for each field divided by the total number of non-null values. Weighting factors are then applied to each of the scores 86. Another score that can be developed from the comparison 80 is a count 88 of all of the unique values in each field. For example, the unit of measure value and "pound" are added together to form a count of four no matter how many times each of these values occurs in the weight field. This total number of unique values are assigned scores 90, so that a larger number of synonyms generates a lower score. A preferred score is an aggregate of the number of synonym groups divided by the count of synonyms found for each of the :i synonym groups. A synonym group is, for example, weights in pounds and the synonyms are the various ways of expressing pounds above Lb., pound etc.) The score is then 15 weighted 92 in the same manner as the total number of matches. Finally, the weighted scores are combined 94 to produce an overall consistency score for the database being analyzed. The overall consistency score preferably reflects a ratio that is (count of :redundant abbreviation units of measure) (count of unique abbreviation units of oee measure) A complete mathematical analysis would be very similar to that presented above for completeness.
Docket No.: 003919.P004 Express Mail No.: EL328715941 US 16 05/15/01 TUE 18:58 FAX BST&Z 10019 A detailed consistency report for a catalog preferably shows per category and per attribute: Number of abbreviations Number of unique abbreviations Number of redundant abbreviations Number of units of measure Number of unique units of measure Number of redundant units of measure A third example quality to analyze is comprehension. In one method to analyze comprehension, the present invention looks to see if the item descriptions in the catalog use native language words, and if the variety of words is consistent with the size of the catalog. Numbers and alphanumeric strings are excluded from the analysis as are known units of measure and abbreviations. Numbers are assumed to be either part numbers or values associated with descriptors. Alphanumeric strings are assumed to be part 15 numbers. Units of measure and abbreviations are dealt with in the consistency evaluation discussed above.
Additional analysis can be generated to look at the usage of nouns and adjectives in describing items. The present invention can also analyze optimal value ranges for i describing items in a given domain and the relationship between the number of unique nouns and the number of categories. In this case a grade can be associated with the percent of unique words in that are found in the dictionary. Each recurrence of a word is Docket No.: 003919.P004 Express Mail No.: EL328715941US 17 05/15/01 TUE 18:58 FAX BST&Z not counted. Other factors to include are the number of words in the catalog, the number of unique words in the catalog, the number of nouns used per record or item as distinguished by having a unique SKU and the number of adjectives per record. This last measure can also be considered by measuring the percentage of records that are described with at least one word. For catalogs with which users prefer written descriptions, a statistical count of the extent of the descriptions is valuable. All of these measure are preferably sorted by category and by attribute to provide the most useful measure to the user and creator of the catalog.
In Figure 5, the process of analyzing a database for the quality of comprehension begins with selecting the fields that are to analyzed 100. Typically, for the example of an electronic catalog, the fields which contain text descriptions would be selected. In a database, this information may be spread over several fields associated with the product so all the fields can be examined in their entirety. Users can select which fields are appropriate for the particular situation. Furthermore, parts of speech across the entire catalog can be analyzed as an indication of a catalog's ability to differentiate between similar items. Text components of product descriptions can be evaluated for sufficiency as well as consistency. Sufficiency is providing enough description to effectively describe a product, as well as to effectively differentiate items within a catalog.
o Examining the number and variance of each part of speech (noun, adjective, etc) on a per item basis provides some indication of the degree of information conveyed about that item. After the fields are selected, they are ranked in order of importance 102. Weights Docket No.: 003919.P004 Express Mail No.: EL328715941 US 18 1[020 05/15/01 TUE 18:59 FAX BST&Z 11021 are next assigned 104 .based on the rankings. After the fields are selected and ranked, the values in the database for each of the selected fields is fetched 106 and then analyzed through a process of comparison 108 to a dictionary 110. The dictionary is specifically designed for the type of catalog being analyzed.
In the comparisonl, a determination is made as to whether each word for each record matches an entry in the dictionary 112. Then a count is made of all of the different matching values for each field 114. A score is assigned 114 based on the number of matches. Preferably, the score is simply the number of values that find a match in the dictionary for each field. Weighting factors are then applied to each of the scores 116.
Another score that can be developed from the comparison 108 is a count 1 18 of all of the nouns in each field. The dictionary comparison can be used to determine the word's part of speech. The noun counts are assigned scores 120, preferably simply the count, and then weighted 122 in the same manner as the total number of matches. In addition, the adjectives are counted 124, assigned a score for each field 126 and properly weighted 128. Finally, all of the weighted scores are combined 130 to produce an overall ::*comprehension score for the database being analyzed. Scoring may be based on the ratio of found words to total words, found unique instances of words to total unique words, and ratios after filtering for non-language text such as part number and non descriptive (or otherwise un-interesting text such as conjunctions and prepositions.
2 0 Scoring is based on a value driven methodology in which the score for each component is normalized. As scoring components are aggregated into larger Docket No.: 003919.P004 Express Mail No.: EL328715941 US 19 05/15/01 TUE 18:59 FAX BST&Z 14022 representations, each aggregated score is renormalized. Any component consisting of multiple elements has weighting applied to reflect the relative value of that element in relation to other catalog elements. Weights are applied at all levels of scoring aggregation. Users are permitted to configure relative weights of the scoring (value weighting and normalization). There are additional methods for evaluating the sufficiency of item descriptions. Among these are examinations of description length, by both character and word count, and comparing this to an expected value (range) or an existing, calculated distribution. Such evaluations may be performed over one or more fields, and by category, catalog, catalog set, or other grouping defined by the user.
Different applications and domains have different requirements for finding and purchasing. The scoring system is preferably configurable to reflect the values of each particular environment. Domain and application experts apply their own evaluation of the relative importance of the components of catalog scoring.
In the description above, three basic quality attributes are scored. The same 15 method can be used to evaluate many other qualities of a database. The invention is not limited to the quality measures discussed above.
S After all of the desired qualities are scored, a report of the results is configured. A basic report is a summary description of the catalog. This includes the total number of items (SKUs) in the catalog, number of unique items (Total SKUs duplicates), the number of categories, number of base and local attributes, and the number of unique local attributes.
Docket No.: 003919.P004 Express Mail No.: EL328715941US 05/15/01 TUE 18:59 FAX BST&Z Q023 The Catalog Grade is a weighted average of all of the individual quality scores mentioned above. Preferably, all grades use a 0-10 scale, with 10 being the best possible score. The user may define the weights assigned to each component, though standardized weighting values are preferred in order to facilitate catalog comparisons.
While this invention has been particularly shown and described with references to a preferred embodiment thereof, it will be understood by those skilled in the art that variations, adaptations and modifications may be made therein without departing from the spirit and scope of the invention as defined by the following claims.
Throughout this specification and the claims which follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.
The reference to any prior art in this specification is not, and should not be taken as, an acknowledgement or any form of suggestion that that prior art forms part of the common general knowledge in Australia.
Docket No.: 003919.P004 Express Mail No.: EL328715941US 21

Claims (15)

  1. 08-06-' 07 11:57 FROM- T-007 P006/008 F-983 22 THlE CLAIMS DEINVING THE INENTION ARE AS FOLLOWS: I. A method, performed by a computer system, for scoring a database for a quality comprising: selecting at least one field of the database for analysis; fetching each value for each record of the database from the selected fields that are to be analyzed; comparing each of the fetched values for a selected field to a field standard; and assigning a field score for each selected field based on the comparison. 2. The method of Claim I further comprising: ranking the selected fields in order of pertinence to the quality; and weighting the field scores for each selected field based on the rank of each selected field. *15 The method of Claim 1 wherein the quality comprises completeness, wherein comparing each of the fetched values comprises comparing fetched values for a selected field to other fetched values for the same selected field and wherein assigning a field score comprises assigning points for each non-null value so that the field score for a selected field corresponds to the number of non-null values for all records in that selected field. 4. The method of Claim 1 wherein the quality comprises consistency, wvherein comparing each of the fetched values comprises comparing fetched values for a selected field to a dictionary of possible values and wherein assigning a field score comprises assigning points for each fetched value that does not match a dictionary value so that the field score a for a selected field corresponds to the number of non- matching values for all records for that selected field. 5, The method of Claim 4 wherein at least one selected field to be analyzed corresponds to units of measure, wherein the dictionary for the units of measure field contains alternative expressions for the same units of measure and wherein COMS ID No: SBMI-07708304 Received by IP Australia: Time 11:59 Date 2007-06-08 P \OPER\UC\ 100\MayQ4l spa I dc.25)$/2007 23 assigning a field score includes assigning points for each use of an alternate expression for the same unit of measure. 6. The method of Claim 4 wherein at least one selected field to be analyzed contains values that are abbreviations, wherein the dictionary for the at least one selected field contains alternative abbreviations with the same meaning and wherein assigning a field score includes assigning points for each use of an alternate abbreviation for the same meaning. 7. The method of Claim 2 wherein weighting the field score for each selected field based on the rank of the selected field comprises assigning a weight to each selected field based on the rank of the selected field and multiplying the total points assigned to the selected field by the weight. 15 8. The method of Claim 1 wherein the quality comprises comprehensibility, wherein comparing each of the fetched values comprises comparing fetched values for a selected field to a dictionary of possible values and wherein assigning a field score comprises assigning points for each fetched value that does not match a dictionary value so that the field score for a selected field corresponds to the number of non- matching values for all records for that selected field.
  2. 9. The method of Claim 8 further comprising classifying fetched values into types, counting the number of each value type for each selected field and assigning a field 0 ~score based on the number of each value type for each selected field. 0 0 The method of Claim 9 wherein the value types include one or more of nouns and adjectives.
  3. 11. The method of Claim 9 wherein assigning a field score based on the number of each value type includes forming a ration of value types in a selected field to other value types in the same selected field and comparing the ratio to a desired ratio. P \OPERUUJC02007V Mi)\2417442 I docM20,5l07 -24-
  4. 12. A machine-readable medium having stored thereon data representing sequences of instructions which, when executed by a processor, cause the processor to perform the steps of: selecting at least one field of the database for analysis; fetching each value for each record of the database from the selected fields that are to be analyzed; comparing each of the fetched values for a selected field to a field standard; and assigning a field score for each selected field based on the comparison.
  5. 13. The medium of Claim 12 further comprising: ranking the selected fields in order of pertinence to the quality; weighting the field scores for each selected field based on the rank of each selected field; and 15 combining the weighted field scores to obtain a combined score for the database. .ooo.i
  6. 14. The medium of Claim 12 wherein the quality comprises completeness, wherein comparing each of the fetched values comprises comparing fetched values for a selected field to other fetched values for the same selected field and wherein assigning a field score comprises assigning points for each null value so that the ~field score for a selected field corresponds to the number of null values for all records in that selected field. The medium of Claim 12 wherein the quality comprises consistency, wherein comparing each of the fetched values comprises comparing fetched values for a field to a dictionary of possible values and wherein assigning a field score comprises assigning points for each fetched value that does not match a dictionary value so that the field score for a selected field corresponds to the number of non- matching values for all records for that selected field. P \OPERRJC\_00f7 .I\24 174A2 s I dc-.25)'OY2 7 25
  7. 16. The medium of Claim 15 wherein at least one selected field to be analyzed corresponds to units of measure, wherein the dictionary for the units of measure contains alternative expressions for the same units of measure and wherein assigning a field score includes assigning points for each use of an alternate expression for the same unit of measure.
  8. 17. The medium of Claim 15 wherein at least one selected field to be analyzed contains values that are abbreviations, wherein the dictionary for the at least one selected field contains alternative abbreviations with the same meaning and wherein assigning a field score includes assigning points for each use of an alternate abbreviation for the same meaning.
  9. 18. The medium of Claim 13 wherein weighting the field score for each selected field based on the rank of the selected field comprises assigning a weight to each selected field based on the rank of the selected field and multiplying the total points assigned to the selected field by the weight.
  10. 19. The medium of Claim 12 wherein the quality comprises comprehensibility, wherein comparing each of the fetched values comprises comparing fetched values for a selected field to a dictionary of possible values and wherein assigning a field score comprises assigning points for each fetched value that does not match a dictionary value so that the field score for a selected field corresponds to the number of non- :matching values for all records for that selected field.
  11. 20. The medium of Claim 19 further comprising classifying fetched values into value types, counting the number of each value type for each selected field and assigning a field score based on the number of each value type for each selected field.
  12. 21. The medium of Claim 20 wherein the value types include one or more of nouns and adjectives. P OPER JC\20(0%TM.y12417442 spaI dom.25KnOif7 -26-
  13. 22. The medium of Claim 20 wherein assigning a field score based on the number of each value type includes forming a ratio of value types in a selected field to other value types in the same selected field and comparing the ratio to a desired ratio.
  14. 23. A method for scoring a database substantially as hereinbefore described with reference to the accompanying drawings.
  15. 24. A machine-readable medium substantially as hereinbefore described with reference to the accompanying drawings. oe *o
AU43918/01A 2001-05-16 2001-05-16 Method and apparatus for analyzing the quality of the content of a database Ceased AU785458B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU43918/01A AU785458B2 (en) 2001-05-16 2001-05-16 Method and apparatus for analyzing the quality of the content of a database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU43918/01A AU785458B2 (en) 2001-05-16 2001-05-16 Method and apparatus for analyzing the quality of the content of a database

Publications (2)

Publication Number Publication Date
AU4391801A AU4391801A (en) 2002-11-21
AU785458B2 true AU785458B2 (en) 2007-07-12

Family

ID=3731214

Family Applications (1)

Application Number Title Priority Date Filing Date
AU43918/01A Ceased AU785458B2 (en) 2001-05-16 2001-05-16 Method and apparatus for analyzing the quality of the content of a database

Country Status (1)

Country Link
AU (1) AU785458B2 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998049632A1 (en) * 1997-04-25 1998-11-05 Price Waterhouse, Llp System and method for entity-based data retrieval
US5983220A (en) * 1995-11-15 1999-11-09 Bizrate.Com Supporting intuitive decision in complex multi-attributive domains using fuzzy, hierarchical expert models

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5983220A (en) * 1995-11-15 1999-11-09 Bizrate.Com Supporting intuitive decision in complex multi-attributive domains using fuzzy, hierarchical expert models
WO1998049632A1 (en) * 1997-04-25 1998-11-05 Price Waterhouse, Llp System and method for entity-based data retrieval

Also Published As

Publication number Publication date
AU4391801A (en) 2002-11-21

Similar Documents

Publication Publication Date Title
US6631365B1 (en) Method and apparatus for analyzing the quality of the content of a database
US9613373B2 (en) System and method for retrieving and normalizing product information
US6377937B1 (en) Method and system for more effective communication of characteristics data for products and services
US6836777B2 (en) System and method for constructing generic analytical database applications
US9710457B2 (en) Computer-implemented patent portfolio analysis method and apparatus
US6236985B1 (en) System and method for searching databases with applications such as peer groups, collaborative filtering, and e-commerce
US8271476B2 (en) Method of searching text to find user community changes of interest and drug side effect upsurges, and presenting advertisements to users
US20060036510A1 (en) System and method for directing a customer to additional purchasing opportunities
Biega et al. Overview of the trec 2019 fair ranking track
Kennedy et al. Descriptive and predictive analyses of industrial buyers' use of online information for purchasing
US20080021892A1 (en) Process and system for matching product and markets
US20090037342A1 (en) Systems & methods for determining a value of an intellectual asset portfolio
Yeung Functional characteristics of commercial web sites: a longitudinal study in Hong Kong
EP1258814A1 (en) Method and apparatus for analyzing the quality of the content of a database
Niemir et al. Product Data Quality in e-Commerce: Key Success Factors and Challenges
Sadeddin et al. Online shopping bots for electronic commerce: the comparison of functionality and performance
AU785458B2 (en) Method and apparatus for analyzing the quality of the content of a database
Lu et al. Clustering e-commerce search engines based on their search interface pages using WISE-Cluster
US20080208883A1 (en) Method And System For A User-Customizable Interactive Physician Recall Message Database
JP2002351875A (en) Database contents quality analyzing method and device
Stolze et al. Towards scalable scoring for preference-based item recommendation
Sashi Structural differences between business and consumer markets
US7110990B2 (en) Decision analysis system and method
Mathieson et al. Improving the effectiveness and efficiency of appraisal reviews: an information systems approach
KR20040076692A (en) Method for trademark watching services and apparatus therefor