WO2020193985A1 - Improved system and method for data classification - Google Patents

Improved system and method for data classification Download PDF

Info

Publication number
WO2020193985A1
WO2020193985A1 PCT/GB2020/050820 GB2020050820W WO2020193985A1 WO 2020193985 A1 WO2020193985 A1 WO 2020193985A1 GB 2020050820 W GB2020050820 W GB 2020050820W WO 2020193985 A1 WO2020193985 A1 WO 2020193985A1
Authority
WO
WIPO (PCT)
Prior art keywords
destination
taxonomy
node
data items
data item
Prior art date
Application number
PCT/GB2020/050820
Other languages
French (fr)
Inventor
Geoffrey Paul AINSWORTH
Keir Joseph MURPHY
Drew Anthony Peter SMITH
Original Assignee
Upp Technologies Group Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Upp Technologies Group Ltd filed Critical Upp Technologies Group Ltd
Publication of WO2020193985A1 publication Critical patent/WO2020193985A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]

Definitions

  • the present invention relates to systems and methods for the classification of data items within a hierarchical taxonomy.
  • the present invention has particular applicability in the area of electronic commerce, where each data item relates to a product listing.
  • Each channel establishes its own taxonomy which is typically organised hierarchically to help consumers locate relevant products offered by a variety of product sellers.
  • Such hierarchical taxonomies comprise a collection of nodes (e.g. categories) that have a parent or child relationship with adjacent nodes in the taxonomy.
  • nodes e.g. categories
  • the node or category “Clothing” may be the parent of the sub-category/node "Men”, as well as others sub-category/nodes like "Women", and "Children".
  • a single seller will typically seek to offer its product range via various channels, but as different channels operate different taxonomies, a seller must determine an appropriate location within each taxonomy to list each and every product. Whilst a seller may operate a source taxonomy for their own product inventory, there is a need to map from this source taxonomy to potentially several destination taxonomies operated by various channels. Furthermore, it would be typically desirable for the seller to list products within a category in which other very similar products are already listed (e.g. by a seller's competitor).
  • a computer- implemented method of classifying a plurality of data items within a destination taxonomy is provided. Ideally, this is achieved by assigning each data item with a destination classifier.
  • the data item can thus be classified in accordance with that taxonomy.
  • classifiers represent nodes in a hierarchical taxonomy. Therefore, as every node has at least one parent or child relationship with one or more adjacent nodes, every classifier similarly has a corresponding parent and/or child relationship with other classifiers. Additionally, whilst classifiers may be represented as a classifier code, each classifier also has corresponding text components, such as a category title, allowing the structural arrangement of a taxonomy to be determined by a user or operator.
  • the computer-implemented method comprises determining the structure of the hierarchical destination taxonomy. This may include registering the parent-child relationships of each node with other nodes of the destination taxonomy.
  • the method comprises parsing text components of the destination taxonomy.
  • the method comprises registering from which nodal location within the destination taxonomy the text component originates.
  • this subsequently allows each text component to be treated differently for each node.
  • the method comprises generating a text component set for each node that includes those originating from the node.
  • the method comprises generating a text component set for each node that includes those originating from nearby nodes such as at least one of a parent, a child and a sibling node.
  • text components that are from nearby nodes can be used to assess the context of a particular node, and so aids in subsequent classification of data items into the destination taxonomy.
  • the method comprises assigning a weight to each text component of each set.
  • the weight depends, at least in part, on the relative difference in nodal location between the node of the set, and that from which the text component originates. Again, this aids in the subsequent classification of data items as nearby nodes are useful for determining contextual significance of a give node in the taxonomy.
  • the method comprises parsing text components from a first set of the plurality of data items.
  • the method comprises assigning a weight to each text component for a respective data item.
  • the method comprises calculating the level of correlation between nodes of the destination taxonomy and data items, such as data item of the first set.
  • the level of correlation is calculated on the basis of their respective weighted text components.
  • the level of correlation may be represented by a confidence score.
  • the method comprises classifying data items of the first set of data items by assigning a destination classification code to them if the calculated level of correlation, as represented by an applicable confidence score, is above a predetermined threshold.
  • the method comprises applying a validation process to other data items of the first set if the level of correlation, as represented by an applicable confidence score, is below the predetermined threshold.
  • the validation process results in the assignment of a destination classification code to the other data items.
  • the method comprises parsing text components of the destination taxonomy and/or from the first set of the plurality of data items further comprises pre-processing to remove predetermined low-semantic-value text components.
  • the step of assigning a weight to each text component of each node set comprises assigning a higher weight to text components originating from the same or parent nodal locations within the destination taxonomy than text components originating from child or sibling nodal locations.
  • the step of assigning a weight to each text component comprises assigning a higher weight to text components derived from a title attribute than a description attribute.
  • the weight assigned to text components derived from a title attribute may be at least 10 times greater than the weight assigned to text components derived from the description attribute.
  • the weight assigned to text components derived from a title attribute may be at least 20 times greater than the weight assigned to text components derived from the description attribute.
  • the method further comprises monitoring the destination taxonomy over time for changes.
  • the method may comprise taking predetermined actions in response to detecting a change in the destination taxonomy.
  • the method may further comprise reclassifying data items that belong to nodes affected by the detected change.
  • it can be advantageous to reclassify data items that are classified under a node that has changed, or one of the nearby nodes - especially those that are descendant from a node that has changed.
  • the method may further comprise reclassifying data items that belong to nodes descendant from a parent node itself having a descendant node (direct, or indirect) that has been detected to have changed.
  • the method comprises applying a performance-based classification strategy comprising classifying a data item under a first trial classification code within the destination taxonomy, monitoring performance characteristics of that data item whilst classified under the first trial classification code within the destination taxonomy, comparing those performance characteristics with other performance characteristics resulting from classifying the data item and/or similar data items under a second trial classification code, and reclassifying the data item under the classification code that has the most optimal performance characteristics, as determined by the performance characteristic comparison.
  • a performance-based classification strategy comprising classifying a data item under a first trial classification code within the destination taxonomy, monitoring performance characteristics of that data item whilst classified under the first trial classification code within the destination taxonomy, comparing those performance characteristics with other performance characteristics resulting from classifying the data item and/or similar data items under a second trial classification code, and reclassifying the data item under the classification code that has the most optimal performance characteristics, as determined by the performance characteristic comparison.
  • the step of monitoring the performance characteristic may comprise determining one or more performance metrics relating to operations performed in respect of that data item.
  • a performance metric is increased with a greater number of operations predetermined as positive (such as the number of times that data item is viewed, located in a search and/or is subject to a sale transaction whilst listed under that trial classification code within the destination taxonomy).
  • the performance metric may be reduced with a greater number of operations predetermined as negative (such as the return or negative review left for a product associated with a respective data item).
  • the method comprises applying a validation process.
  • the validation process may be applied to a subset of the plurality of data items - for example data items not in the first set of data items, or other data items of the first set of data items where the calculated level of correlation, as represented by an applicable confidence score, is below a predetermined threshold.
  • the validation process may include applying a manual validation process.
  • the manual validation process may comprise (a) presenting identifiers of each data item, via an operator interface, to an operator with a set of operator-selectable options each describing a respective proposed destination classifier, ideally selected from those calculated to have the highest level of correlation to that data item, and (b) receiving, via the operator interface, a selection of one of the operator-selectable options, and in response assigning the corresponding destination classification code to that data item.
  • the validation process may comprise processing data items differently or more intensive to determine a better confidence score and/or match with an appropriate destination classification code.
  • the validation process may utilise image recognition to generate text components for use in determining the level correlation between a data item and a node of the destination taxonomy.
  • the validation process may comprise: determining an image that is associated with a corresponding data item (e.g. one of the other data items of the first set); performing image recognition of that image to generate additional text components associated with that data item; assigning a weight to each additional text component; calculating the level of correlation between that data item and nodes of the destination taxonomy on the basis of their respective weighted text components, the level of correlation being represented by a confidence score; and classifying that data item by assigning a destination classification code to it if the calculated level of correlation, as represented by an applicable confidence score, is above a predetermined threshold.
  • Image recognition may comprise: comparing and matching the image with at least one destination image provided on a database from which the destination taxonomy is derived, destination images being provided together with a corresponding destination text component; and processing the destination text components to generate additional text components for association with the data item.
  • the method may further comprise detecting an incomplete data item of the plurality of data items, and in response generating restorative data.
  • incomplete data item may be detected as being incomplete as a result of being detected to be absent of a set of missing information, such as one or more values, attributes and attribute-value pairs.
  • the generated restorative data can be stored and used to recomplete the otherwise incomplete data item.
  • the recompleted data item can be added to a set of data items for subsequent classification, such as the first set of data items.
  • the method may comprise generating restorative data for a data item by processing data that already exists for that data item to generate text components for use as restorative data.
  • the method may comprise processing image data that already exists for that data item by performing image recognition on said image data to generate text components for use as restorative data to recomplete that data item.
  • the method comprises determining a second set of the plurality of data items that each have a source classifier that represents a node within a source taxonomy, and applying a mapping between the source taxonomy and the destination taxonomy to thereby classify the second set of the plurality of data items according to the destination taxonomy.
  • the method comprises determining a second set of the plurality of data items that each have a source classifier that represents a node within a source taxonomy and then processing the source taxonomy to in the same way to the destination taxonomy to determine sets of weighted text components.
  • this allows the same computer-implemented processing steps to be efficiently repurposed.
  • the method may comprise determining the structure of the source taxonomy, parsing its text components, generating a text component set for each node of the source taxonomy, and assigning a weight to each text component of each node set.
  • the method may further comprise calculating the level of correlation between nodes of the source and destination taxonomy on the basis of their respective weighted text components, the level of correlation being represented by a confidence score.
  • the method may comprise generating a source-to-destination map comprising node mappings from each node of the source taxonomy to one or more nodes of the destination taxonomy - in particular, those calculated to be the most correlated. Ideally, each mapping is stored together with an associated confidence score.
  • the method may comprise classifying data items of the second set by assigning a destination classification code to them if, according to the generated source-to-destination map, the source-to-destination confidence score of an applicable source-to-destination mapping is above a predetermined threshold. In other words, if the source classification code of a data item of the second set confidently maps on to a node of the destination taxonomy, it can be immediately classified under that destination taxonomy.
  • the method may comprise applying a validation process to other data items of the second set if, according to the generated map, the source-to-destination confidence score of an applicable source-to-destination mapping is below the predetermined threshold.
  • the validation process ideally results in the assignment of a destination classification code to the other data items.
  • each data item comprises at least one of:
  • a source classifier relating to the location of the data item within a source taxonomy, (for example expressed as a classification key)
  • At least one text field such as:
  • a description text field the value of which is a description of a product to which the data item relates
  • data items need not be pre-categorised i.e. the invention is able to classify previously unclassified data items, and reclassify those that have been previously classified.
  • the method may comprise source-destination taxonomy analysis/mapping, such as destination taxonomy monitoring, with optionally, remapping occurring in response to a monitored change.
  • the method may comprise determining mappings between source and destination taxonomy by category matching with a confidence score, e.g. via counting weighted bag- of-words technique.
  • the system may comprise an automated classifier generator for generating at least one proposed classifier for notional assignment to a data item, and a confidence score associated with each proposed classifier.
  • the proposed classifier may be generated by text analysis and/or image analysis.
  • the system may comprise a classification validator, optionally carrying out the steps of: determining if the confidence score associated with a proposed classifier is below a predetermined threshold value, and:
  • a system for classifying a plurality of data items within a destination taxonomy According to a second aspect of the present invention there is provided a system for classifying a plurality of data items within a destination taxonomy.
  • the second aspect of the invention may reside in a computer-implemented classification system for classifying data items according to a hierarchical destination taxonomy by assigning each data item with a destination classifier that is representative of a node in the destination taxonomy.
  • the system may comprise means for carrying out the steps of the method according to the first aspect of the present invention.
  • the system may comprise at least one of a database configured to store the destination hierarchy, and computing resources configured to carry out one or more method steps according to the first aspect of the invention.
  • the computing resources may comprise a processor and a memory.
  • the system may further comprise an operator interface.
  • the system may also comprise at least one of: a first interface for reading a taxonomy, such as a channel taxonomy of a channel database the destination taxonomy being determined from the reading via the first interface of that taxonomy; and a second interface for reading and/or updating a taxonomy, such as a product taxonomy of a seller product database, the source taxonomy being determined from the reading via the second interface of that taxonomy.
  • a first interface for reading a taxonomy such as a channel taxonomy of a channel database the destination taxonomy being determined from the reading via the first interface of that taxonomy
  • a second interface for reading and/or updating a taxonomy, such as a product taxonomy of a seller product database, the source taxonomy being determined from the reading via the second interface of that taxonomy.
  • the system may be configured to: determine the structure of the hierarchical destination taxonomy including parent- child relationships of each node with other nodes of the taxonomy; parse text components of the destination taxonomy, each text component being registered as originating from a respective nodal location within the destination taxonomy; generate a text component set for each node that includes those originating from the node, and those originating from at least one of a parent, a child and a sibling node; assign a weight to each text component of each set depending, at least in part, on the relative difference in nodal location between the node of the set, and that from which the text component originates; parse text components from a first set of the plurality of data items, each text component being assigned a weight for a respective data item; calculate the level of correlation between each data item of the first set and nodes of the destination taxonomy on the basis of their respective weighted text components, the level of correlation being represented by a confidence score; and/or classify data items of
  • the system may also apply a validation process to other data items of the first set if the level of correlation, as represented by an applicable confidence score, is below the predetermined threshold, the validation process resulting in the assignment of a destination classification code to the other data items.
  • the system may be configured to process a second set of data items that are preassigned with a source classification code that denotes a location of a respective data item within a source taxonomy, the processing comprising: loading the source taxonomy and the destination taxonomy into the database; comparing the source and destination taxonomies to generate a source-to- destination taxonomy map that includes a plurality of source-to-destination mappings each having a corresponding confidence score; classifying data items of the second set by assigning a destination classification code to them if, according to the generated map, the source-to-destination confidence score of an applicable source-to-destination mapping is above a predetermined threshold; and applying a validation process to other data items of the first set if, according to the generated map, the source-to-destination confidence score of an applicable source-to- destination mapping is below the predetermined threshold.
  • an aspect of the invention extends to a computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method of the first aspect of the invention.
  • Figure 1 is a top-level schematic view of a classification system according to a first embodiment of the present invention
  • Figure 2 represents an example extract from a product database for use by the classification system of Figure 1 ;
  • Figure 3 shows an example source taxonomy to destination taxonomy map for use by the classification system of Figure 1 ;
  • Figure 4 shown a top-level flow diagram of the method used by the classification system of Figure 1 to perform classification
  • Figure 5 is a flow diagram of the steps of an example method of generating a source-to- destination taxonomy map, as carried out by the classification system of Figure 1 ;
  • Figure 6 shows an example graphical representation of part of a product taxonomy for use by the classification system of Figure 1. Specific description of the preferred embodiments
  • Figure 1 is a top-level schematic view of a classification system 1 according to a first embodiment of the present invention.
  • the classification system 1 is configured to implement a method of classification which also accords to a first embodiment of the present invention, and as will be described in greater detail below.
  • classification system 1 of the present embodiment is a product classification system 1 for classifying/categorising data items that relate to products that are to be hierarchically listed on an electronic commerce platform.
  • Figure 1 also shows the other principle systems with which the classification system 1 is configured to interface with, namely, a channel system 2 and a seller system 3.
  • a channel system may be provided by an electronic commerce platform, such as Amazon® or eBay®.
  • a seller system may be that of a retailer or reseller of a range of products.
  • the product classification system 1 comprises system computing resources 10, such a processor and memory, that facilitate the computerised execution of the classification process described below.
  • the product classification system 1 also comprises a database 11 in which data associated with the classification process is stored, and a user interface 15 via which an operator of the product classification system 1 can input data into and receive information from the system 1.
  • the product classification system 1 comprises a seller interface 13 for interfacing with a seller system 3, and specifically for reading information from, or writing information to a seller product database 30.
  • the seller product database 30 may be read by transferring a file (e.g. a product spreadsheet file) from the seller system 3 to the product classification system 1 , processed, and then transmitted back to the seller system 3.
  • a set of entries of the product database 30 may be updated via the product classification system 1 sending update requests to the seller system 3.
  • the seller product database 30 has associated with it a hierarchical product taxonomy 30a, with each item of the seller product database 30 being classified according to (or having a location within) the taxonomy 30a.
  • This can be represented by assigning each product data item with a classification code.
  • Figure 2 represents an example extract from a product database, showing an example product record relating to a pair of shoes.
  • the location of a data item within the seller product taxonomy 30a is notionally codified by the value of a seller classification code attribute.
  • this is "CL/ME/SH”, which relates to the category/node "shoes” which is a child node of the category/node “Mens”, which itself is a child node of the category/node “Clothing”.
  • nodes of a taxonomy 30a may not necessarily be restricted to categories, but may include other attributes such as fields or values.
  • attributes such as the size of a shoe product may be defined as a node within a taxonomy, and the possible values such an attributes can take on as further associated nodes, especially if those values are restricted to a range of values, or a limited number of predetermined options - as opposed to free text, for example.
  • Figure 6 shows an example graphical representation of part of a product taxonomy, with a category node "Clothing” being shown in expanded form to display many direct child categories nodes associated with it (e.g. Child, Men, Baby, Girls, Boys, Novelty & Special Use, and Women) as well as associated attributes (e.g. colour, season, collection, etc).
  • the child node "Women” is also expanded to show further sub-categories children nodes to it (e.g. Knitwear, Swimwear, Jeans, Maternity, etc) as well as associated attributes.
  • the attribute “season” may be restricted to four predetermined options: “spring”, “summer”, “autumn”, and “winter”, and these value restrictions are specifically codified in the taxonomy (although not shown in Figure 6).
  • a GB shoe size attribute may be restricted to whole numbers, or 0.5 increments between those whole numbers, within the range 3 to 13.5.
  • Figure 4 shown a top-level flow diagram of the method used by the classification system 1 to perform classification.
  • a first step of the method 41 comprises loading the source and destination taxonomies into the database 11 of the system 1 :
  • the seller product database and its associated taxonomy 30a can be ingested by the product classification system 1 , and stored within the database 11 as a source taxonomy ST for subsequent computing operations.
  • Taxonomies can be stored on the database 11 of the product classification system 1 in a variety of forms - for example as a relational database or a graph database. It should be noted however that the latter is preferred, due to its speed, scalability and suitability for representing hierarchical taxonomies.
  • the seller system 3 may provide a product database 30 without an accompanying predefined taxonomy 30a.
  • a taxonomy 30a may be automatically generated by the product classification system 1 as will be discussed later.
  • a source taxonomy ST derived from the seller product database 30 can be established on the database 11 of the product classification system 1
  • the product classification system 1 also comprises a channel interface 12 for interfacing with a channel system 3, and read from it the structure of the channel database 20 to derive from it a destination taxonomy DT which can also be obtained and stored in a similar way on the database 11 of the product classification system 1.
  • the channel interface 12 of the product classification system 1 is also configured to list or update a set of products on the channel database.
  • the product classification system 1 therefore acts as an intermediary between a seller and a channel. It should be noted that the product classification system 1 will normally only have authorisation to list or update a set of products relating to the seller's products. It will not be able to change product listings that have been uploaded by other users of the channel - although it may be able to read data items relating to those products, providing they are publicly listed and accessible (e.g. via a web interface). Also, it should be noted that whilst the product classification system 1 is able to read the hierarchical structure of the channel database 20, it cannot change that structure.
  • the second step 42 of the classification process is to compare the source and destination taxonomies. Assuming there are differences between a source taxonomy ST and destination taxonomy DT, the classification system 1 will need to perform a mapping operation that generates a source-to-destination map between the source taxonomy ST and the destination taxonomy DT.
  • product data items classified under the source taxonomy ST can be more efficiently remapped to the destination taxonomy DT, and so be automatically listed within an appropriate location within a channel system 2.
  • the product classification system 1 is configured to monitor the channel system 2 to determine when and how an updated is made to restructure the channel database 20, and in response adjust the destination taxonomy DT, and consequently the source-to- destination map.
  • the source-to-destination map generally comprises a list of nodes (or classifier codes representing those nodes) of the source taxonomy associated with a corresponding list of nodes of the destination taxonomy together with a normalised confidence score (from 0 to 1) that represents the determined strength of association between the two linked nodes. It should also be noted that, for each source taxonomy node, there may be many possible destination taxonomy nodes, each with its own confidence score. This can be codified in a variety of forms, but for ease of understanding could be represented in a table.
  • Figure 3 shows an example source taxonomy to destination taxonomy map in the form of a table, with one node of a source (seller) taxonomy mapped to two possible nodes of a destination (channel) taxonomy, a confidence score associated with each source to destination node mapping. It will be understood that more than two maps may be provided in practice.
  • Each node is represented here in Figure 3 as a classification code.
  • mapping itself is performed progressively and hierarchically, where more ancestral nodes of the source taxonomy - i.e. closer to the "root" of the hierarchy - are mapped at a higher priority than the nodes that are closer to the leaves of the hierarchy.
  • Source and destination taxonomies differ in terms of hierarchical category depth. Some split and some have very fine sub-categories while others have more coarse-grained sub-categories.
  • the order of the nodes or categories may be different across different taxonomies. For example, you may have Clothing>Mens>Sports or Clothing>Sports>Mens. Some categories present in one taxonomy (e.g. the source ST) may not be represented in another (e.g. the DT). This necessitates registering this, and then establishing a map to a generic category in the destination taxonomy.
  • a set of textual components is established from both the source and destination taxonomies and compared with one another using a matching algorithm in order to determine the contextual similarity between a node in the source taxonomy and that of the destination taxonomy, and thereby a confidence score can be generated.
  • a "bag of words” model can be used as the basis for this comparison, using a “distance- based” matching algorithm (e.g. Levenshtein distance) to determine a similarity between each word in the set, and also the overall difference between the set of words. Whilst this can generate category matches with a confidence score, there are certain drawbacks to this approach. Notably, confidence scores will be low (and/or there would be incorrect classification) where: attribute data is sparse in either the source or destination taxonomy;
  • one category on a channel database such as eBay® has more than
  • a pre-processing step is first performed by the product classification system 1 before matching, the pre-processing reducing noise levels and providing domain synonyms.
  • pre-processing can include: removing common words like "a”, “the”, “and” which would otherwise falsely positively skew the match between two nodes; and
  • FIG. 5 is a flow diagram of the steps of an example method 50 of generating a source- to-destination taxonomy map.
  • a first step 51 comprises obtaining the text components from the source and destination taxonomies. This is achieved by traversing each taxonomy structure and parsing text components from each node.
  • a prime example of a suitable text component is the category title (c.f. the value of the "title field" of Figure 2).
  • other text is the category title (c.f. the value of the "title field" of Figure 2).
  • each node will be assigned a set of parsed text components.
  • a second step 52 comprises removing low-semantic-value text components, especially those that are unlikely to form the basis of distinguishing one node from another.
  • Examples are commonly-used words such as determiners ("a”, “an”, “the”, “this”, “that”... etc.) This is done to minimise the computational burden of processing such words during a comparison.
  • a third step 53 comprises generating a filtered text component set for each node. This is achieved as a by-product of the second step 52, but additional filtering may be employed to further filter out (or in) additional text component terms.
  • a fourth step 54 comprises assigning a weight to each text component which represents its relative importance to the node to which it has been assigned. Notably, a higher weight is applied to the text component derived from the title of the respective node, than the title of adjacent nodes, which in turn will be given greater weight to text components derived from fields or attributes.
  • a fifth step 55 comprises calculating a level of correlation between each node of the source taxonomy, and a number of candidate nodes of the destination taxonomy. This is achieved by comparing all the text components (+ weights) associated with a node in the source taxonomy with all the text components (+ weights) associated with candidate notes of the destination taxonomy. As mentioned, the basis of this comparison can be a matching algorithm which generates, as an output, a confidence score of the link between the source and destination nodes.
  • a sixth step 56 comprises populating a map with this information.
  • Each node from the source taxonomy is stored in the map together with the 'X' most correlated nodes from the destination taxonomy, as indicated by a corresponding confidence score.
  • 'X' is between 2 to 10, and more ideally 5, so as to achieve a balance between storing superfluous data, and providing a viable set of alternative nodes to choose from in the event that the node rated with the highest confidence score is later determined to be inappropriate via a validation process (e.g. user validation).
  • a single node of the source taxonomy will be mapped on to up to five nodes of the destination taxonomy, each with a confidence score. It should be noted that there may not be a direct 1 -to- 1 map between a node of the source taxonomy and that of the destination taxonomy and so generalised, default or "catch-all" categories may be selected where a specific mapping is not possible. For example, the node-path Clothing>Mens>Sports>Bowling on a source taxonomy may get mapped to
  • this situation can be used as a prompt to the product classification system 1 to conduct a more sophisticated comparison between the taxonomies. Moreover, this situation can be used by the system 1 to handle the reclassification of product data item into an appropriate node in the destination taxonomy in a more nuanced way. This situation may occur where the source taxonomy has an only relatively generalised category node, whereas the destination taxonomy has more fine-grained category nodes, for example.
  • the system 1 optionally performs the fourth step 44 of the process, which is a manual validation process.
  • each source node is presented via an operator interface 15 to an operator of the system 1 alongside multiple destination nodes. These are presented as operator-selectable options which, when selected, update the confidence score to significantly increase it (e.g. increase to 1) so that a unambiguous link between a source node and a destination node can be determined. From this it is possible to place products originally classified under a source classifier to a destination classifier.
  • the system 1 so far described utilises attempts matching between the taxonomic structures of source and destination databases.
  • This relies on the database 30 imported into the classification system 1 having an existing taxonomy - such as the product taxonomy 30a of the seller product database 30.
  • a natural extension of the system 1 is to be able to process databases (or parts of them) that do not have an already-established taxonomy.
  • This also includes individual database records, in the form of data items. Moreover, this can enable the combination of two or more sets of data items or records, one of which may not be part of a taxonomically structured database.
  • the classification system 1 and associated method is able to classify previously unclassified data items.
  • the seller interface 13 of the system 1 reads in a seller product database 30 that does not have an accompanying predefined taxonomy 30a at all.
  • the seller product database 30 may not have a seller classification code attribute (i.e. the first column in Figure 2 is entirely missing).
  • the seller product database 30 may have a seller classification code attribute, but certain data items in the seller product database 30 have a null or zero value under that attribute.
  • the classification system is configured to implement an automated classifier generator for generating at least one proposed classification code for assignment to a corresponding data item.
  • This may simply be a classification code that is part of the destination taxonomy DT.
  • the classification code may be part of an existing source taxonomy ST for subsequent mapping to a destination taxonomy DT.
  • the automated classifier generator is configured to estimate the proposed classifier code for a data item by processing other attribute-value pairs of that data item. Specifically, the automated classifier generator applies a process that is very similar to that already described in relation to Figure 5 in that attribute-value pairs of a data item are parsed to extract text components. Each text component is then attributed a weight for subsequent comparison against those of a destination taxonomy DT. As before, a higher weight is applied to text components derived from a title field of a data item (vs. text components of any other field), as this is more representative of the likely category and so classifier code of a data item. Ideally, the weighting of a text component derived from the title attribute is between 10 and 50 times greater compared to other text components, especially those from a description attribute.
  • the automated classifier generator is configured to generate a confidence score associated with each proposed classifier. This represents the likelihood that the proposed classifier is appropriate for the data item. If the confidence score is below a threshold score, then the automated classifier generator is configured to take further actions, such as initiating a validation process. For example, a manual validation process similar to that described in relation to step 44 of Figure 4 can be initiated.
  • identifiers of a data item are presented via an operator interface 15 to an operator of the system 1.
  • one or more values, or attribute-value pairs of the data item are presented as descriptive identifiers of that data item. Therefore, identifiers are presented in a form and manner that allows the operator to decide what the subject matter of a data item concerns. For example, where the data item corresponds to a particular product record, title and description values may be provided (e.g. Shoe: A stylish suede loafer%) If a data item includes an image, the image may also be displayed as an identifier.
  • Such identifiers are presented via the operator interface 15 to the operator at the same time and alongside a set of proposed classifiers.
  • the proposed classifiers are ideally presented with a respective classification description (e.g. "Clothing / Mens / Shoes") - again to help a human operator understand which category is being proposed by the system 1.
  • the proposed classifiers are presented as operator-selectable options which, when selected, increase the confidence score of the proposed classifier to a data item.
  • previously unclassified data items can be categorised in a newly-generated and operator-validated product taxonomy.
  • This may be that of the seller system 3, a corresponding source taxonomy ST, or a destination taxonomy DT associated with a channel system 2.
  • This is achieved via the automatic or system-assisted assignment of an appropriate classification code to a data item.
  • the automated classifier generator generates an otherwise missing classification code value for a data item under the classification code attribute, allowing a previously unclassified data item to be classified.
  • data items that are incomplete in other ways can be processed in a similar manner to restore information that would otherwise be missing.
  • the above-described process can be generalised to encompass other attributes or values that may be missing. This may be as a result of the information being initially missing from the seller product database 30, or determined subsequently as being information that is useful or appropriate to include when mapping data items from the source taxonomy to the destination taxonomy.
  • a particular jacket may be classed within a source taxonomy ST under "Clothing / Mens / Jackets”, but may be determined to be classifiable within a destination taxonomy DT under "Sports & Outdoor / Cycling Jackets", in which case attributes such as whether or not the jacket is waterproof, include reflective details etc become useful and appropriate to include as attributes.
  • the product classification system 1 is additionally configured to detect the absence of a respective attribute or value in the seller product database 30, and in response, and utilising other information that is available, generate restorative data (i.e. values, or attribute-value pairs) to take the place of the otherwise missing data.
  • the restorative data can thus be added to recomplete otherwise incomplete data items.
  • the restorative data can be added by the product classification system 1 to the seller product database 30 via an update issued by the seller interface 13, and similarly transferred to the source taxonomy ST and destination taxonomy DT. Additionally, classification mappings can be determined on the basis of the restorative data as described above.
  • the restorative data generated for a particular data item is generated by processing the data that already exists for that particular data item.
  • many data items comprise one or more images that can be processed to generate restorative data - for example, turning image data into text data.
  • This can be particularly effective as the one or more images typically depict the subject (or part of the subject) of the data item.
  • the data item is a product record relating to a pair of shoes
  • images included as part of that product record may depict the pair of shoes, a single shoe, parts of a shoe, etc. as separate images.
  • both the subject be identified (i.e. a pair of shoes) as well as many other characteristics of the subject (colour, style, adornments, material, etc) - by the system computing resources 10 applying image recognition techniques to each of the images to extract and identify features of the images.
  • data items that comprise images may do so directly, such as storing the image data in a suitable data format as part of the data item.
  • the data items may comprise images indirectly: for example via a link such as a Uniform Resource Locator to the image.
  • a link such as a Uniform Resource Locator to the image.
  • linked-to images are held in another part of the database 11 of the system 1 , or otherwise accessible to the system 1 - for example as part of the seller system 3.
  • the image recognition performed by the product classification system 1 is assisted by its connection to the channel system 2.
  • the channel system 2 typically has a complete database and taxonomy 20 that includes both images, classifiers and other attributes.
  • image recognition involves comparing an image of an incomplete data item to corresponding images accessible via the database of the channel system 2, determining a suitable match, and generating the restorative data - such as descriptions, attributes and a classification code - on the basis of the data in the channel database 20 corresponding to the matched image.
  • this can also provides a very high confidence mapping to a destination classification code.
  • an image of an incomplete data item may be passed by the product classification system 1 to a generic image recognition service (typically trained on a large generic image dataset), and receive in response, in text form, a generic estimate of the subject of an image (e.g. "shoe").
  • the generic estimate can then be used to query a specific subset of the database of the channel system 2 to determine a subset of images against which the original image of the incomplete data item can be compared - thereby to determine better restorative data - i.e. a more comprehensive set of attributes/values with which to recomplete otherwise incomplete data items.
  • a further optional step is to include a manual validation process as already described above - i.e. whereby an operator interface 15 presents operator- selectable options to enable manual validation of which of the automatically-determined attributes/values should be used to complete an otherwise incomplete data item.
  • the operator or user interface 15 may be in the form of a graphical user interface that is hosted by the product classification system 1 , but is accessed and controlled remotely - for example via a device of the seller system 3 using a web or mobile interface.
  • the system 1 described above enables a user (e.g. a seller) to action the ingestion by the product classification system 1 of a set of products records (i.e. data items) of a seller product database 30, with the images from those product records being used by the product classification system 1 to recognise the type of each one of those products.
  • a user e.g. a seller
  • the images from those product records being used by the product classification system 1 to recognise the type of each one of those products.
  • the user/seller can, taking the role of the operator, be provided with suggestions of attributes/values which are relevant
  • characteristics of a product can be based on inferences from information accessible via the channel system 2 (e.g. text, further images of other products).
  • the user interacts with the suggestions, and selects an appropriate one of the automatically- determined suggestions.
  • the product classification system 1 is configured to apply iterative classification strategies to data items, such as product records.
  • a first classification strategy may comprise the automated generation of restorative data, this can be followed a second classification strategy of manually validating that the restorative data is appropriate to a data item, this can then be followed by a third classification strategy that involves updating the classification code (and/or the associated weight) of a data item in response to the manually-validated restorative data.
  • the product classification system 1 is configured to apply an additional performance- based classification strategy that may be used in complement with the others above- mentioned.
  • the performance-based classification strategy is particularly apt for the classification of products that are to be sold via a channel system 2, but can be applied in other contexts as well.
  • the performance-based classification strategy comprises: classifying a data item under a first trial classification code within the destination taxonomy; monitoring performance characteristics of that data item whilst classified under the first trial classification code within the destination taxonomy (e.g. the number of times that data item is viewed, located in a search and/or an associated product sold whilst listed under that trial classification code); comparing those performance characteristics with other performance
  • a product listed under "Sports & Outdoor / Cycling Jackets” may sell better than the same product listed under "Clothing / Mens / Jackets", in which case the former classification is chosen by the product classification system 1 is the most appropriate for that product.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A computer-implemented method and system of classifying a plurality of data items within a hierarchical destination taxonomy is described. Data items are assigned with a destination classifier that is representative of a node in the destination taxonomy. Text component sets for each node are generated that have a weight assigned to them. Weighted text components sets for each data item are also generated. A level of correlation between data items and nodes of the destination taxonomy are based on their respective weighted text components, a high level of correlation leading to the assignment of a destination classification code.

Description

Improved system and method for data classification
Field of the invention
The present invention relates to systems and methods for the classification of data items within a hierarchical taxonomy. The present invention has particular applicability in the area of electronic commerce, where each data item relates to a product listing.
Background to the invention
In the area of electronic commerce, various channels exist through which sellers can list a range of their products for sale. Well-known channels are provided by electronic commerce platforms such as Amazon® and eBay®, for example.
Each channel establishes its own taxonomy which is typically organised hierarchically to help consumers locate relevant products offered by a variety of product sellers. Such hierarchical taxonomies comprise a collection of nodes (e.g. categories) that have a parent or child relationship with adjacent nodes in the taxonomy. For example, the node or category "Clothing" may be the parent of the sub-category/node "Men", as well as others sub-category/nodes like "Women", and "Children".
A single seller will typically seek to offer its product range via various channels, but as different channels operate different taxonomies, a seller must determine an appropriate location within each taxonomy to list each and every product. Whilst a seller may operate a source taxonomy for their own product inventory, there is a need to map from this source taxonomy to potentially several destination taxonomies operated by various channels. Furthermore, it would be typically desirable for the seller to list products within a category in which other very similar products are already listed (e.g. by a seller's competitor).
Manual classification is possible. A domain expert can manually choose the right classification for each product listing, and so the appropriate location within a taxonomy. However, doing this for each and every product is time-consuming, expensive and laborious. This must also be repeated for every different destination taxonomy.
This problem is further compounded by the dynamic nature of the taxonomies operated by each channel. Categories within the taxonomy are frequently added, removed, renamed, or moved, leading to restructuring of the taxonomy. This necessitates iteratively remapping each product listing to the appropriate location within the destination taxonomy. Therefore, there is a need to efficiently automate the process of accurately classifying data items within a hierarchical taxonomy, at least in part.
It is against this background that the present invention has been conceived.
Summary of the invention
According to a first aspect of the present invention there is provided a computer- implemented method of classifying a plurality of data items within a destination taxonomy. Ideally, this is achieved by assigning each data item with a destination classifier.
As a destination classifier can represent a node in the destination taxonomy, the data item can thus be classified in accordance with that taxonomy.
The following recited method steps that refer to a destination taxonomy can, in whole or in part, apply to other taxonomies - such as a source taxonomy.
In general, classifiers represent nodes in a hierarchical taxonomy. Therefore, as every node has at least one parent or child relationship with one or more adjacent nodes, every classifier similarly has a corresponding parent and/or child relationship with other classifiers. Additionally, whilst classifiers may be represented as a classifier code, each classifier also has corresponding text components, such as a category title, allowing the structural arrangement of a taxonomy to be determined by a user or operator.
Preferably, the computer-implemented method comprises determining the structure of the hierarchical destination taxonomy. This may include registering the parent-child relationships of each node with other nodes of the destination taxonomy.
Preferably, the method comprises parsing text components of the destination taxonomy. Ideally, the method comprises registering from which nodal location within the destination taxonomy the text component originates. Advantageously, this subsequently allows each text component to be treated differently for each node.
Preferably, the method comprises generating a text component set for each node that includes those originating from the node. Ideally, the method comprises generating a text component set for each node that includes those originating from nearby nodes such as at least one of a parent, a child and a sibling node. Advantageously, text components that are from nearby nodes can be used to assess the context of a particular node, and so aids in subsequent classification of data items into the destination taxonomy.
Preferably, the method comprises assigning a weight to each text component of each set. Preferably, the weight depends, at least in part, on the relative difference in nodal location between the node of the set, and that from which the text component originates. Again, this aids in the subsequent classification of data items as nearby nodes are useful for determining contextual significance of a give node in the taxonomy.
Preferably, the method comprises parsing text components from a first set of the plurality of data items. Preferably, the method comprises assigning a weight to each text component for a respective data item.
Preferably, the method comprises calculating the level of correlation between nodes of the destination taxonomy and data items, such as data item of the first set. Preferably, the level of correlation is calculated on the basis of their respective weighted text components. The level of correlation may be represented by a confidence score.
Preferably, the method comprises classifying data items of the first set of data items by assigning a destination classification code to them if the calculated level of correlation, as represented by an applicable confidence score, is above a predetermined threshold.
Preferably, the method comprises applying a validation process to other data items of the first set if the level of correlation, as represented by an applicable confidence score, is below the predetermined threshold. Ideally, the validation process results in the assignment of a destination classification code to the other data items.
Preferably, the method comprises parsing text components of the destination taxonomy and/or from the first set of the plurality of data items further comprises pre-processing to remove predetermined low-semantic-value text components.
Preferably, for a node of the destination taxonomy, the step of assigning a weight to each text component of each node set comprises assigning a higher weight to text components originating from the same or parent nodal locations within the destination taxonomy than text components originating from child or sibling nodal locations.
Preferably, the step of assigning a weight to each text component comprises assigning a higher weight to text components derived from a title attribute than a description attribute. The weight assigned to text components derived from a title attribute may be at least 10 times greater than the weight assigned to text components derived from the description attribute. Preferably, the weight assigned to text components derived from a title attribute may be at least 20 times greater than the weight assigned to text components derived from the description attribute.
Preferably, the method further comprises monitoring the destination taxonomy over time for changes. The method may comprise taking predetermined actions in response to detecting a change in the destination taxonomy. For example, upon detection of a change in the nodal structure of the destination taxonomy, the method may further comprise reclassifying data items that belong to nodes affected by the detected change. Moreover, it can be advantageous to reclassify data items that are classified under a node that has changed, or one of the nearby nodes - especially those that are descendant from a node that has changed. Additionally, as a change may correspond to the adding of nodes to the destination taxonomy, it may be necessary to reclassify a node to a recently-added sibling node (in which it is a better fit). Accordingly, the method may further comprise reclassifying data items that belong to nodes descendant from a parent node itself having a descendant node (direct, or indirect) that has been detected to have changed.
Preferably, the method comprises applying a performance-based classification strategy comprising classifying a data item under a first trial classification code within the destination taxonomy, monitoring performance characteristics of that data item whilst classified under the first trial classification code within the destination taxonomy, comparing those performance characteristics with other performance characteristics resulting from classifying the data item and/or similar data items under a second trial classification code, and reclassifying the data item under the classification code that has the most optimal performance characteristics, as determined by the performance characteristic comparison.
The step of monitoring the performance characteristic may comprise determining one or more performance metrics relating to operations performed in respect of that data item.
This is particular relevant where a saleable product is associated with each data item, and the destination taxonomy is defined by a channel of an electronic commerce platform.
Generally, a performance metric is increased with a greater number of operations predetermined as positive (such as the number of times that data item is viewed, located in a search and/or is subject to a sale transaction whilst listed under that trial classification code within the destination taxonomy). The performance metric may be reduced with a greater number of operations predetermined as negative (such as the return or negative review left for a product associated with a respective data item).
Preferably, the method comprises applying a validation process. The validation process may be applied to a subset of the plurality of data items - for example data items not in the first set of data items, or other data items of the first set of data items where the calculated level of correlation, as represented by an applicable confidence score, is below a predetermined threshold. The validation process may include applying a manual validation process. The manual validation process may comprise (a) presenting identifiers of each data item, via an operator interface, to an operator with a set of operator-selectable options each describing a respective proposed destination classifier, ideally selected from those calculated to have the highest level of correlation to that data item, and (b) receiving, via the operator interface, a selection of one of the operator-selectable options, and in response assigning the corresponding destination classification code to that data item.
The validation process may comprise processing data items differently or more intensive to determine a better confidence score and/or match with an appropriate destination classification code. For example, the validation process may utilise image recognition to generate text components for use in determining the level correlation between a data item and a node of the destination taxonomy.
Specifically, the validation process may comprise: determining an image that is associated with a corresponding data item (e.g. one of the other data items of the first set); performing image recognition of that image to generate additional text components associated with that data item; assigning a weight to each additional text component; calculating the level of correlation between that data item and nodes of the destination taxonomy on the basis of their respective weighted text components, the level of correlation being represented by a confidence score; and classifying that data item by assigning a destination classification code to it if the calculated level of correlation, as represented by an applicable confidence score, is above a predetermined threshold. Image recognition may comprise: comparing and matching the image with at least one destination image provided on a database from which the destination taxonomy is derived, destination images being provided together with a corresponding destination text component; and processing the destination text components to generate additional text components for association with the data item. The method may further comprise detecting an incomplete data item of the plurality of data items, and in response generating restorative data. For example, incomplete data item may be detected as being incomplete as a result of being detected to be absent of a set of missing information, such as one or more values, attributes and attribute-value pairs. The generated restorative data can be stored and used to recomplete the otherwise incomplete data item. Thus, the recompleted data item can be added to a set of data items for subsequent classification, such as the first set of data items.
The method may comprise generating restorative data for a data item by processing data that already exists for that data item to generate text components for use as restorative data. For example, the method may comprise processing image data that already exists for that data item by performing image recognition on said image data to generate text components for use as restorative data to recomplete that data item.
Preferably, the method comprises determining a second set of the plurality of data items that each have a source classifier that represents a node within a source taxonomy, and applying a mapping between the source taxonomy and the destination taxonomy to thereby classify the second set of the plurality of data items according to the destination taxonomy.
Advantageously, this allows these data items to be classified without significant computational burden: they need not necessarily be processed individually data item by data item, but rather depending on their existing categorisation within the source taxonomy.
Preferably, the method comprises determining a second set of the plurality of data items that each have a source classifier that represents a node within a source taxonomy and then processing the source taxonomy to in the same way to the destination taxonomy to determine sets of weighted text components. Advantageously, this allows the same computer-implemented processing steps to be efficiently repurposed.
Specifically, the method may comprise determining the structure of the source taxonomy, parsing its text components, generating a text component set for each node of the source taxonomy, and assigning a weight to each text component of each node set.
Thus, the method may further comprise calculating the level of correlation between nodes of the source and destination taxonomy on the basis of their respective weighted text components, the level of correlation being represented by a confidence score.
The method may comprise generating a source-to-destination map comprising node mappings from each node of the source taxonomy to one or more nodes of the destination taxonomy - in particular, those calculated to be the most correlated. Ideally, each mapping is stored together with an associated confidence score. The method may comprise classifying data items of the second set by assigning a destination classification code to them if, according to the generated source-to-destination map, the source-to-destination confidence score of an applicable source-to-destination mapping is above a predetermined threshold. In other words, if the source classification code of a data item of the second set confidently maps on to a node of the destination taxonomy, it can be immediately classified under that destination taxonomy.
Additionally, the method may comprise applying a validation process to other data items of the second set if, according to the generated map, the source-to-destination confidence score of an applicable source-to-destination mapping is below the predetermined threshold. The validation process ideally results in the assignment of a destination classification code to the other data items.
Preferably, each data item comprises at least one of:
a source classifier relating to the location of the data item within a source taxonomy, (for example expressed as a classification key)
at least one text field, such as:
a title text field, the value of which is a title of a product to which the data item relates;
a description text field, the value of which is a description of a product to which the data item relates;
at least one image file; and
at least one numerical field.
Notably, data items need not be pre-categorised i.e. the invention is able to classify previously unclassified data items, and reclassify those that have been previously classified.
The method may comprise source-destination taxonomy analysis/mapping, such as destination taxonomy monitoring, with optionally, remapping occurring in response to a monitored change.
The method may comprise determining mappings between source and destination taxonomy by category matching with a confidence score, e.g. via counting weighted bag- of-words technique.
The system may comprise an automated classifier generator for generating at least one proposed classifier for notional assignment to a data item, and a confidence score associated with each proposed classifier. The proposed classifier may be generated by text analysis and/or image analysis.
The system may comprise a classification validator, optionally carrying out the steps of: determining if the confidence score associated with a proposed classifier is below a predetermined threshold value, and:
if not, then automatically assigning the classifier to the data item;
if so, then running a manual validation routine comprising:
displaying a user prompt containing a set of options;
receiving a user input choosing one option; and
assigning a classification and/or a confidence score in dependence on the chosen option.
According to a second aspect of the present invention there is provided a system for classifying a plurality of data items within a destination taxonomy.
Moreover the second aspect of the invention may reside in a computer-implemented classification system for classifying data items according to a hierarchical destination taxonomy by assigning each data item with a destination classifier that is representative of a node in the destination taxonomy. The system may comprise means for carrying out the steps of the method according to the first aspect of the present invention.
For example, the system may comprise at least one of a database configured to store the destination hierarchy, and computing resources configured to carry out one or more method steps according to the first aspect of the invention. The computing resources may comprise a processor and a memory. The system may further comprise an operator interface.
The system may also comprise at least one of: a first interface for reading a taxonomy, such as a channel taxonomy of a channel database the destination taxonomy being determined from the reading via the first interface of that taxonomy; and a second interface for reading and/or updating a taxonomy, such as a product taxonomy of a seller product database, the source taxonomy being determined from the reading via the second interface of that taxonomy.
The system may be configured to: determine the structure of the hierarchical destination taxonomy including parent- child relationships of each node with other nodes of the taxonomy; parse text components of the destination taxonomy, each text component being registered as originating from a respective nodal location within the destination taxonomy; generate a text component set for each node that includes those originating from the node, and those originating from at least one of a parent, a child and a sibling node; assign a weight to each text component of each set depending, at least in part, on the relative difference in nodal location between the node of the set, and that from which the text component originates; parse text components from a first set of the plurality of data items, each text component being assigned a weight for a respective data item; calculate the level of correlation between each data item of the first set and nodes of the destination taxonomy on the basis of their respective weighted text components, the level of correlation being represented by a confidence score; and/or classify data items of the first set of data items by assigning a destination classification code to them if the calculated level of correlation, as represented by an applicable confidence score, is above a predetermined threshold.
The system may also apply a validation process to other data items of the first set if the level of correlation, as represented by an applicable confidence score, is below the predetermined threshold, the validation process resulting in the assignment of a destination classification code to the other data items.
The system may be configured to process a second set of data items that are preassigned with a source classification code that denotes a location of a respective data item within a source taxonomy, the processing comprising: loading the source taxonomy and the destination taxonomy into the database; comparing the source and destination taxonomies to generate a source-to- destination taxonomy map that includes a plurality of source-to-destination mappings each having a corresponding confidence score; classifying data items of the second set by assigning a destination classification code to them if, according to the generated map, the source-to-destination confidence score of an applicable source-to-destination mapping is above a predetermined threshold; and applying a validation process to other data items of the first set if, according to the generated map, the source-to-destination confidence score of an applicable source-to- destination mapping is below the predetermined threshold.
Naturally, an aspect of the invention extends to a computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method of the first aspect of the invention.
It will be understood that features and advantages of different aspects of the present invention may be combined or substituted with one another where context allows.
For example, the features of the method described in relation to the first aspect of the present invention may be provided as part of the system described in relation to the second aspect of the present invention. Furthermore, such features may themselves constitute further aspects of the present invention.
Brief description of the drawings
In order for the invention to be more readily understood, embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings in which:
Figure 1 is a top-level schematic view of a classification system according to a first embodiment of the present invention;
Figure 2 represents an example extract from a product database for use by the classification system of Figure 1 ;
Figure 3 shows an example source taxonomy to destination taxonomy map for use by the classification system of Figure 1 ;
Figure 4 shown a top-level flow diagram of the method used by the classification system of Figure 1 to perform classification;
Figure 5 is a flow diagram of the steps of an example method of generating a source-to- destination taxonomy map, as carried out by the classification system of Figure 1 ; and
Figure 6 shows an example graphical representation of part of a product taxonomy for use by the classification system of Figure 1. Specific description of the preferred embodiments
Figure 1 is a top-level schematic view of a classification system 1 according to a first embodiment of the present invention. Naturally, the classification system 1 is configured to implement a method of classification which also accords to a first embodiment of the present invention, and as will be described in greater detail below.
To aid contextual understanding, the classification system 1 of the present embodiment is a product classification system 1 for classifying/categorising data items that relate to products that are to be hierarchically listed on an electronic commerce platform.
Naturally, other embodiments of the invention may relate to the classification of other types of data items for other purposes, but using the same techniques.
Figure 1 also shows the other principle systems with which the classification system 1 is configured to interface with, namely, a channel system 2 and a seller system 3. It should be noted that the product classification system 1 can interface with a variety of different channel and seller systems, but only one of each is depicted in Figure 1 in the interests of simplicity. By way of illustration, a channel system may be provided by an electronic commerce platform, such as Amazon® or eBay®. A seller system may be that of a retailer or reseller of a range of products.
The product classification system 1 comprises system computing resources 10, such a processor and memory, that facilitate the computerised execution of the classification process described below. The product classification system 1 also comprises a database 11 in which data associated with the classification process is stored, and a user interface 15 via which an operator of the product classification system 1 can input data into and receive information from the system 1.
The product classification system 1 comprises a seller interface 13 for interfacing with a seller system 3, and specifically for reading information from, or writing information to a seller product database 30. For example, the seller product database 30 may be read by transferring a file (e.g. a product spreadsheet file) from the seller system 3 to the product classification system 1 , processed, and then transmitted back to the seller system 3. Alternatively, a set of entries of the product database 30 may be updated via the product classification system 1 sending update requests to the seller system 3.
In the present example, the seller product database 30 has associated with it a hierarchical product taxonomy 30a, with each item of the seller product database 30 being classified according to (or having a location within) the taxonomy 30a. This can be represented by assigning each product data item with a classification code. Figure 2 represents an example extract from a product database, showing an example product record relating to a pair of shoes. The location of a data item within the seller product taxonomy 30a is notionally codified by the value of a seller classification code attribute. For the shown example, this is "CL/ME/SH", which relates to the category/node "shoes" which is a child node of the category/node "Mens", which itself is a child node of the category/node "Clothing".
The specific structure of the seller product taxonomy 30a, with an identification of each node (or category) and the relationship with each node with adjacent parent or child nodes may be predefined, or alternatively may be generated by aggregating the classification code value of all products within the seller product database 30. It should also be noted that nodes of a taxonomy 30a may not necessarily be restricted to categories, but may include other attributes such as fields or values. For example, attributes such as the size of a shoe product may be defined as a node within a taxonomy, and the possible values such an attributes can take on as further associated nodes, especially if those values are restricted to a range of values, or a limited number of predetermined options - as opposed to free text, for example.
Figure 6 shows an example graphical representation of part of a product taxonomy, with a category node "Clothing" being shown in expanded form to display many direct child categories nodes associated with it (e.g. Child, Men, Baby, Girls, Boys, Novelty & Special Use, and Women) as well as associated attributes (e.g. colour, season, collection, etc). The child node "Women" is also expanded to show further sub-categories children nodes to it (e.g. Knitwear, Swimwear, Jeans, Maternity, etc) as well as associated attributes. By way of example, the attribute "season" may be restricted to four predetermined options: "spring", "summer", "autumn", and "winter", and these value restrictions are specifically codified in the taxonomy (although not shown in Figure 6). By way of another example, a GB shoe size attribute may be restricted to whole numbers, or 0.5 increments between those whole numbers, within the range 3 to 13.5.
Figure 4 shown a top-level flow diagram of the method used by the classification system 1 to perform classification.
A first step of the method 41 comprises loading the source and destination taxonomies into the database 11 of the system 1 :
Referring back to Figure 1 , via the seller interface 13, the seller product database and its associated taxonomy 30a can be ingested by the product classification system 1 , and stored within the database 11 as a source taxonomy ST for subsequent computing operations. Taxonomies can be stored on the database 11 of the product classification system 1 in a variety of forms - for example as a relational database or a graph database. It should be noted however that the latter is preferred, due to its speed, scalability and suitability for representing hierarchical taxonomies.
In certain embodiments, the seller system 3 may provide a product database 30 without an accompanying predefined taxonomy 30a. In such examples, a taxonomy 30a may be automatically generated by the product classification system 1 as will be discussed later. Regardless, a source taxonomy ST derived from the seller product database 30 can be established on the database 11 of the product classification system 1
The product classification system 1 also comprises a channel interface 12 for interfacing with a channel system 3, and read from it the structure of the channel database 20 to derive from it a destination taxonomy DT which can also be obtained and stored in a similar way on the database 11 of the product classification system 1.
The channel interface 12 of the product classification system 1 is also configured to list or update a set of products on the channel database. The product classification system 1 therefore acts as an intermediary between a seller and a channel. It should be noted that the product classification system 1 will normally only have authorisation to list or update a set of products relating to the seller's products. It will not be able to change product listings that have been uploaded by other users of the channel - although it may be able to read data items relating to those products, providing they are publicly listed and accessible (e.g. via a web interface). Also, it should be noted that whilst the product classification system 1 is able to read the hierarchical structure of the channel database 20, it cannot change that structure.
Referring back to Figure 4, the second step 42 of the classification process is to compare the source and destination taxonomies. Assuming there are differences between a source taxonomy ST and destination taxonomy DT, the classification system 1 will need to perform a mapping operation that generates a source-to-destination map between the source taxonomy ST and the destination taxonomy DT.
Using this source-to-destination map, product data items classified under the source taxonomy ST can be more efficiently remapped to the destination taxonomy DT, and so be automatically listed within an appropriate location within a channel system 2.
Moreover, the product classification system 1 is configured to monitor the channel system 2 to determine when and how an updated is made to restructure the channel database 20, and in response adjust the destination taxonomy DT, and consequently the source-to- destination map. The source-to-destination map generally comprises a list of nodes (or classifier codes representing those nodes) of the source taxonomy associated with a corresponding list of nodes of the destination taxonomy together with a normalised confidence score (from 0 to 1) that represents the determined strength of association between the two linked nodes. It should also be noted that, for each source taxonomy node, there may be many possible destination taxonomy nodes, each with its own confidence score. This can be codified in a variety of forms, but for ease of understanding could be represented in a table.
Figure 3 shows an example source taxonomy to destination taxonomy map in the form of a table, with one node of a source (seller) taxonomy mapped to two possible nodes of a destination (channel) taxonomy, a confidence score associated with each source to destination node mapping. It will be understood that more than two maps may be provided in practice. Each node is represented here in Figure 3 as a classification code.
The process of mapping itself is performed progressively and hierarchically, where more ancestral nodes of the source taxonomy - i.e. closer to the "root" of the hierarchy - are mapped at a higher priority than the nodes that are closer to the leaves of the hierarchy.
In a typical product database hierarchy, this correlates with "categories" (which are closer to the root) being mapped at a higher priority than fields, attributes and values.
There are several challenges associated with the way information is organised in each source and destination taxonomy, and so problems of performing mapping accurately and efficiently:
Source and destination taxonomies differ in terms of hierarchical category depth. Some split and some have very fine sub-categories while others have more coarse-grained sub-categories.
The order of the nodes or categories may be different across different taxonomies. For example, you may have Clothing>Mens>Sports or Clothing>Sports>Mens. Some categories present in one taxonomy (e.g. the source ST) may not be represented in another (e.g. the DT). This necessitates registering this, and then establishing a map to a generic category in the destination taxonomy.
An approach to these challenges is to extract the category title of each node and then use "word-sense disambiguation" techniques to build a link between the source and destination taxonomies. However, this alone would typically generate a relatively poor confidence score for anything but the most direct associations. For example, an appropriate map between a source category entitled "shoes" and a destination category entitled "footwear" may not be attributed with a high confidence score. To address this, the product classification system 1 , and the method it employs for classification goes further, and parses from both the source and destination taxonomies not only the title, but information relating to the path of each node. In particular, textual components (such as titles) of other adjacent nodes are parsed together with attributes and values (or value restrictions) under a candidate category. This additional information provides a rich context which describes a range of the products which the category represents so a more reliable association can be established between nodes of the source and destination taxonomy.
Specifically, for each node, a set of textual components is established from both the source and destination taxonomies and compared with one another using a matching algorithm in order to determine the contextual similarity between a node in the source taxonomy and that of the destination taxonomy, and thereby a confidence score can be generated.
A "bag of words" model can be used as the basis for this comparison, using a "distance- based" matching algorithm (e.g. Levenshtein distance) to determine a similarity between each word in the set, and also the overall difference between the set of words. Whilst this can generate category matches with a confidence score, there are certain drawbacks to this approach. Notably, confidence scores will be low (and/or there would be incorrect classification) where: attribute data is sparse in either the source or destination taxonomy;
attributes are disparate between the source and destination taxonomies;
values follow different conventions - e.g. US vs GB shoe sizes; and
there are large valid or noise values in one or more of the attributes of a taxonomy.
For example, one category on a channel database such as eBay® has more than
1000 brand names.
To account for these potential drawbacks, a pre-processing step is first performed by the product classification system 1 before matching, the pre-processing reducing noise levels and providing domain synonyms. For example pre-processing can include: removing common words like "a", "the", "and" which would otherwise falsely positively skew the match between two nodes; and
adding weights to words from category path, attributes and values so that text components originating from different locations within the taxonomy are not given an equal weighing. Generally, higher weights are assigned to text components originating from higher "category" nodes relative to lower "attribute" or "value" nodes. Figure 5 is a flow diagram of the steps of an example method 50 of generating a source- to-destination taxonomy map.
A first step 51 comprises obtaining the text components from the source and destination taxonomies. This is achieved by traversing each taxonomy structure and parsing text components from each node. A prime example of a suitable text component is the category title (c.f. the value of the "title field" of Figure 2). However, other text
components will be collected and assigned against each node of each taxonomy. For example, the text from particular fields or attributes can be used (e.g. "description", "season", "size", "colour"). Thus, each node can be assigned a set of parsed text components.
A second step 52 comprises removing low-semantic-value text components, especially those that are unlikely to form the basis of distinguishing one node from another.
Examples are commonly-used words such as determiners ("a", "an", "the", "this", "that"... etc.) This is done to minimise the computational burden of processing such words during a comparison.
A third step 53 comprises generating a filtered text component set for each node. This is achieved as a by-product of the second step 52, but additional filtering may be employed to further filter out (or in) additional text component terms.
A fourth step 54 comprises assigning a weight to each text component which represents its relative importance to the node to which it has been assigned. Notably, a higher weight is applied to the text component derived from the title of the respective node, than the title of adjacent nodes, which in turn will be given greater weight to text components derived from fields or attributes.
A fifth step 55 comprises calculating a level of correlation between each node of the source taxonomy, and a number of candidate nodes of the destination taxonomy. This is achieved by comparing all the text components (+ weights) associated with a node in the source taxonomy with all the text components (+ weights) associated with candidate notes of the destination taxonomy. As mentioned, the basis of this comparison can be a matching algorithm which generates, as an output, a confidence score of the link between the source and destination nodes.
A sixth step 56 comprises populating a map with this information. Each node from the source taxonomy is stored in the map together with the 'X' most correlated nodes from the destination taxonomy, as indicated by a corresponding confidence score. Ideally 'X' is between 2 to 10, and more ideally 5, so as to achieve a balance between storing superfluous data, and providing a viable set of alternative nodes to choose from in the event that the node rated with the highest confidence score is later determined to be inappropriate via a validation process (e.g. user validation).
In the present embodiment, a single node of the source taxonomy will be mapped on to up to five nodes of the destination taxonomy, each with a confidence score. It should be noted that there may not be a direct 1 -to- 1 map between a node of the source taxonomy and that of the destination taxonomy and so generalised, default or "catch-all" categories may be selected where a specific mapping is not possible. For example, the node-path Clothing>Mens>Sports>Bowling on a source taxonomy may get mapped to
Clothing>Mens>Activewear on a destination (channel) taxonomy in the event that the channel doesn't support a Bowling specific category. Hence either a parent or a catch-all category under Activewear will be selected. Defaults/catch-all categories are typically identified during the taxonomy import process and may need to be set up as channel- specific logic.
In the situation where a node of the source taxonomy is mapped to multiple nodes in the destination taxonomy with very similar confidence scores for each, this can be used as a prompt to the product classification system 1 to conduct a more sophisticated comparison between the taxonomies. Moreover, this situation can be used by the system 1 to handle the reclassification of product data item into an appropriate node in the destination taxonomy in a more nuanced way. This situation may occur where the source taxonomy has an only relatively generalised category node, whereas the destination taxonomy has more fine-grained category nodes, for example.
Thus, referring to Figure 4, the third step 43 of the classification process can be achieved,
- i.e. the application of a destination classification code where the source-to-destination confidence score is above a predetermined threshold.
Where the source-to-destination confidence score is below a predetermined threshold, the system 1 optionally performs the fourth step 44 of the process, which is a manual validation process. Here, each source node is presented via an operator interface 15 to an operator of the system 1 alongside multiple destination nodes. These are presented as operator-selectable options which, when selected, update the confidence score to significantly increase it (e.g. increase to 1) so that a unambiguous link between a source node and a destination node can be determined. From this it is possible to place products originally classified under a source classifier to a destination classifier.
The system 1 so far described utilises attempts matching between the taxonomic structures of source and destination databases. This relies on the database 30 imported into the classification system 1 having an existing taxonomy - such as the product taxonomy 30a of the seller product database 30. However, a natural extension of the system 1 is to be able to process databases (or parts of them) that do not have an already-established taxonomy. This also includes individual database records, in the form of data items. Moreover, this can enable the combination of two or more sets of data items or records, one of which may not be part of a taxonomically structured database. Thus, the classification system 1 and associated method is able to classify previously unclassified data items.
One example where this may arise is when the seller interface 13 of the system 1 reads in a seller product database 30 that does not have an accompanying predefined taxonomy 30a at all. In this case, the seller product database 30 may not have a seller classification code attribute (i.e. the first column in Figure 2 is entirely missing). Another example is where the seller product database 30 may have a seller classification code attribute, but certain data items in the seller product database 30 have a null or zero value under that attribute.
In these situations, the classification system, typically via the computing resources 10, is configured to implement an automated classifier generator for generating at least one proposed classification code for assignment to a corresponding data item. This may simply be a classification code that is part of the destination taxonomy DT. Alternatively, the classification code may be part of an existing source taxonomy ST for subsequent mapping to a destination taxonomy DT.
To do this, the automated classifier generator is configured to estimate the proposed classifier code for a data item by processing other attribute-value pairs of that data item. Specifically, the automated classifier generator applies a process that is very similar to that already described in relation to Figure 5 in that attribute-value pairs of a data item are parsed to extract text components. Each text component is then attributed a weight for subsequent comparison against those of a destination taxonomy DT. As before, a higher weight is applied to text components derived from a title field of a data item (vs. text components of any other field), as this is more representative of the likely category and so classifier code of a data item. Ideally, the weighting of a text component derived from the title attribute is between 10 and 50 times greater compared to other text components, especially those from a description attribute.
Additionally, the automated classifier generator is configured to generate a confidence score associated with each proposed classifier. This represents the likelihood that the proposed classifier is appropriate for the data item. If the confidence score is below a threshold score, then the automated classifier generator is configured to take further actions, such as initiating a validation process. For example, a manual validation process similar to that described in relation to step 44 of Figure 4 can be initiated.
Specifically, identifiers of a data item are presented via an operator interface 15 to an operator of the system 1. In many cases, one or more values, or attribute-value pairs of the data item are presented as descriptive identifiers of that data item. Therefore, identifiers are presented in a form and manner that allows the operator to decide what the subject matter of a data item concerns. For example, where the data item corresponds to a particular product record, title and description values may be provided (e.g. Shoe: A stylish suede loafer...) If a data item includes an image, the image may also be displayed as an identifier.
Such identifiers are presented via the operator interface 15 to the operator at the same time and alongside a set of proposed classifiers. The proposed classifiers are ideally presented with a respective classification description (e.g. "Clothing / Mens / Shoes") - again to help a human operator understand which category is being proposed by the system 1. Moreover, the proposed classifiers are presented as operator-selectable options which, when selected, increase the confidence score of the proposed classifier to a data item.
It should be noted that whilst user intervention is required, the process of validating a set of options via simple user selection imposes a relatively small burden on the user. It is not necessary for the user/operator to think about and find a suitable category, and enter the category manually (e.g. in text form). User selection is ideally via the operation of a simple user interface element (e.g. clicking on a Ul element using a mouse pointer, or touching the Ul element presented on a touch-screen). Accordingly, the cognitive and operational burden on the user is significantly relieved, allowing the system 1 as a whole to operate more efficiently.
Accordingly, previously unclassified data items can be categorised in a newly-generated and operator-validated product taxonomy. This may be that of the seller system 3, a corresponding source taxonomy ST, or a destination taxonomy DT associated with a channel system 2. This is achieved via the automatic or system-assisted assignment of an appropriate classification code to a data item. The automated classifier generator generates an otherwise missing classification code value for a data item under the classification code attribute, allowing a previously unclassified data item to be classified. In a further natural extension to the system 1 of the present embodiment, data items that are incomplete in other ways can be processed in a similar manner to restore information that would otherwise be missing. In other words, the above-described process can be generalised to encompass other attributes or values that may be missing. This may be as a result of the information being initially missing from the seller product database 30, or determined subsequently as being information that is useful or appropriate to include when mapping data items from the source taxonomy to the destination taxonomy.
This may occur when the product classification system 1 determines that a data item should or could have a classification code under which data items have a set of particular attributes not present in other similar classification codes. For example, this may arise when a product record is deemed to alternatively located within a different location within a destination taxonomy than originally classified under the source taxonomy. A particular jacket, for example, may be classed within a source taxonomy ST under "Clothing / Mens / Jackets", but may be determined to be classifiable within a destination taxonomy DT under "Sports & Outdoor / Cycling Jackets", in which case attributes such as whether or not the jacket is waterproof, include reflective details etc become useful and appropriate to include as attributes.
In these situation, the product classification system 1 is additionally configured to detect the absence of a respective attribute or value in the seller product database 30, and in response, and utilising other information that is available, generate restorative data (i.e. values, or attribute-value pairs) to take the place of the otherwise missing data. The restorative data can thus be added to recomplete otherwise incomplete data items.
The restorative data can be added by the product classification system 1 to the seller product database 30 via an update issued by the seller interface 13, and similarly transferred to the source taxonomy ST and destination taxonomy DT. Additionally, classification mappings can be determined on the basis of the restorative data as described above. The restorative data generated for a particular data item is generated by processing the data that already exists for that particular data item.
For example, many data items comprise one or more images that can be processed to generate restorative data - for example, turning image data into text data. This can be particularly effective as the one or more images typically depict the subject (or part of the subject) of the data item. For example, where the data item is a product record relating to a pair of shoes, images included as part of that product record may depict the pair of shoes, a single shoe, parts of a shoe, etc. as separate images. Accordingly, both the subject be identified (i.e. a pair of shoes) as well as many other characteristics of the subject (colour, style, adornments, material, etc) - by the system computing resources 10 applying image recognition techniques to each of the images to extract and identify features of the images.
It should be noted that data items that comprise images may do so directly, such as storing the image data in a suitable data format as part of the data item. Alternatively, the data items may comprise images indirectly: for example via a link such as a Uniform Resource Locator to the image. Ideally, such linked-to images are held in another part of the database 11 of the system 1 , or otherwise accessible to the system 1 - for example as part of the seller system 3.
Advantageously, the image recognition performed by the product classification system 1 is assisted by its connection to the channel system 2. This is because the channel system 2 typically has a complete database and taxonomy 20 that includes both images, classifiers and other attributes.
Accordingly, image recognition involves comparing an image of an incomplete data item to corresponding images accessible via the database of the channel system 2, determining a suitable match, and generating the restorative data - such as descriptions, attributes and a classification code - on the basis of the data in the channel database 20 corresponding to the matched image. Advantageously, this can also provides a very high confidence mapping to a destination classification code.
Other image recognition techniques may also be applied instead, or in complement with this approach - for example, an image of an incomplete data item may be passed by the product classification system 1 to a generic image recognition service (typically trained on a large generic image dataset), and receive in response, in text form, a generic estimate of the subject of an image (e.g. "shoe"). The generic estimate can then be used to query a specific subset of the database of the channel system 2 to determine a subset of images against which the original image of the incomplete data item can be compared - thereby to determine better restorative data - i.e. a more comprehensive set of attributes/values with which to recomplete otherwise incomplete data items.
Naturally, in between generating restorative data, and applying an update to otherwise incomplete data items, a further optional step is to include a manual validation process as already described above - i.e. whereby an operator interface 15 presents operator- selectable options to enable manual validation of which of the automatically-determined attributes/values should be used to complete an otherwise incomplete data item.
It should be additionally noted that, in the examples described above, the operator or user interface 15 may be in the form of a graphical user interface that is hosted by the product classification system 1 , but is accessed and controlled remotely - for example via a device of the seller system 3 using a web or mobile interface.
Thus, the system 1 described above enables a user (e.g. a seller) to action the ingestion by the product classification system 1 of a set of products records (i.e. data items) of a seller product database 30, with the images from those product records being used by the product classification system 1 to recognise the type of each one of those products.
Furthermore, via the operator interface 15 the user/seller can, taking the role of the operator, be provided with suggestions of attributes/values which are relevant
characteristics of a product, and these can be based on inferences from information accessible via the channel system 2 (e.g. text, further images of other products). The user interacts with the suggestions, and selects an appropriate one of the automatically- determined suggestions.
It should also be noted that the product classification system 1 is configured to apply iterative classification strategies to data items, such as product records. For example, a first classification strategy may comprise the automated generation of restorative data, this can be followed a second classification strategy of manually validating that the restorative data is appropriate to a data item, this can then be followed by a third classification strategy that involves updating the classification code (and/or the associated weight) of a data item in response to the manually-validated restorative data.
The product classification system 1 is configured to apply an additional performance- based classification strategy that may be used in complement with the others above- mentioned. The performance-based classification strategy is particularly apt for the classification of products that are to be sold via a channel system 2, but can be applied in other contexts as well.
Notably, the performance-based classification strategy comprises: classifying a data item under a first trial classification code within the destination taxonomy; monitoring performance characteristics of that data item whilst classified under the first trial classification code within the destination taxonomy (e.g. the number of times that data item is viewed, located in a search and/or an associated product sold whilst listed under that trial classification code); comparing those performance characteristics with other performance
characteristics resulting from classifying the data item (or similar data items) under a second trial classification code; and reclassifying the data item under the classification code that has the most optimal performance characteristics, as determined by the performance characteristic comparison.
For example, a product listed under "Sports & Outdoor / Cycling Jackets" may sell better than the same product listed under "Clothing / Mens / Jackets", in which case the former classification is chosen by the product classification system 1 is the most appropriate for that product.
Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art.

Claims

1. A computer-implemented method of classifying a plurality of data items within a hierarchical destination taxonomy by assigning each data item with a destination classifier that is representative of a node in the destination taxonomy, the method comprising: determining the structure of the hierarchical destination taxonomy including parent-child relationships of each node with other nodes of the taxonomy; parsing text components of the destination taxonomy, each text component being registered as originating from a respective nodal location within the destination taxonomy; generating a text component set for each node that includes those originating from the node, and those originating from at least one of a parent, a child and a sibling node; assigning a weight to each text component of each set depending, at least in part, on the relative difference in nodal location between the node of the set, and that from which the text component originates; parsing text components from a first set of the plurality of data items, each text component being assigned a weight for a respective data item; calculating the level of correlation between each data item of the first set and nodes of the destination taxonomy on the basis of their respective weighted text components, the level of correlation being represented by a confidence score; classifying data items of the first set of data items by assigning a destination classification code to them if the calculated level of correlation, as represented by an applicable confidence score, is above a predetermined threshold; and applying a validation process to other data items of the first set if the level of correlation, as represented by an applicable confidence score, is below the predetermined threshold, the validation process resulting in the assignment of a destination classification code to the other data items.
2. The method of claim 1 , wherein parsing text components of the destination
taxonomy and/or from the first set of the plurality of data items further comprises pre-processing to remove predetermined low-semantic-value text components.
3. The method of claim 1 or claim 2, wherein, for a node of the destination taxonomy, the step of assigning a weight to each text component of each node set comprises assigning a higher weight to text components originating from the same or parent nodal locations within the destination taxonomy than text components originating from child or sibling nodal locations.
4. The method of any preceding claim, wherein the step of assigning a weight to each text component comprises assigning a higher weight to text components derived from a title attribute than a description attribute.
5. The method of claim 4, wherein the weight assigned to text components derived from a title attribute is between 10 and 50 times greater than the weight assigned to text components derived from the description attribute.
6. The method of any preceding claim, further comprising monitoring the destination taxonomy over time for changes, and reclassifying data items in response to detecting a change in the nodal structure of the destination taxonomy, wherein the reclassified data items belong to nodes descendant from a parent node itself having a descendant node that has been detected to have changed.
7. The method of any preceding claims, further comprising applying a performance- based classification strategy comprising: classifying a data item under a first trial classification code within the destination taxonomy; monitoring performance characteristics of that data item whilst classified under the first trial classification code within the destination taxonomy; comparing those performance characteristics with other performance characteristics resulting from classifying the data item and/or similar data items under a second trial classification code; and reclassifying the data item under the classification code that has the most optimal performance characteristics, as determined by the performance characteristic comparison.
8. The method of claim 7, wherein the step of monitoring the performance
characteristic comprises monitoring the number of times that data item is viewed, located in a search and/or is subject to a transaction whilst listed under that trial classification code within the destination taxonomy.
9. The method of any preceding claim, wherein the step of applying a validation process to other data items includes applying a manual validation process comprising: presenting identifiers of each data item, via an operator interface, to an operator with a set of operator-selectable options each describing a respective proposed destination classifier selected from those calculated to have the highest level of correlation to that data item; and receiving, via the operator interface, a selection of one of the operator- selectable options, and in response assigning the corresponding destination classification code to that data item.
10. The method of any preceding claim, wherein the step of applying a validation process to other data items further comprises: determining an image that is associated with a corresponding one of the other data items; performing image recognition of that image to generate additional text components associated with that data item; assigning a weight to each additional text component; calculating the level of correlation between that data item and nodes of the destination taxonomy on the basis of their respective weighted text components, the level of correlation being represented by a confidence score; and classifying the other data items by assigning a destination classification code to them if the calculated level of correlation, as represented by an applicable confidence score, is above a predetermined threshold.
11. The method of claim 10, wherein image recognition comprises: comparing and matching the image with at least one destination image provided on a database from which the destination taxonomy is derived, destination images being provided together with a corresponding destination text component; and processing the destination text components to generate additional text components for association with the data item.
12. The method of any preceding claim, further comprising: detecting an incomplete data item of the plurality of data items, the incomplete data item being detected to be absent of a set of missing information, such as one or more values, attributes and attribute-value pairs; generating and storing restorative data to recomplete the otherwise incomplete data item by processing data that already exists for that data item to generate text components for use as restorative data; and adding the recompleted data item to the first set of plurality of data items for subsequent classification.
13. The method of claim 12, wherein the step of generating and storing restorative data further comprises processing image data that already exists for that data item by performing image recognition on said image data to generate text components for use as restorative data to recomplete that data item.
14. The method of any preceding claim, further comprising: determining a second set of the plurality of data items that each have a source classifier that represents a node within a source taxonomy; determining the structure of the source taxonomy, parsing its text components, generating a text component set for each node of the source taxonomy, and assigning a weight to each text component of each node set; calculating the level of correlation between nodes of the source and destination taxonomy on the basis of their respective weighted text components, the level of correlation being represented by a confidence score; generating a source-to-destination map comprising node mappings from each node of the source taxonomy to a plurality of the nodes of the destination taxonomy calculated to be the most correlated, each mapping being stored together with an associated confidence score; classifying data items of the second set by assigning a destination classification code to them if, according to the generated source-to- destination map, the source-to-destination confidence score of an applicable source-to-destination mapping is above a predetermined threshold; and applying a validation process to other data items of the second set if, according to the generated map, the source-to-destination confidence score of an applicable source-to-destination mapping is below the predetermined threshold, the validation process resulting in the assignment of a destination classification code to the other data items.
15. A computer-implemented classification system for classifying data items according to a hierarchical destination taxonomy by assigning each data item with a destination classifier that is representative of a node in the destination taxonomy, the system comprising a database configured to store the destination hierarchy, and computing resources configured to: determine the structure of the hierarchical destination taxonomy including parent-child relationships of each node with other nodes of the taxonomy; parse text components of the destination taxonomy, each text component being registered as originating from a respective nodal location within the destination taxonomy; generate a text component set for each node that includes those originating from the node, and those originating from at least one of a parent, a child and a sibling node; assign a weight to each text component of each set depending, at least in part, on the relative difference in nodal location between the node of the set, and that from which the text component originates; parse text components from a first set of the plurality of data items, each text component being assigned a weight for a respective data item; calculate the level of correlation between each data item of the first set and nodes of the destination taxonomy on the basis of their respective weighted text components, the level of correlation being represented by a confidence score; classify data items of the first set of data items by assigning a destination classification code to them if the calculated level of correlation, as represented by an applicable confidence score, is above a predetermined threshold; and apply a validation process to other data items of the first set if the level of correlation, as represented by an applicable confidence score, is below the predetermined threshold, the validation process resulting in the assignment of a destination classification code to the other data items.
PCT/GB2020/050820 2019-03-26 2020-03-26 Improved system and method for data classification WO2020193985A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB1904183.9A GB201904183D0 (en) 2019-03-26 2019-03-26 Improved system and method for data classification
GB1904183.9 2019-03-26

Publications (1)

Publication Number Publication Date
WO2020193985A1 true WO2020193985A1 (en) 2020-10-01

Family

ID=66381492

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2020/050820 WO2020193985A1 (en) 2019-03-26 2020-03-26 Improved system and method for data classification

Country Status (2)

Country Link
GB (1) GB201904183D0 (en)
WO (1) WO2020193985A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120314941A1 (en) * 2011-06-13 2012-12-13 Microsoft Corporation Accurate text classification through selective use of image data
US20170103434A1 (en) * 2015-10-08 2017-04-13 Paypal, Inc. Automatic taxonomy alignment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120314941A1 (en) * 2011-06-13 2012-12-13 Microsoft Corporation Accurate text classification through selective use of image data
US20170103434A1 (en) * 2015-10-08 2017-04-13 Paypal, Inc. Automatic taxonomy alignment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DOMENICO BENEVENTANO ET AL: "Beneventano et al.: A Web Service based framework for the semantic mapping amongst product classification A WEB SERVICE BASED FRAMEWORK FOR THE SEMANTIC MAPPING AMONGST PRODUCT CLASSIFICATION SCHEMAS", JOURNAL OF ELECTRONIC COMMERCE RESEARCH, vol. 5, no. 2, 31 January 2004 (2004-01-31), pages 114 - 127, XP055706623, ISSN: 1526-6133 *

Also Published As

Publication number Publication date
GB201904183D0 (en) 2019-05-08

Similar Documents

Publication Publication Date Title
RU2686590C1 (en) Method and device for comparing similar elements of high-dimensional image features
US10783402B2 (en) Information processing apparatus, information processing method, and storage medium for generating teacher information
US9025811B1 (en) Performing image similarity operations using semantic classification
US9128988B2 (en) Search result ranking by department
US20140229281A1 (en) Taxonomy based targeted search advertising
US20130290319A1 (en) Performing application searches
US20140108206A1 (en) System and method for managing product catalogs
JP6850806B2 (en) Annotation system for extracting attributes from electronic data structures
JP2017224184A (en) Machine learning device
US20140258002A1 (en) Semantic model based targeted search advertising
CN103678335A (en) Method and device for identifying commodity with labels and method for commodity navigation
CN110515896B (en) Model resource management method, model file manufacturing method, device and system
US20110302167A1 (en) Systems, Methods and Computer Program Products for Processing Accessory Information
US20140358931A1 (en) Product Record Normalization System With Efficient And Scalable Methods For Discovering, Validating, And Using Schema Mappings
JP2008515049A (en) Displaying search results based on document structure
KR20190128246A (en) Searching methods and apparatus and non-transitory computer-readable storage media
JP7497403B2 (en) Method and system for performing product matching on an electronic commerce platform - Patents.com
WO2017139247A1 (en) Inconsistency detection and correction system
US20110252042A1 (en) Make and model classifier
JPH10505930A (en) Method and apparatus for extracting information from a database
CN110188207B (en) Knowledge graph construction method and device, readable storage medium and electronic equipment
JP2010061176A (en) Text mining device, text mining method, and text mining program
CN108694242B (en) Node searching method, equipment, storage medium and device based on DOM
JP5780036B2 (en) Extraction program, extraction method and extraction apparatus
WO2020193985A1 (en) Improved system and method for data classification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20716898

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20716898

Country of ref document: EP

Kind code of ref document: A1