METHOD AND SYSTEM FOR DATA CERATION
PRIORITY DOCUMENTS
[0001] The present application claims priority from Australian Provisional Patent Application No. 2018900840 titled“METHOD AND SYSTEM FOR DATA CUR AT ION-' and filed on 14 March 2018, the content of which is hereby incorporated by reference in its entirety.
TECHNICAL FIELD
[0002] The present disclosure relates to the automated organising and processing of electronic data items. In a particular form, the present disclosure relates to automated curation of an electronic data stream consisting of multiple electronic data items.
BACKGROUND
[0003] The continuous improvement in network connectivity, electronic data storage, electronically enabled sensors and the electronic publication of material, including that related to both traditional and social media, has led to the generation of a deluge in electronic data. The field of big data analytics attempts to draw meaning from these large amounts of data. Initiatives in the field of big data analytics area have been used to personalize advertisements in elections; improve government services, targeting marketing of products and services through social media; predict intelligence activities; unravel human trafficking activities; understand the impact of news and current events on stock markets; analyse financial risks; accelerate scientific discovery; as well as to improve national security and public health.
[0004] An important aspect of big data analytics is the data curation process. Referring now to Figure 1, there is shown a system overview diagram of an ideal data curation system 100 which functions to process and transform input raw electronic data streams 110, ie, DS1, DS2, DS3 ... DSN, which consists of unstructured, semi-structured and structured electronic data items, eg, text, video, image data sets, which are then curated in data curation process 120 to generate curated data streams 130, ie,
CD1, CD2, CD3 ... CDm, each consisting of contextualised data and knowledge that is maintained and made available for use by end-users and applications. For example, information extracted from tweets can be enriched with metadata on geo-location, and then linked to health issues or government policies to understand their impact on society in that geographical locale.
[0005] The data curation process 120, involves identifying relevant data sources, extracting data and knowledge, cleaning, enriching and linking data and knowledge. For example, information extracted from Tweets originating from the Twitter™ computer software application is often enriched with metadata
such as geo-location, in the absence of which the extracted information would be difficult to interpret and meaningfully utilise. Data curation thus acts as the glue between raw data and analytics, providing an abstraction layer that relieves users from time consuming, tedious and error prone curation tasks.
[0006] One category of approaches to the task of data curation is the adoption of computational machine-learning based algorithms for information extraction, item classification, record-linkage, clustering, and sampling. As an example, a machine learning-based algorithm can be used to extract named entities from Tweets (eg“ISIS” and“Palmyra” in“There are 1800 ISIS terrorists in Palmyra, only 300 are Syrians”), link these named entities to equivalent entities in an electronic knowledge base (eg, Wikidata) and on this basis classify Tweets into a set of predefined topics (eg, using a Naive Bayes classifier).
[0007] Unfortunately, an algorithmic approach alone is limited for various reasons. Algorithms are typically designed for a specific data processing context and cannot be easily adapted to work in another context (eg, information extraction algorithms are developed for well-formed English texts whereas a significant amount of electronic data originating from social media is misspelled, ungrammatical, and often comprises poorly formatted short sentences with colloquial fragments). As such, algorithms are designed to solve a specific and concrete task. Moreover, machine-learning algorithms require training data, which may not be easily available in some situations, eg, where the relevant training data is sensitive or confidential.
[0008] Another approach that deals with many of the shortcomings inherent in the pure algorithmic methodologies is through the use of rule-based techniques where electronic data is processed in accordance with a classification or curation rule designed to select for the data of interest. One of the benefits of this approach is that non-technical domain-experts or even knowledge-workers are able to generate rules much faster than designing an equivalent machine learning algorithms. This is especially useful when big data feeds are continuous, or when rules are in some sense“obvious”. Furthermore, rules may easily be modified to correct mistakes in contrast to retraining machine-learning systems.
[0009] In addition, rule-based techniques can handle cases that machine-based learning cannot handle. In particular, the user defined curation rules maintain data quality as an essential success factor for sustaining any knowledge-intensive process, such as entity linking and data analytics, and can be used to detect data quality issues in the federated knowledge graph. This is because rules, unlike a machine learning procedure can be expressed independently. Therefore, when circumstances change, rules can easily be added or removed to meet changing requirements. For example, in the context of social media posts, entities who do not wish their posts to be identified and classified will change their hashtags very often to cheat learning based classifiers. Additionally, new keywords, hashtags, products, etc. are introduced every day. In this context, creating training data for learning algorithms is either extremely
time consuming or indeed may be impossible for extremely dynamic and constantly changing content. A rule-based approach can, however, address these problems. As an example, a rule that classifies Tweets as requiring‘Urgent Investigation’ on the basis of the following rule [location =‘Syria’ AND tweet-owner greater than 1000 followers AND text contains‘recruiting’] is not dependent on dynamically changing hashtags or keywords.
[0010] This flexibility is also one of the greatest disadvantages of current rule-based systems.
Typically, domain specialists or data analysts that have the highly specialised background knowledge related to the curation task are required to in the first instance to devise and assess the performance of curation rules. As such, the formulation of these curation rules can be highly subjective as the particular domain specialist’s approach to devising and modifying rules can be coloured by their own individual experience. This can result in curation rules that are unduly focused on areas of interest of the domain specialist in the process ignoring other approaches to generating the curation rules. Another drawback is that, depending on the domain area, there may not be enough available specialists in that area and as such only the most simple curation rules can be devised covering a particular area of interest.
[001 1] There is therefore a need for a method for the automated generation or enhancement of curation rules which allows for the development of comprehensive curation rules for a particular area of interest without the necessary input of a domain specialist in that area.
SUMMARY
[0012] In a first aspect, the present disclosure provides a computer implemented method for curating an electronic data stream consisting of a plurality of electronic data items into a category, where the category relates to infonnation contained in the electronic data items, comprising:
processing each of the electronic data items in accordance with a feature schema applicable to the electronic data stream to determine values of the features defined by the feature schema for each of the individual electronic data items;
adopting an initial curation rule that is operative on a selection of the features defined by the feature schema and their determined feature values for each of the individual electronic data items to determine whether an individual electronic item is in the category;
automatically processing the initial curation rule to generate an enriched curation rule, wherein the enriched curation rule is operative on an expanded selection of the determined values of the features defined by the feature schema for each of the individual electronic data items as compared to the initial curation rule; and
applying the enriched curation rule to the electronic data stream to curate electronic data items into the category based on determination by the enriched curation rule.
[0013] In another form, processing each of the electronic data items in accordance with the feature schema includes:
extracting data from the electronic data item;
classifying the extracted data to determine the feature values of the features defined by the feature schema for each of the individual electronic items; and
storing in a data store the determined feature values of the features defined by the feature schema for each of the individual electronic items.
[0014] In another form, processing the initial curation rule to generate the enriched curation rule comprises:
selecting a feature from the selection of features associated with the initial curation rule;
decomposing the selected feature into an expanded set of features from the feature schema; and augmenting the initial curation rule to operate on the expanded set of features and their associated determined feature values for each of the individual electronic data items.
[0015] In another form, augmenting the initial curation rule includes generating component curation rules that operate on one or more of the expanded set of features and their associated determined feature values for each of the individual electronic data items.
[0016] In another form, the selected feature is expressible as text and wherein the expanded set of features correspond to features associated with constituent semantic elements of the text.
[0017] In another form, processing the initial curation rule to generate the enriched curation rule comprises:
selecting a feature from the selection of features associated with the initial curation rule;
generating one or more alternative feature values that the feature may adopt consistent with the initial curation rule;
modifying the initial curation rule to also operate on the alternative feature values to determine whether an individual electronic item is in the category.
[0018] In another form, the selected feature is expressible as text and wherein the one or more alternative feature values are semantically equivalent to the selected feature.
[0019] In another form, the one or more alternative feature values are generated from a knowledge graph based on the selected feature.
[0020] In another form, the one or more alternative feature values includes a synonym to a feature value of the selected feature adopted in the initial curation rule.
[0021] In another form, the one or more alternative feature values includes a subject related term to a feature value of the selected feature adopted in the initial curation rule.
[0022] In another form, the one or more alternative feature values includes an alternative spelling, mispronunciation or abbreviation of a feature value of the selected feature adopted in the initial curation rule.
[0023] In another form, the method includes generating the one or more alternative feature values from a machine learning model.
[0024] In another form, generating the one or more alternative feature values from the machine learning model comprises:
adopting one or more text corpuses;
decomposing the one or more text corpuses into individual tokens;
processing the tokens to generate a vector space model;
selecting an initial feature value related to the determined feature values corresponding to the initial curation rule; and
generating the one or more alternative feature values based on the initial feature value and a similarity measure applied to the vector space model.
[0025] In another form, the vector space model is selected in accordance with one or more of the selection of features related to the initial curation rule.
[0026] In another form, the similarity measure is a cosine similarity measure.
[0027] In another form, the similarity measure is a Jacard similarity measure.
[0028] In another form, the selected feature value is expressible as text and comprising generating the one or more alternative feature values by pattern matching.
[0029] In another form, pattern matching includes determining the one or more alternative feature values by whether they co-occur with the selected feature value relative to one or more text corpuses.
[0030] In another form, determining the one or more alternative feature values by whether they co occur with the selected feature value includes generating a co-occurrence table based on the one or more text corpuses and the selected feature value.
[0031] In another form, the electronic data stream is dynamically updating in real time.
[0032] In a second aspect, the present disclosure provides a system for curating an electronic data stream consisting of a plurality of electronic data items into a category, where the category relates to information contained in the electronic data items, the system comprising:
a data classification and storage module comprising one or more processors and configured to process each of the electronic data items in accordance with a feature schema applicable to the electronic data stream to determine values of the features defined by the feature schema for each of the individual electronic data items and to adopt an initial curation rule that is operative on a selection of the features defined by the feature schema and their determined feature values for each of the individual electronic data items to determine whether an individual electronic item is in the category;
a curation rule processing module comprising one or more processors and communicatively coupled to the data classification and storage module and configured to automatically process the initial curation rule to generate an enriched curation rule, wherein the enriched curation rule is operative on an expanded selection of the determined values of the features defined by the feature schema for each of the individual electronic data items as compared to the initial curation rule; and
a rule application module comprising one or more processors and communicatively coupled to the data classification and storage module and the curation rule processing module and configured to apply the enriched curation rule to the electronic data stream to curate electronic data items into the category based on determination by the enriched curation rule.
[0033] In another form, the data classification and storage module comprises a data classification processor configured to process each of the electronic data items in accordance with the feature schema by:
extracting data from the electronic data item;
classifying the extracted data to determine the feature values of the features defined by the feature schema for each of the individual electronic items; and
storing in a data store a populated data structure comprising the determined feature values of the features defined by the feature schema for each of the individual electronic items.
[0034] In another form, processing the initial curation rule by the curation rule processing module to generate the enriched curation rule comprises:
selecting a feature from the selection of features associated with the initial curation rule;
decomposing the selected feature into an expanded set of features from the feature schema; and augmenting the initial curation rule to operate on the expanded set of features and their associated determined feature values for each of the individual electronic data items.
[0035] In another form, augmenting the initial curation rule includes generating component curation rules that operate on one or more of the expanded set of features and their associated determined feature values for each of the individual electronic data items.
[0036] In another form, the selected feature is expressible as text and wherein the expanded set of features correspond to features associated with constituent semantic elements of the text.
[0037] In another form, processing the initial curation rule by the curation rule processing module to generate the enriched curation rule comprises:
selecting a feature from the selection of features associated with the initial curation rule;
generating one or more alternative feature values that the feature may adopt consistent with the initial curation rule;
modifying the initial curation rule to also operate on the alternative feature values to determine whether an individual electronic item is in the category.
[0038] In another form, the selected feature is expressible as text and wherein the one or more alternative feature values are semantically equivalent to the selected feature.
[0039] In another form, the one or more alternative feature values are generated from a knowledge graph based on the selected feature.
[0040] In another form, the one or more alternative feature values includes a synonym to a feature value of the selected feature adopted in the initial curation rule.
[0041] In another form, the one or more alternative feature values includes a subject related term to a feature value of the selected feature adopted in the initial curation rule.
[0042] In another form, the one or more alternative feature values includes an alternative spelling, mispronunciation or abbreviation of a feature value of the selected feature adopted in the initial curation rule.
[0043] In another form, the curation rule processing module is configured to generate the one or more alternative feature values from a machine learning model.
[0044] In another form, generating the one or more alternative feature values from the machine learning model comprises:
adopting one or more text corpuses;
decomposing the one or more text corpuses into individual tokens;
processing the tokens to generate a vector space model; and
selecting an initial feature value related to the determined feature values corresponding to the initial curation rule; and
generating the one or more alternative feature values based on the initial feature value and a similarity measure applied to the vector space model.
[0045] In another form, the vector space model is selected in accordance with one or more of the selection of features related to the initial curation rule.
[0046] In another form, the similarity measure is a cosine similarity measure.
[0047] In another form, the similarity measure is a Jacard similarity measure.
[0048] In another form, the selected feature value is expressible as text and the curation rule processing module is configured to generate the one or more alternative feature values by pattern matching.
[0049] In another form, pattern matching includes determining the one or more alternative feature values by whether they co-occur with the selected feature value relative to one or more text corpuses.
[0050] In another form, determining the one or more alternative feature values by whether they co occur with the selected feature value includes generating a co-occurrence table based on the one or more text corpuses and the selected feature value.
[0051] In another form, the electronic data stream is dynamically updating in real time.
BRIEF DESCRIPTION OF DRAWINGS
[0052] Embodiments of the present disclosure will be discussed with reference to the accompanying drawings wherein:
[0053] Figure 1 is a system overview diagram of an ideal data curation system;
[0054] Figure 2 is a system flowchart of a computer-implemented method for curating an electronic data stream in accordance with an illustrative embodiment;
[0055] Figure 3 is a system overview diagram of a curation system for implementing the method for curating an electronic data stream illustrated in Figure 2 in accordance with an illustrative embodiment;
[0056] Figure 4 is a system flowchart of a computer-implemented method for processing an individual electronic data item in accordance with a feature schema in accordance with an illustrative embodiment;
[0057] Figure 5 is a system overview diagram of a data classification and storage module in accordance with an illustrative embodiment;
[0058] Figure 6 is a system overview diagram of the data classification and storage module illustrated in Figure 5 as applied to a Twitter™ stream in accordance with an illustrative embodiment;
[0059] Figure 7 is a curation rule graph directed to classifying Tweets according to an illustrative embodiment;
[0060] Figure 8 is a system flowchart of a method for determining an expanded set of feature values for use in an enriched curation rule according to an illustrative embodiment;
[0061] Figure 9 is a system flowchart of a method for determining an expanded set of feature values for use in an enriched curation rule according to another illustrative embodiment; and
[0062] Figure 10 depicts a co-occurrence frequency table as employed in a pattern matching function for determining an expanded set of feature values for use in an enriched curation rule according to an illustrative embodiment.
[0063] In the following description, like reference characters designate like or corresponding parts throughout the figures.
DESCRIPTION OF EMBODIMENTS
[0064] Referring now to Figure 2, there is shown a system flowchart of a computer-implemented method 1000 for curating an electronic data steam according to an illustrative embodiment of the present disclosure. In this illustrative embodiment, the electronic data stream consists of a number of electronic data items and the curation tasks involves determining whether an individual curation item should fall within a given category where the category relates to the information contained in the electronic data items. By way of overview, the curation method 1000 comprises processing the electronic data stream in accordance with a feature schema at step 1010, adopting an initial curation rule 1020, automatically generating an enriched curation rule 1030 and then applying the enriched curation rule to the processed electronic data stream at step 1040.
[0065] Referring also to Figure 3, there is shown a system overview diagram of a computer based curation system 2000 for implementing the method 1000 for curating electronic data stream illustrated in Figure 2 according to an illustrated embodiment. By way of overview, curation system 2000 comprises a data classification and storage module 2100 which has as an input the electronic data stream and which
provides as an output the processed data stream 2100A, a curation rule processing module 2200 which takes as its input an initial curation rule and processes this automatically to output an enriched curation rule 2200A and a rule application module 2300 which applies the enriched curation rule 2200A to the processed data stream 2100A to provide a curated data stream 2300A.
[0066] Those of skill in the art would appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed may be
implemented as electronic hardware, computer software or instructions, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether the disclosed functionality is implemented as hardware or software will depend upon the requirements of the application and design constraints imposed on the overall system (eg, a system to process electronic data streams in the form of Twitter feeds).
[0067] As an example, data classification and storage module 2100, curation processing module 2200 and rule application module 2300 may be developed in PHP, .NET, Python, Ruby, Java or similar language and run on a digital computer such as a laptop, desktop, workstation, mobile device or any other appropriate computer or processor. In one example embodiment, the digital computer may be based on an x86 processor running Windows, Linux or other variations of UNIX. Data classification and storage module 2100, curation processing module 2200 and/or rule application module 2300 may be run as separate applications on a single digital computer or processor or be distributed over a number of digital computers or processors depending on requirements. In an alternative embodiment, any combination of data classification and storage module 2100, curation processing module 2200 and/or rule application module 2300 may be implemented as separate capabilities within a single application.
[0068] The systems and methods described here may be implemented in a computing system that includes a back end component (eg, a data server), or that includes a middleware component (eg, an application server), or that includes a front end component (eg, a client computer having a graphical user interface or a Web browser through which a user can interact with the system, or any combination of such back end, middleware, or front end components. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), peer-to- peer networks (having ad-hoc or static members), grid computing infrastructures, and the Internet.
[0069] The computing system may be implemented as a client-server arrangement where the clients and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In one example of a client-server
arrangement the server may be a web server running Apache or a similar Internet-enabled service to allow remote access by operators employing web browsers operable to access and interact with the described system.
[0070] At step 1010 of Figure 2, the electronic data items making up the electronic data stream are first processed in accordance with a feature schema applicable to the electronic data stream in order to determine feature values of the features defined by the feature schema for each of the individual electronic data items.
[0071] As used throughout this specification, a feature is an observable quantity that characterises or classifies data in the electronic data items that make up the electronic data stream. As would be appreciated, this classification may be at multiple levels of granularity and targeted to particular subject domains so that meaningful inferences may then be drawn from the electronic data stream.
[0072] At a generalised level, the term“feature” may be considered as some function Fi that is applied to an electronic data item to determine an associated feature value on the following basis:
^(electronic _ data_item) - > feature _ valuex
[0073] Similarly compound features may also be defined as follows:
F. F ^electronic _ data _item)) - » feature _ value y
[0074] The output value type of Ft , ie of the feature value, will correspond to one of the standard data types found in common programming and scripting languages. This includes simple data-types, such as string, integer and Boolean, as well complex data-types such as: Set or List. This allows leveraging of the standard data-type operators that are available in programming and scripting languages. As an example, where a feature corresponds to an integer data type, then standard operators such as: equals, less-than or more-than may be employed to deal with this feature. Where alternatively a feature corresponds to a complex data-type such as Sets, then operators such as: contains or isEmpty may be employed.
[0075] Throughout this specification, the term“feature schema” refers to a data structure that functions to map out the relationships or logical constraints between the various features of the electronic data items that will form the basis of the curation task that will be applied to the electronic data stream
and which allows ready interrogation of the data structure to determine the values of these features as determined by the initial processing of the electronic data items in accordance with the feature schema.
[0076] Referring now to Figure 4, there is shown a system flowchart of the method for processing an individual electronic data item 1010 as illustrated in Figure 2 according to an illustrative embodiment. At step 1011, data is extracted from the electronic data item. At step 1012, the extracted data is classified based to determine the values of features defined by the feature schema for each of the electronic data items. At step 1013, the values of the features defined by the feature schema for each of the individual electronic items are stored in accordance with the feature schema. As would be appreciated, for many of the features defined by the feature schema no associated feature value may be determined by this processing step depending on the content of the electronic data item.
[0077] Referring now to Figure 5, there is shown a generalised system overview diagram of the data classification and storage module 2100 in accordance with an illustrative embodiment which performs the initial processing of an input electronic data stream DS± comprising a plurality of electronic data items 0 , 0/2, 0/3 ... 0/M . In this illustrative embodiment, classification and storage module 2100 includes a data classification processor 21 10 that extracts information from each individual electronic data item in turn and classifies this extracted data by determining the values of the features defined by the feature schema and then stores these values in accordance with the feature schema in electronic database or data store 2120.
[0078] In one example embodiment, electronic data stream DS± consists of a stream of electronic data items in the form of“Tweets” as continuously generated by the Twitter™ social media platform and provide one example of an electronic data stream that dynamically updates in real time . As would be appreciated, a“Tweet” is a short posting consisting of 280 characters which can be supplemented by photographs, videos and the like which can be shared publicly or within the activity stream of the followers of a particular Twitter user. At the time of writing, there are generated on average around 6,000 Tweets per second corresponding to over 350,000 Tweets sent per minute resulting in 500 million Tweets per day and around 200 billion Tweets per year.
[0079] Referring now to Figure 6, there is shown a detailed system overview diagram of the data classification and storage module 2100 illustrated in Figure 2 for executing the processing step 1010 for the case of where the electronic data stream is a Twitter stream and the individual electronic data items are Tweets.
[0080] In accordance with step 101 1, data is first extracted and classified from the electronic data item in the form of Tweet T^from the twitter stream DS^ by data classification processor 2110 which includes as an input feature schema 2112 that maps out the relationships or logical constraints between
the various features of the input Tweet. As would be appreciated, the feature schema applicable to a particular electronic data item will be governed by the content and type of the data in the electronic data item as well as the curation task.
[0081] In this example embodiment, where the electronic data item is a Tweet as referred to above, the determination of the values of features in the feature schema will range from the application of text extraction and classification functions that identify named-entities, noun, verbs, phrases, etc in the text component of the Tweet to the application of higher level functions such as determining an overall sentiment measure for a Tweet.
[0082] In this example, twitter feature schema 21 12 is applied to a Tweet to generate a Persistent Tweet Ti which is a populated data structure 2115 in accordance with feature schema 2112 containing the determined feature values of the features for the input Tweet 7^ which is then stored in data store 2120 for further analysis and curation..
[0083] As noted above, feature schema 2112 functions to map out the relationships or logical constraints between the various features of the persistent Tweet data structure 21 15 whose feature values are then stored in data store 2120. In this manner, the Tweet 7 is converted to an associated populated data structure in accordance with the feature schema which as a result classifies and stores the constituent data of the Tweet in terms of features which may be then readily interrogated for further rule based curation processing based on these determined feature values in accordance with step 1013. As would be appreciated, this process may be applied iteratively to a stream of individual Tweets whose feature values are then determined, identified as relating to a particular Tweet, and then stored in data store 220 for analysis.
[0084] In this illustrative example directed to Tweets, feature schema 2112 and corresponding data structure 2115 include the following fundamental features that are extracted from electronic data item which in this embodiment is a Tweet:
• Tweet. Text:
defined as the text of the Tweet.
• Tweet . Text . Named-Entities :
defined as the list of Named Entities that can be extracted from the text of the Tweet.
• Tweet . Text . keywords :
defined as the list of Keywords that can be extracted from the text of the Tweet.
• Tweet . Text . phrase :
defined as a list of phrases that can be extracted from the text of the Tweet.
• Tweet . HashTag :
defined as a list of Twitter hashtags mentioned in the Tweet.
• Tweet . HashTag . Named-Entities:
defined as the list of Named Entities extracted from a Twitter hashtags mentioned in the Tweet.
• Tweet . HashTag . Keywords :
defined as the list of Keywords extracted from a Twitter hashtags mentioned in the Tweet.
• Tweet . Link :
defined as the list of the links mentioned in the Tweet.
• Tweet . Link . Named-Entities :
defined as the list of Named Entities extracted from the content of a link in the Tweet.
• Tweet . Link . Keywords :
defined as the list of Keywords extracted from the content of a link mentioned in the Tweet.
• Tweet . Location :
defined as the geographical location information for the Tweet.
• Tweet . Source :
defined as the source the Tweet was sent from (eg, PC, Laptop or a mobile).
• Tweet . MediaType :
defined as the type of media included in a Tweet (eg, Photo or Video).
• Tweet . Lang :
defined as the language a Tweet was written in (eg, English or French).
• Tweet . User :
defined as the user who sent the Tweet.
[0085] Furthermore more granular features (not shown in Figure 6) may also be determined such as the following examples relating to further classification of the feature Tweet . Text . Named- Entities .
• Tweet . Text . Named- Entities . Person :
defined as the list of Named Entities that are then subsequently identified as persons.
• Tweet . Text . Named- Entities . Topic :
defined as the list of Named Entities that are then subsequently identified as topics.
• Tweet . Text . Named- Entities . Location :
defined as the list of Named Entities that are then subsequently identified as placenames or locations.
[0086] The following additional features relate to identifying and classifying the words in the text of a Tweet as relating to part of speech and then further as relating to what type of speech.
• Tweet . Text . Part -of - Speech :
defined as the list of words that forms a part of speech that can be extracted from the text of the Tweet.
• Tweet . Text . Part -of - Speech . Quote :
defined as the list of Part-of-Speech words that are subsequently identified as forming a part of a quote.
• Tweet . Text . Part -of - Speech . Phrase :
defined as the list of Part-of-Speech words that are subsequently identified as forming a part of a phrase.
• Tweet . Text . Part -of - Speech . Keywords :
defined as the list of Part-of-Speech words that are subsequently identified as a keyword.
[0087] The following feature relates to determining an overall sentiment for a Tweet.
Tweet . Sentiment
defined as the classification of sentiment of the Tweet as selected from negative , positive or neutral as determined by an algorithmic based sentiment classifier which classifies takes as input a set of keywords (eg, a sentence or the text of a Tweet) and classifies this text into a set of emotional categories.
[0088] As would be appreciated the above list of features is not an exhaustive list for a Tweet and may vary in accordance with the curation task and the type of input stream.
[0089] There are many other examples of feature schemas from various other categories of electronic data streams such as news feeds, encyclopaedias and social-media. In one example, the electronic data stream consists of online news articles published by a media entity where the individual electronic data item would be an article. In this illustrative example, the feature schema for this type of electronic data stream may consist of the following fundamental features such as:
• news . Age :
defined as the number of seconds since a news article was published.
• news . Authors :
defined as the list of authors of the news article.
• news . Category :
defined as the subject category assigned to the news article.
• news . Text :
defined as the text of the news article.
• news . Country :
defined as the home country for the organization that published the news article.
• news . MediaType :
defined as the type of media included in the news article (eg, Photo or Video).
[0090] Additional text related features may then also be defined directed to the textual content of the article as referred to above in the example of text based features for a Tweet.
[0091] In another example, the electronic data stream consists of pages published by the online encyclopaedia, Wikipedia, and the feature schema for this type of electronic data stream may consist of the following fundamental features including:
• wikipedia . page . pageid :
defined as the unique identifier of the Wikipedia page.
• wikipedia . page . t it le :
defined as the title associated with the Wikipedia page.
• wikipedia . page . text :
defined as the text of the Wikipedia page.
• wikipedia . page . images :
defined as the list of images in the Wikipedia page.
• wikipedia . author . username :
define as the author's user name.
[0092] Again, also additional text related features may then be defined directed to textual content of the Wikipedia page as referred to above in the example of text based features for a Tweet.
[0093] In another example, the electronic data stream consists of videos published by the online video share site, YouTube1M, and the feature schema for this type of electronic data stream may consist of the following fundamental features including:
• youtube . age :
defined as the number of seconds since a video was published.
• youtube . author . l ink :
defined as the link to an author's profile page.
• youtube . category :
defined as the category selected for a video when it was uploaded.
• youtube . capt ion :
defined as the caption track for the selected video.
• youtube . comment :
defined as the comment/s associated with the selected video.
• youtube . subscription :
defined as the information about a user subscription.
[0094] Referring again to Figure 2, following the assignment of features to the electronic data items at step 1010, the next step is to adopt an initial curation rule in accordance with the particular curation task.
[0095] At a general level a curation rule may be defined as follows:
<Rule> : : =<Dataset > . <feature> { . feature } ( <string | integer | boolean> )
[0096] As can be seen from above, a Rule is expressed in terms of operations on features where, as discussed above, these features correspond to programmatic-like functions whose determined feature values have common input and output types, eg, string, integer, boolean, etc. As a result, any suitable functional or rule-expression language may be adopted to express the rule.
[0097] Examples of these languages include, but are not limited to: DEL (Data Extraction
Language), AQL (query-based Information Extraction language), Cloud Dataflow from Google™,
Kinesis from Amazon™, and QL.io from eBay™. General-purpose languages that may be adopted for the formulation of rules include, but are not limited to: R, Scala, Python and their extensions. As will be appreciated a significant benefit of a rules based approach to data curation is that it does not require the invention of a rule language per se.
[0098] As would be appreciated, use of curation rules enables a higher-level of abstraction that offers flexibility and customisation to deal with curation tasks. Once an input electronic data stream has been processed and stored in accordance with a feature schema to determine the values of the features defined by the feature schema for each electronic data item a curation rule may then be applied to these data based on these features to determine which of the electronic data items meet the criteria of the curation rule.
[0099] A composite rule may be defined as follows (using BNF format):
<Rule> ::= <Rule> [AND | OR | NOT <Rule>]
[0100] As such, composite rules may be composed over one or more other rules allowing for the definition of initial base level rules associated with coarser-grain curation tasks whose output results or data may then form the input for a subsequent rule for finer-grain curation tasks. In this way, curation
rules may be“chained” together to form a composite rule where the data output from one curation rule may then be used as the input to another curation rule and so on.
[0101] Referring now to Figure 7, there is shown a system overview diagram of a curation rule graph 500 where nodes represent rules and edges are the conjunction of rules to form high-level rules, in accordance with an illustrative embodiment. In this example, curation rule graph 500 comprises a number of successive curation rules expressed in term of features from the feature schema that are applied to arrive at a curated data set. In this illustrative example, the initial data source is the data store of persistent Tweets PT 510 where these“Tweets” have been initially processed in accordance with a twitter feature schema to provide a populated data structure consisting of a number of different features characterising the individual Tweets that have been processed as has been described above.
[0102] In this example, the curation task is to identify health related Tweets that have a negative sentiment. In this case, two low-level curation rules 520, 530, based on respective features may be invoked and combined together to produce a composite rule that corresponds to a defined higher-level feature that expresses the effect of the composite rule.
[0103] The first curation rule, rulei 520 is defined to select for“health” related Tweets and is defined as follows:
rulei := Tweets . keywords . contains ( "health" )
[0104] Accordingly, following the formalism outlined above, rulei is an operation that operates over the database of persistent Tweets that have been stored in accordance with a feature schema which includes the extracted feature keywords to select for the sub-set of Tweets that fall into the category of those Tweets that include the keyword“health”.
[0105] The second curation rule, rule ,, 530 is defined to select for Tweets where the feature schema includes an assignment of the feature sentiment to select for the sub-set of Tweets that fall into the category where the sentiment feature has been defined to be“negative” as determined in the original processing of the Tweet in accordance with the feature schema.
[0106] Accordingly, rulej is defined as follows: rulej := Tweets . sentimentNegative (true)
[0107] A composite rule, rulek may then be defined as follows: rulek := rulej AND rulei
= Tweets . keywords . contains ( "health" ) AND
Tweets . sentimentNegative (true)
[0108] Composite rule rulek will accordingly curate all those Tweets that fall into the category that contain both the keyword“health” and which further have a negative sentiment. As would be appreciated, this might be an example of an initial curation rule that could be adopted which is operative on a limited selection of features defined by the feature schema and their determined feature values that would not require any particular domain knowledge on the part of the rule formulator. As can be seen from the example above, an initial curation rule may include a composite rule based on the operation of successive curation rules operating with respect to individual features from the feature schema.
[0109] As would be appreciated, while the adoption of an initial curation rule may be relatively straightforward and in many instances may be achieved by a rule designer having no specific domain knowledge, the process of enriching the initial curation rule soon becomes extremely complex and increasingly reliant on the rule designers domain knowledge. Additionally, as well as being time- consuming and tedious, as the complexity of the rule increases there is then more scope for error on the part of the rule formulator.
[01 10] At step 1030, the initial curation rule rule , which may be automatically generated or generated by a rule designer, is automatically enriched to operate on an expanded selection of the determined values of the features defined by the feature schema as compared to the initial curation rule in order to generate the enriched curation rule rule.
[0111] In one example, processing the initial curation rule is achieved by an enrichment operator that functions to select a feature from the original selection of features that the initial curation rule is based on, decompose this feature into an expanded set of features and then augment the initial curation rule to operate on the expanded set of features and their associated determined feature values for each of the individual electronic items.
[0112] In one embodiment, where the original curation rule relates to determining whether an electronic data item contains a feature that relates to a portion of text or group of words such as a phrase or similar, an enriched curation rule that instead determines whether an electronic data item contains words corresponding to the syntactic elements of the original phrase that was the basis of the original curation rule will be able to better perform the curation tasks based on the semantics of the syntactic elements of the original phrase.
[0113] As an example, consider the infoDecompose (...) enrichment operator, defined as:
infoDecompose ( { data } )
[0114] This enrichment operator returns a set of tuples that splits the input text data into part of speech elements, such as: person, keyword, location, these being features that form part of the original feature schema that the original electronic data items have been classified in accordance with. An example pseudo code embodiment of an enrichment operator applicable to the text of Tweets is shown below:
Algorithm : i n f oDe compo s e(T weet. T ext)
Input: Tweet. Text
Output: List of extracted named entities
# splits text into chunks of words (strings)
1 : tokenizedText = tokenizer(T weet. Text)
# adds grammatical tags into tokens (e.g. nouns, verbs)
2: posTaggedText = posTagger(tokenizedText)
# finds entities from list of tagged tokens (eg, person, location)
3: entitiesList = entityDetector(posTaggedText)
4: return entitiesList
[0115] In one example, the enrichment operator infoDecompose ( ...) is based on the Linguistic Inquiry and Word Count (LIWC) lexicon which at the time of writing consists of 62 syntactic categories (eg, present tense, verbs, pronouns, adjectives ). In another embodiment, infoDecompose (...) is based on a conditional random field (CRF) classifier. In this manner, a phrase which forms the basis for an initial curation rule, eg, does the electronic data item contain the phrase, will be decomposed into nouns, verbs, adjectives, etc which may then be further processed to determine whether a noun, as an example, is a person or location or some other keyword.
[0116] In this manner, an initial curation rule based on determining whether an electronic data item contains a phrase, as an example, may be augmented to generate an enriched curation rule comprising component curation rules that individually operate on one or more of the expanded set of features associated with the constituent semantic elements obtained from the phrase, eg, person, location, etc. In this manner, the enriched curation rule by operating on an expanded set of features will in turn operate on an expanded selection of the determined values of the features defined by the feature schema which in
this example relate to semantic elements of the text as compared to the initial curation rule which only relates to matching phrases.
[0117] This process may be demonstrated by an illustrative example as set out below. Consider an initial curation rule designed to curate Tweets relating to the assassination of President John F. Kennedy from a Twitter stream that has been processed to determine and store the component features of individual Tweets as has been described above. This initial curation rule is based on the feature
Tweet . Text . phrase as referred to above and is defined as: rule = Tweets . Text . phrase
. contains ( "John Kennedy was assassinated in Dallas " )
[01 18] As such, rule will select for those Tweets that are in the category that contain the specific phrase John Kennedy was assassinated in Dallas”. As would be appreciated, this curation rule is quite narrow and likely to fail to curate relevant Tweets that are related to the assassination of President John F. Kennedy.
[01 19] Applying the inf oDecompose ( { data } ) enrichment operator as defined above, the initial curation rule is then decomposed as follows to generate: rule = Tweets . Text . Named-Entities . Person . contains ( "John Kennedy" ) AND
Tweets . Text . keywords . contains ( "assassinated" ) AND Tweets . Text . Named- Entities . Location . contains ( "Dallas " )
[0120] In this manner, the original curation rule based on determining whether the feature defined as “phrase” in the database of Tweets contains the text“John Kennedy was assassinated in Dallas” has now been processed to generated an enriched curation rule that comprises component curation rules that operate on the values of the expanded set of semantically relevant features, including:
• “person” for words that have been classified as the feature“person” in the original feature schema applied to the database of Tweets;
• “location” for words that have classified as the feature“location” in the original feature schema; and
• “keywords” for words that have been classified as a“keyword” in the original feature schema, ie, non-extraneous words.
[0121] As would be appreciated, the enriched curation rule may then be applied to the set of Tweets to more sensitively curate for those Tweets that relate to the assassination of President John F. Kennedy.
[0122] In another example, processing the initial curation rule is achieved by an enrichment operator that functions to select a feature from the selection of features associated with the initial curation rule, generate one or more alternative values that the selected feature may adopt that are consistent with the original rule and then modify the initial curation rule to also operate on the one or more alternative values.
[0123] As an example, consider the entityEquivalent (...) enrichment operator defined as follows: entityEquivalent ( 'entity-mention' )
[0124] This enrichment operator returns a list of entity -mentions that are semantically equivalent to the input feature entity-mention. Throughout this specification use of the term“entity” refers to an object that represents a specific concept within a given domain or context. An example is the“entity” George Walker Bush which at a general broad level is representative of the concept of a Person but more specifically relates to a specific Person being the 43rd President of the United States. An“entity-mention” or feature value related to this concept would be George Bush. Other equivalent“entity-mentions” would be a variation of this feature value but which still represents the same concept in the given domain or context. As an example, for the entity-mention George Bush a non-exhaustive list of equivalent entity- mentions would include G WB, George Bush 2 and Dubya, all being equivalent representations of the “entity” George Walker Bush.
[0125] Another example is the“entity” Tomorrow which at a general broad level is representative of the concept of a Date but more particular relates to a specific date. An example“entity-mention” might be the feature value tomorrow with equivalent“entity-mentions” being all the different ways this specific date could be expressed as feature values.
[0126] In one example, alternative values or“entity-mentions” for a feature value may be generated based on an enrichment operator of information derived from a Knowledge Graph (e.g., WordNet, ConceptNet, Google KG, as well as others which will be mentioned below) based on the original feature or feature value.
[0127] A Knowledge Graph (KG) in this context is a database structure consisting of a network of classified entities and their relationships configured to allow semantic queries of the knowledge graph. In one example, the database structure is based on organising the classified entities and their relations in a graph.
[0128] In one example, where the feature of a curation rule relates to an“entity-mention” or feature value such as a name of a person, location, organisation, quantity, etc, or a“topic” where text is classified as relating to a particular topic then one or more knowledge graphs may be employed to generate an enriched curation rule.
[0129] Some example knowledge graphs that may be used include, but are not limited to:
• WordNet - WordNet is a lexical database for the English language that groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet may be viewed generally as a combination of dictionary and thesaurus.
• ConceptNet - ConceptNet is a semantic network based on the information in the Open Mind Common Sense (OMCS) database where ConceptNet is expressed as a directed graph whose nodes are concepts and whose edges are assertions of common sense about these concepts.
• Wikipedia - Wikipedia is an online encyclopaedia.
• WikiData - Wikidata is a structured database consisting of the source data for Wikipedia.
• Google™ Knowledge Graph - Google Knowledge Graph is a knowledge base based on a graph database to provide structured and detailed information about a topic in addition to a list of links to other sites related to the topic.
• BabelNet - BabelNet is a multilingual lexicalised semantic network and ontology.
• YAGO - YAGO is a large semantic knowledge base derived from Wikipedia, WordNet,
WikiData, GeoNames, and other data sources.
• Urban Dictionary - The Urban Dictionary is an online dictionary of slang words and phrases.
[0130] In another example, where a feature of a curation rule relates to a determined emotion then knowledge graphs that assist in sentiment detection of a word or text may be employed to generate an enriched curation rule.
[0131] Some example knowledge graphs that may be used include, but are not limited to:
• Emolex (NRC Word-Emotion Association Lexicon (aka EmoLex)) - Emolex is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).
• ANEW - ANEW is a sentiment lexicon that provides a database of normative emotional ratings for a large number of words in the English language.
• SentiWordNet - SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity.
[0132] In another example, where a feature of a curation rule relates to a place name or location then knowledge graphs based on geographical information such as GeoPedia may be employed to generate an enriched curation rule. GeoPedia is geographical database that may be queried to provide Wikipedia articles around any location.
[0133] Considering again the entityEquivalent (...) enrichment operator which functions to returns a list of entity-mentions that are semantically equivalent to the input entity-mention In one example, a semantic network such as ConceptNet may be employed to generate one or more semantic equivalents in the form of a:
• “synonym” (ie, an alternative entity-mention with the same meaning),
• “related term” (ie, an entity-mention that is commonly related to the input entity- mention),
• “derived term” (ie, an entity-mention that is formed from the input entity-mention), or
• “word form” (ie, an alternative spelling or presentation of the input entity-mention).
[0134] The pseudo code description of an example entityEquivalent (...) enrichment operator in accordance with the above principles is set out below:
Algorithm: entityEquivalent(entitiesList = inf oDecompose(Tweet.Text))
Input: List of Entities
Output: List of entity-mentions for each Entity
1: mentionList = []
2: for entity in entity List do
# mentions from ConceptNet
3: cpnMentions = conceptNet.getRelated(entity.name)
# mentions from WordNet considering entity type (e.g. person, location)
4: wdnMentions = wordNet.getSynonym(entity.name, entity .type)
# synonyms from BabelNet
5: synsetlist = ba belNet.getSynett entity name)
6: if synsetlist not empty then
# finds the most closest synonym set to the entity type)
7: candidateSynset = findCandidate (synsetlist, entity. type)
# mentions from BabelNet asking for a specific list of synonyms (candidateSynset)
8: blnMentions = babelNet.getsenses(candidateSynset.id)
# add all the mentions of an entity into the list
9: mentionList.append(entity.name, entity .type, cpnMentions, wdnMentions, blnMentions)
# remove duplications
10: cleanedMentionList = duplicationRemover (mentionList)
11 : return cleanedMentionList
[0135] This process may be demonstrated by an illustrative example as set out below. Consider the initial curation rule directed to curating those Tweets relating to the assassination of President John F Kennedy from a Twitter stream that has been processed to determine and store the values of features of individual Tweets in accordance with the Twitter feature schema as has been described above. This initial curation rule is based on the features Tweets . Text . Named-Entities . Person and
Tweets . Text . Named-Entities . Location as referred to above and the initial curation rule is defined as: rulei = Tweets . Text . Named-Entities . Person . contains ( "John Kennedy" ) AND
Tweets . Text . Named- Entities . Location . contains ( "Dallas " )
[0136] Applying the entityEqui valent ( ) enrichment operator as defined above, semantically equivalent alternative values corresponding to the features being selected from the initial curation rule are adopted and the initial curation rule is processed to generate an enriched curation rule that operates on these alternative values as follows: rulei = Tweets . Text . Named-Entities . Person . contains ( "John Kennedy"
OR "John F . Kennedy"
OR "JFK" )
AND
Tweets . Text . Named-Entities . Location . contains ( "Dallas "
OR "TX")
[0137] In this example, the alternative values for the Tweet . Text . Named-
Entities . Person feature value which was originally“John Kennedy” now includes the synonym
“John F. Kennedy” or the alternative word form“JFK” which are now included as alternatives in the curation rule. Similarly, the alternative value generated for the Tweet . Text . Named-
Entities . Location feature value which was originally“Dallas” now includes the related term
“TX”.
[0138] In this manner, the original curation rule based on a sought for relationship between a given feature and an associated feature value is now enriched so that the given feature can potentially match an enhanced number of semantically equivalent feature values in an enriched curation rule consistent with the initial curation rule.
[0139] In another embodiment, a machine learning model is employed to generate an extended set of semantically related feature values that may be employed to generate an enriched curation rule over and above those provided by the use of knowledge graphs as referred to above where feature values derived from the original knowledge graphs may be used as seeds for the deep learning model to generate additional semantically related feature values.
[0140] In one example, the machine learning model is a deep learning model that employs other information sources that are not part of knowledge graphs. In the example of textual processing, the other information sources may include, but not be limited to, simple text corpuses such as data dumps of twitter tweets, Wikipedia pages, articles, news pages, etc.
[0141] Referring to now Figure 8, there is shown a system flowchart of a method 3000 for generating an expanded set of feature values according to an illustrative embodiment.
[0142] As step 3010, one or more text corpuses are adopted as relevant to the curation task. The choice of corpus is based on the specifics of the curation tasks. As such, in most cases general-purpose corpuses may be adopted initially such as the corpus of Tweets or Wikipedia as referred to above which can be further supplemented by domain-specific corpuses (eg, Stackoverflow and/or Github for which are online knowledge repositories related to computer programming, AustLII, WorldLII or NZLII which are online repository related to case law and legislation) depending on the domain or context to which the curation rule belongs.
[0143] At step 3020, the text from the text corpus is decomposed to individual tokens. In one example embodiment this process includes:
• initial removal of stop words, URLs and hashtags;
• breaking of sentences into lines;
conversion of lines to lowercase; and
parsing of lines into individual tokens.
[0144] At step 3030, the tokens are processed to generate a vector space model where each token is mapped to a vector in the vector space where words or tokens that are semantically similar are defined to be“close” to each other in the vector space as defined by a similarity measure. In one embodiment, this process is achieved through word embedding. In one example, a neural network embed generator (NNEG) is employed to carry out the word embedding and create a vector space model.
[0145] Different vector space models may be created depending on the different types of tokens or words that correspond to the features adopted in the initial curation rule. In order to obtain related features to an input feature (eg, names related to George Bush), different vector space models may be created for different name entity types (refer to the output of inf oDecompose algorithm) as set out below in the following non-exhaustive list of examples including:
• where the word or token has no specific type other than being a keyword (eg, table, car or phone), a general vector space model of the words may be created (eg,“word2vec”);
• where the word or token may be classified as a Person entity type (eg, George W Bush, Barack Obama) a people based vector space model may be created (eg,“people2 vec”);
• where the word or token may be classified as a Topic entity type (eg, health, safety), a topic based vector space model may be created (eg,“topic2vec”);
• where the word or token may be classified as a Location entity type (eg, Sydney, Canberra, NY) a location based vector space model may be created (eg,“loc2 vec”); or
• where the word or token may be classified as an Organization entity type (eg, Shell,
Samsung, Commonwealth), an organisation based vector space model may be created (eg, “org2vec”).
[0146] At step 3040, an input feature value is selected. As referred to above, the input feature value may be determined by a first stage knowledge graph enrichment based on an initial feature value originating from the initial curation rule. Alternatively, the input feature value may be an initial feature value from the initial curation rule.
[0147] At step 3050, the expanded set of feature values is determined in accordance with the vector space model and a similarity measure. In one example, the similarity measure is a cosine similarity measure defined as follows: cosine ( 'data [threshold] )
[0148] where this function returns a list of similar data entities that matches the inputted list of entities. Input data may be a set of string keywords or phrase and threshold is a real number between 0 - 1 that defines the cut off as to which similar entities to accept and which may be tuned in accordance with processing requirements. In this example, the input data would consist of the input feature value and the cosine function will return one or more alternative feature values that are similar to the input feature value.
[0149] In another example, the similarity measure is a defined as follows: jaacard ( 'data', [threshold] )
[0150] where again the input data may be a set of string keywords or phrase and threshold is a real number between 0 - 1 that defines the cut off as to which similar entities to accept. As would be appreciated, other similarity measures may be adopted including, but not limited to, term frequency- inverse document frequency (“tf-idf’).
[0151] As would be appreciated, use of a deep learning model in accordance with the above principles allows alternative textual sources to be used apart from knowledge graphs.
[0152] In one example, an initial curation rule based on determining those documents that contain a particular hashtag is enhanced by similarity measure applied over a text corpus that has been processed to generate an associated vector space model. As would be appreciated, hashtags are typically not standard language words and further will evolve over time in accordance with underlying trends requiring modification of an original rule.
[0153] This example curation task relates to a community advocating drastic weight-loss measures for young women. Material such as social media posts and the like were initially circulated from this community using the hashtag tthighgap. Over time, health advocates in an attempt to counteract these drastic and negative messages to young women relating to weight loss would post material that promoted healthy weight choices also using the hashtag #thighgap. Staunch thigh-gap supporters were displeased with this and quickly evolved their hashtag into misspelled versions, such as #thyhgapp (or similar).
[0154] Consider an initial curation rule applicable designed to curate Tweets relating to drastic weight loss based on the hashtag #thighgap: rule = Tweets . HashTag . contains ( "#thighgap" )
[0155] Based on the above scenario where the hashtag evolves, the effectiveness of this rule to curate social media posts from the drastic weight loss community would diminish over time.
[0156] In this example, the original curation rule is modified to include alternative spellings by adopting a similarity measure function such as the cosine ( ) or j aacard ( ) functions as described above as applied to a vector space model generated from a text corpus such as a dataset of Tweets to first generate alternative feature values and then to define the enriched curation rule: rule = Tweets .HashTag. contains ( "#thighgap"
OR
"tthyhgapp"
OR
"#thygap" )
[0157] In another example, an initial curation rule based on determining those documents that contain a particular named entity such as a person is enhanced by similarity measure applied over a text corpus that has been processed to generate an associated vector space model. As would be appreciated, a particularly person will be referred to by a continually evolving range of nick names and slang references over time. In one example, consider the initial curation rule designed to determine those Tweets that relate to President George Bush: rule = Tweets . Text .Named-Entities . Person ( "George Bush")
[0158] In this example, the original curation rule is modified to include alternative spellings by adopting a similarity metric function such as the cosine ( ) or j aacard ( ) functions referred above to first generate alternative feature values and then to define the enriched curation rule: rule: rule = Tweets . Text . Named-Entities . Person ( "George Bush"
OR
"George dubya bush"
OR
"g dub"
OR
"George bush 2"
OR
"g.w. dub")
[0159] As would be appreciated, the above methods involving the use of deep learning techniques on additional textual sources to create vector space models that may be used to generate alternative feature values based on a similarity measure applied to the vector space model provide an additional range of alternative feature values that would not necessarily be apparent from the use of knowledge graphs. Furthermore, the use of knowledge graphs in isolation considers feature values on an individual basis. As such, there may be many feature values that are related to an input feature but whose relationship can only be discerned based on a larger corpus or corpuses of information which allow these relationships to be inferred. By using deep learning models that are trained on appropriate corpuses of information, relationships between feature values may be determined by forming an embedded vector space model which then allows similarity measures to be applied to the derived vector space model to return these related feature values.
[0160] In one example, based on initial feature value such as“George Bush”, use of a similarity measure on a vector space model produced by a deep learning approach will generate related feature values such as“Barack Obama”,“John Kennedy”,“Bill Clinton” from deep learning models. Note that these related features are not textual variations of the initial feature values but are related because they are also US Presidents, a relationship that would not generally be inferred by adopting a knowledge graph process alone.
[0161] In another illustrative embodiment, pattern-matching functions may be adopted to provide alternative feature values by determining common feature values that co-occur with those specified based on additional textual sources. These alternative feature values to the initial feature value may then be used to generate the enriched curation rule.
[0162] Referring to now Figure 9, there is shown a system flowchart of a method 4000 for generating an expanded set of feature values according to an illustrative embodiment.
[0163] At step 4010, one or more text corpuses are adopted as relevant to the curation task. Similar to the approach above in relation to generating a deep learning based vector spaced model, in most cases general-purpose corpuses may be adopted initially such as the corpus of Tweets or Wikipedia as referred to above which can be further supplemented by domain-specific corpuses (eg, Stackoverflow and/or Github for which are online knowledge repositories related to computer programming, AustLII, WorldLII or NZLII which are online repository related to case law and legislation) depending on the domain or context to which the curation rule belongs.
[0164] At step 4020, an initial feature value is selected from the initial curation rule.
[0165] At step 4030, a co-occurrence table is generated based on the initial feature value.
[0166] At step 4040, an expanded set of feature values are determined based on the initial feature value. In one example, a pattern matching function coocur is used that returns a list of co-occurring feature values that are found in the input dataset which in this illustrative embodiment is the one or more text corpuses that have been adopted at step 4010.
[0167] The input initial_feature_value may be a set of words or a phrase and
threshold is defined as a real number between 0 - 1 that defines the cut off to the degree of matching. When pattern matching is performed, it returns a vector of feature values alongside a co-occurrence frequency as a relative percentage. This may be set out formally as: cooccur ( dataset , { initial_feature_value } , [threshold ] )
[0168] The expanded set of feature values may then be used in the enriched curation rule as has been previously described.
[0169] In one illustrative example, the curation task is to curate those Tweets that relate to health.
An initial curation rule appropriate to this task may be: rule = Tweets . keywords . contains ( "health"
OR
"wellbeing"
OR "fitness")
[0170] As would be appreciated, the above initial curation rule is unlikely to identify a proportion of health related Tweets on the basis that many of these Tweets would not necessarily contain the words initial feature values of“health”,“wellbeing” or“fitness”.
[0171] In accordance with method 4000, a pattern matching function is adopted to provide additional keywords. Referring now to Figure 10 there is shown a co-occurrence frequency table in accordance with an illustrative embodiment. From Figure 10, it can be seen that the three specified keywords (‘health’,‘wellbeing’,‘fitness’) frequently co-occur with other keywords (‘diet’,‘gym’, ‘cardio’) based on an 80% frequency threshold based.
[0172] An enriched curation rule may then be generated by adopting the high frequency co occurring words identified above as alternative feature values. Following the application of the pattern matching function (eg, using cooccur ( ) function referred to above), the modified rule is defined as: rulei = Tweets . eywords ( "health
OR "wellbeing"
OR "fitness"
OR "diet"
OR "gym"
OR "cardio")
[0173] As would be appreciated, these seemingly unrelated keywords would not have been necessarily identified by semantic based similarity functions as described above.
[0174] The above described embodiments provide for a system where an initial curation rule may be automatically enriched to operate on an expanded selection of feature values relevant to the curation task. Exemplary techniques for enriching curation rules include leveraging knowledge that has already been curated by domain experts (eg, WordNet, BabelNet, ConceptNet) to leveraging knowledge that has been curated automatically (eg, Google Knowledge Graph, Yagoo and other general knowledge graphs as set our above).
[0175] In other exemplary embodiments, machine learning approaches such as Word Embedding and the like may be employed to generate an extended set of semantically related feature values to generate an enriched curation rule over and above those provided by the use of knowledge graphs. In these examples, corpus data may be leveraged (eg, Wikipedia articles, court and legislative databases), along with values derived from the original knowledge graphs as input for a deep learning model to generate additional semantically related feature values relevant to the curation task.
[0176] As would be appreciated, the automated enrichment of curation rules provides the advantage of a data driven approach to improving curation rules that can rapidly evolve as required and which does not contain the inherent biases that may occur when domain specialists are involved.
[0177] Throughout the specification and the claims that follow, unless the context requires otherwise, the words“comprise” and“include” and variations such as“comprising” and“including” will be understood to imply the inclusion of a stated integer or group of integers, but not the exclusion of any other integer or group of integers.
[0178] The reference to any prior art in this specification is not, and should not be taken as, an acknowledgement of any form of suggestion that such prior art forms part of the common general knowledge.
[0179] It will be appreciated by those skilled in the art that the invention is not restricted in its use to the particular application described. Neither is the present invention restricted in its preferred embodiment
with regard to the particular elements and/or features described or depicted herein. It will be appreciated that the invention is not limited to the embodiment or embodiments disclosed, but is capable of numerous rearrangements, modifications and substitutions without departing from the scope of the invention as set forth and defined by the following claims