WO2019000051A1

WO2019000051A1 - Data analysis method and learning system

Info

Publication number: WO2019000051A1
Application number: PCT/AU2018/050677
Authority: WO
Inventors: Lee-Martin SEYMOUR; Timothy GRIFFITHS
Original assignee: Xref (Au) Pty Ltd
Priority date: 2017-06-30
Filing date: 2018-06-29
Publication date: 2019-01-03

Abstract

Disclosed herein are a data analysis method and learning system. The method builds a corpus, based on an initial set of answers associated with an initial set of job references and trains a classification model, based on the corpus. The method then calculates a sentiment score for each answer in the corpus, and serialises the trained model; The method receives a new reference associated with a job candidate, and re-trains the classification model based on the new reference. The re-trained model then generates a sentiment score for the new reference and presents new sentiment analysis, based on the sentiment scores.

Description

DATA ANALYSIS METHOD AND LEARNING SYSTEM

Related Application

[0001] This application is related to Australian Provisional Patent Application No.

2017902552 titled "Data analysis method and learning system" and filed 30 June 2017 in the name of XREF (AU) Pty Ltd, the entire content of which is incorporated by reference as if fully set forth herein.

Technical Field

[0002] The present disclosure relates to a data analysis method and learning system. In particular, the present disclosure relates to a computer-implemented data analysis method and learning system for implementing a sentiment analysis engine.

Background

[0003] Sentiment analysis is a process that seeks to determine the attitude of a person with respect to a subject. The subject may be, for example, but is not limited to, a document, an interaction, an event, or a person. Such sentiment analysis may be used, for example, in the context of capturing the tone of voice and intent of a person.

[0004] Sentiment analysis may use one or more techniques to process data. Such techniques may include, for example, natural language processing, textual parsing, computational linguistics, biometric analysis, or any combination thereof in order to quantify and classify affective states and subjective information.

[0005] Traditionally, employment references are provided over the phone. The reference feedback collector relies in part on the 'tone of voice' when listening to the feedback. When completing a reference call, the collector makes an assumption on how positive the reference provider was. This can be incorrect, especially if the reference provider is in fact fraudulent. Such acquisition of employment references is a manually intensive process that is prone to inconsistency across different collectors and is susceptible to fraud or other manipulation.

[0006] Thus, a need exists to provide an improved data analysis method and learning system suitable for use in sentiment analysis. Summary

[0007] The present disclosure relates to a data analysis method and learning system.

[0008] A first aspect of the present disclosure provides a data analysis method and learning system comprising the steps of:

building a corpus, based on an initial set of answers associated with an initial set of job references;

training a classification model, based on said corpus;

calculating a sentiment score for each answer in said corpus;

serialising said trained model;

receiving a new reference associated with a job candidate;

re-training the said classification model based on said new reference;

generating, by said re-trained model, a sentiment score for said new reference; and

presenting new sentiment analysis, based on said sentiment scores.

[0009] A second aspect of the present disclosure provides a candidate referencing system comprising:

a communications network connection for coupling to a communications network; and

a cloud execution model that includes:

a sentiment analysis service for providing an interface to the cloud execution model, via said communications network connection,

a sentiment analysis engine, and

a data persistence environment for storing data associated with sentiment analysis;

wherein said cloud execution model:

receives references received via the candidate referencing system;

wherein said sentiment analysis engine:

pre-processes the received references;

classifies the pre-processed references to generate a classification model; determines a prediction of an individual sentiment score for each answer in said received references, based on said classification model;

wherein said data persistence environment:

persists the sentiment score; and wherein said sentiment analysis engine:

re-trains the classification model based on new references; determines an overall sentiment for a received reference, based on an aggregate of the individual sentiment scores of the answers associated with that reference.

[0010] According to another aspect, the present disclosure provides an apparatus for implementing any one of the aforementioned methods.

[0011] According to another aspect, the present disclosure provides a computer program product including a computer readable medium having recorded thereon a computer program for implementing any one of the methods described above.

[0012] Other aspects of the present disclosure are also provided. Brief Description of the Drawings

[0013] One or more embodiments of the present disclosure will now be described by way of specific example(s) with reference to the accompanying drawings, in which:

[0014] Fig. 1 is a flow diagram illustrating a method of sentiment analysis;

[0015] Fig. 2 is a schematic representation of a system on which one or more embodiments of the present disclosure may be practised;

[0016] Fig. 3 is a schematic block diagram representation of a system that includes a general purpose computer on which one or more embodiments of the present disclosure may be practised;

[0017] Fig. 4 is a schematic block diagram representation of a system that includes a general smartphone on which one or more embodiments of the present disclosure may be practised;

[0018] Fig. 5 is a flow diagram illustrating a method for performing the pre-processing step of Fig. 1;

[0019] Fig. 6 is a flow diagram illustrating a method for performing the training and learning process step of Fig. 1;

[0020] Fig. 7 is a flow diagram illustrating a method for performing the prediction step of Fig. 1; and

[0021] Fig. 8 is a graphical representation of a sentiment score. Detailed Description

[0022] The present disclosure provides a data analysis method and learning system. Throughout this specification, the data analysis method and learning system will be described in the context of processing data relating to one or more candidates for a job, whereby the data analysis method and learning system processes data provided by one or more referees associated with the candidate and produces a sentiment calculation for that data.

[0023] It will be appreciated by a person skilled in the art that whilst the examples provided in this specification relate to the human resources industry, the data analysis method and learning system of the present disclosure may be equally practised in other industries, without departing from the scope of the present application, including, for example, reviews, survey responses, customer service, and any other industry that generates feedback or scoring data that can be analysed with machine learning.

[0024] Method steps or features in the accompanying drawings that have the same reference numerals are to be considered to have the same function(s) or operation(s), unless the contrary intention is expressed or implied.

[0025] When processing job applications from prospective candidates, it is common for employers and employment agencies to request references from one or more referees who can vouch for the respective candidates. Such references may be provided in written form, over the telephone, or via online forms or surveys. It is useful to normalise the references provided by different referees, so that the references can be readily compared.

[0026] In one arrangement, the data analysis method and learning system of the present disclosure are applied to a candidate referencing process. In one

implementation, the data analysis method and learning system provides a cloud-based platform coupled to a communications network, such as the Internet, to readily collect and process references for prospective candidates.

[0027] Traditionally, a job candidate provides contact details for one or more referees, when applying for a job. The employer or employment agency then contacts the referees to validate the candidate to determine the suitability of that candidate for the job and assist in identifying the best candidate for the available job position. It is common for employers and employment agencies to contact referees by telephone and conduct a telephone interview. Such telephonic interviews are a manually-intensive process driven by the employer. [0028] The data analysis method and learning system of the present disclosure provides an automated platform for acquiring and analysing data to interpret the sentiment of any given statement and may be applied, as indicated above, to employment reference feedback provided by referees. In particular, the data analysis method and learning system are suitable for use in relation to written feedback provided as a reference from a referee in relation to a prospective job candidate. Such references generally include an opinion of the candidate's performance whilst known to the referee in one of the following capacities: employment, volunteering, academic, unemployment, or socially in the form of a character reference.

[0029] The data analysis method and learning system utilises a sentiment analysis engine implemented using a trained machine learning classification model. The data analysis method and learning system builds a corpus based on an initial set of answers from an initial set of references. In one implementation, the corpus is based on millions of answers obtained from thousands of references. The system then trains a

classification model. Once the model has been trained, a sentiment score is calculated for each answer/reference belonging to the corpus. The method then builds the model and serialises the model for storage. In one implementation, the built model is serialised as a python-object.

[0030] Fig. 1 is a flow diagram illustrating a data analysis method and learning system 100 in accordance with the present disclosure. The method 100 begins at a Start step 110 and proceeds to a pre-processing step 120. The pre-processing step 120 processes a set of received data into a predefined format. In the example in which the data relates to information provided by one or more referees in association with a prospective candidate for a job, the pre-processing step 120 processes one or more answers provided by the referees into a predefined format, such that the answers may be presented in a consistent format. Such pre-processing may include, for example, parsing of the answers to remove or clean punctuation, conversion of answers to lower case letters, and the like.

[0031] Control passes from the pre-processing step 120 to a classification step 130. The classification step 130 receives the pre-processed data from step 120 and applies machine learning to the pre-processed data to generate a model. The machine learning may include, for example, but is not limited to, a Naive Bayes classifier or a neural network. [0032] Control passes from the classification step 130 to a prediction step 140, which receives new data, such as an answer from a referee in relation to a candidate, and uses the model generated in the classification step 130 to generate a sentiment rating associated with the new data. Control passes to an End step 150 and the method 100 terminates.

[0033] Fig. 2 is a schematic representation of a networked computer system 200 on which one or more embodiments of the present disclosure may be practised. The system 200 includes a candidate referencing system 201 coupled to a communications network 205. The communications network 205 may include, for example, one or more wired or wireless connections, including a Local Area Network (LAN), Wide Area

Network (WAN), a virtual private network (VPN), cellular telephony network, the Internet, or any combination thereof.

[0034] In this example, the candidate referencing system 201 is used to determine a sentiment calculation for answers provided by referees in references for job candidates. The candidate referencing system 201 includes a cloud execution model 240 that hosts a sentiment analysis service 242, a sentiment analysis engine 246 and a data persistence environment 270, which may be internal or external to the cloud execution model 240. The data persistence environment 270 may be used, for example, to store data for the sentiment analysis, including data relating to previous responses provided by one or more referees and models produced by the training and learning process step 130 of Fig. 1.

[0035] The system 200 also includes a set of customers 220, comprising in the example of Fig. 2 customer 1 280 and customer 2 290. Customer 1 280 and customer 2 290 represent corporate entities that require assistance in vetting potential candidates for vacant job positions. In large corporate entities, it is common to have a large turnover of staff and continually performing manual reference checks on a large number of candidates consumes an excessive amount of time.

[0036] In the example of Fig. 2, customer 1 280 and customer 2 290 subscribe to the candidate referencing system 201, which stores relevant customer details in a storage medium, such as a database (not shown). Such customer details may include, for example, contact details, current listings of job vacancies, and the like. The database may also store account information relating to the respective subscribed customers. In one implementation, the candidate referencing system 201 charges subscribed customers on a per reference check basis, a periodic fee, or a combination thereof. Other accounting practices may equally be implemented without departing from the spirit and scope of the present disclosure.

[0037] The system 200 also includes a first computing device 210 coupled to the communications network 205. The first computing device 210 may be implemented using a smartphone, laptop, desktop computer, server, or general purpose computer.

[0038] In the example of Fig. 2, the user 202 is a job candidate who nominates one or more referees for a job reference check initiated by one of customer 1 280 or customer 2 290 via the candidate referencing system 201 using the communications network 205.

[0039] In one arrangement, the candidate 202 accesses the candidate referencing system 201, which provides the candidate with a unique candidate identifier. The unique candidate identifier may be a composite identifier, based on the candidate and the job position for which the candidate is applying. In one implementation, the candidate identifier is a link, such as a Uniform Resource Locator (URL), that a user can use to access a relevant page of the candidate referencing system 201.

[0040] In the example of Fig. 2, the candidate 202 interacts with the candidate referencing system 201 using a browser executing on the computing device 210 or via a software application ("app") associated with the candidate referencing system 201 and executing on the computing device 210. In one arrangement, the candidate 202 provides contact details for one or more referees.

[0041] In the example of Fig. 2, the candidate 202 provides contact details in the form or email addresses for a first referee 230 and a second referee 235. The candidate referencing system 201 sends a communication to each of the first referee 230 and the second referee 235, indicating that the candidate 202 has nominated each of them as a referee. The communication may be a letter, email, Short Message Service (SMS) text message, Multimedia Message Service (MMS) text message, or the like, which contains the candidate identifier.

[0042] The first referee 230 uses a second computing device 250. The second computing device 250 may be implemented using a smartphone, laptop, desktop computer, server, or general purpose computer. In one arrangement, the first referee 230 uses a browser executing on the second computing device 250 to

communicate with the candidate referencing system 201 via the communications network 205. [0043] The candidate referencing system 201 receives the entered candidate identifier, retrieves relevant information pertaining to the candidate 202 and the job position in question and presents an interface to the second computing device 250 for display to the first referee 230. The interface may include, for example, but is not limited to, a questionnaire, template, form, survey, or the like. The first referee 230 provides a reference for the candidate 202 in the form of answers to the presented interface. The second computing device 250 transmits the entered answers to the candidate referencing system 201, which triggers the sentiment analysis engine 246.

[0044] A similar process is followed in relation to the second referee 235, who accesses a third computing device 260 to interact with the candidate referencing system 201. The third computing device 260 may be implemented using a smartphone, laptop, desktop computer, server, or general purpose computer.

[0045] As an alternative to an app or the candidate referencing system 201 presenting an interface to a computing device 250, 260 accessed by the referees 230, 235, wherein the interface requires a set of answers, the interface is adapted to receive a prepared written reference from the respective referee 230, 235.

[0046] The sentiment analysis engine 246 parses the answers and/or written references to pre-process the received information, as described with reference to step 120 of Fig. 1. The sentiment analysis engine 246 then classifies the pre-processed data to generate a model, which is used to determine a prediction of the sentiment of the received answers, based on machine learning. Thus, the sentiment analysis engine 246 determines a sentiment calculation for each answer. The sentiment analysis engine 246 then determines an overall sentiment score for the overall reference provided by the respective referee 230, 235. The overall sentiment may be an aggregate of the individual sentiment calculations for the individual answers. Alternatively, individual sentiment calculations may be weighted in order to determine the overall sentiment. In order to avoid skewing the overall sentiment of a given reference, ^" weights^" are applied to minimise the effect (typically extremely positive or extremely negative effect) of some answers.

[0047] In one arrangement, the sentiment analysis engine 246 uses neural networks, in an alternative arrangement Naive Bayes machine learning, in another arrangement deep neural nets or other machine learning functionality to determine sentiment calculations and scores, which are persisted in the data persistence environment 270. [0048] Each subscribed customer 280, 290 of the candidate referencing system 201 is presented with a personalised employer dashboard, on logging in to the candidate referencing system 201. The candidate referencing system 201 calls the sentiment analysis service 242 to provide dashboard information to the respective employer dashboard of the respective customers 280, 290. In one arrangement, each employer dashboard provides a graphical representation for each reference provided in relation to a job candidate. The graphical representation may be a numerical scale, bar graph, colour scale, or any combination thereof. For example, the graphical representation may be a colour from green to red, with green indicating a positive sentiment, blue indicating a neutral sentiment, and red indicating a negative sentiment.

[0049] The data analysis method and learning system of the present disclosure may be practised using a computing device, such as a general purpose computer, a computer server or a cloud execution system. Fig. 3 is a schematic block diagram of a system 300 that includes a general purpose computer 310. The general purpose computer 310 includes a plurality of components, including: a processor 312, a memory 314, a storage medium 316, input/output (I/O) interfaces 320, and input/output (I/O) ports 322.

Components of the general purpose computer 310 generally communicate using one or more buses 348.

[0050] The memory 314 may be implemented using Random Access Memory (RAM), Read Only Memory (ROM), or a combination thereof. The storage medium 316 may be implemented as one or more of a hard disk drive, a solid state "flash" drive, an optical disk drive, or other storage means. The storage medium 316 may be utilised to store one or more computer programs, including an operating system, software applications, and data. In one mode of operation, instructions from one or more computer programs stored in the storage medium 316 are loaded into the memory 314 via the bus 348.

Instructions loaded into the memory 314 are then made available via the bus 348 or other means for execution by the processor 312 to implement a mode of operation in accordance with the executed instructions.

[0051] One or more peripheral devices may be coupled to the general purpose computer 310 via the I/O ports 322. In the example of Fig. 3, the general purpose computer 310 is coupled to each of a speaker 324, a camera 326, a display device 330, an input device 332, a printer 334, and an external storage medium 336. The speaker 324 may be implemented using one or more speakers, such as in a stereo or surround sound system.

[0052] The display device 330 may be a computer monitor, such as a cathode ray tube screen, plasma screen, or liquid crystal display (LCD) screen. The display 330 may receive information from the computer 310 in a conventional manner, wherein the information is presented on the display device 330 for viewing by a user. The display device 330 may optionally be implemented using a touch screen to enable a user to provide input to the general purpose computer 310. The touch screen may be, for example, a capacitive touch screen, a resistive touchscreen, a surface acoustic wave touchscreen, or the like.

[0053] The input device 332 may be a keyboard, a mouse, a stylus, drawing tablet, or any combination thereof, for receiving input from a user. The external storage medium 336 may include an external hard disk drive (HDD), an optical drive, a floppy disk drive, a flash drive, solid state drive (SSD), or any combination thereof and may be implemented as a single instance or multiple instances of any one or more of those devices. For example, the external storage medium 336 may be implemented as an array of hard disk drives.

[0054] The I/O interfaces 320 facilitate the exchange of information between the general purpose computing device 310 and other computing devices. The I/O interfaces may be implemented using an internal or external modem, an Ethernet connection, or the like, to enable coupling to a transmission medium. In the example of Fig. 3, the I/O interfaces 322 are coupled to a communications network 338 and directly to a computing device 342. The computing device 342 is shown as a personal computer, but may be equally be practised using a smartphone, laptop, or a tablet device. Direct

communication between the general purpose computer 310 and the computing device 342 may be implemented using a wireless or wired transmission link.

[0055] The communications network 338 may be implemented using one or more wired or wireless transmission links and may include, for example, a dedicated communications link, a local area network (LAN), a wide area network (WAN), the Internet, a

telecommunications network, or any combination thereof. A telecommunications network may include, but is not limited to, a telephony network, such as a Public Switch Telephony Network (PSTN), a mobile telephone cellular network, a short message service (SMS) network, or any combination thereof. The general purpose computer 310 is able to communicate via the communications network 338 to other computing devices connected to the communications network 338, such as the mobile telephone handset 344, the touchscreen smartphone 346, the personal computer 340, and the computing device 342.

[0056] One or more instances of the general purpose computer 310 may be utilised to implement a server acting as a cloud execution model 240 hosting a sentiment analysis service 242, a sentiment analysis engine 246 and a data persistence environment 270 in accordance with the present disclosure. In such an embodiment, the memory 314 and storage 316 are utilised to store data relating to previous and current references and machine learning models. Software for implementing the data analysis method and learning system is stored in one or both of the memory 314 and storage 316 for execution on the processor 312. The software includes computer program code for implementing method steps in accordance with the method of data analysis, and in particular sentiment analysis, described herein.

[0057] One or more instances of the general purpose computer 310 may also be utilised to implement one or more of the computing devices 210, 250, 260 of Fig. 2.

[0058] Fig. 4 is a schematic block diagram of a system 400 on which one or more aspects of a data analysis method and learning system of the present disclosure may be practised. The system 400 includes a portable computing device in the form of a smartphone 410, which may be used by a registered user, candidate, or referee in relation to the data analysis method and learning system in Fig. 2. The smartphone 410 includes a plurality of components, including: a processor 412, a memory 414, a storage medium 416, a battery 418, an antenna 420, a radio frequency (RF) transmitter and receiver 422, a subscriber identity module (SIM) card 424, a speaker 426, an input device 428, a camera 430, a display 432, and a wireless transmitter and receiver 434. Components of the smartphone 410 generally communicate using one or more bus connections 448 or other connections therebetween. The smartphone 410 also includes a wired connection 445 for coupling to a power outlet to recharge the

battery 418 or for connection to a computing device, such as the general purpose computer 310 of Fig. 3. The wired connection 445 may include one or more connectors and may be adapted to enable uploading and downloading of content from and to the memory 414 and SIM card 424.

[0059] The smartphone 410 may include many other functional components, such as an audio digital-to-analogue and analogue-to-digital converter and an amplifier, but those components are omitted for the purpose of clarity. However, such components would be readily known and understood by a person skilled in the relevant art.

[0060] The memory 414 may include Random Access Memory (RAM), Read Only Memory (ROM), or a combination thereof. The storage medium 416 may be implemented as one or more of a solid state "flash" drive, a removable storage medium, such as a Secure Digital (SD) or microSD card, or other storage means. The storage medium 416 may be utilised to store one or more computer programs, including an operating system, software applications, and data. In one mode of operation, instructions from one or more computer programs stored in the storage medium 416 are loaded into the memory 414 via the bus 448. Instructions loaded into the memory 414 are then made available via the bus 448 or other means for execution by the processor 412 to implement a mode of operation in accordance with the executed instructions.

[0061] The smartphone 410 also includes an application programming interface (API) module 436, which enables programmers to write software applications to execute on the processor 412. Such applications include a plurality of instructions that may be pre-installed in the memory 414 or downloaded to the memory 414 from an external source, via the RF transmitter and receiver 422 operating in association with the antenna 420 or via the wired connection 445.

[0062] The smartphone 410 further includes a Global Positioning System (GPS) location module 438. The GPS location module 438 is used to determine a geographical position of the smartphone 410, based on GPS satellites, cellular telephone tower triangulation, or a combination thereof. The determined geographical position may then be made available to one or more programs or applications running on the processor 412.

[0063] The wireless transmitter and receiver 434 may be utilised to communicate wirelessly with external peripheral devices via Bluetooth, infrared, or other wireless protocol. In the example of Fig. 4, the smartphone 410 is coupled to each of a printer 440, an external storage medium 444, and a computing device 442. The computing device 442 may be implemented, for example, using the general purpose computer 310 of Fig. 3.

[0064] The camera 426 may include one or more still or video digital cameras adapted to capture and record to the memory 414 or the SIM card 424 still images or video images, or a combination thereof. The camera 426 may include a lens system, a sensor unit, and a recording medium. A user of the smartphone 410 may upload the recorded images to another computer device or peripheral device using the wireless transmitter and receiver 434, the RF transmitter and receiver 422, or the wired connection 445.

[0065] In one example, the display device 432 is implemented using a liquid crystal display (LCD) screen. The display 432 is used to display content to a user of the smartphone 410. The display 432 may optionally be implemented using a touch screen, such as a capacitive touch screen or resistive touchscreen, to enable a user to provide input to the smartphone 410.

[0066] The input device 428 may be a keyboard, a stylus, or microphone, for example, for receiving input from a user. In the case in which the input device 428 is a keyboard, the keyboard may be implemented as an arrangement of physical keys located on the smartphone 610. Alternatively, the keyboard may be a virtual keyboard displayed on the display device 432.

[0067] The SIM card 424 is utilised to store an International Mobile Subscriber Identity (IMSI) and a related key used to identify and authenticate the user on a cellular network to which the user has subscribed. The SIM card 424 is generally a removable card that can be used interchangeably on different smartphone or cellular telephone devices. The SIM card 424 can be used to store contacts associated with the user, including names and telephone numbers. The SIM card 424 can also provide storage for pictures and videos. Alternatively, contacts can be stored on the memory 414.

[0068] The RF transmitter and receiver 422, in association with the antenna 420, enable the exchange of information between the smartphone 410 and other computing devices via a communications network 490. In the example of Fig. 4, RF transmitter and receiver 422 enable the smartphone 410 to communicate via the communications network 490 with a cellular telephone handset 450, a smartphone or tablet device 452, a computing device 454 and the computing device 442. The computing devices 454 and 442 are shown as personal computers, but each may be equally be practised using a smartphone, laptop, or a tablet device.

[0069] The communications network 490 may be implemented using one or more wired or wireless transmission links and may include, for example, a cellular telephony network, a dedicated communications link, a local area network (LAN), a wide area network (WAN), the Internet, a telecommunications network, or any combination thereof. A telecommunications network may include, but is not limited to, a telephony network, such as a Public Switch Telephony Network (PSTN), a cellular (mobile) telephone cellular network, a short message service (SMS) network, or any combination thereof.

[0070] In one arrangement, one or more components of the system 200 of Fig. 2 are implemented using a cloud computing environment. A cloud computing environment uses a network of remote servers coupled by a communications network, such as the network 205, to store, manage, and process data. Such a cloud computing environment provides shared computing processing resources and data storage to enable flexible, elastic on-demand resources. Users may use the computing devices 210, 250, 260 to access and interact with services provided in the cloud computing environment.

[0071] In one implementation, the candidate referencing system 201 is implemented using a cloud computing environment, wherein the sentiment analysis service 242, sentiment analysis engine 246, and data persistence environment 270 may be physical co-located or distributed across shared computing resources.

[0072] In order to train a sentiment model, received text relating to a reference has to be pre-processed to make the text 'readable' for the machine learning engine. The preprocessing involves a couple of steps of cleaning and formatting. Fig. 5 is a flow diagram illustrating a method for performing the pre-processing step 120 of Fig. 1. The pre-processing step 120 begins at step 510, which receives a set of answers from a referee 230, 235. The answers are presented to step 515, which retrieves data. In one arrangement, the sentiment analysis engine 246 retrieves the data from the data persistence environment 270 or another storage medium. In one implementation, the sentiment analysis engine 246 uses "pandas", which is an open source Berkeley Software Distribution-licensed library providing high-performance, easy-to-use data structures and data analysis tools. "Pandas" is used to retrieve the text of the references.

[0073] Step 520 cleans the text of the references, such as by removing punctuation and symbols. One implementation uses The Natural Language Toolkit (NTLK) and "textblob". NLTK is a suite of libraries and programs for symbolic and statistical natural language processing for the English language, "textblob" is a library for processing textual data and provides a simple API for common Neuro-linguistic programming tasks, such as, part- of-speech tagging, noun-phrase extraction, sentiment analysis, classifications, translation, and the like.

[0074] Control passes from step 520 to step 525, which cleans stop words using NLTK. Stop words generally refer to common words in a language that are excluded before performing processing of natural language data. Control then passes to step 530, which formats the responses as lower case and applies tag words. In one implementation, the sentiment analysis engine 246 applies the tag words using nltk.pos_tag, which is a utility for parsing text and classifying words into their respective parts of speech.

[0075] For example, syntactic and morphological tags are used to give one or more additional dimensions to the words in the specific context. Those tags collaborate with the model to make better predictions. Example: ^" candidate^" alone can be 50% positive, which has no predictive importance to the model and possibly should be excluded.

However, ^" candidate^" as ^" subject^" of a sentence can be 85% negative, whereas

" candidate^" as ^" object^" can be 73% positive.

[0076] Control passes from step 530 to step 535, which formats the output of step 530 for training and/or prediction. Such formatting may be performed, for example, using pandas and sklearn (Scikit-learn), wherein Scikit-learn is a free software machine learning library. The output of step 535 is pre-processed data 540, being the output of the pre-processing step 120.

[0077] Fig. 6 is a flow diagram illustrating a method for performing the learning and training process step 130 of Fig. 1. The learning and training process step 130 may be implemented using different machine learning techniques. The example of Fig. 6 provides two alternative implementations: a first implementation using a Naive Bayes machine learning classifier; and a second implementation using a Neural Net. The first

implementation begins at step 605, which receives the pre-processed data 540 from Fig. 5. Control passes to step 610, which instantiates and trains a Naive Bayes classifier to implement a 'Bag of Words' model. This approach trains the classifier to identify words that appear predominantly in negative or positive sentences. The sentiment is

subsequently calculated based on the overall positiveness/negativeness of the specific set of words (i.e., a sentence).

[0078] In this example, from "textblob" a Naive Bayes classifier is initiated and trained to predict sentiment from job reference answers. Control passes from step 610 to step 615, which serialises a model using the "pickle" Python module. It will be appreciated that other serialising techniques may equally be practised, such as JSON, PMML, XML, etc.

[0079] The output of step 615 is presented as an output model in step 620. In this example, the model is labelled model_nb. [0080] The second implementation relates to a neural network classifier that creates a network made of nodes, such as may be achieved using Python's NumPy. Each layer of nodes has its behaviour, which can be described as a mathematical function that transforms the output that has been received from of the previous layer. When this happens, a given weight is applied to fine-tune the outcome, which again is passed to the next layer. Eventually, the values reach the output node, where for example a sigmoid activation function is applied. After computing the cost/loss, the error is back- propagated, updating the weights.

[0081] The second implementation begins at step 625, which receives the pre-processed data 540 from Fig. 5. Control passes to step 630, which vectorises words in the pre- processed data 540 using NumPy, which is a scientific computing package in the Python programming language. Basically, this step converts the words into numbers and feeds those numbers to the first layer of the neural network (NN).

[0082] Control passes from step 630 to step 635, which multiplies the numbers by random weights that represent the first cognitive response to that stimulus (in this example 'words').

[0083] Control passes to step 640, which forwards the outcome of step 635 to the sequence of hidden layers and then applies another set of weights and then forwards another outcome using NumPy.dot.

[0084] Step 645 applies an activation function and creates a prediction, which is compared to the real label/sentiment, which in this example has previously been acquired using Amazon Mechanical Turk. Amazon Mechanical Turk is a web service that provides an on-demand, scalable, human workforce to complete jobs that humans can do better than computers, such as recognizing objects in photographs.

[0085] The output of step 645 is presented to step 650, which calculates the distance between the sentiment prediction of step 645 and the actual sentiment and

backpropagates the difference, updating the weights. Backpropagation is a known method of training a neural net, wherein an initial system output is compared to the desired output, and the system is adjusted until the difference between the two is minimised. A predefined number of iterations is set, with the output of step 650 being fed back to step 635. Once the number of iterations has been performed, the output of step 650 is presented as an output model, model_nn in step 655. Thus, the output of the classification step 130 in this example includes two models, model_nb corresponding to the Naive Bayes generated model and model_nn corresponding to the Neural Net generated model. Other approaches for the training and learning process are possible, which could either produce the mentioned models or other models.

[0086] Fig. 7 is a flow diagram illustrating a method for performing the prediction step 140 of Fig. 1. The prediction step 140 receives a new answer from a

referee 230, 235. The answer is provided to step 710, which in this example uses serverless async architecture to take advantage of AWS-Lambda to pre-process and calculate a sentiment associated with the received answer, by using one or more of the models output from step 130. AWS Lambda is a server-less customisable trigger-based auto-scalable function environment.

[0087] The output of step 710 is presented to step 715, which uses AWS-Lambda to insert new answers to a data storage medium, such as Amazon Redshift, which in this example houses all of the previously submitted answers.

[0088] A following step 720 fetches data from the data storage medium and displays a sentiment for a given reference using Lambda.

[0089] The output of step 720 is presented as a sentiment score associated with a reference for a job candidate 202. The sentiment score may be displayed graphically on an employer dashboard of the candidate referencing system 201 or associated app. Fig. 8 shows a graphical representation 800 of a sentiment score. In the example of Fig. 8, the sentiment score reflects a sentiment distribution across the reference of:

12% negative; 24% neutral; and 64% positive.

Industrial Applicability

[0090] The arrangements described are applicable to the recruitment, survey, data analysis, reviews, survey responses, customer service, HR industries, and every other industry that generates feedback or scoring data that can be analysed with machine learning.

[0091] The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.

[0092] In the context of this specification, the word "comprising" and its associated grammatical constructions mean "including principally but not necessarily solely" or "having" or "including", and not "consisting only of". Variations of the word "comprising", such as "comprise" and "comprises" have correspondingly varied meanings.

[0093] As used throughout this specification, unless otherwise specified, the use of ordinal adjectives "first", "second", "third", "fourth", etc., to describe common or related objects, indicates that reference is being made to different instances of those common or related objects, and is not intended to imply that the objects so described must be provided or positioned in a given order or sequence, either temporally, spatially, in ranking, or in any other manner.

[0094] Although the invention has been described with reference to specific examples, it will be appreciated by those skilled in the art that the invention may be embodied in many other forms.

Claims

The claims defining the invention are as follows:

1. A data analysis method and learning system comprising the steps of:

training a classification model, based on said corpus;

calculating a sentiment score for each answer in said corpus;

serialising said trained model;

receiving a new reference associated with a job candidate;

re-training the said classification model based on said new reference;

generating, by said re-trained model, a sentiment score for said new reference; and presenting new sentiment analysis, based on said sentiment scores.

2. The method according to claim 1, wherein building said corpus includes pre-processing said initial set of answers.

3. The method according to claim 2, wherein pre-processing said initial set of answers includes at least one of removing punctuation, cleaning punctuation, and converting text to lower case letters.

4. The method according to any one of claims 1 to 3, wherein said classification model is a sentiment classification model.

5. The method according to claim 4, wherein said sentiment classification model is one of one of a Naive Bayes model and a neural net.

6. The method according to claim 5, wherein said sentiment classification model is a Naive Bayes model and the training of said Naive Bayes classification model includes the step of:

instantiating and training a Naive Bayes classifier to predict sentiment from said initial set of answers.

7. The method according to claim 5, wherein said classification model is a neural net and the training of said neural net classification model includes the steps of:

vectorising words in said initial set of answers;

feeding said vectorised words to an input layer of a neural network; applying weights to nodes of said neural network;

forwarding an outcome of said input layer to a sequence of hidden layers to generate a hidden layers output;

generate a sentiment prediction, based on an activation function applied to said hidden layers output;

calculate a distance between said sentiment prediction and an actual sentiment and backpropagate the distance to said neural network.

8. The method of any one of claims 1 to 7, wherein generating said sentiment analysis includes the steps of:

using an asynchronous processing environment to pre-process said reference and calculate said sentiment score, using said serialised trained model.

9. The method according to any one of claims 1 to 8, comprising the further step of: displaying said sentiment score to an employer dashboard associated with a candidate referencing system.

10. The method according to claim 9, wherein said displayed sentiment score is a graphical representation indicating at least one of a negative sentiment, neutral sentiment, and positive sentiment associated with the reference.

11. A candidate referencing system comprising:

a communications network connection for coupling to a communications network; and a cloud execution model that includes:

a sentiment analysis engine, and

wherein said cloud execution model:

receives references received via the candidate referencing system;

wherein said sentiment analysis engine:

pre-processes the received references;

wherein said data persistence environment:

persists the sentiment score; and

wherein said sentiment analysis engine:

re-trains the classification model based on new references; and

determines an overall sentiment for a received reference, based on an aggregate of the individual sentiment scores of the answers associated with that reference.