NL1037190A

NL1037190A - CLASSIFICATION MECHANISM USING A DYNAMIC QUESTIONNAIRE.

Info

Publication number: NL1037190A
Application number: NL1037190A
Authority: NL
Inventors: Andreas Theodorus Jong; Martijn Antonius Broenland
Original assignee: Hexon B V
Priority date: 2009-08-11
Filing date: 2009-08-11
Publication date: 2011-02-14
Also published as: NL1037190C2

Description

Classificatiemechanisme met behulp van een dynamische vragenlijst.Classification mechanism using a dynamic questionnaire.

Het mechanisme voor het toewijzen van een categorie (klasse) aan een persoon die vragen in een vragenlijst beantwoord (participant) wordt een classification questionnaire genoemd. Voorbeelden hiervan zijn medische diagnoses, waar doormiddel van het beantwoorden van vragen over symptomen de meest passende aandoening wordt gezocht, of een stemwijzer, waarin door het beantwoorden van vragen over politieke voorkeur een best passende politieke partij wordt gezocht.The mechanism for assigning a category (class) to a person who answers questions in a questionnaire (participant) is called a classification questionnaire. Examples of this are medical diagnoses, where by answering questions about symptoms the most appropriate condition is searched for, or a voting pointer, in which answering questions about political preference seeks a most suitable political party.

Een eigenschap van classification questionnaires is dat de lijst met vragen in een vooraf bepaalde volgorde wordt gesteld.A characteristic of classification questionnaires is that the list of questions is put in a predetermined order.

Dit heeft als nadeel dat voor een participant niet elke gestelde vraag even relevant hoeft te zijn. Een voorbeeld is op het moment dat eerst de vragen 'Bezit u een auto?' en 'Wat is de reisafstand tussen u en uw werk?' worden gesteld, en de participant beantwoord deze met respectievelijk 'nee' en 'meer dan 10 km', de vraag 'Gebruikt u het openbaar vervoer' minder relevant wordt omdat de kans dat deze met ja beantwoord groter is.The disadvantage of this is that not every question asked has to be equally relevant to a participant. An example is the first time the questions "Do you own a car?" and "What is the travel distance between you and your work?" and the participant answers this with 'no' and 'more than 10 km' respectively, the question 'Do you use public transport' becomes less relevant because the chance that this is answered with yes is greater.

Een bestaande oplossing voor dit probleem is door logica toe te passen, en vragen in een vragenlijst over te slaan (skip-pattern questionnaires). Hierbij worden regels gebruikt waarbij op het moment dat een bepaald antwoord gegeven wordt één of meerdere vragen kunnen worden overgeslagen. Voorbeelden hiervan zijn 'Heeft u kinderen?' waarbij, bij het antwoord 'nee' vermeld wordt, 'Ga naar vraag ..'.An existing solution to this problem is by applying logic and by skipping questions in a questionnaire (skip-pattern questionnaires). Here rules are used whereby one or more questions can be skipped at the moment a given answer is given. Examples of this are "Do you have children?" with the answer 'no' being stated, 'Go to question ..'.

Met het gebruik van rekenkracht in een computersysteem, is het mogelijk verder te gaan dan deze binaire logica, en de relevantie van een vraag op een probabilistische manier, dynamisch te bepalen. In plaats van de relevantie binair te bepalen met logica, wordt in het hier gepresenteerde model, na elke vraag de relevantie van de resterende vragen in de questionnaire berekend en wordt op basis van deze relevantie de beste vervolgvraag gesteld aan de deelnemer. Op deze manier kan een computersysteem worden ontwikkeld waar met behulp van een triviale interface vragen kunnen worden gesteld aan een participant, waarmee het computersysteem de participant of een domeinexpert advies kan geven met betrekking tot de uitkomstkandidaten.With the use of computing power in a computer system, it is possible to go beyond this binary logic, and dynamically determine the relevance of a question in a probabilistic way. Instead of determining the relevance binary with logic, in the model presented here, the relevance of the remaining questions in the questionnaire is calculated after each question and the best follow-up question is asked to the participant based on this relevance. In this way a computer system can be developed where questions can be asked to a participant using a trivial interface, with which the computer system can advise the participant or a domain expert with regard to the outcome candidates.

Door een optimale vraagvolgorde te gebruiken kan er na een relevant deel van de vragen te hebben gesteld al worden gestopt met vragen stellen met minimaal verlies van accuraatheid van de classificatie, of kunnen er, door het systeem over meer vragen te laten beschikken dan er zullen worden gesteld, meer relevante vragen worden gesteld, resulterend in een betere classificatie.By using an optimal question sequence, after asking a relevant part of the questions, you can already stop asking questions with minimal loss of accuracy of the classification, or by allowing the system to have more questions than will be available asked, more relevant questions are asked, resulting in a better classification.

Ook kunnen deze twee voordelen worden gecombineerd resulterend in een korte, precieze questionnaire.These two benefits can also be combined, resulting in a short, precise questionnaire.

De manier van het bepalen van de relevantie van een vraag gebeurt door enerzijds te kijken naar welke vragen een goed onderscheid zullen maken tussen de relevante kandidaten (als het ware het maximaliseren van de entropie van een vraag), en anderzijds door te kijken naar hoeveel een beantwoorde vraag zou toevoegen aan niet overlappende informatie in de classificatie (met als doel het behalen van een hoge objectiviteit).The way to determine the relevance of a question is done on the one hand by looking at which questions will make a good distinction between the relevant candidates (as it were, maximizing the entropy of a question), and on the other by looking at how much a question add answered question to non-overlapping information in the classification (with the aim of achieving high objectivity).

Hoe goed vragen aan deze criteria voldoen wordt automatisch berekend door zowel de vragen als de invloed die de antwoorden van die vragen op de verschillende mogelijke uitkomsten hebben, op een universele manier te modelleren. Met de uitvinding kunnen, in combinatie met een willekeurig classificatiemodel (bijvoorbeeld kansrekening, bayesian logic, PCA e.d.), de bovenstaande criteria worden bepaald.How well questions meet these criteria is calculated automatically by modeling both the questions and the influence that the answers of those questions have on the various possible outcomes in a universal way. With the invention, in combination with a random classification model (for example probability, bayesian logic, PCA and the like), the above criteria can be determined.

De uitvinding kan worden geplaatst in een geautomatiseerde vragenlijst, welke kan worden gemodelleerd met behulp van (figuur 1) waarin schematisch wordt weergegeven hoe de doorloop van een geautomatiseerde vragenlijst plaatsvindt. Allereerst wordt er met behulp van de uitvinding een vraag geselecteerd (stap 1 in figuur 1) . Deze vraag wordt vervolgens aan de participant gesteld (stap 2) . Vervolgens wordt er beslist (stap 3) of er meer vragen gesteld kunnen worden en of dit zou moeten gebeuren. Indien het geval, dan herhaalt het proces zich vanaf stap 1. Op het moment dat er geen vragen meer zijn of er wordt gekozen niet door te gaan, wordt er een classificatie uitgevoerd doormiddel van de hierna omschreven classificatie inferentiemachine (stap 4). De uitkomst van de classificatie wordt vervolgens gerepresenteerd in een vorm die toepasselijk is voor de context waarin de vragenlijst gehouden is. (stap 5)The invention can be placed in an automated questionnaire, which can be modeled with the aid of (figure 1) in which it is shown schematically how the processing of an automated questionnaire takes place. First of all, a question is selected using the invention (step 1 in Figure 1). This question is then asked to the participant (step 2). It is then decided (step 3) whether more questions can be asked and whether this should be done. If this is the case, the process repeats itself from step 1. The moment there are no more questions or it is decided not to proceed, a classification is performed using the classification inference machine described below (step 4). The result of the classification is then presented in a form that is appropriate for the context in which the questionnaire is held. (step 5)

De werking van de uitvinding maakt gebruik van een willekeurig classificatiemodel dat kan worden gemodelleerd zoals schematisch weergegeven in (figuur 2). Hierin representeert blok A de al eerder vernoemde classificatie inferentiemachine. Deze heeft invoer h, representerende een set met één of meerdere vraag-antwoordtupels in welke elke tupel in de set wordt gerepresenteerd door een tupel <v,a> waarin v een vraag representeert, en a een antwoord op vraag v representeert. De tweede invoer (e) van de classificatie inferentiemachine representeert een set met uitkomst kandidaten. De inferentiemachine kan met gebruik van deze twee invoeren een uitvoer set S met dezelfde lengte als e genereren, waarin elk element in 5 de relevantie van een element uit e geeft, gegeven dat een participant de vragen in de vraag-antwoordtupels h heeft beantwoord met de overeenkomstige antwoorden.The operation of the invention uses a random classification model that can be modeled as shown schematically in (Figure 2). Herein, block A represents the aforementioned classification inference machine. It has input h, representing a set with one or more question-answer tuples in which each tuple in the set is represented by a tuple <v, a> in which v represents a question, and a represents an answer to question v. The second entry (e) of the classification information machine represents a set of outcome candidates. Using these two inputs, the inference machine can generate an output set S of the same length as e, in which each element in 5 indicates the relevance of an element from e, given that a participant has answered the questions in the question-answer tuples h with the corresponding answers.

De vraagselectie uit stap 1 van (figuurl) wordt met behulp van de uitvinding bewerkstelligd. De uitvinding is te modelleren met behulp van (figuur 3) waarin een 'Informatiemaat' (blok A) en een 'Objectiviteitsmaat' (blok B) beiden een score (5j en S2) opleveren welke door deze te combineren (blok S in figuur 3) het mogelijk maken een vervolgvraag (v) te selecteren.The demand selection from step 1 of (figure 1) is effected with the aid of the invention. The invention can be modeled with the aid of (figure 3) in which an 'Information measure' (block A) and an 'Objectivity measure' (block B) both yield a score (5j and S2) which can be combined by combining these (block S in figure 3) ) make it possible to select a follow-up question (v).

De Informatiemaat (blok A in figuur 3) is een maat, die de mate van de hoeveelheid nieuwe informatie die een vraag potentieel biedt berekent, en wordt gemodelleerd met behulp van (figuur 4). Allereerst wordt er, met behulp van de al gegeven antwoorden h, en de uitkomst kandidaten e een 'Important Set' berekend (blok IS in figuur 4). Dit hierna omschreven proces selecteert de uitkomst-kandidaten die gegeven h op dat moment het meest relevant zijn (e' in figuur 4). Vervolgens wordt Score Si berekend voor elke kandidaat vraag (v) uit de set met kandidaat vragen V en e' (Ent in figuur 4). Dit proces begint met voor elk mogelijk antwoord (a) van een kandidaat vraag v te berekenen wat de set met scores S is voor de kandidaat uitkomsten in e' indien a zou worden toegevoegd aan de op dat moment geldende historie h. Dit gebeurt door gebruik te maken van de eerder omschreven classificatie inferentiemachine gemodelleerd in (figuur 2), waarin in dit geval h, h uit (figuur 4) met daaraan toegevoegd de tupel <v,a> representeert, en e, e' uit (figuur 4).The Information measure (block A in figure 3) is a measure that calculates the amount of new information that a demand offers, and is modeled with the help of (figure 4). First of all, using the answers already given h, and the outcome candidates e an 'Important Set' is calculated (block IS in figure 4). This process described below selects the outcome candidates that are most relevant at that moment (e 'in Figure 4). Score Si is then calculated for each candidate question (v) from the set of candidate questions V and e '(Ent in Figure 4). This process starts with calculating for each possible answer (a) of a candidate question v what the set of scores S is for the candidate outcomes in e 'if a would be added to the current h history. This is done by using the previously described classification inference machine modeled in (Figure 2), in which in this case h, h from (Figure 4) with the tuple <v, a> added, and e, e 'out ( figure 4).

Vervolgens worden de resulterende scores voor elke uitkomstkandidaat met elkaar vergeleken en wordt er een nieuwe set M gemaakt (M in figuur 4), waarin elk element voor een antwoord van v het aantal uitkomstkandidaten uit e' representeert dat de hoogste relevantiescore bij dat antwoord heeft. Deze set geeft dus de verwachte verdeling van uitkomstkandidaten gegeven de vragen weer. Door over deze set vervolgens de entropy zoals in literatuur door Shannon is gedefinieerd uit te rekenen, wordt een maat van onzekerheid over de uitkomst gevonden(S^). Hoe hoger dit getal, hoe meer de vraag bijdraagt aan de klassificatie.Subsequently, the resulting scores for each outcome candidate are compared and a new set M is made (M in Figure 4), in which each element for an answer of v represents the number of outcome candidates from e 'that has the highest relevance score for that answer. This set therefore shows the expected distribution of outcome candidates given the questions. By then calculating the entropy over this set as defined by Shannon in literature, a measure of uncertainty about the outcome is found (S ^). The higher this number, the more the question contributes to the classification.

Het berekenen van de 'Important Set' (Blok IS in figuur 4) gebeurt door de classificatie inferentiemachine de relevantiescore S voor elke uitkomst-kandidaat te laten berekenen, gegeven de op dat moment geldende historie h. Door de scores van alle uitkomstkandidaten met elkaar te vergelijken kan er vervolgens een selectie gemaakt worden van de uitkomstkandidaten waarin een zeker percentage g met meest relevante scores wordt geselecteerd. Het percentage g wordt automatisch gekozen met hulp van het aantal al beantwoorde vragen. Hoe meer vragen er beantwoord zijn, hoe lager het percentage dient te zijn, teneinde de 'Important set' steeds kleiner en specifieker te maken.The calculation of the 'Important Set' (Block IS in Figure 4) is done by having the classification inference machine calculate the relevance score S for each outcome candidate, given the current h history. By comparing the scores of all outcome candidates with each other, a selection can then be made of the outcome candidates in which a certain percentage g with the most relevant scores is selected. The percentage g is automatically selected with the help of the number of questions that have already been answered. The more questions are answered, the lower the percentage must be, in order to make the 'Important set' increasingly smaller and more specific.

De Objectiviteitsmaat van een potentiële vraag v (Blok B in figuur 3) is een maat die bepaalt in hoeverre een potentiële vraag onbeantwoorde factoren bevraagt, en wordt gemodelleerd met behulp van het model in (figuur 5). De objectiviteit wordt bepaald met behulp van de relevantiescore s in figuur 5 welke allereerst voor de uitkomstkandidaten (e) en de op dat moment geldende historie (h) wordt bepaald met behulp van de classificatie inferentiemachine (Blok A in figuur 5). Ook wordt er een score s'in figuur 5 bepaalt voor elk antwoord (a) op de potentiële vraag v door middel van de elementaire invloeden te gebruiken die dat antwoord (a) op de uitkomst-kandidaten heeft en deze te gebruiken als zijnde een historie van een vraag die vervolgens kan worden toegepast op de classificatie inferentiemachine (Blok B in figuur 5). De nu berekende scores s en s' zijn de scores van de historie en de antwoorden op de vraag, en worden vervolgens vergeleken met behulp van de algemeen bekende Pearsons correlatie maat (Blok C in figuur 5). Dit resulteert in een score voor elk antwoord. Deze score wordt gecombineerd tot één score S2 die vervolgens gebruikt kan worden in het vraagselectie model (figuur 3).The Objectivity Measure of a Potential Question v (Block B in Figure 3) is a measure that determines the extent to which a potential question questions unanswered factors, and is modeled using the model in (Figure 5). Objectivity is determined with the help of the relevance scores s in figure 5, which is first determined for the outcome candidates (e) and the current history (h) with the help of the classification inference machine (Block A in figure 5). A score s is also determined in Figure 5 for each answer (a) to the potential question v by using the elemental influences that that answer (a) has on the outcome candidates and using it as a history of a question that can then be applied to the classification machine (Block B in Figure 5). The now calculated scores s and s' are the scores of the history and the answers to the question, and are then compared using the well-known Pearson's correlation measure (Block C in Figure 5). This results in a score for each answer. This score is combined into one score S2 that can then be used in the question selection model (Figure 3).

Claims

1. The invention arranges a questionnaire in such a way that more relevant questions are asked earlier in order to speed up a classification questionnaire and / or to qualitatively improve the classification.

2. By means of information that can be obtained from the answers that a participant gives to questions, a strategic change in order of questions is achieved in order to achieve the aim of the first conclusion.

3. By modeling outcome candidates and their elemental influences on the outcome candidates by means of a random classification model that is compatible with the model in (Figure 2) in which a question / answer history (ft) and outcome candidates [e) have a relevance score ( 5) for each outcome candidate is calculated, insight is provided into similarity between already answered questions and potential questions, and insight is given into the amount of relevant information that answers and questions can offer in order to determine the strategic question sequence from claim 2.

4. By using the absolute value of the correlation between the answers already given and the elemental influences that a potential answer has, combined with the chances that a potential answer of a potential question is given as an answer, the similarity between a potential question and possibly previously asked questions determined in order to use it in determining the strategic order of questions from claims 2 and 3.

5. By using the classification model described in claim 3 and the well-known entropy measure, the amount of relevant information that a potential question, given the demand history (ft) and the outcome candidates (e), could yield, is calculated for a subset of outcome candidates who are characterized by their high relevance score at that time, in order to be able to use it in determining the strategic order of questions described in claim 3.