SE532252C2

SE532252C2 - Method and apparatus for extracting information from a database

Info

Publication number: SE532252C2
Application number: SE0801708A
Authority: SE
Inventors: Haakan Wolge
Original assignee: Qliktech Internat Ab
Priority date: 2008-07-18
Filing date: 2008-07-18
Publication date: 2009-11-24
Also published as: SE0801708L; CN101635001A; ES2713097T3; CN101635001B; DK2146292T3

Description

532 252 2 matematisk funktion på den extraherade delmängden, varvid evalueringen av den matematiska funktionen görs på basis av en vald uppsättning beräkningsvariabler, och varvid kubens dimensioner ges av en vald uppsättning klassiﬁceringsvariabler. Även om den tidigare kända algoritrnen är effektiv, behöver denna ändå utföra ett stort antal operationer för att skapa den multidimensionella kuben, i synnerhet om stora mängder data ska analyseras. I sådana situationer kan algoritmen ställa oönskat höga krav på behandlingshårdvaran och / eller uppvisa en beräkningstid som är oönskat lång. 532 252 2 mathematical function on the extracted subset, whereby the evaluation of the mathematical function is made on the basis of a selected set of calculation variables, and whereby the dimensions of the cube are given by a selected set of classification variables. Although the prior art algorithm is efficient, it still needs to perform a large number of operations to create the multidimensional cube, especially if large amounts of data are to be analyzed. In such situations, the algorithm may place undesirably high demands on the processing hardware and / or exhibit a computation time that is undesirably long.

Sarnmanfattriingen av uppfinningen Det är ett ändamål med uppfinningen av åtminstone delvis övervinna en eller ﬂera av ovan angivna begränsningar hos den kända tekniken.SUMMARY OF THE INVENTION It is an object of the invention to at least partially overcome one or more of the above limitations of the prior art.

Detta och andra ändamål, som kommer att framgå av nedanstående beskrivning, har åtminstone delvis uppnåtts med ett förfarande och en apparat enligt de självständiga patentkraven, varvid utföringsformer därav definieras av de osjälvständiga patentkraven.This and other objects, which will become apparent from the following description, have been achieved at least in part by a method and an apparatus according to the independent claims, embodiments thereof being defined by the dependent claims.

En första aspekt av uppfnningen avser ett datorimplementerat förfarande för extrahering av information från en databas, vilket förfarande omfattar en Sekventiell kedja av huvudberåkningar, vilka omfattar en första huvudberäkning som opererar ett första urvalsobjekt på en datamängd som representerar databasen för att producera ett första resultat, och en andra huvudberälmirig som opererar ett andra urvalsobjekt på det första resultatet för att producera ett andra resultat, varvid förfarandet ytterligare omfattar att de första och andra resultaten cachas genom: beräkning av ett första urvalsidentiﬁerarvärde som funktion av åtminstone det första urvalsobjektet, och ett andra urvalsidentiﬁerarvärde som funktion av åtminstone det andra urvalsobjektet och det första resultatet; samt lagring av det första urvalsidentiñerarvärdet och det första resultatet respektive det andra urvalsidentiﬁerarvärdet och det andra resultatet som associerade objekt i en datastruktur.A first aspect of the invention relates to a computer-implemented method for extracting information from a database, which method comprises a sequential chain of main calculations, which comprises a first main calculation operating a first selection object on a set of data representing the database to produce a first result, and a second main operator operating a second sample object on the first result to produce a second result, the method further comprising caching the first and second results by: calculating a first sample identifier value as a function of at least the first sample object, and a second sample identifier value as function of at least the second selection object and the first result; and storing the first sample identifier value and the first result and the second sample identifier value and the second result, respectively, as associated objects in a data structure.

I en utföringsforrn omfattar förfarandet ytterligare att datastrukturen används för att ﬁnna det andra resultatet baserat på den första urvalsobjektet och det andra urvalsobjektet, varvid steget att använda omfattar delstegen: (a) att beräkna det första urvalsidentiﬁerarvärdet som funktion av åtminstone det första urvalsobjektet; (b) att söka bland objekten i datastrukturen baserat på det första urvalsidentiﬁerarvärdet för lokalisering av det första resultatet; (c) att, om det första 532 252 3 resultatet återfmns i delsteg (b), beräkna det andra urvalsidentiﬁerarvärdet som funktion av det första resultatet och det andra urvalsobjektet, och söka bland objekten i datastrukturen baserat på det andra urvalsidentiﬁeraxvärdet för lokalisering av det andra resultatet; (d) att, om det första resultatet ej återfinns i delsteg (b), exekvera den första huvudberäkningen för att producera det första resultatet, beräkna det andra urvalsidentifieraivärdet som funktion av det första resultatet och det andra urvalsobjektet, och söka bland objekten i datastrukturen baserat på det andra urvalsidentiﬁerarvärdet for lokalisering av det andra resultatet; och (e) att, om det andra resultatet ej återfinns i delsteg (c) eller (d), exekvera den andra huvudberäkningen för producering av det andra resultatet.In one embodiment, the method further comprises using the data structure to obtain the second result based on the first sample object and the second sample object, the step of using comprising the sub-steps: (a) calculating the first sample identifier value as a function of at least the first sample object; (b) searching among the objects in the data structure based on the first sample identifier value for locating the first result; (c) if the result is found in step (b), calculating the second sample identifier value as a function of the first result and the second sample object, and searching among the objects in the data structure based on the second sample identifier value for locating the second the result; (d), if the first result is not found in sub-step (b), executing the first main calculation to produce the first result, calculating the second sample identification value as a function of the first result and the second sample object, and searching among the objects in the data structure based on the second sample identifier value for locating the second result; and (e), if the second result is not found in step (c) or (d), executing the second main calculation to produce the second result.

I en utföringsform omfattar förfarandet ytterligare steget att beräkna ett första resultatidentiﬁerarvärde som funktion av det första resultatet, varvid steget att lagra ytterligare omfattar stegen att lagra det första urvalsidentifierarvårdet och det första resultatidentiﬁerarvärdet som associerade objekt i datastrukturen, och att lagra det första resultatidentiﬁerarvärdet och det första resultatet som associerade objekt i datastrukturen.In one embodiment, the method further comprises the step of calculating a first result identifier value as a function of the first result, the step of storing further comprising the steps of storing the first sample identifier value and the first result identifier value as associated objects in the data structure, and storing the first result value and the first result identifier. the result as associated objects in the data structure.

I en utföringsform omfattar förfarandet ytterligare steget att använda datastrukturen för att finna det andra resultatet baserat på det första urvalsobjektet och det andra urvalsobjektet, varvid steget att använda omfattar delstegen: (a) att beräkna det första urvalsidentiﬁerarvärdct som funktion av åtminstone det första urvalsobjektet; (b) att söka bland objekten i datastrukturen baserat på det första urvalsidentifierarvärdet för lokalisering av det första resultatidentiﬁeraivärdet, och att söka bland objekten i datastrukturen baserat på det första resultatidentifierarvärdet för lokalisering av det första resultatet; (c) att, om det första resultatet återfinns i delsteg (b), beräkna det andra urvalsidentiﬁeraivärdet som funktion av det första resultatet och det andra urvalsobjektet, och söka bland objekten i datastrukturen baserat på det andra urvalsidentiﬁerarvärdet för lokalisering av det andra resultatet; (d) att, om det första resultatidentiﬁerarvärdet eller det första resultatet ej återfinns i delsteg (b), exekvera den första huvudberälcriingen för att producera det första resultatet, beräkna det andra urvalsidentiñerarvårdet som funktion av det första resultatet och det andra urvalsobjektet, och söka bland objekten i datastrukturen baserat på det andra urvalsidentiﬁerarvärdet för lokalisering av det andra resultatet; och (e) att, om det andra resultatet ej återfinns i delsteg (c) eller (d), exekvera den andra huvudberäkningen för producering av det andra resultatet. 532 252 4 I en utföringsform representeras det första resultatet, i beräkningen av det andra urvalsidentiﬁerarvärdet, av det första resultatidentiﬁerarvärdet.In one embodiment, the method further comprises the step of using the data structure to find the second result based on the first sample object and the second sample object, the step of using comprising the sub-steps: (a) calculating the first sample identifier value as a function of at least the first sample object; (b) searching among the objects in the data structure based on the first sample identifier value for locating the first result identifier value, and searching among the objects in the data structure based on the first result identifier value for locating the first result; (c), if the first result is found in step (b), calculating the second sample identifier value as a function of the first result and the second sample object, and searching among the objects in the data structure based on the second sample identifier value for locating the second result; (d), if the first result identifier value or the first result is not found in sub-step (b), executing the first main report to produce the first result, calculating the second sample identifier value as a function of the first result and the second sample object, and searching among the objects in the data structure based on the second sample identifier value for locating the second result; and (e), if the second result is not found in step (c) or (d), executing the second main calculation to produce the second result. 532 252 4 In one embodiment, the first result, in the calculation of the second sample identifier value, is represented by the first result identifier value.

I en utföringsform omfattar förfarandet ytterligare steget att använda datastrukturen för att finna det andra resultatet baserat på det första urvalsobjektet och det andra urvalsobjektet, varvid steget att använda omfattar delstegen: (a) att beräkna det första urvalsidentiﬁerarvårdet som funktion av åtminstone det första urvalsobjektet; (b) att söka bland objekten i datastrukturen baserat på det första urvalsidentiﬁerarvärdet för lokalisering av det första resultatidentíﬁerarvärdet; (c) att, om det första resultatidentiﬁerarvärdet återﬁnns i delsteg (b), beräkna det andra urvalsidentiﬁerarvårdet som funktion av det första resultatidentiñerarvârdet och det andra urvalsobjektet, och söka bland objekten i datastrukturen baserat på det andra urvalsidentiﬁerarvärdet för lokalisering av det andra resultatet; (d) att, om det första resultatidentiﬁerarvärdet ej återfinns i delsteg (b), exekvera den första huvudberåkningen för att producera det första resultatet, beräkna det första resultatidentíﬁerarvärdet som funktion av det första resultatet, beräkna det andra urvalsidentiﬁerarvärdet som funktion av det första resultatidentiﬁerarvärdet och det andra urvalsobjektet, och söka bland objekten i datastrukturen baserat på det andra urvalsidentiﬁeraivärdet för lokalisering av det andra resultatet; (e) att, om det andra resultatet ej återfmns i delsteg (c), söka bland objekten i datastrukturen baserat på det första resultatidentiñerarvärdet för lokalisering av det första resultatet, och exekvera den andra huvudberälmingen för att producera det andra resultatet; (i) att, om det första resultatet ej återfinns i delsteg (e), exekvera den första huvudberäkningen för att producera det första resultatet, och exekvera den andra huvudberâlmingen för att producera det andra resultatet; och (g) att, om det andra resultatet ej återfinns i delsteg (d), exekvera den andra huvudberäkningen för att producera det andra resultatet.In one embodiment, the method further comprises the step of using the data structure to find the second result based on the first sample object and the second sample object, the step of using comprising the sub-steps: (a) calculating the first sample identifier care as a function of at least the first sample object; (b) searching among the objects in the data structure based on the first sample identifier value to locate the first result identifier value; (c), if the first result identifier value is returned in sub-step (b), calculating the second sample identifier value as a function of the first result identifier value and the second sample object, and searching among the objects in the data structure based on the second sample identifier value for the second localization result; (d) if the first result identifier value is not found in step (b), executing the first main calculation to produce the first result, calculating the first result identifier value as a function of the first result, calculating the second sample identifier value as a function of the first result identifier value and the second sample object, and search among the objects in the data structure based on the second sample identifier value for locating the second result; (e), if the second result is not found in sub-step (c), searching among the objects in the data structure based on the first result identifier value for locating the first result, and executing the second main determination to produce the second result; (i), if the first result is not found in step (e), executing the first main calculation to produce the first result, and executing the second main calculation to produce the second result; and (g), if the second result is not found in step (d), executing the second main calculation to produce the second result.

I en utföringsforrn omfattar förfarandet ytterligare steget att beräkna ett andra resultatidentiﬁerarvärde som funktion av det andra resultatet, varvid steget att lagra ytterligare omfattar stegen att lagra det andra urvalsidentiñerarvärdet och det andra resultatidentiﬁerarvärdet som associerade objekt i datastrukturen, och att lagra det andra resultatidentiﬁerarvârdet och det andra resultatet som associerade objekt i datastrukturen.In one embodiment, the method further comprises the step of calculating a second result identifier value as a function of the second result, the step of storing further comprising the steps of storing the second sample identifier value and the second result identifier value as associated objects in the data structure, and storing the second result value and the result as associated objects in the data structure.

I en utföringsform är vart och ett av identiñerarvärdena statistiskt unikt. 532 252 5 I en utföringsform är vart och ett av identiﬁerarvärdena ett digitalt fingeravtryck genererat medelst en hashfunktion. Exempelvis kan det digitala ﬁngeravtrycket omfatta minst 256 bitar.In one embodiment, each of the identifier values is statistically unique. 532 252 In one embodiment, each of the identifier values is a digital fingerprint generated by a hash function. For example, the digital memory footprint may comprise at least 256 bits.

I en utföringsforrn omfattar förfarandet ytterligare steget att selektivt ta bort dataposter som innehåller associerade objekt i datastrukturen, utgående åtminstone från dataposternas storlek. Steget att selektivt ta bort kan vara utformat att gynna borttagning av dataposter som innehåller ett första resultat. I en sådan utföringsform omfattar förfarandet ytterligare steget att associera varje datapost med ett viktvärde, vilket beräknas som funktion av en användningspararneter för varje datapost, en beräkningstidspararneter för varje datapost och en storleksparameter för varje datapost. Viktvârdet kan beräknas genom evaluering av en viktfunktion som ges av W=U*T/ M, varvid U är användningparametern, T är beräkningstidsparametern och M är storlekspararneternVärdet på anvåndningsparametern kan inlcrementeras varje gång dataposten accessas, samtidigt som värdet exponentiellt minskas som funktion av tid. Steget att selektivt ta bort kan vara baserat på viktvärdet för datapostema i datastrukturen.Vidare kan steget att selektivt ta bort triggas utgående från en jämförelse av datastrukturens aktuella storlek med ett tröskelvårde.In one embodiment, the method further comprises the step of selectively deleting data records that contain associated objects in the data structure, based at least on the size of the data records. The step of selectively deleting may be designed to favor the deletion of data records that contain an initial result. In such an embodiment, the method further comprises the step of associating each data record with a weight value, which is calculated as a function of a usage pair for each data record, a calculation time pair for each data record and a size parameter for each data record. The weight value can be calculated by evaluating a weight function given by W = U * T / M, where U is the usage parameter, T is the calculation time parameter and M is the size parameter The value of the usage parameter can be incremented each time the data item is accessed, while the value is exponentially reduced. The step of selectively deleting may be based on the weight value of the data records in the data structure. Furthermore, the step of selectively deleting may be triggered based on a comparison of the current size of the data structure with a threshold value.

I en utföringsform är databasen en dynamisk databas, och beräknas den första urvalsidentiñeraren som funktion av åtminstone det första urvalsobjektet och datamångden.In one embodiment, the database is a dynamic database, and the first sample identifier is calculated as a function of at least the first sample object and the data set.

I en utföringsform definierar det första urvalsobjektet en uppsättning fält i datamängden och ett villkor för varje fält, varvid resultatet av den första huvudberálmmgen är representativt för en delmângd av dataniängden, varvid det andra urvalsobjektet definierar en matematisk funktion, en eller ﬂera i delmängden inkluderade beråkningsvariabler och en eller ﬂera i delmängden inkluderade klassiﬁceringsvariabler, och varvid resultatet av den andra huvudberäkningen är en multi-dimensionell kubdatastruktur som innehåller resultatet av att operera den matematiska funktionen på nämnda en eller ﬂera beräkningsvariabler for varje unikt värde på varje klassiﬁceringsvariabel.In one embodiment, the first selection object defines a set of fields in the data set and a condition for each field, the result of the first main calculation being representative of a subset of the data set, the second selection object defining a mathematical function, one or ﬂs in the subset including calculation variables one or ﬂ era in the subset included classification variables, and wherein the result of the second main calculation is a multi-dimensional cube data structure which contains the result of operating the mathematical function on said one or ﬂ era calculation variables for each unique value of each classification variable.

En andra aspekt av uppfinningen är ett datorläsbart medium på. vilket det är lagrat ett datorprogram som, vid exekvering medelst en dator, är utformat att verkställa förfarandet enligt den första aspekten. 532 252 6 En tredje aspekt av uppfinningen är en apparat för extrahering av information från en databas, vilken apparat omfattar ett organ för exekvering av en Sekventiell kedja av beräkningar, vilka omfattar en första huvudberåkning som opererar ett första urvalsobjekt på en datamängd som representerar databasen för att producera ett första resultat, och en andra huvudberäkning som opererar ett andra urvalsobjekt på det första resultatet för att producera ett andra resultat, varvid apparaten ytterligare omfattar ett organ för cachning av de första och andra resultaten genom: beräkning av ett första urvalsidentiñerarvärde som funktion av åtminstone det första urvalsobjektet, och ett andra urvalsidentiﬁerarvårde som funktion av åtminstone det andra urvalsobjektet och det första resultatet; samt lagring av det första urvalsidentifierarvârdet och det första resultatet respektive det andra urvalsidentifierarvärdet och det andra resultatet som associerade objekt i en datastruktur.A second aspect of the invention is a computer readable medium. which is stored a computer program which, when executed by means of a computer, is designed to execute the procedure according to the first aspect. A third aspect of the invention is an apparatus for extracting information from a database, the apparatus comprising a means for executing a sequential chain of calculations, which comprises a first main calculation operating a first sample object on a data set representing the database for producing a first result, and a second main calculation operating a second sample object on the first result to produce a second result, the apparatus further comprising a means for caching the first and second results by: calculating a first sample identifier value as a function of at least the first sample object, and a second sample identifier as a function of at least the second sample object and the first result; and storing the first sample identifier value and the first result and the second sample identifier value and the second result, respectively, as associated objects in a data structure.

Apparaten enligt den tredje aspekten har samma fördelar som förfarandet enligt den första aspekten och kan omfatta ytterligare särdrag i motsvarighet till någon de utföríngsformer som beskrivs ovan med hänvisning till den första aspekten.The apparatus of the third aspect has the same advantages as the method of the first aspect and may comprise further features corresponding to any of the embodiments described above with reference to the first aspect.

Ytterligare andra ändamål, särdrag, aspekter och fördelar med föreliggande uppfinning kommer att framgå av den efterföljande detaljerade beskrivningen, de bifogade patentkraven samt ritningarna.Still other objects, features, aspects and advantages of the present invention will become apparent from the following detailed description, the appended claims and the drawings.

Kort beskrivning av ritningarna Utföringsforiner av uppfinningen kommer nu att beskrivas mer i detalj med hänvisning till de bifogade, schematiska ritningarna, på vilka motsvarande element identifieras med samma hänvisningsbeteckningar.Brief Description of the Drawings Embodiments of the invention will now be described in more detail with reference to the accompanying schematic drawings, in which like elements are identified by the same reference numerals.

Fig. 1 visar en process som inbegriper en kedja av beräkningar för extrahering av information från en databas, varvid identiﬁerare och resultat selektivt sparas i och hämtas från ett datorminne.Fig. 1 shows a process involving a chain of calculations for extracting information from a database, in which identifiers and results are selectively saved in and retrieved from a computer memory.

Fig. 2 visar en utföringsform av processen i ﬁg. 1.Fig. 2 shows an embodiment of the process in ﬁ g. 1.

Fig. 3 visar en annan utföringsform av processen i ﬁg. 1.Fig. 3 shows another embodiment of the process in ﬁ g. 1.

Fig. 4 visar ytterligare en annan utföringsform av processen i ﬁg. 1.Fig. 4 shows yet another embodiment of the process in ﬁ g. 1.

Fig. 5 visar ytterligare en annan utföringsform av processen i ﬁg. l.Fig. 5 shows yet another embodiment of the process in ﬁ g. l.

Fig. 6 är ett exempliﬁerande flödesschema för processen i ﬁg. 5.Fig. 6 is an exemplary flow chart for the process in ﬁ g. 5.

Fig. 7 är en översikt av processen i ﬁg. 5 implementerad i ett specifikt sammanhang. 532 252 7 Fig. 8 är ett blockschema av en datorbaserad omgivning för implementering av utföringsformer av uppfinningen.Fig. 7 is an overview of the process in ﬁ g. 5 implemented in a specific context. Fig. 8 is a block diagram of a computer-based environment for implementing embodiments of the invention.

Detalierad beskrivning av exempliﬁerande uttörirrgsformer Föreliggande uppﬁnning hänför sig till tekniker fór extrahering av information från en databas. För att underlätta förståelsen kommer vissa grundläggande principer först att diskuteras i förhållande till ett generaliserat exempel. Därefter kommer olika aspekter, särdrag och fördelar att diskuteras i förhållande till en specifik utfóringsforrn.DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS The present invention relates to techniques for extracting information from a database. To facilitate understanding, certain basic principles will first be discussed in relation to a generalized example. Then, various aspects, features and advantages will be discussed in relation to a specific embodiment.

Generellt Fig. 1 visar ett exempel på en datorimplementerad process för extrahering av information från en databas DB, vilken kan, men behöver ej, vara lagrad externt i förhållande till den dator som implementerat' processen.General Fig. 1 shows an example of a computer-implemented process for extracting information from a database DB, which may, but need not, be stored externally in relation to the computer which implemented the process.

Extraheringsprocessen inbegriper extrahering av en initial datamängd eller ett initialt omfång (”scope”) RO från databasen DB, typiskt genom inläsning av den initiala datamängden RO i datorns primårminne (typiskt RAM). Den initiala datamängden RO kan innefatta hela innehållet i databasen DB, eller en delrnångd därav.The extraction process involves extracting an initial amount of data or an initial scope (RO) from the database DB, typically by loading the initial amount of data RO into the computer's primary memory (typically RAM). The initial data set RO may include the entire contents of the DB database, or a subset thereof.

Processen i ﬁg. l inbegriper en sekvens av huvudsakliga beräkningsprocedurer Pl, P2, vilka är utformade att generera ett slutresultat R2 på basis av den initiala datamångden RO. Närmare bestämt opererar en första procedur Pl på den initiala datarnängden RO för att producera ett mellanresultat Rl, och opererar den andra proceduren P2 på mellanresultatet för att producera slutresultatet R2.The process i ﬁ g. 1 includes a sequence of main calculation procedures P1, P2, which are designed to generate an end result R2 on the basis of the initial amount of data RO. More specifically, a first procedure P1 operates on the initial data set RO to produce an intermediate result R1, and the second procedure P2 operates on the intermediate result to produce the final result R2.

Den första proceduren Pl styrs via ett första urvalsobjekt S1, som kan, men behöver ej, härröra från användargenererad indata. På liknande vis styrs den andra proceduren P2 via ett andra urvalsobjekt S2, som kan, men behöver ej, härröra från användargenerad indata. Varje urvalsobjekt S1, S2 kan inbegripa varje kombination av variabler och/ eller matematiska funktioner som definierar en förﬁning av indata till respektive procedur, dvs datamängden RO respektive mellanresultatet RI.The first procedure P1 is controlled via a first selection object S1, which can, but need not, derive from user-generated input data. Similarly, the second procedure P2 is controlled via a second selection object S2, which may, but need not, be derived from user-generated input data. Each selection object S1, S2 can include any combination of variables and / or mathematical functions that define a combination of input data for each procedure, ie the amount of data RO and the intermediate result RI, respectively.

Fig. 1 indikerar också att extraheringsprocessen interagerar med ett datorrninne 10 (typiskt RAM eller cache-rninne), genom att de första och andra procedurerna Pl, P2 är utformade att spara dataobjekt i minnet lO och hämta 533 252 8 dataobjekt från minnet 10. I det visade exemplet är den första proceduren P1 utformad att spara och hämta identiñerare, som allmänt betecknas ID, och mellanresultat Rl , och är den andra proceduren utformad att spara och hämta identiﬁerare, som allmänt betecknas ID, mellanresultat Rl och slutresultat R2. I det följande kallas proceduren att spara eller lagra identiﬁerare och resultat i datorminnet 10 även för ”cachning” _ Olika identifierare genereras typiskt av procedurerna Pl, P2 som funktion av en eller flera processparametrar, såsom en annan identiﬁerare och/ eller ett urvalsobjekt S1, S2 och / eller ett resultat Rl , R2. Olika funktioner kan, men behöver inte, användas för generering av olika identiﬁerare. Funktionen eller funktionerna för genereríng av en identiﬁerare vara en hashningsalgoritm som genererar ett digitalt fingeravtryck för den eller de relevanta processparametem/ - parametrarna. Funktionen/ funktionerna är lärnpligen utformad / utforrnade på ett sådant sätt att varje unik kombination av parametervärden resulterar i ett identiñerarvärde som är unikt bland alla identiﬁerarvärden som genereras för alla olika identiﬁerare inom processen. I detta sammanhang inkluderar ”unik” inte endast teoretiskt unika identiﬁeraivården, utan även statistiskt unika identiﬁerarvärden. Ett icke-begränsande exempel på en sådan funktion är en hashningsalgoritm som genererar ett digitalt fingeravtryck om minst 256 bitar.Fig. 1 also indicates that the extraction process interacts with a computer memory 10 (typically RAM or cache memory), in that the first and second procedures P1, P2 are designed to store data objects in the memory 10 and retrieve 533 data objects from the memory 10. In In the example shown, the first procedure P1 is designed to save and retrieve identifiers, commonly referred to as ID, and intermediate results R1, and the second procedure is designed to save and retrieve identifiers, commonly referred to as ID, intermediate results R1 and end results R2. In the following, the procedure of storing or storing identifiers and results in the computer memory 10 is also called "caching". Different identifiers are typically generated by the procedures P1, P2 as a function of one or more process parameters, such as another identifier and / or a selection object S1, S2. and / or a result R1, R2. Different functions can, but do not have to, be used to generate different identifiers. The function or functions for generating an identifier is a hashing algorithm that generates a digital fingerprint for the relevant process parameter (s). The function (s) are dutifully designed / designed in such a way that each unique combination of parameter values results in an identifier value that is unique among all identifier values generated for all different identifiers within the process. In this context, “unique” includes not only theoretically unique identifier care, but also statistically unique identifier values. A non-limiting example of such a function is a hashing algorithm that generates a digital fingerprint of at least 256 bits.

I en utföringsform, som ytterligare åskådliggörs i ﬁg. 2, år den första proceduren Pl utformad att beräkna ett första urvalsidentiﬁerarvårde IDl som funktion av det första urvalsobjektet S1, dvs ID1=f(S 1), och är den andra proceduren P2 utformad att beräkna ett andra urvalsidentiﬁerarvärde ID3 som funktion av det andra urvalsobjektet S2 och mellanresultatet Rl, dvs ID3=f(S2, Rl).In one embodiment, which is further illustrated in ﬁ g. 2, the first procedure P1 is designed to calculate a first sample identifier ID1 as a function of the first sample object S1, i.e. ID1 = f (S1), and the second procedure P2 is designed to calculate a second sample identifier ID3 as a function of the second sample object. S2 and the intermediate result R1, ie ID3 = f (S2, R1).

Den första proceduren är också utformad att spara IDl och mellanresultatet Rl som associerade objekt i en datastruktur 12 i datorminnet, och den andra proceduren P2 är utformad att spara ID3 och R2 som associerade objekt i datastrukturen 12. Således är datastrukturen 12 i datorminnet 12 utformad att spara heterogena uppsättningar av objekt, dvs objekt av olika typer.The first procedure is also designed to store ID1 and the intermediate result R1 as associated objects in a data structure 12 in the computer memory, and the second procedure P2 is designed to store ID3 and R2 as associated objects in the data structure 12. Thus, the data structure 12 in the computer memory 12 is designed to save heterogeneous sets of objects, ie objects of different types.

Denna utföringsform möjliggör en minskning av extraheringsprocessens svarstid och/ eller en minskning av databehandlingskraven på datom som ímplementerar extraheringsprocessen, genom att den minskar behovet av att exekvera huvudberäkningsprocedurerna Pl , P2 för beräkning av mellanresultatet RI respektive slutresultatet R2. Exempelvis kan extraheringsprocessen vara utformad att använda datastrukturen 12 närhelst möjligt för att finna slutresultatet 532 252 9 R2 på basis av det första urvalsobjektet Sl och det andra urvalsobjektet S2. När processen således upptäcker ett behov av att beräkna slutresultatet R2, baserat på S1 och S2, kan den generera IDl=f(S1) och accessa datastrukturen 12 baserat på ID1. Om ett identiskt första urvalsobjekt S1 har använts tidigare med den första proceduren P1 är det sannolikt att det genererade värdet på lDl återﬁnns i datastrukturen 12 och är associerat med motsvarande mellanresultat Rl. Således kan mellanresultatet R1 återvinnas från datastrukturen 12 istället för att beräknas medelst proceduren P1. Om mellanresultatet R1 ej återfinns i datastrukturen 12 kan processen bringa den första proceduren Pl att beräkna mellanresultatet Rl .This embodiment enables a reduction of the response time of the extraction process and / or a reduction of the data processing requirements on the computer implementing the extraction process, by reducing the need to execute the main calculation procedures P1, P2 for calculating the intermediate result R1 and the final result R2, respectively. For example, the extraction process may be designed to use the data structure 12 whenever possible to find the end result 532 252 9 R2 on the basis of the first selection object S1 and the second selection object S2. Thus, when the process detects a need to calculate the end result R2, based on S1 and S2, it can generate ID1 = f (S1) and access the data structure 12 based on ID1. If an identical first selection object S1 has been used previously with the first procedure P1, it is probable that the generated value of lD1 is found again in the data structure 12 and is associated with the corresponding intermediate result R1. Thus, the intermediate result R1 can be recovered from the data structure 12 instead of being calculated by the procedure P1. If the intermediate result R1 is not found in the data structure 12, the process can cause the first procedure P1 to calculate the intermediate result R1.

Vidare kan processen, efter att ha erhållit mellanresultatet Rl, generera lD3=f(R1, S2) och accessa datastrukturen 12 baserat på ID3. Återigen, om samma operation har exekverats tidigare medelst proceduren P2, år det sannolikt att det genererade värdet på lD3 återﬁnns i datastrukturen 12 och är associerat med motsvarande slutresultat R2. Därigenom kan slutresultatet R2 återvinnas från datastrukturen 12 istället för att beräknas medelst proceduren P2.Furthermore, after obtaining the intermediate result R1, the process can generate lD3 = f (R1, S2) and access the data structure 12 based on ID3. Again, if the same operation has been executed previously by procedure P2, it is likely that the generated value of ID3 is returned in the data structure 12 and is associated with the corresponding end result R2. Thereby, the end result R2 can be recovered from the data structure 12 instead of being calculated by the procedure P2.

I en utföringsfonn, som ytterligare skådliggörs i ﬁg.3, är den första proceduren Pl vidare utformad att beräkna ett första resultatidentiﬁerarvärde lD2 som funktion av mellanresultatet Rl. Den första proceduren P1 är även utformad att spara ID1 och lD2 som associerade objekt i datastrukturen 12, och att spara ID2 och mellanresultatet Rl som associerade objekt i data strukturen 12.In an embodiment, which is further illustrated in ﬁ g.3, the first procedure P1 is further designed to calculate a first result identifier value lD2 as a function of the intermediate result R1. The first procedure P1 is also designed to store ID1 and lD2 as associated objects in the data structure 12, and to save ID2 and the intermediate result R1 as associated objects in the data structure 12.

Denna utföringsform gör det möjligt att minska den storlek på datorminnet som krävs av processen, eftersom vaije mellanresultat R1 endast sparas en gång i datastrukturen 12, även om två eller ﬂera första urvalsobjekt S1 ger identiska mellanresultat Rl. Denna utföringsform är särskilt relevant när mellanresultaten R1 är stora, vilket ofta är fallet vid behandling av information från databaser.This embodiment makes it possible to reduce the size of the computer memory required by the process, since each intermediate result R1 is only saved once in the data structure 12, even if two or första your first selection objects S1 give identical intermediate results R1. This embodiment is particularly relevant when the intermediate results R1 are large, which is often the case when processing information from databases.

Beräkningen av det första resultatidentiferarvärdet ID2 möjliggör även en ytterligare utföringsform, som visas i fig. 4, vid vilken mellanresultatet R1 representeras av det första resultatidentiﬁerarvärdet ID2 i beräkningen av det andra urvalsidentiﬁerarvärdet lD3, dvs. ID3=f(lD2, S2).The calculation of the first result identifier value ID2 also enables a further embodiment, shown in Fig. 4, in which the intermediate result R1 is represented by the first result identifier value ID2 in the calculation of the second sample identifier value ID3, i.e. ID3 = f (lD2, S2).

Denna utföringsform resulterar i ett minskat behov av att spara mellanresultatet Rl i datastrukturen 12, eftersom slutresultatet R2 kan hämtas från datastrukturen 12 baserat på ID3, vilken genereras baserat på ID2 och inte mellanresultatet Rl. Detta möjliggör effektiv beräkning av slutresultatet R2 även om mellanresultatet Rl har rensats från datastrukturen 12. Exempelvis kan 532 252 10 processen vara utformad att använda datastrukturen 12 närhelst möjligt, för att lokalisera slutresultatet R2 baserat på det första urvalsobjektet S1 och det andra urvalsobjektet S2. Når processen upptäcker ett behov av att beräkna slutresultatet R2, baserat på S1 och S2, kan den således generera lDl=f(Sl) och accessa datastrukturen 12 baserat på IDl för att återvinna ID2 som är associerad därmed, om ett identiskt första urvalsobjekt S1 har använts tidigare med den första proceduren Pl. Sedan kan processen generera ID3=f(lD2, S2) och accessa datastrukturen 12 baserat på ID3 för att hämta slutresultatet R2 som är associerat därmed, om den andra proceduren P2 tidigare har opererat på ett identiskt mellanresultat Rl och ett identiskt andra urvalsobjekt S2. I detta exempel kan således slutresultatet R2 återvinnas från datastmkturen 12 även om mellanresultatet Rl har raderats.This embodiment results in a reduced need to save the intermediate result R1 in the data structure 12, since the end result R2 can be retrieved from the data structure 12 based on ID3, which is generated based on ID2 and not the intermediate result R1. This enables efficient calculation of the end result R2 even if the intermediate result R1 has been cleared from the data structure 12. For example, the process may be designed to use the data structure 12 whenever possible, to locate the end result R2 based on the first selection object S1 and the second selection object S2. Thus, when the process detects a need to calculate the final result R2, based on S1 and S2, it can generate lD1 = f (S1) and access the data structure 12 based on ID1 to recover ID2 associated therewith, if an identical first selection object S1 has used previously with the first procedure P1. Then, the process can generate ID3 = f (lD2, S2) and access the data structure 12 based on ID3 to retrieve the end result R2 associated therewith, if the second procedure P2 has previously operated on an identical intermediate result R1 and an identical second selection object S2. Thus, in this example, the end result R2 can be recovered from the data structure 12 even if the intermediate result R1 has been deleted.

I en utföringsform, som visas i ﬁg. 5, âr den forsta proceduren Pl vidare utformad att beräkna ett andra resultatidentiñerarvárde ID4 som funktion av slutresultatet R2. Den första proceduren Pl är också utformad att spara ID3 och ID4 som associerade objekt i datastrukturen 12 och att spara ID4 och slutresultatet R2 som associerade objekt i datastrukturen 12.In one embodiment, shown in ﬁ g. 5, the first procedure P1 is further designed to calculate a second result identifier value ID4 as a function of the final result R2. The first procedure P1 is also designed to save ID3 and ID4 as associated objects in the data structure 12 and to save ID4 and the end result R2 as associated objects in the data structure 12.

Denna utföringsform gör det möjligt att minska den storlek på datorminnet som krävs av processen, eftersom varje slutresultat R2 endast sparas en gång i datastrukturen 12, även om två eller flera andra urvalsobjekt S2 ger identiska slutresultat R2. Denna utfóringsforrn är särskilt relevant när slutresultaten R2 är stora.This embodiment makes it possible to reduce the size of the computer memory required by the process, since each end result R2 is saved only once in the data structure 12, even if two or more other selection objects S2 give identical end results R2. This embodiment is particularly relevant when the final results R2 are large.

Fig. 6 är ett ﬂödesschema för en exempliﬁerande implementation av utföringsformen i ﬁg. 5. Processen startar med inmatning av datamängden R0 (steg 600), det första urvalsobjektet Sl (steg 602) och det andra urvalsobjektet S2 (604).Fig. 6 is a flow chart for an exemplary implementation of the embodiment in Figs. The process starts by entering the data set R0 (step 600), the first selection object S1 (step 602) and the second selection object S2 (604).

Sedan genereras ett värde på den första urvalsidentiﬁeraren IDl som funktion av S1 och RO (steg 606). Ett uppslag görs i datastrukturen baserat på IDl (608). Om värdet på IDl återfinns i datastrukturen, dvs om detta har cachats i en tidigare iteration, hämtar processen värdet på den därmed associerade första resultatidentiﬁeraren ID2 (steg 610) och fortsätter till steg 612.Then, a value of the first sample identifier ID1 is generated as a function of S1 and RO (step 606). A lookup is made in the data structure based on ID1 (608). If the value of ID1 is found in the data structure, i.e. if this has been cached in a previous iteration, the process retrieves the value of the associated first result identifier ID2 (step 610) and proceeds to step 612.

Om värdet på IDl ej återﬁnns i datastrukturen i steg 608 bringar processen den första proceduren P1 att beräkna Rl, genom att operera S1 på RO (steg 614). Sedan genereras värdet på ID2 som funktion av R1 (steg 616) och värdena på IDl, ID2 och Rl sparas i datastrukturen i associerade par ID1:ID2 och lD2rRl (steg 618). Processen fortsätter sedan till steg 612. 532 252 11 I steg 612 genereras den andra urvalsidentifieraren som funktion av S2 och ID2. Sedan görs ett uppslag i datastnikturen baserat på ID3 (steg 620). Om värdet på ID3 återfinns i datastrukturen, dvs om det har cachats i en tidigare iteration, hämtar processen det därmed associerade värdet på den andra resultatidentifieraren ID4 (steg 622). Ett ytterligare uppslag görs i datastrukturen baserat på ID4 (steg 624). Om värdet på ID4 återﬁnns i datastrukturen, dvs om det har cachats i en tidigare iteration, hämtar processen det därmed associerade slutresultatet R2 (steg 626).If the value of ID1 is not returned in the data structure in step 608, the process causes the first procedure P1 to calculate R1, by operating S1 on RO (step 614). Then, the value of ID2 is generated as a function of R1 (step 616) and the values of ID1, ID2 and R1 are stored in the data structure of associated pairs ID1: ID2 and ID2rR1 (step 618). The process then proceeds to step 612. 532 252 In step 612, the second sample identifier is generated as a function of S2 and ID2. Then a lookup is made in the data snippet based on ID3 (step 620). If the value of ID3 is found in the data structure, ie if it has been cached in a previous iteration, the process retrieves the associated value of the second result identifier ID4 (step 622). An additional lookup is made in the data structure based on ID4 (step 624). If the value of ID4 is found again in the data structure, ie if it has been cached in a previous iteration, the process retrieves the associated end result R2 (step 626).

Om värdet på ID3 ej återfinns i datastrukturen i steg 620 görs ett ytterligare uppslag i datastrukturen baserat på det värde på ID2 som fastställts i steg 612 eller steg 616 (steg 628). Om värdet på ID2 återfinns i datastrukturen, dvs om det har cachats i en tidigare iteration, hämtar processen det därmed associerade första resultatet Rl (steg 630). Processen bringar sedan den andra proceduren P2 att beräkna R2 genom att operera S2 på R1 (steg 632). För att uppdatera datastrukturen genererar processen även värdet på ID4 som funktion av R2 (steg 634) och sparar värdena på ID3, ID4 och R2 i datastrukturen i associerade par lD3:ID4 och ID4:R2 (steg 636).If the value of ID3 is not found in the data structure in step 620, an additional lookup is made in the data structure based on the value of ID2 determined in step 612 or step 616 (step 628). If the value of ID2 is found in the data structure, ie if it has been cached in a previous iteration, the process retrieves the associated first result R1 (step 630). The process then causes the second procedure P2 to calculate R2 by operating S2 on R1 (step 632). To update the data structure, the process also generates the value of ID4 as a function of R2 (step 634) and saves the values of ID3, ID4 and R2 in the data structure of associated pairs ID3: ID4 and ID4: R2 (step 636).

Om värdet på ID2 ej återfinns i datastrukturen i steg 628 bringar processen den första proceduren Pl att beräkna Rl, genom att operera S1 på RO (steg 638), och sparar processen värdena på ID2 och Rl i datastrukturen i ett associerat par lD2:Rl (steg 640). Processen fortsätter sedan till steg 632. Det bör emellertid inses att det ej är nödvändigt att utföra stegen 628, 630, 638 och 640 om mellanresultatet R1 redan beräknades i steg 614. I ett sådant fall, om ID3 ej återfinns i steg 620, kan processen fortsätta direkt till steg 632, i vilket den andra proceduren P2 bringas att beräkna R2,genom att operera S2 på Rl.If the value of ID2 is not found in the data structure in step 628, the process causes the first procedure P1 to calculate R1, by operating S1 on RO (step 638), and the process saves the values of ID2 and R1 in the data structure in an associated pair ID2: R1 ( step 640). The process then proceeds to step 632. However, it should be appreciated that it is not necessary to perform steps 628, 630, 638 and 640 if the intermediate result R1 was already calculated in step 614. In such a case, if ID3 is not found in step 620, the process may proceed directly to step 632, in which the second procedure P2 is made to calculate R2, by operating S2 on R1.

Om värdet på ID4 ej återfinns i datastrukturen i steg 622 bringar processen den andra proceduren P2 att beräkna R2, genom att operera S2 på Rl (steg 642). För att uppdatera datastrukturen genererar processen också värdet på ID4 som funktion av R2 (steg 644) och sparar värdena på ID4 och R2 i datastrukturen i ett associerat par ID4:R2 (steg 646).If the value of ID4 is not found in the data structure in step 622, the process causes the second procedure P2 to calculate R2, by operating S2 on R1 (step 642). To update the data structure, the process also generates the value of ID4 as a function of R2 (step 644) and saves the values of ID4 and R2 in the data structure of an associated pair of ID4: R2 (step 646).

Fackmannen inser utan svårighet att utföringsforinerna i ﬁg. 2-4 resulterar i motsvarande processer för sparande och återvinning, dock med användning av olika kombinationer av identifierare. För att hålla framställningen kortfattad visas dessa processer ej i ﬂödesscheman, utan ges endast som exempliñerande utföringsformer under avsnittet Sarnmanfattriing av uppfinningen. 532 252 12 Hittills har databasen DB, och således datauppsättningen RO, förutsätts vara statisk. Om databasen är dynamisk kan det vara lämpligt att generera den första urvalsidentiﬁeraren ID1 som funktion av det första urvalsobjektet S1 och datamängden RO, dvs. IDl=f(Sl, RO). Med en sådan modifiering är samtliga utföringsforrner som har beskrivits med hänvisning till ﬁg. 1-6 likaså tillämpliga för en dynamisk databas, dvs en databas som kan förändras vid varje tidpunkt.The person skilled in the art realizes without difficulty that the execution forines in ﬁ g. 2-4 results in corresponding processes for saving and recycling, however, using different combinations of identifiers. In order to keep the presentation brief, these processes are not shown in schematic diagrams, but are given only as exemplary embodiments in the section Summary of the Invention. 532 252 12 So far, the database DB, and thus the data set RO, is assumed to be static. If the database is dynamic, it may be appropriate to generate the first sample identifier ID1 as a function of the first sample object S1 and the data set RO, i.e. ID1 = f (Sl, RO). With such a modification, all embodiments that have been described with reference to ﬁ g. 1-6 also apply to a dynamic database, ie a database that can be changed at any time.

Det bör inses att varje datastruktur 12, linjär eller icke-linjär, kan användas för att spara identifierare och resultat. Med avseende på behandlingshastíghet kan det emellertid vara föredraget att använda en datastruktur 12 med ett effektivt indexeringssystem, såsom en sorterad lista, en hashtabell eller ett binärt träd, såsom ett AVL-träd.It should be appreciated that any data structure 12, linear or non-linear, may be used to store identifiers and results. However, with respect to processing speed, it may be preferable to use a data structure 12 with an efficient indexing system, such as a sorted list, a hash table or a binary tree, such as an AVL tree.

Specifika utföringsforrner och exempel I det följande diskuteras och exempliﬁeras utföringsformer av uppfinningen i mer detalj.Specific Embodiments and Examples In the following, embodiments of the invention are discussed and exemplified in more detail.

Utföringsformer av uppﬁnningen använder tidigare beräkningar och resultat vid behandlingen av successiva begäranden (request) om ny data och nya beräkningar. För detta syfte är extraheringsprocessen utformad att cacha resultat under behandlingen av sådana databegäranden. Vid behandlingen av en efterföljande begäran fastställer extraheringsprocessen om ett lämpligt tidigare resultat redan har genererats och cachats. Om så är fallet används det tidigare resultatet vid behandlingen av den efterföljande begäran. Eftersom de tidigare beräkningarna ej behöver regenereras kan behandlingstiden för den efterföljande begäran minskas avsevärt.Embodiments of the invention use previous calculations and results in the processing of successive requests for new data and new calculations. For this purpose, the extraction process is designed to cache results during the processing of such data requests. When processing a subsequent request, the extraction process determines if an appropriate previous result has already been generated and cached. If so, the previous result is used in the processing of the subsequent request. Since the previous calculations do not need to be regenerated, the processing time for the subsequent request can be significantly reduced.

I utföringsformer av uppfinningen används digitala identifierare (digitala fingeravtryck) för att identifiera den cachade informationen, och på detta vis kan ett cachat resultat återanvändas även när det har uppnåtts på annorlunda sätt än i den tidigare beräkningen.In embodiments of the invention, digital identifiers (digital fingerprints) are used to identify the cached information, and in this way a cached result can be reused even when it has been achieved in a different way than in the previous calculation.

I utföringsforrner av uppfinningen sparas de digitala identifierarna i sig i cachen. Närmare bestämt sparas identiﬁeraren fór indata till en berâkningsprocedur tillsammans med den digitala identiﬁeraren för utdata från beräkningsproceduren. Således kan slutresultatet för en ﬂerstegsoperation uppnås även när erforderliga, komplexa mellanresultat har rensats från cachen. Endast den digitala identiﬁeraren för mellanresultatet/ mellanresultaten erfordras. 532 252 13 I utföringsforrner av uppﬁnningen är nämnda cache implementerat som en datastruktur som kan lagra heterogena objekt, såsom tabeller, datadelmängder, vektorer och digitala identiﬁerare.In embodiments of the invention, the digital identifiers themselves are stored in the cache. More specifically, the identifier for input data for a calculation procedure is saved together with the digital identifier for output data from the calculation procedure. Thus, the end result of a step-by-step operation can be achieved even when the required, complex intermediate results have been cleared from the cache. Only the digital identifier for the intermediate result (s) is required. 532 252 13 In embodiments of the invention, said cache is implemented as a data structure that can store heterogeneous objects, such as tables, data subsets, vectors and digital identifiers.

Utföringsformer av uppfinningen kan således tjäna till att minimera, eller åtminstone reducera, svarstiderna för en användare som utforskar en datalagringsenhet med användning av en förfrågan som nyligen har exekverats av samma eller en annan användare.Thus, embodiments of the invention may serve to minimize, or at least reduce, the response times of a user exploring a data storage device using a request recently executed by the same or another user.

Utföringsformer av uppfinningen kan också tjäna till att minimera, eller åtminstone reducera, minnesåtgången för nämnda cache genom återanvändning av samma cache-post för ﬂera olika förfrågningar eller beräkningar, för det fall att två förfrågningar eller beräkningar råkar ge samma resultat.Embodiments of the invention may also serve to minimize, or at least reduce, the memory consumption of said cache by reusing the same cache record for your different requests or calculations, in case two requests or calculations happen to give the same result.

Utföríngsformer av uppfinningen är också tillämpliga för extrahering av varje typ av information från varje typ av känd databas, såsom relationsdatabaser, post-relationsdatabaser, objektorienterade databaser, hierarkiska databaser, osv.Embodiments of the invention are also applicable to extracting any type of information from any type of known database, such as relational databases, post-relational databases, object-oriented databases, hierarchical databases, and so on.

Internet kan också anses vara en databas inom ramen för föreliggande uppfinning.The Internet can also be considered a database within the scope of the present invention.

Fig. 7 visar en specifik utföringsform av uppfinningen, vilken är en extraheringsprocess eller informationssökning som inbegriper en databasförfrågan med en efterföljande diagramberälming baserat på resultatet av förfrågan.Fig. 7 shows a specific embodiment of the invention, which is an extraction process or information search that includes a database request with a subsequent chart relay based on the result of the request.

Resultatet av diagramberälcningen, betecknad Diagramresultat, är typiskt data som är aggregerad, sorterad eller grupperad i en, två eller ﬂera dimensioner, t ex i form av en multi-dimensionell kub såsom diskuterades i avsnittet om bakgrundstelmik.The result of the chart calculation, called the Chart Result, is typically data that is aggregated, sorted, or grouped into one, two, or two dimensions, such as a multi-dimensional cube as discussed in the Background Telemic section.

I ett första steg deﬁnieras Omfånget (Scope) för informationssökningen. l fallet med en databasförfrågan deﬁnieras omfånget av de tabeller som inkluderas i en SELECT-formulering (eller motsvarande) och hur dessa är förenade. För en Internetsökning kan omfånget vara ett index över funna webbsidor, vanligen också organiserade som en eller ﬂera tabeller. Utdata från det första steget är således en datamängd (järnför RO i ﬁg. 1-6).In a first step, the Scope is called for the information search. In the case of a database request, the scope of the tables included in a SELECT formulation (or equivalent) and how these are joined are defined. For an Internet search, the scope can be an index of found web pages, usually also organized as one or your tables. The output from the first step is thus a data set (iron RO in ﬁ g. 1-6).

I ett andra steg gör en användare ett Uruali datamängden, vilket bringar en Slutledníngsmotor (lnference Engine) att evaluera ett antal ﬁlter på datamängden. Slutledningsmotorn skulle exempelvis kunna vara en databasmotor, ett förfrågningsverktyg eller ett verktyg för affärsanalys. Vid en förfrågan på en databas som innehåller data på lagda order skulle detta exempelvis kunna vara att man begär orderåret ”2007” och produktgruppen ”Mejeriproduktefï Urvalet kan således vara unikt definierat av en lista av inkluderade fält och, för varje fält, en lista över valda värden eller, mer generellt, ett villkor. 532 252 14 Baserat på urvalet (jämför S1 i ﬁg. 1-6) utför slutledningsmotom en berâlmingsprocedur (järnför Pl i ñg. l-6) för att generera en Datadelmängd (jämför Rl i ﬁg. 1-6) som representerar en del av omfånget (jämför RO i ﬁg. 1-6).In a second step, a user makes a Uruali data set, which causes an Inference Engine to evaluate a number of elements on the data set. The inference engine could be, for example, a database engine, a query tool or a business analysis tool. When requesting a database that contains data on placed orders, this could be, for example, requesting the order year “2007” and the product group “Dairy Product The selection can thus be uniquely defined by a list of included fields and, for each field, a list of selected values or, more generally, a condition. 532 252 14 Based on the sample (compare S1 in ﬁ g. 1-6), the inference motor performs a calibration procedure (cf. P1 in ñg. 1-6) to generate a Data subset (compare R1 in ﬁ g. 1-6) which represents a part of the range (compare RO in ﬁ g. 1-6).

Datadelmängden kan således innehålla en uppsättning relevanta dataposter från omfånget, eller en lista av hänvisningar (t ex index, pekare eller binåra tal) till dessa relevanta dataposter. I ovanstående exempel skulle de relevanta dataposterna vara endast de dataposter som avser året ”2007” och produktgruppen ”Mejeriprodukter” .The data subset may thus contain a set of relevant data records from the scope, or a list of references (eg indexes, pointers or binary numbers) to these relevant data records. In the above example, the relevant data items would be only the data items relating to the year "2007" and the product group "Dairy Products".

Om urvalet aldrig har gjorts tidigare opereras slutledningsmotorn i ﬁg. 7 att beräkna datadelmängden. Om emellertid beräkningen har gjorts tidigare opereras istället slutledningsmotorn att återanvända det tidigare resultatet genom accessning av en specifik datastruktur: en ”cache”.If the selection has never been made before, the inference motor is operated in ﬁ g. 7 to calculate the data subset. However, if the calculation has been made previously, the inference engine is instead operated to reuse the previous result by accessing a specific data structure: a "cache".

Nästa steg är ofta att göra några ytterligare beräkningar, t ex aggregering och / eller sortering och/ eller gruppering, baserat på datadelmängden. I exemplet i ﬁg. 7 görs dessa efterföljande beräkningar av en Díagrammotor som beräknar Diagramresultatet baserat på datadelmängden och en vald uppsättning Diagramegenskaper (jämför S2 i fig. 1-6). Diagrarnmotorn verkställer således en diagramberäkningsprocedur (jämför P2 i fig. l-6) för generering av diagrarnresultatet (jämför R2 i fig. l-6). Om dessa beräkningar aldrig har orts tidigare opereras diagrammotorn i ﬁg. 7 att generera diagramresultatet. Om emellertid dessa beräkningar har orts tidigare opereras diagrammotorn istället att återanvända det tidigare resultatet genom accessning av ovarmämnda cache.The next step is often to make some additional calculations, such as aggregation and / or sorting and / or grouping, based on the data subset. In the example in ﬁ g. 7, these subsequent calculations are made by a Diagram engine which calculates the Diagram result based on the data subset and a selected set of Diagram properties (compare S2 in Figs. 1-6). The diagram motor thus performs a diagram calculation procedure (compare P2 in Figs. 1-6) for generating the diagram result (compare R2 in Figs. 1-6). If these calculations have never been performed before, the chart motor is operated in ﬁ g. 7 to generate the chart result. However, if these calculations have been performed previously, the chart engine is operated instead of reusing the previous result by accessing the above cache.

Diagramresultatet kan sedan åskådliggöras för en användare i pivot-tabeller eller graﬁskt i 2D- och SD-diagram.The chart result can then be illustrated to a user in pivot tables or graphically in 2D and SD charts.

Fig. 7 visar också processen att använda cachen, varvid f representerar den hashningsalgoritrn som opereras för generering av digitala identiﬁerare, IDl- ID4 representerar de sålunda genererade, digitala identiﬁerarna, och heldragna linjepilar representerar ﬂödet av data för generering av identiﬁerarna ID 1-ID4.Fig. 7 also shows the process of using the cache, where f represents the hashing algorithm operated for generating digital identifiers, ID1-ID4 represents the digital identifiers thus generated, and solid line arrows represent the flow of data for generating the identifiers ID 1-ID4.

Vidare representerar streckade pilar i ﬁg. 7 cache-uppslag.Furthermore, dashed arrows represent in ﬁ g. 7 cache lookups.

När en användare gör ett nytt urval i ﬁg. 7 beräknar slutledningsmotorn datadelmängden. Dessutom genereras identiﬁeraren IDl för urvalet tillsammans med omfånget baserat på filtren i urvalet och omfånget. Därefter genereras identiﬁeraren ID2 för datadelmängden baserat på datadelmängdens definition, typiskt en bit sekvens som definierar innehållet i datadelmängden. Slutligen placeras ID2 i cachen med användning av IDl som uppslagsidentiﬁerare. Likaledes 532 252 15 placeras datadelmängdens deﬁnition i cachen med användning av ID2 som uppslagsidentiﬁerare.When a user makes a new selection in ﬁ g. 7, the inference motor calculates the data subset. In addition, the identifier ID1 is generated for the sample along with the range based on the filters in the sample and the range. Thereafter, the identifier ID2 for the data subset is generated based on the definition of the data subset, typically a bit sequence that defines the contents of the data subset. Finally, ID2 is placed in the cache using ID1 as the lookup identifier. Similarly, the definition of the data subset is placed in the cache using ID2 as the lookup identifier.

I fig. 7 sker díagramberäkningen på ett liknande vis. Här fmns två informationsmängder: datadelmängden och de relevanta diagramegenskaperna. De senare är typiskt, men inte begränsat till, en matematisk funktion tillsammans med berälmingsvariabler och klassiﬁceringsvariabler (dimensioner). Båda av dessa informationsmängder används för beräkning av diagramresultatet, och båda dessa informationsmängder används också för att generera identiﬁeraren ID3 för indata till diagramberäkningen. ID2 genererades redan i det föregående steget, och ID3 genereras som det första steget i diagramberäkningsproceduren.In Fig. 7, the diagram calculation takes place in a similar manner. There are two sets of information here: the data subset and the relevant diagram properties. The latter are typically, but not limited to, a mathematical function together with estimation variables and classification variables (dimensions). Both of these amounts of information are used to calculate the chart result, and both of these amounts of information are also used to generate the identifier ID3 for input data to the chart calculation. ID2 was already generated in the previous step, and ID3 is generated as the first step in the chart calculation procedure.

Identiﬁeraren ID3 bildas av ID2 och de relevanta diagramegenskaperna.The ID3 identifier is formed by the ID2 and the relevant diagram properties.

ID3 kan betraktas som en identiﬁerare för en specifik diagramgenereringsinstans, vilken inkluderar alla information som behövs för att beräkna ett speciﬁkt diagramresultat. Dessutom skapas en diagramresultatidentifierare ID4 från diagramresultatdeﬁnitionen typiskt en bitsekvens som definierar diagramresultatet.ID3 can be considered as an identifier for a specific chart generating instance, which includes all the information needed to calculate a specific chart result. In addition, a chart result identifier ID4 from the chart results section typically creates a bit sequence that defines the chart result.

Slutligen placeras ID4 i cachen med användning av ID3 som uppslagsidentiﬁerare.Finally, ID4 is placed in the cache using ID3 as the lookup identifier.

Likaledes placeras diagrarnresultatdeñnitionen i cachen med användning av ID4 som uppslagsidentifierare.Similarly, the chart result definition is placed in the cache using ID4 as the lookup identifier.

I detta specifika exempel utförs en tvåstegs-cachning av resultatet i både slutledningsproceduren och diagramberäkningsproceduren. I slutledriingsproceduren representerar lDl och ID2 olika saker: urvalet respektive datadelmängdens definition. Om två olika urval ger samma datadelmängd, vilket är mycket möjligt, bringar tvástegs-cachningen (ID1:ID2; lD2zdatadelmängd) datadelrnängden att cachas endast en gång. Detta kallas Objektvikning i det följande, dvs ﬂera dataobjekt i cachen delar samma cachepost. På liknande vis representerar ID3 och ID4 olika saker i diagramberâkningsproceduren: diagramgenereririgsinstansen respektive diagramresultatets definition. Om två olika cliagramgenereringsinstanser ger samma diagramresultat, vilket är mycket möjligt, bringar tvåstegs-cachníngen (ID3:ID4; lD4rdíagramresultat) diagramresultatet att cachas endast en gång.In this specific example, a two-step caching of the result is performed in both the inference procedure and the chart calculation procedure. In the inference procedure, ID1 and ID2 represent different things: the sample and the definition of the data subset, respectively. If two different selections give the same data subset, which is very possible, the two-step caching (ID1: ID2; lD2z data subset) causes the data subset to be cached only once. This is called Object Folding in the following, ie data your data objects in the cache share the same cache record. Similarly, ID3 and ID4 represent different things in the chart calculation procedure: the chart generating instance and the definition of the chart result, respectively. If two different cliagram generation instances give the same diagram result, which is very possible, the two-step caching (ID3: ID4; lD4r diagram result) causes the diagram result to be cached only once.

Dessutom, genom att cacha ID3, kan man återskapa diagrarnresultatet även om datadelmängdens definition har rensats från cachen. Detta är en relevant fördel eftersom datadelmängdens deﬁnition kan vara mycket stor och således benägen att bli bortrensad från cachen om en cacherensningsmekanism är 532 252 16 implementerad. Ett icke-begränsande exempel på en sådan mekanism kommer att beskrivas ytterligare nedan.In addition, by caching ID3, one can recreate the diagram result even if the definition of the data subset has been cleared from the cache. This is a relevant advantage as the data subset can be very large and thus prone to be cleared from the cache if a cache clearing mechanism is implemented. A non-limiting example of such a mechanism will be further described below.

Under extraheringsprocessen beräknas identiﬁerare utgående från urvalet, de relevanta diagramegenskaperna, osv och används för uppslag av eventuellt cachade berâkningsresultat, såsom indikeras medelst de streckade pilarna i ﬁg. 7. Om identiﬁeraren återfinns kommer det motsvarande cachade resultatet att återanvändas. Om den ej återfinns kommer extraheringsprocessen att generera nya identiﬁerare och cacha dem med respektive resultat.During the extraction process, identifiers are calculated based on the sample, the relevant chart properties, etc. and are used to look up any cached calculation results, as indicated by the dashed arrows in ﬁ g. 7. If the identifier is found, the corresponding cached result will be reused. If it is not found, the extraction process will generate new identifiers and cache them with the respective results.

För att ytterligare exempliﬁera extraheringsprocessen kan man betrakta ovannämnda urval av orderår ”2007” och produktgrupp ”Mejeriprodukter”. Det första steget år att generera en digital identiﬁerare IDl som funktion av detta urval, t ex (skrivet i hexadecimal notation): '3 Idca 7ad0I 3 964891 df4 28095 ad9b 78ad7a69eaaa1 ca3 88 6bcf05d8j81 84e 84 a'.To further exemplify the extraction process, one can consider the above-mentioned selection of order year “2007” and product group “Dairy products”. The first step is to generate a digital identifier ID1 as a function of this selection, for example (written in hexadecimal notation): '3 Idca 7ad0I 3 964891 df4 28095 ad9b 78ad7a69eaaa1 ca3 88 6bcf05d8j81 84e 84 a'.

För att hålla framställningen kort representeras varj e identiﬁerare av dess inledande 4 tecken i följande exempel. Sålunda blir IDl istället ”3ldc”. Vidare innehåller, av tydlighetsskål, de illustrerande tabellerna nedan identifierarmårken, t ex 'ID 1:' framför de digitala identifierania. Detta år ej nödvändigt i en verklig lösning.To keep the presentation short, each identifier is represented by its initial 4 characters in the following examples. Thus, ID1 instead becomes "3ldc". Furthermore, for the sake of clarity, the illustrative tables below contain the identifier marks, eg 'ID 1:' in front of the digital identifiers. This year is not necessary in a real solution.

Den efterföljande extraheringsprocessen är som följer: När IDl har genererats, eftersöks denna i cachen. Den första gången urvalet görs kommer denna identiﬁerare ej att återfinnas i cachen, varför den resulterande datadelmängden måste beräknas på normalt vis. När detta väl är gjort kan ID2 genereras från datadelmångden att bli t ex 'd2b8'. Sedan cachas IDl, pekande på ID2; och cachas ID2, pekande på den bítsekvens som definierar den resulterande datadelmängden. Denna bitsekvens kan vara av ansenlig storlek. Innehållet i cachen visas i tabell 1 nedan. 532 252 17 Tabell 1: ID Cachat värde ID1:3ldc ID2:d2b8 datadelmångd> lD2zd2b8 Nästa gång samma urval görs kommer processen att vara annorlunda: Nu återﬁnns IDl i cachen, pekande på ”ID2:d2b8”, vilket i sin tur används för ett andra uppslag, varpå bitsekvensen för den resulterande datadelmängden återfinns, hämtas och används istället för en tidsödande beräkning.The subsequent extraction process is as follows: Once ID1 has been generated, it is searched in the cache. The first time the selection is made, this identifier will not be found in the cache, so the resulting data subset must be calculated in the normal way. Once this is done, ID2 can be generated from the data subset to become eg 'd2b8'. Then cached ID1, pointing to ID2; and cache ID2, pointing to the byte sequence defining the resulting data subset. This bit sequence can be of considerable size. The contents of the cache are shown in Table 1 below. 532 252 17 Table 1: ID Cached value ID1: 3ldc ID2: d2b8 data subset> lD2zd2b8 Next time the same selection is made, the process will be different: Now ID1 is in the cache again, pointing to "ID2: d2b8", which in turn is used for a second lookup, whereupon the bit sequence of the resulting data subset is found, retrieved and used instead of a time consuming calculation.

Betrakta nu fallet när ett armorlunda urval har gjorts, som dock ger samma resulterande datadelmängd. Det kan exempelvis hända att en användare väljer exakt de kunder som har köpt Mejeriprodukter' utan att explicit efterfråga ”Mejeriprodukter”', och att dessa inte har köpt något annat än mejeriprodukter. IDl genereras nu som t ex 'fl 42' och kommer ej att återﬁnnas i cachen. Sålunda måste den resulterande datadelmängden beräknas på normalt vis. När detta väl är gjort kan ID2 genereras från datadelmängden och beﬁnns vara ”d2b8' som redan är sparad i cachen. Sålunda behöver algoritmen endast lägga till en post i cachen, den där 'lDlzf 142' pekar på 'lD2zd2b8ï Innehållet i cachen visas i tabell 2 nedan.Now consider the case when a poor selection has been made, which, however, gives the same resulting subset of data. For example, a user may choose exactly the customers who have purchased Dairy Products 'without explicitly requesting "Dairy Products"', and that they have not purchased anything other than dairy products. IDl is now generated as eg 'fl 42' and will not be retrieved in the cache. Thus, the resulting data subset must be calculated in a normal manner. Once this is done, ID2 can be generated from the data subset and is called 'd2b8' which is already stored in the cache. Thus, the algorithm only needs to add an entry to the cache, the one where 'lDlzf 142' points to 'lD2zd2b8ï The contents of the cache are shown in Table 2 below.

Tabell 2: ID Cachat värde lDlzfl42 ID2:d2b8 ID 1:3 1 dc lD2zd2b8 datadelrnängd> ID2:d2b8 Ingen beräkningstid har sparats, denna gång, men cacheposter återanvänds för att förhindra cachen från att växa onödigt mycket. Och nu pekar både 'IDl:f 142' och 'ID1:3 ldc' på den cachepost som innehåller samma resulterande datadelmångd: 'ID2:d2b8', och båda kan användas i senare uppslag.Table 2: ID Cached value lDlzfl42 ID2: d2b8 ID 1: 3 1 dc lD2zd2b8 data part length> ID2: d2b8 No calculation time has been saved, this time, but cache records are reused to prevent the cache from growing unnecessarily. And now both 'ID1: f 142' and 'ID1: 3 ldc' point to the cache record that contains the same resulting data subset: 'ID2: d2b8', and both can be used in later lookups.

Detta är således ett exempel på ovannämnda ”objektvilcriingï 532 252 18 En ytterligare fördel med cachning av digitala identifierare kommer att framgå när efterföljande diagramberäkning utförs. Antag således att ovannämnda urval har gjorts och att den efterföljande diagrarnberäkningen har utförts. ID3 och ID4 har genererats som 'e4OA' respektive 7505' och sparats i cachen. Innehållet i cachen visas i tabell 3 nedan.This is thus an example of the above-mentioned "object conditioning". An additional advantage of caching digital identifiers will emerge when the subsequent diagram calculation is performed. Thus, assume that the above selection has been made and that the subsequent diagram calculation has been performed. ID3 and ID4 have been generated as' e4OA 'and 7505' respectively and saved in the cache. The contents of the cache are shown in Table 3 below.

Tabell 3: ID Cachat värde ID 1:f142 ID2 :d2b8 ID 1 :3 ldc ID2 :d2b8 lD2zd2b8 datade1mängd> lD3ze4OA ID4:7505 ID4:7505 diagramresultat> Bland de fem posterna i tabell 3 är en post sannolikt betydligt större än alla andra: 'ID2:d2b8', vilken innehåller hela bitsekvensen som definierar den potentiellt stora datadelmängden. Dess storlek gör den till en kandidat för rensning när/ om cachen underhålls, såsom beskrivs ytterligare nedan. Efter ett tag kan innehållet i cachen således vara såsom visas i tabell 4 nedan.Table 3: ID Cached value ID 1: f142 ID2: d2b8 ID 1: 3 ldc ID2: d2b8 lD2zd2b8 data1set> lD3ze4OA ID4: 7505 ID4: 7505 chart results> Among the five items in table 3, one item is probably significantly larger than all the others: 'ID2: d2b8', which contains the entire bit sequence that defines the potentially large data subset. Its size makes it a candidate for cleaning when / if the cache is maintained, as described further below. After a while, the contents of the cache may thus be as shown in Table 4 below.

Tabell 4: ID Cachat värde ID1:f142 lD2zd2b8 ID1:3ldc ID2:d2b8 ID3:e40A ID4:7505 ID4:7505 < matris av tal som representerar diagramresultat> 532 252 19 Eftersom de digitala identiﬁerama cachas är det emellertid fortfarande möjligt att ta fram diagramresultatet utan att behöva återberäkna den mellanliggande datadelmängden. När urvalet görs, beräknas istället ID1. Härnäst görs ett uppslag för ID1 i cachen, vilket resulterar i att ID2 hämtas. Därefter genereras ID3 från kombinationen av de relevanta diagramegenskaperna och lD2.Table 4: ID Cached value ID1: f142 lD2zd2b8 ID1: 3ldc ID2: d2b8 ID3: e40A ID4: 7505 ID4: 7505 <matrix of numbers representing chart results> 532 252 19 However, since the digital identities are cached, it is still possible to retrieve the chart result without having to recalculate the intermediate data subset. When the selection is made, ID1 is calculated instead. Next, a lookup is made for ID1 in the cache, which results in ID2 being retrieved. Then, ID3 is generated from the combination of the relevant chart properties and ID2.

Ett uppslag för ID3 görs i cachen, och ID4 hämtas. Slutligen görs ett uppslag för ID4 i cachen, och diagramresultatet återvinns. Således tas diagramresultatet fram utan några tunga beräkningar, utan baseras på digitala identiﬁerare som kan genereras genom snabba och behandlingseffektiva operationer.A lookup for ID3 is done in the cache, and ID4 is retrieved. Finally, a lookup is made for ID4 in the cache, and the chart result is retrieved. Thus, the diagram result is produced without any heavy calculations, but is based on digital identifiers that can be generated through fast and treatment-efficient operations.

Av det ovanstående inses att de digitala identiﬁerarna bör vara unika så att varje identiﬁerare i cachen har en otvetydig innebörd. I en utföringsform genereras de digitala identiﬁerarna med användning av en hashningsalgoritm eller - funktion. Hashningsalgoritmer är transformeríngar som tar indata av godtycklig storlek (meddelande) och återbördar en sträng av given storlek, vilken kallas hashvärdet (message digest). Algoritmen delar typiskt upp och blandar indata, t ex via utbyte eller platsbyte, för att skapa ett digital fingeravtryck därav. De enklaste och äldsta hashningsalgoritmerna är enkla modulo-primtalsoperationer. l-lashingsalgoritmer för en mängd olika beräkningsändamål, innefattande kryptograﬁ. Generellt sett bör en hashningsalgoritm uppföra sig så mycket som möjligt som en slumpfunktion, genom att den genererar m möjlig sträng av given storlek med samma ”sannolikhet”, samtidigt som den i realiteten är deterministisk.From the above, it will be appreciated that the digital identifiers should be unique so that each identifier in the cache has an unambiguous meaning. In one embodiment, the digital identifiers are generated using a hashing algorithm or function. Hashing algorithms are transformations that take input data of any size (message) and return a string of a given size, which is called the hash value (message digest). The algorithm typically divides and mixes input data, for example via exchange or location change, to create a digital fingerprint thereof. The simplest and oldest hashing algorithms are simple modulo-prime operations. lashing algorithms for a variety of computational purposes, including cryptographic. In general, a hashing algorithm should behave as much as possible as a random function, in that it generates a possible string of a given size with the same "probability", while in reality it is deterministic.

Det finns ﬂera välkända och ofta använda hashningalgoritmer som kan användas för generering av ovannämnda digitala identiﬁerare. Olika hashningsalgoritmer är optimerade för olika syften, där vissa är optimerade för effektiv och snabb beräkning av hashvärdet, medan andra är utformade för hög kryptografisk säkerhet. En algoritm med hög kryptografisk säkerhet är utformad att göra det svårt att beräkna ett meddelande som matchar ett givet hashvärde inom rimlig tid, och att finna ett andra meddelande som genererar samma hashvärde som ett första givet meddelande. Sådana hashningsalgoritmer innefattar SHA (Secure Hash Algorithm) och MDS (Message-Digest algorithm 5).There are well-known and frequently used hashing algorithms that can be used to generate the above-mentioned digital identifiers. Different hashing algorithms are optimized for different purposes, where some are optimized for efficient and fast calculation of the hash value, while others are designed for high cryptographic security. An algorithm with high cryptographic security is designed to make it difficult to calculate a message that matches a given hash value within a reasonable time, and to find a second message that generates the same hash value as a first given message. Such hashing algorithms include SHA (Secure Hash Algorithm) and MDS (Message-Digest Algorithm 5).

Beräkningseffektiva hashningsalgoritmer uppvisar typiskt lägre kryptograﬁsk säkerhet. Sådana hashningsalgoritmer innefattar FNV-algoritrner (Fowler / Noll / Vo), vilka är utformade att vara snabba samtidigt som de generellt ger mycket låg kollisionsfrekvens. En FNV-algoritm startar typiskt med en förskjutningsbas (offset base), vilken i princip skulle kunna vara varje slumpmässig sträng av värden, men 532 252 20 som typiskt av tradition alltid är uppﬁnnarens signatur i hexadecimal kod körd genom den ursprungliga FNV-O-algoritmen. För generering av ett 256-bitars FNV- hashvärde används vanligen följande förskjutningsbas: '0xdd268dbcaac55 03 62d98c384c4e5 76ccc8bI 53 684 7b6bbb3 1 023b4c8caee0535'.Computational efficient hashing algorithms typically exhibit lower cryptographic security. Such hashing algorithms include FNV (Fowler / Noll / Vo) algorithms, which are designed to be fast while generally providing very low collision rates. An FNV algorithm typically starts with an offset base, which in principle could be any random string of values, but which is typically traditionally always the signer's signature in hexadecimal code run by the original FNV-O algorithm . To generate a 256-bit FNV hash value, the following offset base is usually used: '0xdd268dbcaac55 03 62d98c384c4e5 76ccc8bI 53 684 7b6bbb3 1 023b4c8caeeee0535'.

För varje byte i indatan till hashningsalgoritmen multipliceras först förskjutningen med ett stort primtal, jämförs därefter med byten från indatan och slutligen beräknas den bitvis symmetriska skillnaden (XOR) för att bilda hashvärdet för nästa loop. Lämpliga primtal återfinns i litteraturen. Alla stora primtal kommer att fungera men vissa är mer kollisionsresistenta än andra.For each byte in the input to the hashing algorithm, the offset is first multiplied by a large prime number, then compared with bytes from the input and finally the bit symmetric difference (XOR) is calculated to form the hash value for the next loop. Suitable prime numbers are found in the literature. All large prime numbers will work but some are more collision resistant than others.

De digitala identiñerarna kan genereras med användning av varje hashningsalgoritm som är rimligt kollisionsresistent. I en utföringsform genereras identiﬁerarna med användning av en snabb hashningsalgoritrn med hög kollisionsresistens och låg kryptograﬁsk säkerhet.The digital identifiers can be generated using any hashing algorithm that is reasonably collision resistant. In one embodiment, the identifiers are generated using a fast hashing algorithm with high collision resistance and low cryptographic security.

I en speciﬁk utföringsform kan en 256-bitas identiﬁerare skapas genom sammanlänkning av fyra 64-bitars FNV-hashar, som var och en är genererad med användning av en egen primtalsmultiplikator. Genom att man använder fyra kortare hashar och sammanlänkar dessa kan identiﬁeraren genereras snabbare.In a specific embodiment, a 256-bit identifier can be created by linking four 64-bit FNV hashes, each of which is generated using its own prime multiplier. By using four shorter hashes and linking them together, the identifier can be generated faster.

För att ytterligare snabba upp genereringen av identiﬁeraren kan algoritrnen modiﬁeras att använda inte bara en byte av indatan per loop, utan fyra byte. Detta kan resultera i en minskad kryptograñsk säkerhet, medan kollisionsresistensen förblir i stort sett densamma.To further speed up the generation of the identifier, the algorithms can be modified to use not only one byte of the input data per loop, but four bytes. This can result in reduced cryptographic security, while collision resistance remains largely the same.

Identiﬁerare med en längd av minst 256 bitar kan ge en gynnsam kollisionsresistens. Ett 256-bitars hashvärde innebär att det finns cirka lE+77 möjliga identiﬁerarvärden. Detta tal kan jämföras med antalet atomer i universum som har uppskattats till lE+80. Detta innebär att risken för kollisioner, dvs risken för att två olika urval/ datadelmängder/ diagramegenskaper/ diagramresultat ger samma identiﬁerare, inte bara är extremt liten, utan är försumbar. Sålunda kan vi tryggt anse att risken för kollisioner är acceptabelt liten. Detta innebär att även om hashningsalgoritmen inte genererar teoretiskt unika identiﬁerare, så genererar den dock statistiskt unika identiﬁerare. Det bör dock inses att identiﬁerare med kortare bitlängd, såsom 64 eller 128 bitar, kan vara tillräckligt statistiskt unika för en viss tillämpning. 532 252 21 Såsom nämnts ovan kan en rensningsmekanism vara implementerad att rensa cachen på gamla eller oanvända poster. En strategi kan vara att eliminera den eller de poster som har lägst användningsgrad i cachen. En mer avancerad rensningsmekanism kan emellertid vara implementerad att stödja optimering av såväl processoranvändning som rninnesanvändriing. En utföringsforrn av en sådan avancerad rensningsmekanism opererar på tre parametrar: Användning, Beräkníngstid och Minnesbehov.Identifiers with a length of at least 256 bits can provide favorable collision resistance. A 256-bit hash value means that there are approximately 1E + 77 possible identifier values. This number can be compared with the number of atoms in the universe that have been estimated at IE + 80. This means that the risk of collisions, ie the risk that two different samples / data subsets / diagram properties / diagram results give the same identifier, is not only extremely small, but is negligible. Thus, we can safely consider that the risk of collisions is acceptably small. This means that even if the hashing algorithm does not generate theoretically unique identifiers, it still generates statistically unique identifiers. However, it should be appreciated that shorter bit length identifiers, such as 64 or 128 bits, may be statistically unique enough for a particular application. 532 252 21 As mentioned above, a clearing mechanism may be implemented to clear the cache of old or unused records. One strategy may be to eliminate the item or items that have the lowest utilization rate in the cache. However, a more advanced cleaning mechanism may be implemented to support optimization of both processor usage and memory usage. An embodiment of such an advanced purification mechanism operates on three parameters: Usage, Calculation Time and Memory Requirements.

Användningsparametern är ett numeriskt värde som företrädesvis beaktar både om en post har accessats ”nyligen men inte ofta” och om posten har accessats ”ofta men inte nyligen”. Detta kan uppnås genom att varje post associeras med användningparameter U, vilken ökas med exempelvis en enhet varje gång posten accessas, men vars värde minskas exponentiellt över tid. I en implementation minskas samtliga värden på U i cachen periodiskt med ett fast belopp. Således har användningsparametern en halveringstid, precis som vid ett radioaktivt sönderfall. Värdet på U kommer nu att återspegla hur ofta och hur nyligen posten har accessats.The usage parameter is a numeric value that preferably takes into account both if an entry has been accessed "recently but not often" and if the entry has been accessed "often but not recently". This can be achieved by associating each record with usage parameter U, which is increased by, for example, one unit each time the record is accessed, but whose value decreases exponentially over time. In one implementation, all values of U in the cache are periodically reduced by a fixed amount. Thus, the use parameter has a half-life, just as in a radioactive decay. The value of U will now reflect how often and how recently the item has been accessed.

Om den processortid som krävs för att beräkna en post är ansenlig, bör posten behållas längre i cachen. I det omvända fallet, om den processortid som krävs för beräkningen är liten, så är kostnaden för omräkning liten och behållningen av att behålla posten i cachen också liten. Således är varje post associerad med en tidsparameter T som återspeglar den uppskattade beräkningstiden.If the processor time required to calculate an entry is substantial, the entry should be kept longer in the cache. In the reverse case, if the processor time required for the calculation is small, then the cost of conversion is small and the balance of keeping the record in the cache is also small. Thus, each record is associated with a time parameter T which reflects the estimated computation time.

Om det minnesutrymme som krävs för att spara en post är ansenligt så kostar det mycket cacheresurser att behålla den och den bör rensas från cachen tidigare än en post som kräver mindre minnesutrymme. I det omvända fallet kan en post som kräver lite minnesutryrnme behållas längre i cachen. Således är varje post associerad med en minnespararneter M som återspeglar det uppskattade minnesbehovet.If the memory space required to save an entry is considerable, it costs a lot of cache resources to maintain it and it should be cleared from the cache earlier than an entry that requires less memory space. In the opposite case, an entry that requires a little memory space can be kept longer in the cache. Thus, each record is associated with a memory saver M which reflects the estimated memory requirement.

För varje post i cachen evalueras värdena på U-, T-, och M-parametrarna medelst en viktfunktion W som ges av: W = U * T / M.For each entry in the cache, the values of the U, T, and M parameters are evaluated by means of a weight function W given by: W = U * T / M.

Ett stort värde på W för en post indikerar att det finns goda skål att behålla denna post i cachen. Således bör postema med stora W-värden behållas i cachen, och de med små W-värden bör rensas.A large value of W for an entry indicates that there are good bowls to keep this entry in the cache. Thus, the records with large W-values should be kept in the cache, and those with small W-values should be cleared.

En effektiv rensningsmekanism kan inbegripa att cachen sorteras enligt W-värdena och att den sorterade cachen rensas från en ände, dvs posterna med de 532 252 22 minsta W-värdena. Ett lämpligt men ej nödvändigt sett att uppehålla en sorterad cache skulle vara att spara ídentiﬁerarna, resultaten och U-, T-, M- och W-värdena som ett AVL-träd (Adelson-Velsky och Landis), dvs ett självbalanserande binärt sökträd.An effective clearing mechanism can involve that the cache is sorted according to the W-values and that the sorted cache is cleared from one end, ie the records with the 532 252 22 smallest W-values. An appropriate but not necessary way to maintain a sorted cache would be to save the identifiers, the results and the U, T, M and W values as an AVL tree (Adelson-Velsky and Landis), ie a self-balancing binary search tree.

Rensningsmekanismen kan intermittent rensa alla poster med ett W- värde som ligger under ett förutbestämt tröskelvärde.The clearing mechanism can intermittently clear all records with a W value below a predetermined threshold value.

Alternativt kan rensningsmekanismen styras av mängden tillgängligt minne, eller förhållandet mellan tillgängligt minne och totalt minne. Närhelst storleken av cacheminnet när ett minneströskelvärde avlägsnar således rensningsmekanismen poster från cacheposterna utgående från deras respektive W-värde. Genom att sätta minneströskeln är det möjligt att anpassa cachestorleken i till de lokala hårdvaruförhållandena, t ex att byta processorkraft mot minne.Alternatively, the clearance mechanism may be controlled by the amount of available memory, or the ratio of available memory to total memory. Thus, whenever the size of the cache memory reaches a memory threshold value, the clearing mechanism removes records from the cache records based on their respective W-value. By setting the memory threshold, it is possible to adjust the cache size to the local hardware conditions, eg to switch processing power to memory.

Exempelvis är det möjligt att kompensera för en långsammare processor i en dator genom att addera mer primärminne till datorn och öka minneströskeln. Därigenom kommer ﬂer resultat att behållas i cachen och behovet av databehandling att minskas.For example, it is possible to compensate for a slower processor in a computer by adding more primary memory to the computer and increasing the memory threshold. As a result, your results will be kept in the cache and the need for data processing will be reduced.

Utföringsformer av uppfinningen avser också en apparat för utförande av någon av de i det föregående beskrivna algoritmerna, förfarandena, processerna och procedurerna. Denna apparat kan vara speciellt konstruerad for det avsedda syftet eller så kan den omfatta en allmän dator (general-purpose computer) som selektivt aktiveras eller omkonﬁgureras av ett i datorn lagrat datorprogram.Embodiments of the invention also relate to an apparatus for performing any of the algorithms, methods, processes and procedures described above. This apparatus may be specially designed for the intended purpose or it may comprise a general-purpose computer which is selectively activated or reconfigured by a computer program stored in the computer.

Fig. 8 är ett blockschema över en datorbaserad rniljö för implementering av någon av utföringsformerna av uppfinningen. En användare 1 interagerar med ett databehandlingssystem 2, vilket inbegriper en processor 3 som exekverar operativsystemmjukvara samt ett eller ﬂera applikationsprogram som implementerar uppfmningen. Användaren inmatar information i databehandlingssystemet 2 med användning av en eller ﬂera välkända inmatningsanordningar 4, såsom en mus, ett tangentbord, en pekplatta, osv.Fig. 8 is a block diagram of a computer-based environment for implementing one of the embodiments of the invention. A user 1 interacts with a data processing system 2, which includes a processor 3 executing operating system software and one or more application programs implementing the invention. The user enters information into the data processing system 2 using one or more well-known input devices 4, such as a mouse, a keyboard, a touch pad, etc.

Alternativt kan informationen inmatas med eller utan inblandning av användaren via någon annan typ av inmatningsanordning, såsom en kortläsare, en optisk läsare eller ett annat datorsystem. Visuell återkoppling kan ges till användaren genom att tecken, grafiska symboler, fönster, knappar, osv visas på en display 5.Alternatively, the information may be entered with or without the intervention of the user via any other type of input device, such as a card reader, an optical reader or another computer system. Visual feedback can be given to the user by displaying characters, graphic symbols, windows, buttons, etc. on a display 5.

Databehandlingssystemet innefattar vidare ovannämnda minne 10. Mjukvaran som exekveras av processorn 3 sparar operationsrelaterad information i minnet lO och hämtar lämplig information från minnet 10. Minnet 10 inbegriper typiskt ett 532 252 23 primärminne (såsom RAM, cacheminne, osv) och ett icke-ﬂyktigt sekundärminne (hårddisk, flash-minne, löstagbart medium). Databasen kan vara lagrad i databehandlingssystemets minne 10, eller kan accessas på en extern lagringsanordning via ett kommunikationsgränssnitt 6 i databehandlingssystemet 2.The data processing system further includes the aforementioned memory 10. The software executed by the processor 3 stores operation-related information in the memory 10 and retrieves appropriate information from the memory 10. The memory 10 typically includes a primary memory (such as RAM, cache memory, etc.) and a non-efficient secondary memory. (hard disk, flash memory, removable media). The database may be stored in the memory of the data processing system 10, or may be accessed on an external storage device via a communication interface 6 in the data processing system 2.

Uppﬁnningen har ovan huvudsakligen beskrivits med hânvisningen till ett fåtal utföringsformer. En fackman på området inser emellertid omedelbart att andra utföringsformer än de som visats ovan också är möjliga inom uppﬁnningens ramar och andemening, där uppfinningen endast definieras och begränsas av bifogade patentkrav.The invention has been described above mainly with the reference to a few embodiments. However, a person skilled in the art will immediately realize that embodiments other than those shown above are also possible within the scope and spirit of the invention, where the invention is defined and limited only by the appended claims.

Exempelvis är föreliggande uppfinning inte bara tillämplig för beräkning av multi-dimensionella kuber, utan kan vara användbar i varje situation där information extraheras från en databas med användning av en kedja av beräkningar.For example, the present invention is not only applicable to the calculation of multi-dimensional cubes, but may be useful in any situation where information is extracted from a database using a chain of calculations.

Vidare kan den uppñnningsenliga extraheringsprocessen tillämpas på en kedja av beräkningar som inbegriper ﬂer än två på varandra följande beräkningar.Furthermore, the recoverable extraction process can be applied to a chain of calculations involving more than two consecutive calculations.

Exempelvis kan var och en av två eller ﬂera mellanresultat i en kedja av beräkningar cachas och senare hämtas i likhet med det ovan beskrivna mellanresultatet.For example, each of two or ﬂ your intermediate results in a chain of calculations can be cached and later retrieved similarly to the intermediate result described above.

Vidare behöver den uppﬁnningsenliga extraheringsprocessen ej cacha och senare hämta slutresultatet, utan kan istället operera endast att cacha och hämta mellanresultat i en kedja av beräkningar.Furthermore, the inventive extraction process does not need to cache and later retrieve the end result, but can instead operate only to cache and retrieve intermediate results in a chain of calculations.

Det bör också inses att det inledande steget att extrahera en initial datamängd eller ett dito omfång från databasen kan utelämnas, och att extraheringsprocessen istället kan operera direkt på databasen.It should also be appreciated that the initial step of extracting an initial amount of data or a ditto scope from the database may be omitted, and that the extraction process may instead operate directly on the database.

Claims

532 252 24 PATENT REQUIREMENTS

A computer-implemented method for extracting information from a database, which method comprises a sequential chain of main arrays, which comprises a first main arithmetic (P1) operating a first selection object (S1) on a data set (RO) representing the database to produce a first result (R1), and a second main calculation (P2) operating a second sample object (S2) on the first result (R1) to produce a second result (R2), the method further comprising the first and second results (R1, RQ) are cached by: calculating a first sample identifier value (ID 1) as a function of at least the first sample object (S1), and a second sample identifier value (IDS) as a function of at least the second sample object (S2) and the first result (R1); and storing the first sample identifier value (ID 1) and the first result (R 1) and the second sample identifier value (IDS) and the second result (RQ), respectively, as associated objects in a data structure.

The method of claim 1, further comprising the step of using the data structure to find the second result (R2) based on the first selection object (S1) and the second selection object (S2), the step of using comprising the sub-steps: (a): calculate the first sample identifier value (lD1) as a function of at least the first sample object (S1); (b) searching among the objects in the data structure based on the first sample identifier value (lD1) to locate the first result (R1); (c) that, if the first result (RI) is found in sub-step (b), calculate the second sample identifier value (ID3) as a function of the first result (R1) and the second sample object (S2), and search among the objects in the data structure based on the second sample identifier value (ID3) for locating the second result (R2); (d) that, if the first result (R1) is not found in sub-step (b), executing the first main calculation (P1) to produce the first result (R1), calculating the second sample identifier value (ID3) as a function of the first result (RI) and the second sample object (S2), and search among the objects in the data structure based on the second sample identifier value (ID3) to locate the second result (R2); and 532 252 (e) that, if the second result (R2) is not found in sub-step (c) or (d), executing the second main calculation (P2) to produce the second result (R2).

The method of claim 1, further comprising the step of calculating a first result identifier value (ID2) as a function of the first result (R1), the step of storing further comprising the steps of storing the first sample identifier value (ID 1) and the first result identifier value (ID 1). ID2) as associated objects in the data structure, and storing the first result identifier value (ID2) and the first result (R1) as associated objects in the data structure.

The method of claim 3, further comprising the step of using the data structure to find the second result (R2) based on the first selection object (S1) and the second selection object (S2), the step of using comprising the sub-steps: (a ) calculating the first sample identification value (ID 1) as a function of at least the first sample object (S1); (b) searching among the objects in the data structure based on the first sample identifier value (ID 1) to locate the first result identifier value (ID2), and searching among the objects in the data structure based on the first result identifier value (ID2) to locate the first result ( R1); (c) that, if the first result (R1) is found again in step (b), calculate the second sample identifier (ID3) as a function of the first result (R1) and the second sample object (S2), and search among the objects in the data structure based on the second sample identifier value (ID3) to locate the second result (R2); (d) that, if the first result identifier value (ID2) or the first result (R1) is not found in step (b), executing the first main calculation (P1) to produce the first result (R1), calculating the second sample identifier value (ID3) ) as a function of the first result (R1) and the second sample object (S2), and search among the objects in the data structure based on the second sample identifier value (ID3) to locate the second result (R2); and (e), if the second result (R2) is not found in step (c) or (d), executing the second main calculation (P2) to produce the second result (R2). 532 252 26

The method of claim 3, wherein the first result (R1), in the calculation of the second sample identifier value (ID3), is represented by the first result identifier value (ID2).

The method of claim 5, further comprising the step of using the data structure to obtain the second result (R2) based on the first selection object (S1) and the second selection object (S2), the step of using comprising the sub-steps: (a) calculating the first sample identifier value (ID 1) as a function of at least the first sample object (S1); (b) searching among the objects in the data structure based on the first sample identifier value (lD1) to locate the first result identifier value (lD2); (c) that, if the first result identifier value (H32) is found in sub-step (b), calculate the second sample identifier value (ID3) as a function of the first result identifier value (ID2) and the second sample object (S2), and search among the objects in the data structure based on the second sample identifier value (ID3) to locate the second result (R2); (d), if the first result identifier value (ID2) is not found in step (b), executing the first main calculation (P1) to produce the first result (R1), calculating the first result identifier value (ID2) as a function of the first result (R1), calculate the second sample identifier value (ID3) as a function of the first result identifier value (ID2) and the second sample object (S2), and search among the objects in the data structure based on the second sample identifier value (ID3) to locate the second result (R2) ); (e), if the second result (R2) is not found in sub-step (c), searching among the objects in the data structure based on the first result identifier value (ID2) to locate the first result (R1), and executing the second main relation (P2) ) to produce the second result (R2); (f) that, if the first result (R1) is not found in step (e), executing the first main calculation (P1) to produce the first result (R1), and executing the second main calculation (P2) to produce the second the result (R2); and (g) if the second result (R2) is not found in step (d), executing the second main result (P2) to produce the second result (R2). 532 252 27

The method of claim 1, 3 or 5, further comprising the step of calculating a second result identifier value (ID4) as a function of the second result (R2), the step of storing further comprising the steps of storing the second sample identifier value (IDS) and the second result identifier value (ID4) as associated objects in the data structure, and storing the second result identifier value (ID4) and the second result (RZ) as associated objects in the data structure.

A method according to any one of the preceding claims, wherein each of the identifier values is statistically unique.

A method according to any one of the preceding claims, wherein each of the identifier services is a digital fingerprint generated by means of a hash function.

The method of claim 9, wherein the digital remover imprint comprises at least 256 bits.

A method according to any one of the preceding claims, further comprising the step of selectively deleting data records containing associated objects in the data structure, based on at least. the size of the data records.

The method of claim 11, wherein the step of selectively deleting is designed to promote deletion of data records containing a first result.

The method of claim 1, 1 or 12, further comprising the step of associating each data record with a weight value, which is calculated as a function of a usage parameter for each data record, a calculation time parameter for each data record and a size parameter for each data record.

14. l4. A method according to claim 13, wherein the weight value is calculated by evaluating a weight function given by W = U * T / M, where U is the usage parameter, T is the calculation time parameter and M is the size parameter.

A method according to claim 13 or 14, wherein the value of the usage parameter is incremented each time the data record is accessed, while the value is exponentially reduced as a function of time.

A method according to any one of claims 13-15, wherein the step of selectively deleting is based on the weight value of the data records in the data structure.

A method according to any one of claims 1 to 16, wherein the step of selectively deleting is triggered based on a comparison between a current size of the data structure and a threshold value. 532 252 28

A method according to any one of the preceding claims, wherein the database is a dynamic database, and wherein the first sample identifier (ID1) is calculated as a function of at least the first sample object (S1) and the data set (RO).

A method according to any one of the preceding claims, wherein said information comprises a grouping, sorting or aggregation of data in the database.

A method according to any one of the preceding claims, wherein the first selection object (S1) defines a set of fields in the data set (RO) and a condition for each field, the result (R1) of the first main calculation (P1) being representative of a subset of the data set (RO), wherein the second selection object (S2) defines a mathematical function, one or ﬂ era in the subset (R1) included calculation variables and one or ﬂ erai subset (R1) included classification variables, and wherein the result (R2) of the second main calculation (P2) ) is a multi-dimensional cube data structure which contains the result of operating the mathematical function on the one or two variables for each unique value of each classification variable.

21. 2 A computer-readable medium on which is stored a computer program which, when executed by means of a computer, is designed to perform the method according to any one of claims 1-20.

An apparatus for extracting information from a database, the apparatus comprising a means for executing a sequential chain of calculations, comprising a first main calculation (P1) operating a first sample object (S1) on a data set (RO) representing the database to produce a first result (R1), and a second main calculation (P2) operating a second selection object (S2) on the first result (R1) to produce a second result (R2), the apparatus further comprising a means for caching of the first and second results (R1, R2) by: calculating a first sample identifier value (ID 1) as a function of at least the first sample object (S1), and a second sample identifier value (ID3) as a function of at least the second sample object (S2) ) and the first result (R1); and storing the first sample identifier value (ID 1) and the first result (R1) and the second sample identifier value (IDS) and the second result (R2), respectively, as associated objects in a data structure.