EP1620807A1 - Systeme pour interroger une banque de donnees en utilisant un modele statistique de la banque de donnees pour donner des reponses approximatives a la consultation - Google Patents

Systeme pour interroger une banque de donnees en utilisant un modele statistique de la banque de donnees pour donner des reponses approximatives a la consultation

Info

Publication number
EP1620807A1
EP1620807A1 EP03785583A EP03785583A EP1620807A1 EP 1620807 A1 EP1620807 A1 EP 1620807A1 EP 03785583 A EP03785583 A EP 03785583A EP 03785583 A EP03785583 A EP 03785583A EP 1620807 A1 EP1620807 A1 EP 1620807A1
Authority
EP
European Patent Office
Prior art keywords
database
query
data
database query
compressed image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP03785583A
Other languages
German (de)
English (en)
Inventor
Michael Haft
Reimar Hofmann
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panoratio Database Images GmbH
Original Assignee
Siemens AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens AG filed Critical Siemens AG
Publication of EP1620807A1 publication Critical patent/EP1620807A1/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data

Definitions

  • the invention relates to a database query system and a method for computer-aided query of a database.
  • CRM systems customer relationship management systems
  • supply chain management systems supply chain management systems
  • OLAP On-Line Analytical Processing
  • a simple query option is provided by the use of database queries which are known per se, for example in the form of a database query language, preferably in the standard query language (SQL).
  • SQL standard query language
  • ROLAP relational on-line analytical processing
  • Multidimensional On-Line Analytical Processing is a technology in which many aggregate
  • a multidimensional cube also referred to as a "cube"
  • the required information can either be read directly from the cube or calculated relatively quickly from a few aggregates found there according to MOLAP. Due to the abundance of possible aggregates, MOLAP cubes have a very strong limitation with regard to the number of dimensions that can be taken into account in the MOLAP.
  • the multidimensional cubes can become very large, which is why a very powerful computer as a server computer is required to carry out the database queries. Furthermore, even a very powerful server computer can often not provide sufficient computing power for a large number of requests from several users arriving at the same time.
  • Many OLAP systems offer an open interface - Microsoft, for example, the ODBO standard, the JOLAP interface is defined in the Java environment. In contrast to SQL, interfaces are less strongly standardized at this level.
  • a database query according to ROLAP or a simple database query using SQL for example, is used, the processing of a database query can take a long time for a large database with a more complex structure. The considerable amount of time until a database query is answered or processed is particularly uncomfortable for a user if the result of the database query shows that the specification of the database query was not sufficiently meaningful or error-prone, or that the database - No hits were found in the database.
  • a telecommunications company wants to select a suitable amount of customers for an advertising campaign from its stored electronic customer database. For this purpose, a database query is sent to the customer database of the telecommunications company, which for example reads as follows:
  • the customer database is filtered according to the procedure outlined above for the corresponding customers according to the database query, some depending on the size of the database
  • the result of the database query is assume that the specified conditions in the database query correspond to only 800 customer records. However, an own advertising campaign does not make sense for this small amount of customers. This means that the filter criteria for the database query are changed and a new database query is started, which in turn can take a few minutes to even hours. This procedure is usually continued iteratively until a set of hits of the desired size has been determined.
  • the invention is therefore based on the problem of creating a database query system and a method for computer-aided query of a database, in which the time required for processing database queries is reduced in the statistical sense.
  • a database query system has at least one first device.
  • a database is stored in the first device, the database containing a large amount of data. holds.
  • at least one second device is provided, in which at least one compressed image of at least part of the contents of the database is stored.
  • a query unit is provided which is coupled to the first device and to the second device and is set up in such a way that it can query the contents of the compressed image and query the contents of the database.
  • the compressed image represents a content-compressed representation of the data stored in the database.
  • a static image of the contents of the database particularly preferably a statistical model of the contents of the database, which is stored in the second device, is preferably used as the compressed image.
  • the query unit according to the invention opens up the possibility that the entire database does not have to be searched for each database query, but rather that the compressed image of the database can be accessed first and the compressed image can first be queried.
  • This first query of the compressed image can lead to an approximate result, which may be sufficient for the respective database query or may provide sufficient information for a possible U formulation of the database query, using which the database itself is queried.
  • a statistical model is to be understood as any model that represents all statistical relationships or the common frequency distribution of the data in a database (exact or approximate), for example a Bayesian (or causal) network, a Markov network or generally a graphical probabilistic model, a "latent variable model", a statistical clustering model or a trained artificial neural network.
  • the statistical model can thus be understood as a complete, exact or approximate, but compressed image of the statistics of the database.
  • a database query is formed, preferably by a client computer.
  • a compressed image of the database which was previously formed using the database, is queried in accordance with the database query.
  • the query result of the query of the compressed image it is checked whether the result with regard to the question, i.e. with regard to the database query or other specifiable criteria is sufficient.
  • this check can also be carried out by the user of the client computer by transmitting the result of the query of the compressed image to the client computer, presenting it to the user, and checking by the user whether he wants the desired one Has now received information through the result.
  • a corresponding instruction is transmitted to the query unit.
  • This instruction can consist in that a message is sent to the query unit that more specific information is required using the original database query, whereupon the database is then queried in accordance with the original database query becomes.
  • a new database query can be formed and optionally sent to the query unit together with the information to directly access the database itself, whereupon the compressed image and / or the database is queried in accordance with the new database query.
  • the result of the query of the compressed image and / or the result of the query of the database is made available for further processing, for example transmitted to the client computer sending the database query.
  • a compressed image preferably a statistical model
  • the compressed image is first queried in accordance with the database query and thus an approximate result is determined very quickly and made available to a user, which may already be sufficient for the particular question in order to answer the database query.
  • the approximate result often contains at least good indications of the meaning and the prospects of success and the scope of an exact result of the database query.
  • the configurations described below relate both to the database query system and to the method for computer-aided query of a database.
  • the database query system can have at least one client computer coupled to the query unit, which is set up in such a way that it can generate database queries or database queries.
  • At least some of the data stored in the database is stored in compressed form in the second device.
  • the client computer or computers are usually coupled to the server computer and, moreover, to the database via a telecommunications network, for example a telephone network, generally a wide area network (WAN) or a local area network (LAN), and communication is via the Communication network is preferably carried out according to the Internet protocols Transport Control Protocol (TCP) and Internet Protocol (IP).
  • a telecommunications network for example a telephone network, generally a wide area network (WAN) or a local area network (LAN), and communication is via the Communication network is preferably carried out according to the Internet protocols Transport Control Protocol (TCP) and Internet Protocol (IP).
  • TCP Transport Control Protocol
  • IP Internet Protocol
  • the query unit can be set up in accordance with the quasi-standard Open Database Connectivity (ODBC) or Java Database Connectivity (JDBC). Communication can also take place via (proprietary) OLAP interfaces (ODBO, JOLAP).
  • ODBC Open Database Connectivity
  • JDBC Java Database Connectivity
  • Communication can also take place via (proprietary) OLAP interfaces (ODBO, JOLAP).
  • SQL standard query language
  • the database queries are preferably formulated in accordance with the standard query language (SQL) database query language, in which case the query unit is set up to process the database queries in accordance with SQL.
  • the database can have any number of databases, which can be distributed over several computers, the databases being coupled to the query unit.
  • the database or the databases has or have a plurality of database segments.
  • each database segment is assigned a compressed image, which has been formed via the respective database segment.
  • This embodiment of the invention has the particular advantage that in the event that a database query using a respective compressed image of a database segment for the respective database segment with a high probability has no hits (or even only a very few in an approximate procedure) it can be expected that a detailed database query (ie a full search in the respective database segment) for the respective database segment can be excluded.
  • the database query is also carried out on the database itself, the database query is only carried out for the database segments which, with sufficient probability, provide results which correspond to the query criteria of the database query.
  • Another advantage is that if the compressed image already contains enough information to generate a complete, exact result, a detailed database query (ie a full search in the respective database segment) for the respective database segment can be excluded as well. In total, must So only a few additional detailed queries for a few segments are still started.
  • This embodiment of the invention can also be provided in a corresponding manner for the further development that several databases are contained in the database query system.
  • a compressed image of the respective database is formed for each database.
  • the interrogation unit and the second device can be implemented together in one computer, preferably in a client computer.
  • the use of a compressed image of a database according to the invention makes it possible for the image, which has a significantly smaller amount of data, preferably a few megabytes in comparison to a few gigabytes to terrabytes of a complete database, to be sent to the client in a simple manner via a conventional communication network -Transfer computer.
  • the first query can be made to the compressed image to determine an approximate query result, without the need for a communication link to the actual database. This also enables offline operation of a client computer as long as an approximate result of the database query is sufficient.
  • an additional reduction in the required computing capacity of the server computer is achieved and the bandwidth requirement of the communication network for the transmission of database queries and database query results is further reduced.
  • the second device can be provided in a separate computer that is independent of the client computer and the server computer and can be coupled to it via the communication network. Furthermore, it can be integrated in the server computer, preferably together with the query unit.
  • a decision unit which checks whether the approximate result is sufficient according to a predeterminable quality criterion. In the event that the approximate result is not sufficient, the database query is automatically forwarded to the database management system of the database itself and thus a database query of the complete database is started.
  • the existence of a compressed image is transparent to the user and the user-friendliness is further increased, since the user no longer has to be involved in the decision-making process as to whether the database itself is to be queried or not.
  • information is provided with the database query that indicates whether an exact result of the database query is desired or whether an approximate result is sufficient. If, according to the information additionally given in the database query, a fast but approximate result is accepted, a quality criterion can also be specified up to which degree of statistical reliability the result may be approximate, for example up to which decimal place the approximation may have an impact.
  • the server computer and the client computer can be coupled to one another via any communication network, for example via a fixed network or via a mobile radio network, for the transmission of the respective data and for the transmission of the statistical model.
  • the statistical models can be formed by the server computers, alternatively also by other, possibly specially designed computers which are coupled to the databases.
  • the statistical models formed are transmitted to the respective query unit, which can be arranged in a separate computer, in the server computer or in one or each of the client computers, via the communication network.
  • the statistical models can thus be made available in a very simple manner worldwide in a heterogeneous communication network, for example on the Internet.
  • At least one of the statistical models can be formed by means of a scalable method with which the degree of compression of the statistical model can be set compared to the data elements contained in the respective database.
  • At least one of the statistical models can furthermore be formed by means of an EM learning method or by means of variants thereof or by means of a gradient-based learning method.
  • the so-called APN learning method adaptive probabilistic network learning method
  • all likelihood-based learning methods or Bayesian learning methods can be used, as described for example in [1].
  • the structure of the common probability models can be specified in the form of a graphical probabilistic model (a Bayesian network, a Markov network or a combination thereof).
  • a graphical probabilistic model a Bayesian network, a Markov network or a combination thereof.
  • a special case of this general formalism corresponds to so-called latent variable models or statistical clustering models.
  • any method of learning can not only ter, but also the structure of graphical probabilistic models from available data elements can be used, for example any structure learning method, as described for example in [2] and [3].
  • parts of the data can be saved with the models in various resolutions (e.g. a numerical value roughly represented by just one byte).
  • the statistics of the data recorded by the model are preferably used to present the data in compressed form. The more information is stored in the compressed image, the greater the storage requirement and the more complex the evaluation. It is therefore possible to choose a compromise, starting with a very small, approximate statistical model up to an already very detailed, exact representation of the statistics of the contents of a database.
  • FIG. 1 shows a block diagram of a database query system in accordance with a first exemplary embodiment of the invention
  • Figure 2 is a flowchart showing the individual steps of processing a database query according to a first embodiment of the invention
  • FIG. 3 shows a message flow diagram in which those between a client computer and a server computer according to the first exemplary embodiment of the invention are shown;
  • FIG. 4 is a flowchart showing the individual steps of processing a database query according to a second embodiment of the invention
  • FIG. 5 shows a message flow diagram in which those between a client computer and a server computer according to the second exemplary embodiment of the invention are shown;
  • FIG. 6 shows a database query system according to another exemplary embodiment of the invention.
  • Figure 7 is a block diagram of the database query system according to another embodiment of the invention.
  • FIG. 1 shows a database query system 100 according to a first exemplary embodiment of the invention.
  • the database query system 100 has a client computer 101, a server computer 102 and a database 103.
  • the client computer 101 and the server computer 102 are coupled to one another via a telecommunication network 104, according to an exemplary embodiment of the invention by means of the Internet.
  • the client computer 101 has an input / output interface 105, a processor unit 106 and a memory unit 107.
  • the input / output interface 105, the processor unit 106 and the memory unit 107 are coupled to one another via a computer bus 108.
  • the client computer 101 is coupled to the telecommunication network 104 by means of the input / output interface 105. Furthermore, the client computer 101 is coupled to a screen 110 for displaying data to a user via a first cable 109 or a first radio connection (for example according to Bluetooth). Furthermore, a keyboard 111 is coupled to the input / output interface 105 via a second cable 112 or a second radio connection. Furthermore, a computer mouse 113 is provided, which is coupled to the input / output interface 105 of the client computer 101 via a third cable 114 or by means of a third radio connection.
  • the server computer 102 also has an input / output interface 115, which is coupled to the telecommunications network 104.
  • a processor unit 116 a first storage unit 117, a second storage unit 118 and a database interface 119 are provided in the server computer 102, which are coupled to one another and to the input / output interface 115 by means of a computer bus 120.
  • the programs which are executed by the processor unit 116 are stored in the first memory unit 117.
  • the second storage unit 118 which serves as the second device according to the invention, contains a statistical model 121, explained in more detail below, of the data stored in the database 103.
  • the query unit is implemented in the form of a computer program which is stored in the first memory unit 117 and is carried out by the processor unit 116.
  • the server computer 102 is coupled to the database 103 via a database connection 122 by means of the database interface 119.
  • a database management system (DBMS) (not shown), which implements in the database 103 or in the server computer 102, is provided for managing the database 103, in particular for controlling queries and entries of data from or into the database 103 can be.
  • DBMS database management system
  • the server computer 102 and the client computer 101 are set up for communication in accordance with the Internet communication protocols Transport Control Protocol (TCP) and Internet Protocol (IP).
  • TCP Transport Control Protocol
  • IP Internet Protocol
  • the server computer 102, the database 103 and the client computer 101 are in accordance with the ODBC standard for communication and in the context of the formulation of the database queries themselves, in accordance with the standard query language standard (SQL Standard).
  • SQL Standard standard query language standard
  • a first step the server computer 102 forms a statistical model 121 of the data stored in the database 103.
  • the statistical model 121 is formed in accordance with this exemplary embodiment of the invention using the EM learning method known per se. Other alternative methods for forming the statistical model 121, which are preferably used, are described in detail below.
  • the statistical model 121 is automatically formed again at regular, predefinable time intervals, in each case based on the most current data which are stored in the database 103.
  • the statistical model 121 is stored in the second storage unit 118 (step 202).
  • an SQL query is entered into the client computer 101 (step 203) and transmitted from the client computer 101 to the server computer 102.
  • a browser computer program can be installed in the client computer 101, which interacts with a web server program installed on the server side.
  • the user is shown an HTML page on the screen 110 of the client computer 101 with a prompt for entering database search criteria, which the user would like to use to query the database 103.
  • the user has the option of formulating the query directly in the database query language to be used in each case, or he can formulate a database query in normal language and / or using keywords, in which case the database query is from an intended one Conversion program is converted into an SQL database query.
  • the SQL query is converted into an SQL database query message 301 in accordance with the communication protocol used in each case embedded (compare message flow diagram 300 in FIG. 3) and the SQL database query message 301 is transmitted from the client computer 101 to the server computer 102.
  • the server computer 102 queries the statistical model 121 according to the SQL database query 302, i.e. he searches the statistical model 121 using the SQL database query 302. After a result for the SQL database query 302 has been determined for the statistical model 121, which represents an approximate result with regard to the overall content of the database 103, the approximate result is passed to the server computer 102 as an SQL response 303.
  • the query of the statistical model 121 according to the SQL database query 302 is thus completed (step 204).
  • the server computer 102 uses the SQL response 303 to check whether hits are to be expected at all with regard to the SQL database query 302 when the database 103 is “fully queried” (step 205).
  • a hit is to be understood as a result of a database query in which at least one data element of the database 103 is ascertained which meets the query criteria specified in the SQL database query 302.
  • the server computer 102 sends a corresponding result message to the client computer 101 (not shown in FIG. 3). in which it is stated that no hits are to be expected when the entire database 103 is queried due to the query of the statistical model 121 (step 206). However, if it is determined in step 205 that hits are to be expected with a query of the entire database 103 with sufficient probability (check step 207), the approximate, for example an indication of the number of likely hits in the database 103 in another result message to the client Computer 101 communicates (step 208).
  • the result of the complete search is transferred to the server computer 102 as an exact SQL query result 304, with which the query of the database 103 according to the SQL database query 302 is completed (step 209).
  • the server computer 102 forms an SQL result message 305, which contains the approximate and / or the exact result.
  • the SQL result message 305 is transmitted from the server computer 102 to the client computer 101 (step 210).
  • the method is ended in a last method step (step 211).
  • FIG. 4 and 5 show the individual method steps (flow diagram 400 in FIG. 4) and the message flow (message flow diagram 500 in FIG. 5) for the execution of a database query according to a second exemplary embodiment of the Invention shown, this method is carried out by the structurally the same database query system as shown in Fig.l.
  • Steps 201, 202, 203 and 204 are identical to the procedure according to the first exemplary embodiment.
  • an SQL response message 501 is automatically generated, which contains the approximate query result of the SQL database query 302 and is sent to the client computer 101 transmitted (step 401).
  • the client computer 101 After receiving the first SQL response message 501 according to the information provided by the user of the client computer 101, the client computer 101 forms a second SQL database query message 502 which contains a second SQL database query 503.
  • the second SQL database query 503 can be identical to the first SQL database query 302 or modified, preferably specified, in relation to the first SQL database query 302 (step 402).
  • the second SQL database query message 502 is transmitted from the client computer 101 to the server computer 102 and there the second SQL database query 503 is transferred to the database 103 and it is based on the data in the second SQL database Query message 502 contained second SQL database query 503 performed a full search in the entire database 103 (step 403).
  • the result of the complete database query is passed to the server computer 102 as an exact SQL result 504, whereupon the server computer 102 forms an SQL response message 505 containing the exact SQL result 504 and transmits it to the client computer 101 (step 404).
  • the statistical model 121 can be implemented and stored in a separate computer 601, the computer 601 having an input / output interface 602, by means of which the computer 601 is coupled to the communication network 104.
  • the computer 601 also has a processor unit 603 and a first memory unit 604 for storing the programs that are executed by the processor unit 603 and a second memory unit 605 in which the second statistical unit 121 stores the statistical model 121.
  • the remaining elements of the database query system 600 are identical to those of the database query system 100 according to FIG. 1, which is why no further explanation is given.
  • This exemplary embodiment can clearly be viewed as a distributed data query system 600, in which the client computers 101 and the server computers 102 and the computers 601 in which the statistical models 121 are stored are independent computers, which are by means of of the communication network 104 are coupled to one another.
  • FIG. 7 shows a database query system 700 according to a further embodiment of the invention.
  • the statistical model 121 is in each case stored in a second storage unit 701 in the respective client computer 101.
  • the first database queries for determining an approximate result can take place off-line, i.e. without an activated communication link with a server computer 102.
  • the statistical model 121 usually has a considerably smaller scope compared to the entire database 103 and is therefore easily transmitted by means of electronic mail (e-mail) or by means of a corresponding communication protocol, for example the File Transfer Protocol (FTP) can without using too much bandwidth for data transmission.
  • FTP File Transfer Protocol
  • scalable learning methods that generate highly compressed images are desired, at the same time the images should fuse efficiently, that is, have them merged, for which one should be able to deal with missing information very efficiently.
  • Known learning methods are particularly slow when many of the field assignments are missing from the data.
  • X ⁇ X]
  • the states of the variables are identified with small letters.
  • Li is the number of states of the variable X_.
  • there is a hidden variable or a cluster variable, which is referred to below as ⁇ ; whose states are ⁇ j_, i 1, ..., N ⁇ . So there are N clusters.
  • P ( ⁇ ) describes an a priori distribution
  • P ( ⁇ i) is the a priori weight of the i-th cluster
  • ⁇ _) describes the structure of the i-th
  • the a priori distribution and the conditional distributions for each cluster parameterize a common probability model on X ⁇ and on X, respectively.
  • model's parameters ie the a priori distribution - ⁇ 'and the conditional ones, are aimed at
  • a corresponding EM learning process consists of a series of iteration steps, with an improvement of the model (in the sense of a so-called likelihood) being achieved in each iteration step.
  • new parameters are based on the current or “old” parameters estimated.
  • Each EM step begins with the E step, in which "Sufficient Statistics * are determined in the tables provided for this purpose. It starts with probability tables, the entries of which are initialized with zero values. The fields of the tables are filled with the so-called sufficient statistics s ( ⁇ ) and s (X, ⁇ ) in the course of the E-step, in that for each data point the missing information (in particular the assignment of each data point to the clusters) is filled with expected values be supplemented. In order to calculate expected values for the cluster variable ⁇ , the a posteriori distribution p a (wjx ⁇ j must be determined. This step is also referred to as the “inference step *.
  • the inference step for adopting dependency structures other than a Naive Bayesian Network is similarly complex and often more complex, and thus includes the essential numerical effort of EM learning.
  • membership probabilities for certain classes are only calculated up to a value close to 0 in an iterative process, and the classes with membership probabilities below a selectable value are no longer used in the iterative process.
  • a sequence of the factors to be calculated is determined in such a way that the factor that belongs to a rarely occurring state of a variable is processed first.
  • the rarely occurring values can be stored in an orderly list before the formation of the product begins, so that the variables are are ranked in the list according to the frequency of their appearance.
  • the clusters which have a weight other than zero, can be stored in a list, the data stored in the list being pointers to the corresponding clusters.
  • the method can also be an expectation maximization learning process in which, in the event that a cluster is assigned an a posteriori weight “zero” for a data point, this cluster receives zero weight for this data point in all further steps of the EM method and that this cluster no longer has to be considered in all further steps.
  • the method can only run over clusters that have a non-zero weight.
  • the a posteriori weight belonging to the cluster is then set to zero.
  • it can first be checked whether at least one of the factors in the product is zero. All multiplications for the formation of the overall product are only carried out if all factors are different from zero.
  • a logarithmic representation of the tables is preferably used in order, for example, to avoid underflow problems.
  • This function can be used to replace zero elements with a positive value, for example. This means that complex processing or separations of values that are almost zero and differ from one another by a very small distance are no longer necessary.
  • non-zero clusters in a list, an array or a similar data structure which allows only the non-zero elements to be stored.
  • clusters which are given an a posteriori weight of zero by multiplication by zero, are excluded from all further calculations in order to save numerical effort, in this example, from one EM step to the next, intermediate results regarding cluster affiliations are also obtained individual data points (which clusters are already excluded or still permissible) are stored in additionally necessary data structures.
  • a list or a similar data structure can first be saved that contains references to the relevant clusters that have been given a weight that is different from zero for this data point.
  • a statistical model contains variables that describe what rating a cinema-goer has given to a film.
  • There is a variable for each film with each variable being assigned a plurality of states, each state representing an evaluation value.
  • the new variant of the EM learning process it is now possible to carry out the EM learning process only with the films known up to then until the new film appears, ie the new film (ie generally the new node in the directed graph) initially to ignore. Only when the new film is released will the statistical model be given a new variable. le (a new node) is dynamically added and the ratings of the new film are taken into account. The convergence of the process in terms of log likelihood is still guaranteed; the process converges even faster.
  • H is a hidden node.
  • 0 , 0, ..., 0 J denotes a set of M observable nodes in the directed graph of the statistical model.
  • the statistical model estimates are accumulated according to the following rules:
  • the parameters for all nodes are updated according to the following rules:
  • the expected values for the non-existent nodes Y are calculated and updated according to the sufficient statistics values for these nodes in accordance with regulation (7).
  • the composite distribution corresponds essentially these random numbers in the first step.
  • the initial random numbers are taken into account in the sufficient statistics values according to the ratio of the missing information to the available information.
  • the initial random numbers in each table are only "deleted" according to the ratio of the missing information to the available information.
  • Regulation (7) is not necessary and can therefore be omitted or skipped.
  • N H [P, B] ⁇ ⁇ ß (h
  • xi) (14) i l h
  • H [P, P] - H [P, B] represents the non-negative cross entropy between p (h
  • the current statistical model is designated P ⁇ '.
  • a new statistical model P ⁇ ⁇ is constructed such that:
  • the first line applies generally to all B (see regulation (15)).
  • the second line of regulation (18) in particular in the event that:
  • the third line applies due to regulation (16).
  • the last line of regulation (18) again corresponds to regulation (15).
  • RS tan dard [PB] ⁇ ⁇ ß (y_ i , h
  • x logp (x . I, y_ i , h),: 2 i) i lh, y.
  • ⁇ p (t) ] p (t) ( h
  • the unobserved nodes Xi are divided into two subsets Hi and Y in such a way that none of the nodes in the sets Xi and Hi pending, ie subsequent node (“child * node”) of a node in the set Yi.
  • Yi corresponds to a branch in a Bayesian network for which there is no information in the data.
  • the invention can clearly be seen in the fact that a broad and simple (but generally approximate) access to the statistics of a database (previously via the Internet) by creating statistical models for the content of the database.
  • parts of the data can be stored with the models in a compressed form in order to obtain more precise access to details of the statistics of the contents of the database.
  • the statistical models for "remote diagnosis * ,” so-called “remote assistance * or” remote research * are automatically sent via a communication network.
  • “knowledge * is communicated and sent in the form of a statistical model. Knowledge is often knowledge about the relationships and interdependencies in a domain, for example about the dependencies in a process.
  • a statistical model of a domain which is formed from the data in the database, reflects all of these relationships.
  • the models represent a common probability distribution of the dimensions of the database, so they are not restricted to a specific task, but represent any dependencies between the dimensions. Compressed to the statistical model, knowledge of a domain can be handled and sent very easily , provide to any user, etc.
  • the resolution of the image or the statistical model can be selected according to the requirements of data protection or the needs of the partners.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Une interrogation de banque de données étant réalisée, elle sert à consulter une représentation comprimée de la banque de données à interroger. En fonction du résultat de la consultation de cette représentation comprimée, et si le résultat n'est pas suffisant, la banque de données elle-même est consultée conformément à l'interrogation de banque de données.
EP03785583A 2003-05-07 2003-12-17 Systeme pour interroger une banque de donnees en utilisant un modele statistique de la banque de donnees pour donner des reponses approximatives a la consultation Withdrawn EP1620807A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE10320419A DE10320419A1 (de) 2003-05-07 2003-05-07 Datenbank-Abfragesystem und Verfahren zum rechnergestützten Abfragen einer Datenbank
PCT/DE2003/004175 WO2004100017A1 (fr) 2003-05-07 2003-12-17 Systeme pour interroger une banque de donnees en utilisant un modele statistique de la banque de donnees pour donner des reponses approximatives a la consultation

Publications (1)

Publication Number Publication Date
EP1620807A1 true EP1620807A1 (fr) 2006-02-01

Family

ID=33426695

Family Applications (1)

Application Number Title Priority Date Filing Date
EP03785583A Withdrawn EP1620807A1 (fr) 2003-05-07 2003-12-17 Systeme pour interroger une banque de donnees en utilisant un modele statistique de la banque de donnees pour donner des reponses approximatives a la consultation

Country Status (4)

Country Link
US (1) US20070168329A1 (fr)
EP (1) EP1620807A1 (fr)
DE (1) DE10320419A1 (fr)
WO (1) WO2004100017A1 (fr)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070016545A1 (en) * 2005-07-14 2007-01-18 International Business Machines Corporation Detection of missing content in a searchable repository
US7849025B2 (en) * 2008-01-21 2010-12-07 Microsoft Corporation Modification of relational models
US8930344B2 (en) * 2011-02-04 2015-01-06 Hewlett-Packard Development Company, L.P. Systems and methods for holding a query
US20130117257A1 (en) * 2011-11-03 2013-05-09 Microsoft Corporation Query result estimation
US20130144812A1 (en) * 2011-12-01 2013-06-06 Microsoft Corporation Probabilistic model approximation for statistical relational learning
US10685062B2 (en) * 2012-12-31 2020-06-16 Microsoft Technology Licensing, Llc Relational database management
US10063445B1 (en) * 2014-06-20 2018-08-28 Amazon Technologies, Inc. Detecting misconfiguration during software deployment
WO2019035860A1 (fr) * 2017-08-14 2019-02-21 Sisense Ltd. Système et procédé d'approximation de résultats d'interrogation
US11256985B2 (en) 2017-08-14 2022-02-22 Sisense Ltd. System and method for generating training sets for neural networks
US11216437B2 (en) * 2017-08-14 2022-01-04 Sisense Ltd. System and method for representing query elements in an artificial neural network
CN116089501B (zh) * 2023-02-24 2023-08-22 萨科(深圳)科技有限公司 一种数字化共享平台订单数据统计查询方法

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5574906A (en) * 1994-10-24 1996-11-12 International Business Machines Corporation System and method for reducing storage requirement in backup subsystems utilizing segmented compression and differencing
US5946692A (en) * 1997-05-08 1999-08-31 At & T Corp Compressed representation of a data base that permits AD HOC querying
US6263334B1 (en) * 1998-11-11 2001-07-17 Microsoft Corporation Density-based indexing method for efficient execution of high dimensional nearest-neighbor queries on large databases
US6549907B1 (en) * 1999-04-22 2003-04-15 Microsoft Corporation Multi-dimensional database and data cube compression for aggregate query support on numeric dimensions
AU5782900A (en) * 1999-06-30 2001-01-31 Stephen Billester Secure, limited-access database system and method
US6842758B1 (en) * 1999-07-30 2005-01-11 Computer Associates Think, Inc. Modular method and system for performing database queries
US6898603B1 (en) * 1999-10-15 2005-05-24 Microsoft Corporation Multi-dimensional data structure caching
US6611834B1 (en) * 2000-01-12 2003-08-26 International Business Machines Corporation Customization of information retrieval through user-supplied code
US20020029207A1 (en) * 2000-02-28 2002-03-07 Hyperroll, Inc. Data aggregation server for managing a multi-dimensional database and database management system having data aggregation server integrated therein
US20020103793A1 (en) * 2000-08-02 2002-08-01 Daphne Koller Method and apparatus for learning probabilistic relational models having attribute and link uncertainty and for performing selectivity estimation using probabilistic relational models
US6795825B2 (en) * 2000-09-12 2004-09-21 Naphtali David Rishe Database querying system and method
EP1395924A2 (fr) * 2001-06-08 2004-03-10 Siemens Aktiengesellschaft Modeles statistiques permettant d'augmenter la performance d'operations dans une banque de donnees
US7113936B1 (en) * 2001-12-06 2006-09-26 Emc Corporation Optimizer improved statistics collection
US7266541B2 (en) * 2002-04-12 2007-09-04 International Business Machines Corporation Adaptive edge processing of application data
US7110997B1 (en) * 2003-05-15 2006-09-19 Oracle International Corporation Enhanced ad-hoc query aggregation
US7089266B2 (en) * 2003-06-02 2006-08-08 The Board Of Trustees Of The Leland Stanford Jr. University Computer systems and methods for the query and visualization of multidimensional databases

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2004100017A1 *

Also Published As

Publication number Publication date
US20070168329A1 (en) 2007-07-19
DE10320419A9 (de) 2005-04-14
DE10320419A1 (de) 2004-12-09
WO2004100017A1 (fr) 2004-11-18

Similar Documents

Publication Publication Date Title
DE69938339T2 (de) Ein skalierbares system zum gruppieren von grossen datenbänken
DE60004385T2 (de) Verfahren und systeme um olap hierarchien zusammenfassbar zu machen
DE202017007517U1 (de) Aggregatmerkmale für maschinelles Lernen
EP1783633B1 (fr) Moteur de recherche pour une recherche relative à une position
WO2006066556A2 (fr) Images de base de donnees comprimees relationnelle (permettant une interrogation acceleree de bases de donnees)
DE112007000053T5 (de) System und Verfahren zur intelligenten Informationsgewinnung und -verarbeitung
DE102006047499A1 (de) Datenerweiterbarkeit mit Hilfe externer Datenbanktabellen
CH704497B1 (de) Verfahren zum Benachrichtigen, Speichermedium mit Prozessoranweisungen für ein solches Verfahren.
DE112017006106T5 (de) Erzeugen von, Zugreifen auf und Anzeigen von Abstammungsmetadaten
EP1620807A1 (fr) Systeme pour interroger une banque de donnees en utilisant un modele statistique de la banque de donnees pour donner des reponses approximatives a la consultation
DE60032258T2 (de) Bestimmen ob eine variable numerisch oder nicht numerisch ist
EP1166228B1 (fr) Utilisation de reseaux semantiques fractals pour tous types d'applications de base de donnees
WO2002101581A2 (fr) Modeles statistiques permettant d'augmenter la performance d'operations dans une banque de donnees
WO2004044772A2 (fr) Procede et systeme informatique destines a fournir des informations de base de donnees d'une premiere base de donnees et procede de production assistee par ordinateur d'une image statistique d'une base de donnees
DE112021001743T5 (de) Vektoreinbettungsmodelle für relationale tabellen mit null- oder äquivalenten werten
WO2021104608A1 (fr) Procédé de génération d'une proposition d'ingénierie pour un dispositif ou une installation
EP1264253B1 (fr) Procede et dispositif pour la modelisation d'un systeme
DE112021005210T5 (de) Indexieren von Metadaten zum Verwalten von Informationen
DE102021203300A1 (de) Computerimplementiertes Verfahren für Schlüsselwortsuche in einem Wissensgraphen
EP1324218A1 (fr) Système de categoriser des objets de donnés et procédé de vérifier la consistance des categories designees aux objets d'information
EP2423830A1 (fr) Procédé de recherche dans une multitude d'ensembles de données et machine de recherche
DE10014757B4 (de) Warehousing-Verfahren und verteiltes Computer-Datenbanksystem für das Warehousing
DE102015008607A1 (de) Adaptives Anpassen von Netzwerk-Anforderungen auf Client-Anforderungen in digitalen Netzwerken
DE102009037848A1 (de) Verfahren zum rechnergestützten Verarbeiten von digitalen semantisch annotierten Informationen
EP3076343A1 (fr) Procédé d'affectation d'entrées vocales

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20051107

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: PANORATIO DATABASE IMAGES GMBH

17Q First examination report despatched

Effective date: 20060216

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20080822