WO2004044772A2 - Verfahren und computer-anordnung zum bereitstellen von datenbankinformation einer ersten datenbank und verfahren zum rechnergestützten bilden eines statistischen abbildes einer datenbank - Google Patents
Verfahren und computer-anordnung zum bereitstellen von datenbankinformation einer ersten datenbank und verfahren zum rechnergestützten bilden eines statistischen abbildes einer datenbank Download PDFInfo
- Publication number
- WO2004044772A2 WO2004044772A2 PCT/EP2003/011655 EP0311655W WO2004044772A2 WO 2004044772 A2 WO2004044772 A2 WO 2004044772A2 EP 0311655 W EP0311655 W EP 0311655W WO 2004044772 A2 WO2004044772 A2 WO 2004044772A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- database
- statistical model
- statistical
- computer
- client computer
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 130
- 230000015572 biosynthetic process Effects 0.000 title claims description 19
- 238000004891 communication Methods 0.000 claims abstract description 39
- 238000013179 statistical model Methods 0.000 claims description 140
- 230000008569 process Effects 0.000 claims description 51
- 238000012545 processing Methods 0.000 claims description 10
- 230000006399 behavior Effects 0.000 claims description 6
- 230000006835 compression Effects 0.000 claims description 5
- 238000007906 compression Methods 0.000 claims description 5
- 238000009826 distribution Methods 0.000 description 35
- 239000000047 product Substances 0.000 description 22
- 238000007619 statistical method Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 239000002131 composite material Substances 0.000 description 4
- 238000012546 transfer Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 241000196324 Embryophyta Species 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 2
- 230000001364 causal effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012804 iterative process Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 241000350139 Erythrophleum suaveolens Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 125000003275 alpha amino acid group Chemical group 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004138 cluster model Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000004801 process automation Methods 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 238000004171 remote diagnosis Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 238000013068 supply chain management Methods 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
- 238000012384 transportation and delivery Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
Definitions
- the invention relates to a method and a computer arrangement for providing database information of a first database and a method for computer-aided formation of a statistical image of a database.
- a call center usually records in detail when which call was received in the call center, when the respective incoming call was processed by an employee of the call center, to which other employee of the call center may have been forwarded, etc.
- log files are commonly used in process automation formed in which data is stored on the individual processes.
- a third area of application can be seen in telecommunications; For example, protocol data about the data traffic occurring in the switches are determined and stored in the switches of a mobile radio network. Finally, log data about the data traffic, for example about the frequency of access to information provided by the web server computer, is also frequently formed in a web server computer.
- the manufacturer must find the cause of the problem to access the logged process data, generally the recorded log data of the system.
- a log file containing the log data is currently of considerable size, often on the order of a few dozen GBytes. For this reason, it is difficult to transfer such a log file to the manufacturer of the system, for example using FTP (File Transfer Protocol).
- FTP File Transfer Protocol
- the database data can be data from (public)
- a known possibility of providing information from a database via a communication network from a server computer to a client computer is to install diagnostic or statistical tools for analyzing the data contained in the databases directly on the server side, which, for example, using a web server, which is installed on the server computer and a web browser program installed on a client computer can be used.
- So-called OLAP tools online analytical processing tools
- OLAP tools online analytical processing tools
- their operation is very complex and expensive. With some OLAP tools, the amount of data to be processed has even grown so large that the OLAP tools fail.
- the invention addresses the problem of efficient access to the content of a database via a communication network while maintaining the confidentiality of the data contained in the database.
- the problem is solved by a method and a computer arrangement for providing database information of a first database and by a method for computer-aided formation of a statistical model of a database with the features according to the independent patent claims.
- the general scenario which is addressed by the invention, is characterized in the following way: At a first location A, a large amount of data stored in a database is available. At a second location B, someone wants to use this available data. The user at location B is less interested in individual data sets, but primarily in the statistics characterizing the database data.
- a first statistical image is formed for the first database, for example in the form of a common probability model.
- This image or model represents the statistical relationships of the data elements contained in the first database.
- the first statistical image is stored in a server computer. Furthermore, the first statistical image is transmitted from the server computer to a client computer via a communication network, and the received first statistical image is processed further by the client computer.
- a computer arrangement for computer-aided provision of database information of a first database has a server computer and a client computer, which are coupled to one another by means of a communication network are.
- a first statistical image, which is formed for a first database, is stored in the server computer.
- the first statistical image describes the statistical relationships between the data elements contained in the first database.
- the client computer is set up in such a way that it can be used for further processing, for example an analysis, of the first statistical image transmitted from the server computer via the communication network to the client computer.
- Probability models can be defined within the general formalism of the Bayesian networks (synonymously also causal networks or general graphical probabilistic networks).
- the structure is determined by a directed graph.
- the directed graph has nodes and the nodes relating edges to one another, the nodes describing predeterminable dimensions of the model or of the image in accordance with the values available in the database. Some nodes can also correspond to unobservable quantities (so-called latent variables, as described for example in [1]).
- latent variables as described for example in [1]
- missing or unobservable quantities are replaced by expected values or expected distributions. In the context of the improved EM learning method according to the invention, only the expected values are determined for the missing variables, the parent nodes of which are observable values from the database.
- a statistical model is preferably used as the statistical image.
- a statistical model should be understood to mean any model that represents all statistical relationships or the common frequency distribution of the data in a database (exact or approximate), for example a Bayesian (or causal) network, a Markov network or generally a graphical probabilistic Model, a “latent variable model, a statistical clustering model or a trained artificial neural network.
- the statistical model can thus be understood as a complete, exact or approximate image of the statistics of the database.
- This procedure according to the invention has the following advantages in particular: Compared with the database itself, the statistical model is very small, since the statistical model is a compressed image of the statistics of the database (not of the individual entries in the database), comparable to one according to the JPEG Standard compressed digital image, which is a compressed but approximate image of the digital image;
- the compressed statistical models can thus be transmitted very easily, for example by means of electronic mail (e-mail), FTP (File Transfer Protocol) or other communication protocols for data transmission from the server computer to the client computer.
- the transmitted statistical model can thus be used on the client side for the subsequent statistical analysis.
- the server computer and the client computer can be coupled to one another for transmission of the statistical model via any communication network, for example via a fixed network or via a mobile radio network.
- the invention is suitable for use in any area in which it is desirable not to transmit the entire data of a large database, but rather to transmit only the smallest possible amount of data while maintaining the greatest possible information content of the transmitted data with respect to the database, which is determined by the transferred data are described.
- An advantage of the invention can be seen, in particular, in the fact that it is possible to ensure to a high degree the confidentiality of individual entries in the database, since not all data elements of the database itself are transmitted, but only a statistical representation of the data elements of the database, which enables a statistical analysis of the database on the client side without the concrete, possibly confidential data being available on the client side.
- an operator for example of a technical system, can view the statistical content of the one he manages Database can be provided to a user of a client computer in an uncomplicated manner and as a rule without violating data protection guidelines, for example by means of a web server installed on the server computer, in which case the statistical models are provided by means of a
- Client computer installed web browser program can be accessed.
- the invention can be implemented by means of software, that is to say by means of a computer program, in hardware, that is to say by means of a special electronic circuit, or in any hybrid form, that is to say partly in software and partly in hardware.
- the client computer uses the first statistical model and data elements of a second database stored in the client computer, to form an overall statistical model or an overall statistical image, which is at least a part of those in the first statistical Has image and statistical information contained in the second database.
- a second statistical image or a second statistical model for a second database which represents the statistical relationships of the data elements contained in the second database.
- the second statistical image is about the
- Communication network to the client computer and using the first statistical map and second statistical image, the client computer forms an overall statistical image which has at least part of the statistical information contained in the first statistical image and in the second statistical image.
- the statistical models are stored in different server computers and in each case transmitted from there to the client computer via a communication network.
- the statistical models can be formed by the server computer (s), alternatively also by other, possibly specially configured computers, in which case the statistical models formed still refer to the server computer (s), for example via a local network.
- the statistical models can thus be made available in a very simple manner worldwide in a heterogeneous network, for example on the Internet.
- At least one of the statistical models can be formed using a scalable method with which the
- the degree of compression of the statistical model can be adjusted compared to the data elements contained in the respective database.
- At least one of the statistical models can also be developed using an EM learning process or variants thereof (as described, for example, in [2]) or using an gradient-based learning processes are formed.
- the so-called APN learning method adaptive probabilistic network learning method
- all likelihood-based learning methods or Bayesian learning methods can be used, as described for example in [3].
- the structure of the common probability models can be in the form of a graphical probabilistic model (a Bayesian network, a Markov network or a
- Probabilistic models from available data elements can be used, for example any structure learning method [4] and [5].
- the first database and / or the second database can have data elements which describe at least one technical system.
- the data elements describing the at least one technical system can at least partially represent values measured on the technical system which describe the operating behavior of the technical system.
- a second database with data elements is stored in the client computer.
- the client computer has a unit for forming an overall statistical model using the first statistical model and the data elements of the second database, the overall statistical model containing at least a part of those in the first statistical model and in the second database has statistical information.
- a second server computer is provided, in which a second statistical model, which is formed for a second database, is stored, the second statistical model being the statistical relationships of the data elements contained in the second database represents.
- the client computer is also coupled to the second server computer by means of the communication network. The client computer instructs a unit to form an overall statistical model
- the overall statistical model having at least part of the statistical information contained in the first statistical model and in the second statistical model.
- FIG. 1 shows a block diagram of a computer arrangement according to a first exemplary embodiment of the invention
- FIG. 2 shows a block diagram of a computer arrangement according to a second exemplary embodiment of the invention
- FIG. 3 shows a block diagram of a computer arrangement according to a third exemplary embodiment of the invention.
- FIG. 4 shows a block diagram of a computer arrangement according to a fourth exemplary embodiment of the invention.
- Figure 5 is a block diagram of a computer arrangement according to a fifth embodiment of the invention.
- 1 shows a computer arrangement 100 according to a first exemplary embodiment of the invention.
- the computer arrangement 100 is used in a call center.
- the computer arrangement 100 has a multiplicity of telephone terminals 101 which are connected to a call center computer 103, 104, 105 by means of telephone lines 102.
- the call center the phone calls from employees of the call center are answered and the processing of incoming calls
- Telephone calls in particular the time of the incoming call, the duration, an indication of the employee who answered the call, an indication of the reason for the call and the type of processing of the call or any other information are provided by the call center Computers 103, 104, 105.
- Each call center computer 103, 104, 105 has
- each call center computer 103, 104, 105 are coupled to one another by means of a computer bus 118, 119, 120.
- the call center computers 103, 104, 105 are coupled to a server computer 122 by means of the local network 121.
- the server computer 122 has a first input / output interface 123 to the local network 121, a memory 124, a processor 127 and one
- the server computer 122 serves according to this
- Embodiment as a web server computer, as will be explained in more detail below.
- the data recorded by the call center computers 103, 104, 105 are transmitted to the server computer 122 via the local network 121 and stored there in a database 126.
- a statistical model 125 is also stored in the memory 124, which represents the statistical relationships of the data elements contained in the database 126.
- the statistical model 125 is formed using the EM learning method known per se. Other alternative, preferably used methods for forming the statistical model 125 are described in detail below.
- the statistical model 125 is automatically formed again at regular time intervals, based in each case on the most current data from the database 126.
- the statistical model 125 is automatically provided by the server computer 122 for transmission to one or more client computers 132.
- the client computer 132 is coupled to the second input / output interface 128 of the server computer 122 via a second communication connection 131, for example a communication connection which enables communication in accordance with the TCP / IP communication protocol.
- the client computer 132 also has an input / output interface 133, configured for communication in accordance with the TCP / IP communication protocol, and a processor 134 and a memory 135.
- the statistical model 125 transmitted in an electronic message 130 from the server computer 122 to the client computer 132 is stored in the memory 135 of the client computer 132.
- the user of the client computer 132 now carries out any user-specific statistical analysis on the statistical model 125 and thus “indirectly” on the data in the database 126, without the large database 126 having to be transferred to the client computer 132.
- the client-side statistical analysis can aim to optimize the call center.
- analyzes are carried out in particular with regard to answering the following questions:
- the analyzes to answer the above questions are performed by the user of the client computer 132.
- the operator of the call center is then given suitable measures to optimize the operation of the call center based on the analysis results.
- FIG. 2 shows a computer arrangement 200 according to a second exemplary embodiment of the invention.
- the computer arrangement 200 is used in the field of biotechnology.
- the computer arrangement 200 has a server computer 201 which has a memory 202, a processor 203 and an input / output interface 204 which is set up for communication in accordance with the TCP / IP protocols.
- the components are coupled to one another by means of a computer bus 205.
- In the memory 202 is a database 206 with genetic
- Sequences or amino acid sequences are stored together with the additional information associated with the sequences.
- a statistical model 207 has been formed in the same manner as in the first exemplary embodiment and stored there.
- Each client computer 209, 210, 211 has
- An input / output interface 212, 213, 214 set up for communication in accordance with the TCP / IP protocols
- a memory 218, 219, 220 is provided.
- the server computer 201 Upon request from a client computer 209, 210, 211, the server computer 201 transmits the statistical model 206 to the client computer 209, 210, 211 in an electronic message 221, 222, 223.
- the user of the client computer 209, 210, 211 compares the sequence to be examined with the statistical model 206.
- the result of a statistical analysis is an indication of how many sufficiently similar sequences exist in the database 206 and what properties these sequences are characterized by.
- FIG 3 shows a computer arrangement 300 according to a third exemplary embodiment of the invention.
- the computer arrangement 300 has a first computer 301 and a second computer 309.
- the first computer 301 has a memory 302, a processor 303 and an input device configured for communication in accordance with the TCP / IP communication protocols.
- the first computer 301 is a car of a car dealership, which contains in the customer database stored in the memory 302 information on the customer's first name and last name, place of residence and type of vehicle used, but not on age, marital status and salary receipt.
- the second computer 309 has an input / output interface 310 set up for communication in accordance with the TCP / IP communication protocols, a memory 311 and a processor 312, which are coupled to one another by means of a computer bus 313.
- the second computer 309 is a computer of a bank cooperating with the dealership.
- memory 311 of the second Computers 309 stores a second customer database 314.
- the second customer database 314 contains information about the customer's first name and last name of the customer, their place of residence, marital status, age and salary receipt, but not about the vehicle type used by the respective customer.
- the bank is therefore unable to determine from its stored data which families with which wages typically use which cars.
- the knowledge is at least approximately available in both databases in order to establish a connection, for example, between the vehicle type and the salary input.
- a statistical model 306 according to the EM learning method is formed in the first computer via the database.
- the statistical model 306 compressed with respect to the database is transmitted to the second computer 309, which is bidirectionally coupled to the first computer 301 via the Internet 308, in an electronic message 307.
- this is merged by the second computer 309 with the second customer database 314 to form an overall statistical model 315.
- Partner A has the attributes W, X, Y, which are symbolic for a variety of arbitrary attributes are available.
- Partner B has the attributes X, Y, Z.
- Partner B (according to this exemplary embodiment the car dealership) provides partner A (according to this exemplary embodiment the bank) with a statistical model of its data, which is subsequently referred to as P ß (X , Y, Z).
- the aim of partner A is to create a statistical overall model P (W, X, Y, Z) from his data together with the data from his database.
- Partner A derives a conditional model P ß (Z
- Each customer is assigned the value of the variable Z (as an entry in an additional column in the database) the value that is most likely according to the probability distribution Pg (Z
- partner A can now use standard statistical analysis methods with regard to all four attributes or a common statistical model, the overall model P ß (W, X, Y, Z ), which clearly represents a virtual shared database image.
- the EM- Learning method used. In each learning step of the iterative EM learning process, based on the current parameters, estimates (expected sufficient statistics) are generated for the missing sizes, which replace the missing sizes.
- X, Y) can also be used to determine expected values or expected sufficient statistics values for the variable Z and thus consistently expand this learning process to include a common model of distributed data to create.
- the bank now has all the statistical information available and can carry out corresponding analyzes of the data.
- the bank creates a statistical model via the second customer database and transmits it to the dealership, which in turn forms an overall statistical model.
- the car dealership it would be desirable for the car dealership to know the age of its customers, their marital status and their salary, or at least an estimate of their age, marital status and age
- suitable products can be offered to customers in a much more targeted manner, for example, a young family with an average salary is certainly to be offered a different car than a single with a high salary.
- FIG. 1 shows a computer arrangement 400 according to a fourth exemplary embodiment of the invention.
- n computers 401, 413, 420 are provided, each in 23 computer bus 424 are coupled together.
- a statistical model 425 is also formed via the customer database in the nth computer 420 by means of the EM learning method and is stored in the memory 421 of the nth computer 420.
- the computers 401, 413, 420 are connected to a client computer 409 by means of a respective communication connection 408.
- the client computer 409 has a memory 411, one
- Processor 412 and an input / output interface 410 set up for communication in accordance with the TCP / IP communication protocols, which are coupled to one another by means of a computer bus 426.
- the computers 401, 413, 420 transmit the statistical models 406, 418, 525 to the client computer 409 in respective electronic messages 407, 419, 427, which stores these in its memory 410.
- the exemplary embodiment is explained in more detail below only taking into account the first statistical model 406 and the second statistical model 418.
- any number of statistical models can be combined to form an overall model, for example by repeatedly performing the method steps described below.
- the aim of the third exemplary embodiment is to combine a plurality of statistical models with one another to form an overall model.
- partner A also creates a statistical model PA (W, X, Y) and then the 24 models PA (W, X, Y) and P ß (X, Y, Z) combined to form a statistical overall model P (W, X, Y, Z).
- X, Y) or as P (W, X, Y, Z) P B (X, Y, Z) P A (W
- Z) e.g. a distribution over or an affinity for vehicle types for a given salary receipt.
- the variables X and Y are marginalized.
- variable W is used to infer the common variables X and Y based on the model P A (W, X, Y).
- X, Y) (prediction of the variable Z from the variables X and Y) is used to determine the distribution for the variable Z in accordance with all combinations allowed for the variables X and Y thereafter.
- the overall model 426 P (W, X, Y, Z) can be handled numerically easily if the overlap between these statistical models is not too large, preferably less than 10 common variables. In the case of a large "overlap space”, additional approximations can be used to accelerate the execution of the following sums, which according to the above exemplary embodiments have to be formed over all common states of the common variables X and Y:
- P (W, z) ⁇ P A (W, X, Y) • P ß (z
- H) or the form of the dependency between X, Y and H on the one hand and H and Z on the other hand is chosen so that the above sums are easy to carry out.
- H) are determined in such a way that the approximate total distribution P a pp rox (W, X, Y, Z) is as good as possible for the desired distribution
- P (W, X, Y, Z) P A (W, X, Y) • PB (Z
- the log likelihood or the Kullback-Leibler distance can be used as a cost function.
- An EM learning method or a gradient-based learning method are therefore again suitable as optimization methods.
- Finding optimal parameters can and may be computationally expensive. As soon as the two probability models are then "merged" into an overall model, the overall model can be used in a very efficient manner.
- variable H is a hidden variable, i.e. to parameterize the distribution P (W, X, Y, H) as
- a hidden variable H instead of a hidden variable H, several variables can also be introduced.
- a hidden variable K can also be introduced for the model PB to simplify the numerics.
- An approximation of the overall model P (W, X, Y, Z) takes e.g. the shape
- Tre, e procedure can be carried out.
- H) has to be determined by known learning methods.
- FIG. 5 shows a computer arrangement 500 according to a fifth exemplary embodiment of the invention. 28
- the computer arrangement 500 is used for the exchange of customer information, in accordance with this exemplary embodiment for the exchange of address information for customers.
- the computer arrangement 500 has a server computer 501 and one or more with it via
- Telecommunications network 502 connected client computer 503.
- the server computer 501 has a memory 504, a processor 505 and an input / output interface 506 set up for communication via the Internet, which components are coupled to one another by means of a computer bus 507.
- the server computer 501 serves as a web server computer, as will be explained in more detail below.
- a large customer database 508 (in particular with address information about the customers and information describing the buying behavior of the customers) is stored in the memory 504. Furthermore, a statistical model 509, which was formed by the server computer 501 via the customer database 508, is also stored in the memory 504 and represents the statistical relationships of the data elements contained in the customer database 508.
- the statistical model 509 is formed using the known EM learning method. Other alternative, preferably used methods for forming the statistical model 509 are described in detail below.
- the statistical model 509 is automatically formed again at regular, predetermined time intervals, based in each case on the most current data from the customer database 508.
- the statistical model 509 is automatically provided by the server computer 501 for transmission to the one or more client computers 503.
- the client computer 503 also has an input
- Output interface 510 set up for communication in accordance with the TCP / IP communication protocol as well as a processor 511 and a memory 512.
- the components of the client computer are coupled to one another by means of a computer bus 513.
- the statistical model 509 transmitted in an electronic message 514 from the server computer 501 to the client computer 503 is stored in the memory 512 of the client computer 503.
- the statistical model 509 does not contain the details of the customer database 508, in particular the actual addresses of the customers. However, the statistical model 509 contains statistical information about the behavior, in particular about the purchasing behavior of the customers.
- the user of the client computer 503 now chooses an interesting group of customers, i.e. a part 515 of the statistical model 509 which is of interest to him and which describes a buying behavior which is of interest to the company of the user of the client computer 503.
- the client computer 503 transmits the information 515 about the selected part of the statistical model 509 in a second electronic message 516 to the server computer 501.
- the server computer 501 uses the received information to read the customers designated by means of the part 515 of the statistical model 509 and the associated customer detailed information 517, in particular the customer 30 addresses of the customers, from the customer database 508 and transmits the read customer detail information 517 in a third electronic message 518 to the client computer 503.
- this transmission takes place against payment.
- a very efficient so-called "on-line list broking" is realized.
- the states of the variables are identified with small letters.
- Li is the number of states of the variable Xi.
- An entry in a data record 31 ⁇ (a database) now consists of values for all variables, where x ⁇ ⁇ x, Xg, ...) denotes the ⁇ th data set.
- the variable X ⁇ is in the state x?,
- the variable X2 is in the state x ⁇ etc.
- P ( ⁇ ) describes an a priori distribution
- P ( ⁇ -_) is the a priori weight of the i-th cluster
- ⁇ j describes the structure of the i-th
- Distributions for each cluster together parameterize a common probability model on X ⁇ or on X.
- the aim is to determine the parameters of the model, ie the a priori distribution p ( ⁇ ) and the conditional probability tables p (x
- a corresponding EM learning process consists of a series of iteration steps, with an improvement of the model (in the sense of a so-called likelihood) being achieved in each iteration step.
- new parameters p new are estimated based on the current or "old" parameters p defined.
- Each EM step begins with the E step, in the "Sufficient Statistics" in the tables provided for this purpose 32 can be determined. It starts with probability tables, the entries of which are initialized with zero values. The fields of the tables are filled with the so-called sufficient statistics s ( ⁇ ) and s (x, ⁇ ) in the course of the E-step, in that for each data point the missing information (in particular the assignment of each data point to the clusters) by means of expected values be supplemented.
- the a posteriori distribution p ⁇ l fWj 1x7l must be determined. This step is also referred to as an "inference step”.
- ⁇ ) are calculated for the statistical model, p (x
- membership probabilities for certain classes are only calculated up to a value close to 0 in an iterative process, and the classes with 35 Probabilities of membership below a selectable value are no longer used in the iterative process.
- a sequence of the factors to be calculated is determined in such a way that the factor that belongs to a rarely occurring state of a variable is processed first.
- the rarely occurring values can be stored in an ordered list before the formation of the product begins, so that the variables are ordered according to the frequency of their appearance of a zero in the list.
- the clusters which have a weight other than zero, can be stored in a list, the data stored in the list being pointers to the corresponding clusters.
- the method can also be an expectation maximization learning process, in which in the event that a cluster is assigned an a posteriori weight "zero" for a data point, this cluster receives zero weight for this data point in all further steps of the EM method and that this cluster no longer has to be considered in all further steps.
- the method can only run over clusters that have a non-zero weight.
- Formation of an overall product carried out. As soon as the first zero occurs in the associated factors, which can be read out, for example, from a memory, array or a pointer list, the formation of the overall product is terminated.
- the a posteriori weight belonging to the cluster is then set to zero.
- it can first be checked whether at least one of the factors in the product is zero. All multiplications for the formation of the overall product are only carried out if all factors are different from zero.
- a clever order is chosen such that if a factor in the product is zero, this factor is very likely to appear as one of the first factors in the product very soon. This means that the formation of the overall product can be stopped very soon.
- the definition 37 of the new order can occur according to the frequency with which the states of the variables appear in the data.
- a factor that belongs to a very rare state of a variable is processed first. The order in which the factors are processed can thus be determined once before the start of the learning process by storing the values of the variables in a correspondingly ordered list.
- a logarithmic representation of the tables is preferably used, for example to avoid underflow problems.
- This function can be used to replace zero elements with a positive value, for example. This means that complex processing or separations of values that are almost zero and differ from one another by a very small distance are no longer necessary.
- clusters which are given an a posteriori weight of zero by multiplication by zero, are excluded from all further calculations in order to save numerical effort, in this example, from one EM step to the next, intermediate results regarding cluster affiliations are also obtained individual data points (which clusters are already excluded or still permissible) are stored in additionally necessary data structures.
- a list or a similar data structure can first be saved, which contains references to the relevant clusters, which have been given a non-zero weight for this data point.
- missing information is not added for all sizes.
- part of the missing information can be “ignored”. In other words, this means that no attempt is made to learn something about a random variable Y from data in which there is no information about the random variable Y (a node Y) is or that no attempt is being made - something about the relationships between two random variables Y and X (two nodes Y and X) from data in which no information about the random variables Y and X is contained.
- a statistical model contains variables which describe what rating a cinema-goer has given a film.
- There is a variable for each film with each variable being assigned a plurality of states, each state representing an evaluation value.
- There is a record for each customer, in 40 is stored, which film has received which evaluation value. If a new film is offered, the rating values for this film are initially missing.
- the new variant of the EM learning method it is now possible to carry out the EM learning method only with the films known up to that point until the new film appears, ie the new film (ie generally the new node in the directed graph) initially to ignore. Only when the new film is released is the statistical model dynamically supplemented by a new variable (a new node) and the ratings of the new film are taken into account. The convergence of the process in terms of log likelihood is still guaranteed; the process converges even faster.
- H is a hidden node.
- 0 , 0, ..., 0 denotes a set of M observable nodes in the directed graph of the statistical model.
- a data record with N data record elements ⁇ _, i 1, ..., N
- the statistical model estimates are accumulated according to the following rules:
- the parameters for all nodes are updated according to the following rules:
- Probability tables must be standardized when transferring SS to P.
- the expected values for the non-existent nodes Yi are calculated and according to the 43 Sufficient Statistics values for these nodes updated according to regulation (7).
- the composite distribution P (H, essentially these random numbers in the first step This means that the initial random numbers are taken into account in the sufficient statistics values according to the ratio of the missing information to the available information. This means that the initial random numbers in each table are only "deleted" according to the ratio of the missing information to the existing information.
- Node H are normalized for the log likelihood:
- the sum ⁇ denotes the sum over all states h of the h node H.
- the first line applies generally to all B (see regulation (14)).
- the third line applies due to regulation (15).
- the last line of regulation (17) again corresponds to regulation (14).
- a sequence of EM iterations is formed such that:
- the unobserved nodes Xi are divided into two subsets Hi and Y_i in such a way that none of the nodes in the sets Xi and Hi is a dependent, i.e. subsequent node ("child" node) of a node in the set Y_.
- Y_ corresponds to a branch in a Bayesian network for which there is no information in the data.
- the invention can clearly be seen in the fact that a broad and simple (but generally approximate) access to the statistics of a database (preferably via the Internet) by forming statistical ones
- Models for the content of the database is created.
- the statistical models for "remote diagnosis”, for so-called “remote assistance” or for “remote research” are thus automatically sent via a communication network.
- “knowledge” is communicated and sent in the form of a statistical model.
- Knowledge is often knowledge about the relationships and interdependencies in a domain, for example about the dependencies in a process.
- a statistical model of a domain, which is formed from the data in the database reflects all of these relationships.
- the models represent a common probability distribution of the dimensions of the database, so they are not restricted to a specific task, but represent any dependencies between the dimensions. Compressed with the statistical model, knowledge of a domain can be handled, sent, and used very easily Provide users, etc.
- the resolution of the image or the statistical model can be selected according to the requirements of data protection or the needs of the partners. 51
- the following publications are cited in this document:
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/534,510 US20060129580A1 (en) | 2002-11-12 | 2003-10-21 | Method and computer configuration for providing database information of a first database and method for carrying out the computer-aided formation of a statistical image of a database |
AU2003279305A AU2003279305A1 (en) | 2002-11-12 | 2003-10-21 | Method and computer configuration for providing database information of a first database and method for carrying out the computer-aided formation of a statistical image of a database |
EP03772243A EP1561173A2 (de) | 2002-11-12 | 2003-10-21 | Verfahren und computer-anordnung zum bereitstellen von datenbankinformation einer ersten datenbank und verfahren zum rechnergestützten bilden eines statistischen abbildes einer datenbank |
JP2004550701A JP2006505858A (ja) | 2002-11-12 | 2003-10-21 | 第1データベースにおけるデータベース情報を提供する提供方法及びコンピュータ構造、並びにデータベースにおける統計イメージのコンピュータ援用形成方法 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE10252445.9 | 2002-11-12 | ||
DE10252445A DE10252445A1 (de) | 2002-11-12 | 2002-11-12 | Verfahren und Computer-Anordnung zum Bereitstellen von Datenbankinformation einer ersten Datenbank und Verfahren zum rechnergestützten Bilden eines statistischen Abbildes einer Datenbank |
Publications (3)
Publication Number | Publication Date |
---|---|
WO2004044772A2 true WO2004044772A2 (de) | 2004-05-27 |
WO2004044772A9 WO2004044772A9 (de) | 2004-08-19 |
WO2004044772A3 WO2004044772A3 (de) | 2004-12-16 |
Family
ID=32185484
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2003/011655 WO2004044772A2 (de) | 2002-11-12 | 2003-10-21 | Verfahren und computer-anordnung zum bereitstellen von datenbankinformation einer ersten datenbank und verfahren zum rechnergestützten bilden eines statistischen abbildes einer datenbank |
Country Status (6)
Country | Link |
---|---|
US (1) | US20060129580A1 (de) |
EP (1) | EP1561173A2 (de) |
JP (1) | JP2006505858A (de) |
AU (1) | AU2003279305A1 (de) |
DE (1) | DE10252445A1 (de) |
WO (1) | WO2004044772A2 (de) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7873724B2 (en) * | 2003-12-05 | 2011-01-18 | Microsoft Corporation | Systems and methods for guiding allocation of computational resources in automated perceptual systems |
US7761474B2 (en) * | 2004-06-30 | 2010-07-20 | Sap Ag | Indexing stored data |
US7623651B2 (en) * | 2004-09-10 | 2009-11-24 | Microsoft Corporation | Context retention across multiple calls in a telephone interaction system |
WO2006066556A2 (de) * | 2004-12-24 | 2006-06-29 | Panoratio Database Images Gmbh | Relationale komprimierte datenbank-abbilder (zur beschleunigten abfrage von datenbanken) |
US7512617B2 (en) * | 2004-12-29 | 2009-03-31 | Sap Aktiengesellschaft | Interval tree for identifying intervals that intersect with a query interval |
US20060159339A1 (en) * | 2005-01-20 | 2006-07-20 | Motorola, Inc. | Method and apparatus as pertains to captured image statistics |
JP5510127B2 (ja) * | 2010-06-30 | 2014-06-04 | 株式会社ニコン | 統計情報提供システム、統計情報提供サーバ、移動端末、会員端末及びプログラム |
US20150347421A1 (en) * | 2014-05-29 | 2015-12-03 | Avaya Inc. | Graph database for a contact center |
JP7212103B2 (ja) * | 2021-05-20 | 2023-01-24 | ヤフー株式会社 | 情報処理装置、情報処理方法及び情報処理プログラム |
JP7354181B2 (ja) * | 2021-05-20 | 2023-10-02 | ヤフー株式会社 | 情報処理装置、情報処理方法、及び情報処理プログラム |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000065479A1 (en) * | 1999-04-22 | 2000-11-02 | Microsoft Corporation | Multi-dimensional database and data cube compression for aggregate query support on numeric dimensions |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US623337A (en) * | 1899-04-18 | Birger isidor rydberg | ||
US6449612B1 (en) * | 1998-03-17 | 2002-09-10 | Microsoft Corporation | Varying cluster number in a scalable clustering system for use with large databases |
US6012058A (en) * | 1998-03-17 | 2000-01-04 | Microsoft Corporation | Scalable system for K-means clustering of large databases |
US6728713B1 (en) * | 1999-03-30 | 2004-04-27 | Tivo, Inc. | Distributed database management system |
US20020129038A1 (en) * | 2000-12-18 | 2002-09-12 | Cunningham Scott Woodroofe | Gaussian mixture models in a data mining system |
-
2002
- 2002-11-12 DE DE10252445A patent/DE10252445A1/de not_active Ceased
-
2003
- 2003-10-21 EP EP03772243A patent/EP1561173A2/de not_active Withdrawn
- 2003-10-21 JP JP2004550701A patent/JP2006505858A/ja active Pending
- 2003-10-21 US US10/534,510 patent/US20060129580A1/en not_active Abandoned
- 2003-10-21 AU AU2003279305A patent/AU2003279305A1/en not_active Abandoned
- 2003-10-21 WO PCT/EP2003/011655 patent/WO2004044772A2/de active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000065479A1 (en) * | 1999-04-22 | 2000-11-02 | Microsoft Corporation | Multi-dimensional database and data cube compression for aggregate query support on numeric dimensions |
Non-Patent Citations (4)
Title |
---|
CHAN P K, STOLFO S J: "Sharing learned models among remote database partitions by local meta-learning" KDD-96 PROCEEDINGS. SECOND INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, PORTLAND, OR, USA, 2-4 AUGUST 1996, 1996, XP002292366 AAAI PRESS, MENLO PARK, CA, USA Gefunden im Internet: URL:http://citeseer.ist.psu.edu/chan96sharing.html> [gefunden am 2004-08-13] * |
CHEN R ET AL: "Distributed Web mining using Bayesian networks from multiple data streams" DATA MINING, 2001. ICDM 2001, PROCEEDINGS IEEE INTERNATIONAL CONFERENCE ON SAN JOSE, CA, USA 29 NOV.-2 DEC. 2001, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 29. November 2001 (2001-11-29), Seiten 75-82, XP010583262 ISBN: 0-7695-1119-8 * |
KARGUPTA H ET AL: "Collective data mining: A new perspective toward distributed data analysis" IN KARGUPTA H AND CHAN P, EDITORS, ADVANCES IN DISTRIBUTED AND PARALLEL KNOWLEDGE DISCOVERY, 2000, XP002292368 MIT, AAAI PRESS Gefunden im Internet: URL:http://www.cs.umbc.edu/~hillol/PUBS/bc.pdf> [gefunden am 2004-08-13] * |
PRODROMIDIS A L, STOLFO S J: "Mining databases with different schemas: integrating incompatible classifers" KDD-98 PROCEEDINGS. FOURTH INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, NEW YORK, NY, USA, 27-31 AUGUST 1998, 1998, XP002292367 AAAI PRESS, MENLO PARK, CA, USA Gefunden im Internet: URL:http://citeseer.ist.psu.edu/106070.html> [gefunden am 2004-08-13] * |
Also Published As
Publication number | Publication date |
---|---|
WO2004044772A9 (de) | 2004-08-19 |
US20060129580A1 (en) | 2006-06-15 |
JP2006505858A (ja) | 2006-02-16 |
WO2004044772A3 (de) | 2004-12-16 |
EP1561173A2 (de) | 2005-08-10 |
DE10252445A1 (de) | 2004-05-27 |
AU2003279305A1 (en) | 2004-06-03 |
AU2003279305A8 (en) | 2004-06-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE112021004197T5 (de) | Semantisches Lernen in einem System für ein föderiertes Lernen | |
DE102019129050A1 (de) | Systeme und verfahren zur gemeinsamen nutzung von fahrzeugen über peer-to-peer-netzwerke | |
DE112018005205T5 (de) | Komprimierung von vollständig verbundenen / wiederkehrenden Schichten von einem oder mehreren tiefen Netzen durch Durchsetzen von räumlicher Lokalität für Gewichtsmatrizen und erwirken von Frequenzkomprimierung | |
WO2004044772A2 (de) | Verfahren und computer-anordnung zum bereitstellen von datenbankinformation einer ersten datenbank und verfahren zum rechnergestützten bilden eines statistischen abbildes einer datenbank | |
DE102020215650A1 (de) | Ontologiebewusste klangklassifizierung | |
Goplerud | A Multinomial Framework for Ideal Point Estimation | |
EP1620807A1 (de) | Datenbank-abfragesystem unter verwendung eines statistischen modells der datenbank zur approximativen abfragebeantwortung | |
DE112021005925T5 (de) | Domänenverallgemeinerter spielraum über metalernen zur tiefen gesichtserkennung | |
DE112018006438T5 (de) | Clustering von facetten auf einem zweidimensionalen facettenwürfel für text-mining | |
EP1264253B1 (de) | Verfahren und anordnung zur modellierung eines systems | |
EP3507943B1 (de) | Verfahren zur kommunikation in einem kommunikationsnetzwerk | |
DE102021127398A1 (de) | Beziehungserkennung und -quantifizierung | |
WO2021190715A1 (de) | Computerimplementiertes verfahren und verteiltes speichersystem zum bereitstellen vertrauenswürdiger datenobjekte | |
DE112021001492T5 (de) | Verfahren und systeme zur graphdatenverarbeitung mit hybridem schlussfolgern | |
DE102015008607A1 (de) | Adaptives Anpassen von Netzwerk-Anforderungen auf Client-Anforderungen in digitalen Netzwerken | |
DE10233609A1 (de) | Verfahren zur Ermittlung einer in vorgegebenen Daten vorhandenen Wahrscheinlichkeitsverteilung | |
DE112021005531T5 (de) | Verfahren und vorrichtung zur erzeugung von trainingsdaten für ein graphneuronales netzwerk | |
DE102011077611A1 (de) | Verfahren zum rechnergestützten Erkennen von Angriffen auf ein Computernetz | |
DE112022000630T5 (de) | Abgleichen von informationen durch verwenden von untergraphen | |
DE102014118401A1 (de) | Dezentralisiertes Expertensystem für netzwerkbasiertes Crowdfunding | |
DE102022118244A1 (de) | System, Verfahren und Computerprogrammprodukt zur optimierten Testplanung für das Prototypenmanagement einer Entität | |
DE202022100198U1 (de) | Ein wolkenbasiertes System zur Graphenberechnung | |
WO2023139130A1 (de) | Computer-implementierte datenstruktur, verfahren und system zum betrieb eines technischen geräts mit einem modell auf basis föderierten lernens | |
EP3913567A1 (de) | Server-computersystem sowie bewertungsverfahren | |
CN117952232A (zh) | 数据处理方法、装置、电子设备及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
COP | Corrected version of pamphlet |
Free format text: PAGE 22, DESCRIPTION, ADDED |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2003772243 Country of ref document: EP Ref document number: 2004550701 Country of ref document: JP |
|
WWP | Wipo information: published in national office |
Ref document number: 2003772243 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2006129580 Country of ref document: US Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 10534510 Country of ref document: US |
|
WWP | Wipo information: published in national office |
Ref document number: 10534510 Country of ref document: US |